language models

In the last chapter, we looked at models that use numerical data to understand the relationships between different aspects of a data set (inferential model use) and models that make predictions based on numerical data (predictive model use). In this chapter, we will explore a set of models called language models that transform non-numerical data (such as written text or protein sequences) into the numerical domain, enabling the non-numerical data to be analyzed using the techniques we have already covered. Language models are algorithms that are trained on large amounts of text (or, in the case of protein language models, many sequences) and can perform a variety of tasks related to their training data. In particular, we will focus on embedding models, which convert language data into numerical data. An embedding is a numerical representation of data that captures its essential features in a lower-dimensional space or in a different domain. In the context of language models, embeddings transform text, such as words or sentences, into vectors of numbers, enabling machine learning models and other statistical methods to process and analyze the data more effectively.

A basic form of an embedding model is a neural network called an autoencoder. Autoencoders consist of two main parts: an encoder and a decoder. The encoder takes the input data and compresses it into a lower-dimensional representation, called an embedding. The decoder then reconstructs the original input from this embedding, and the output from the decoder is compared against the original input. The model (the encoder and the decoder) are then iteratively optimized with the objective of minimizing a loss function that measures the difference between the original input and its reconstruction, resulting in an embedding model that creates meaningful embeddings that capture the important aspects of the original input.

pre-reading

Please read over the following:

  • Text Embeddings: Comprehensive Guide. In her comprehensive article, “Text Embeddings: Comprehensive Guide”, Mariya Mansurova explores the evolution, applications, and visualization of text embeddings. Beginning with early methods like Bag of Words and TF-IDF, she traces how embeddings have advanced to capture semantic meaning, highlighting significant milestones such as word2vec and transformer-based models like BERT and Sentence-BERT. Mansurova explains how these embeddings transform text into vectors that computers can analyze for tasks like clustering, classification, and anomaly detection. She provides practical examples using tools like OpenAI’s embedding models and dimensionality reduction techniques, making this article an in-depth resource for both theoretical and hands-on understanding of text embeddings.

  • ESM3: Simulating 500 million years of evolution with a language model. The 2024 blog article “ESM3: Simulating 500 million years of evolution with a language model” by EvolutionaryScale introduces ESM3, a revolutionary language model trained on billions of protein sequences. This article explores how ESM3 marks a major advancement in computational biology by enabling researchers to reason over protein sequences, structures, and functions. With massive datasets and powerful computational resources, ESM3 can generate entirely new proteins, including esmGFP, a green fluorescent protein that differs significantly from known natural variants. The article highlights the model’s potential to transform fields like medicine, synthetic biology, and environmental sustainability by making protein design programmable. It also stresses the importance of responsible development and transparency, with the model and code being made openly available to accelerate scientific progress.

text embeddings

Here, we will create text embeddings using publication data from PubMed. Text embeddings are numerical representations of text that preserve important information and allow us to apply mathematical and statistical analyses to textual data. Below, we use a series of functions to obtain titles and abstracts from PubMed, create embeddings for their titles, and analyze them using principal component analysis.

First, we use the searchPubMed function to extract relevant publications from PubMed based on specific search terms. This function interacts with the PubMed website via a tool called an API. An API, or Application Programming Interface, is a set of rules that allows different software programs to communicate with each other. In this case, the API allows our code to access data from the PubMed database directly, without needing to manually search through the website. An API key is a unique identifier that allows you to authenticate yourself when using an API. It acts like a password, giving you permission to access the API services. Here, I am reading my API key from a local file. You can obtain by signing up for an NCBI account at https://pubmed.ncbi.nlm.nih.gov/. Once you have an API key, pass it to the searchPubMed function along with your search terms. Here I am using “beta-amyrin synthase,” “friedelin synthase,” “Sorghum bicolor,” and “cuticular wax biosynthesis.” I also specify that I want the results to be sorted according to relevance (as opposed to sorting by date) and I only want three results per term (the top three most relevant hits) to be returned:

search_results <- searchPubMed(
  search_terms = c("beta-amyrin synthase", "friedelin synthase", "sorghum bicolor", "cuticular wax biosynthesis"),
  pubmed_api_key = readLines("/Users/bust0037/Documents/Science/Websites/pubmed_api_key.txt"),
  retmax_per_term = 3,
  sort = "relevance"
)
## Error encountered. Attempt 1 of 3. Retrying in 5 seconds...
## Error encountered. Attempt 1 of 3. Retrying in 5 seconds...
## Error encountered. Attempt 2 of 3. Retrying in 5 seconds...
## Error encountered. Attempt 3 of 3. Retrying in 5 seconds...
## Failed to fetch data for term: sorghum bicolor after 3 attempts.
colnames(search_results)
## [1] "entry_number" "term"         "date"        
## [4] "journal"      "title"        "doi"         
## [7] "abstract"
select(search_results, term, title)
## # A tibble: 9 × 2
##   term                       title                          
##   <chr>                      <chr>                          
## 1 beta-amyrin synthase       β-Amyrin synthase from Conyza …
## 2 beta-amyrin synthase       Ginsenosides in Panax genus an…
## 3 beta-amyrin synthase       β-Amyrin biosynthesis: catalyt…
## 4 friedelin synthase         Friedelin Synthase from Mayten…
## 5 friedelin synthase         Friedelin in Maytenus ilicifol…
## 6 friedelin synthase         Functional characterization of…
## 7 cuticular wax biosynthesis Regulatory mechanisms underlyi…
## 8 cuticular wax biosynthesis Update on Cuticular Wax Biosyn…
## 9 cuticular wax biosynthesis Advances in Biosynthesis, Regu…

From the output here, you can see that we’ve retrieved records for various publications, each containing information such as the title, journal, and search term used. This gives us a dataset that we can further analyze to gain insights into the relationships between different research topics.

Next, we use the embedText function to create embeddings for the titles of the extracted publications. Just like PubMed, the Hugging Face API requires an API key, which acts as a unique identifier and grants you access to their services. You can obtain an API key by signing up at https://huggingface.co and following the instructions to generate your own key. Once you have your API key, you will need to specify it when using the embedText function. In the example below, I am reading the key from a local file for convenience.

To set up the embedText function, provide the dataset containing the text you want to embed (in this case, search_results, the output from the PubMed search above), the column with the text (title), and your Hugging Face API key. This function will then generate numerical embeddings for each of the publication titles. By default, the embeddings are generated using a pre-trained embedding language model called ‘BAAI/bge-small-en-v1.5’, available through the Hugging Face API at https://api-inference.huggingface.co/models/BAAI/bge-small-en-v1.5. This model is designed to create compact, informative numerical representations of text, making it suitable for a wide range of downstream tasks, such as clustering or similarity analysis. If you would like to know more about the model and its capabilities, you can visit the Hugging Face website at https://huggingface.co, where you will find detailed documentation and additional resources.

search_results_embedded <- embedText(
  df = search_results,
  column_name = "title",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)
search_results_embedded[1:3,1:10]
## # A tibble: 3 × 10
##   entry_number term  date       journal title doi   abstract
##          <dbl> <chr> <date>     <chr>   <chr> <chr> <chr>   
## 1            1 beta… 2019-11-20 FEBS o… β-Am… 10.1… Conyza …
## 2            2 beta… 2024-04-03 Acta p… Gins… 10.1… Ginseno…
## 3            3 beta… 2019-12-10 Organi… β-Am… 10.1… The enz…
## # ℹ 3 more variables: embedding_1 <dbl>, embedding_2 <dbl>,
## #   embedding_3 <dbl>

The output of the embedText function is a data frame where the first column contains the title and the subsequent 384 columns represent the embedding variables. These embeddings capture the key features of each publication title. You can join this data frame with the original input to create a complete dataset that includes both the original metadata (such as titles and journals) and the numerical embeddings. Below is an example of how to use the left_join function from the dplyr package in R to combine the original search_results data frame with the new embeddings. This merged dataset allows you to perform further analyses on both the original metadata and the generated embeddings.

search_results <- left_join(search_results, search_results_embedded, by = "title")
search_results[1:3, 1:10]
## # A tibble: 3 × 10
##   entry_number.x term.x     date.x     journal.x title doi.x
##            <dbl> <chr>      <date>     <chr>     <chr> <chr>
## 1              1 beta-amyr… 2019-11-20 FEBS ope… β-Am… 10.1…
## 2              2 beta-amyr… 2024-04-03 Acta pha… Gins… 10.1…
## 3              3 beta-amyr… 2019-12-10 Organic … β-Am… 10.1…
## # ℹ 4 more variables: abstract.x <chr>,
## #   entry_number.y <dbl>, term.y <chr>, date.y <date>

To examine the relationships between the publication titles, we perform PCA on the text embeddings. We use the runMatrixAnalysis function, specifying PCA as the analysis type and indicating which columns contain the embedding values. We visualize the results using a scatter plot, with each point representing a publication title, colored by the search term it corresponds to. The grep function is used here to search for all column names in the search_results data frame that contain the word ‘embed’. This identifies and selects the columns that hold the embedding values, which will be used as the columns with values for single analytes for the PCA and enable the visualization below. While we’ve seen lots of PCA plots over the course of our explorations, note that this one is different in that it represents the relationships between the meaning of text passages (!) as opposed to relationships between samples for which we have made many measurements of numerical attributes.

runMatrixAnalysis(
  data = search_results_embedded,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(search_results)[grep("embed", colnames(search_results))],
  columns_w_sample_ID_info = c("title", "journal", "term")
) %>%
  ggplot() +
    geom_label_repel(
      aes(x = Dim.1, y = Dim.2, label = str_wrap(title, width = 35)),
      size = 2, min.segment.length = 0.5, force = 50
    ) +  
    geom_point(aes(x = Dim.1, y = Dim.2, fill = term), shape = 21, size = 5, alpha = 0.7) +
    scale_fill_brewer(palette = "Set1") +
    scale_x_continuous(expand = c(0,1)) +
    scale_y_continuous(expand = c(0,5)) +
    theme_minimal()

We can also use embeddings to examine data that are not full sentences but rather just lists of terms, such as the descriptions of odors in the beer_components dataset:

n <- 31

odor <- data.frame(
  sample = seq(1,n,1),
  odor = dropNA(unique(beer_components$analyte_odor))[sample(1:96, n)]
)

out <- embedText(
  odor, column_name = "odor",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)

runMatrixAnalysis(
  data = out,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(out)[grep("embed", colnames(out))],
  columns_w_sample_ID_info = c("sample", "odor")
) -> pca_out

pca_out$color <- rgb(
  scales::rescale(pca_out$Dim.1, to = c(0, 1)),
  0,
  scales::rescale(pca_out$Dim.2, to = c(0, 1))
)

ggplot(pca_out) +
  geom_label_repel(
    aes(x = Dim.1, y = Dim.2, label = str_wrap(odor, width = 35)),
    size = 2, min.segment.length = 0.5, force = 25
  ) +  
  geom_point(aes(x = Dim.1, y = Dim.2), fill = pca_out$color, shape = 21, size = 3, alpha = 0.7) +
  # scale_x_continuous(expand = c(1,0)) +
  # scale_y_continuous(expand = c(1,0)) +
  theme_minimal()

protein embeddings

Autoencoders can be trained to accept various types of inputs, such as text (as shown above), images, audio, videos, sensor data, and sequence-based information like peptides and DNA. Protein language models convert protein sequences into numerical representations that can be used for a variety of downstream tasks, such as structure prediction or function annotation. Protein language models, like their text counterparts, are trained on large datasets of protein sequences to learn meaningful patterns and relationships within the sequence data.

Protein language models offer several advantages over traditional approaches, such as multiple sequence alignments (MSAs). One major disadvantage of MSAs is that they are computationally expensive and become increasingly slow as the number of sequences grows. While language models are also computationally demanding, they are primarily resource-intensive during the training phase, whereas applying a trained language model is much faster. Additionally, protein language models can capture both local and global sequence features, allowing them to identify complex relationships that span across different parts of a sequence. Furthermore, unlike MSAs, which rely on evolutionary information, protein language models can be applied to proteins without homologous sequences, making them suitable for analyzing sequences where little evolutionary data is available. This flexibility broadens the scope of proteins that can be effectively studied using these models.

Beyond the benefits described above, protein language models have an additional, highly important capability: the ability to capture information about connections between elements in their input, even if those elements are very distant from each other in the sequence. This capability is achieved through the use of a model architecture called a transformer, which is a more sophisticated version of an autoencoder. For example, amino acids that are far apart in the primary sequence may be very close in the 3D, folded protein structure. Proximate amino acids in 3D space can play crucial roles in protein stability, enzyme catalysis, or binding interactions, depending on their spatial arrangement and interactions with other residues. Embedding models with transformer architecture can effectively capture these functionally important relationships.

By adding a mechanism called an “attention mechanism” to an autoencoder, we can create a simple form of a transformer. The attention mechanism works within the encoder and decoder, allowing each element of the input (e.g., an amino acid) to compare itself to every other element, generating attention scores that weigh how much attention one amino acid should give to another. This mechanism helps capture both local and long-range dependencies in protein sequences, enabling the model to focus on important areas regardless of their position in the sequence. Attention is beneficial because it captures interactions between distant amino acids, weighs relationships to account for protein folding and interactions, adjusts focus across sequences of varying lengths, captures different types of relationships like hydrophobic interactions or secondary structures, and provides contextualized embeddings that reflect the broader sequence environment rather than just local motifs.

In this section, we will explore how to generate embeddings for protein sequences using a pre-trained protein language model and demonstrate how these embeddings can be used to analyze and visualize protein data effectively. First, we need some data. You can use the OSC_sequences object provided by the source() code, though you can also use the searchNCBI() function to retrieve your own sequences. For example:

searchNCBI(search_term = "oxidosqualene cyclase", retmax = 100)
## AAStringSet object of length 100:
##       width seq                         names               
##   [1]   766 MVANSTGRDASA...SWALQGNGIEKS sp|A0A0E0SP71.1|E...
##   [2]   726 MAAYAPLELPAS...AHRYLKELHAKK sp|D7NJ68.1|ERG7_...
##   [3]   766 MWRLRTGSSTVD...RYKLFGKKNMYI sp|A0A125SXN2.1|L...
##   [4]   756 MWKLKIAEGSPG...ALGEYRQKIFHS sp|A0A125SXN1.1|L...
##   [5]   755 MWSLKIAEGGGP...ALGAYQCKVLGH sp|A0A125SXN3.1|L...
##   ...   ... ...
##  [96]   532 MSISTPIGTSYA...ACTEADVERTDA WP_376415580.1 pr...
##  [97]   532 MRSISRQATPIR...TAALELLQTRRP WP_376406699.1 pr...
##  [98]   431 MNTVRRGAAALA...VGIGFLVSGRKK WP_376404849.1 pr...
##  [99]   402 MNVRRSAAVLAA...FLFTGRKKRQQS WP_376387762.1 pr...
## [100]   419 MSVRRRAAALAI...FLLSGRRKNQQL WP_376368104.1 MU...

Once you have some sequences, we can embed them with the function embedAminoAcids(). An example is below. Note that we need to provide a biolm API key and the amino acid sequences as an AAStringSet object:

embedded_OSCs <- embedAminoAcids(
  amino_acid_stringset = OSC_sequences,
  biolm_api_key = readLines("/Users/bust0037/Documents/Science/Websites/biolm_api_key.txt")
)
embedded_OSCs$product <- tolower(gsub(".*_", "", embedded_OSCs$name))
embedded_OSCs <- select(embedded_OSCs, name, product, everything())

Nice! Once we’ve bot the embeddings, we can run a PCA analysis to visualize them in 2D space:

runMatrixAnalysis(
  data = embedded_OSCs,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(embedded_OSCs)[3:dim(embedded_OSCs)[2]],
  columns_w_sample_ID_info = c("name", "product")
) %>%
  ggplot() +
    geom_jitter(
      aes(x = Dim.1, y = Dim.2, fill = product),
      shape = 21, size = 5, height = 2, width = 2, alpha = 0.6
    ) +
    theme_minimal()

further reading

  • creating knowledge graphs with LLMs. This blog post explains how to create knowledge graphs from text using OpenAI functions combined with LangChain and Neo4j. It highlights how large language models (LLMs) have made information extraction more accessible, providing step-by-step instructions for setting up a pipeline to extract structured information and construct a graph from unstructured data.

  • creating RAG systems with LLMs. This article provides a technical overview of implementing complex Retrieval Augmented Generation (RAG) systems, focusing on key concepts like chunking, query augmentation, document hierarchies, and knowledge graphs. It highlights the challenges in data retrieval, multi-hop reasoning, and query planning, while also discussing opportunities to improve RAG infrastructure for more accurate and efficient information extraction.

  • using protein embeddings in biochemical research. This study presents a machine learning pipeline that successfully identifies and characterizes terpene synthases (TPSs), a challenging task due to the limited availability of labeled protein sequences. By combining a curated TPS dataset, advanced structural domain segmentation, and language model techniques, the authors discovered novel TPSs, including the first active enzymes in Archaea, significantly improving the accuracy of substrate prediction across TPS classes.

exercises

  1. Recreate the PubMed search and subsequent analysis described in this chapter using search terms that relate to research you are involved in or are interested in. Use multiple search terms and retrieve publications over a period of several years (you may need to set sort = “date”). Embed the titles and visualize the changes in clustering over time using PCA or an x-axis that is the date. Discuss how research trends might evolve and reflect broader changes in the scientific community or societal challenges. Below is an example to help you:
search_results_ex <- searchPubMed(
  search_terms = c("oxidosqualene cyclase", "chemotaxonomy", "protein engineering"),
  pubmed_api_key = readLines("/Users/bust0037/Documents/Science/Websites/pubmed_api_key.txt"),
  retmax_per_term = 50,
  sort = "date"
)
## Error encountered. Attempt 1 of 3. Retrying in 5 seconds...

search_results_ex_embed <- embedText(
  search_results_ex, column_name = "abstract",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)

runMatrixAnalysis(
  data = search_results_ex_embed,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(search_results)[grep("embed", colnames(search_results))],
  columns_w_sample_ID_info = c("title", "journal", "term", "date")
) -> search_results_ex_embed_pca

search_results_ex_embed_pca %>%
    ggplot() +
      geom_point(aes(x = Dim.1, y = date, fill = date, shape = term), size = 5, alpha = 0.7) +
    scale_shape_manual(values = c(21, 22, 23)) +
    scale_fill_viridis() +
    scale_x_continuous(expand = c(0,1)) +
    scale_y_continuous(expand = c(0.1,0)) +
    theme_minimal()
  1. Using the hops_components dataset, determine whether there are any major clusters of hops that are grouped by aroma. To do this, compute embeddings for the hop_aroma column of the dataset, then use a dimensional reduction (pca, if you like) to determine if any clear clusters are present.

  2. Using the OSC_sequences dataset provided