embedding models

To run the analyses in this chapter, you will need four things.

  1. Please ensure that your computer can run the following R script. It may prompt you to install additional R packages.
source("https://thebustalab.github.io/phylochemistry/language_model_analysis.R")
## Loading packages...
## Loading functions...
## Done!!
  1. Please create an account at and obtain an API key from https://pubmed.ncbi.nlm.nih.gov/ (Login > Account Settings > API Key Management)
  2. Please create an account at and obtain an API key from https://huggingface.co (Login > Settings > Access Tokens, then configure your access token/key to “Make calls to the serverless Inference API” and “Make calls to Inference Endpoints”)
  3. Please create an account at and obtain an API key from https://biolm.ai/ (Login > Account > API Tokens)
  4. Please create an account (you may also need to create an NVIDIA cloud account if prompted) at and obtain an API key from https://build.nvidia.com/. (To get API key, go to: https://build.nvidia.com/meta/esm2-650m, switch “input” to python and click “Get API Key” > Generate Key)

Keep your API keys (long sequences of numbers and letters, like a password) handy for use in these analyses.

In the last chapter, we looked at models that use numerical data to understand the relationships between different aspects of a data set (inferential model use) and models that make predictions based on numerical data (predictive model use). In this chapter, we will explore a set of models called language models that transform non-numerical data (such as written text or protein sequences) into the numerical domain, enabling the non-numerical data to be analyzed using the techniques we have already covered. Language models are algorithms that are trained on large amounts of text (or, in the case of protein language models, many sequences) and can perform a variety of tasks related to their training data. In particular, we will focus on embedding models, which convert language data into numerical data. An embedding is a numerical representation of data that captures its essential features in a lower-dimensional space or in a different domain. In the context of language models, embeddings transform text, such as words or sentences, into vectors of numbers, enabling machine learning models and other statistical methods to process and analyze the data more effectively.

A basic form of an embedding model is a neural network called an autoencoder. Autoencoders consist of two main parts: an encoder and a decoder. The encoder takes the input data and compresses it into a lower-dimensional representation, called an embedding. The decoder then reconstructs the original input from this embedding, and the output from the decoder is compared against the original input. The model (the encoder and the decoder) are then iteratively optimized with the objective of minimizing a loss function that measures the difference between the original input and its reconstruction, resulting in an embedding model that creates meaningful embeddings that capture the important aspects of the original input.

pre-reading

Please read over the following:

  • Text Embeddings: Comprehensive Guide. In her article, “Text Embeddings: Comprehensive Guide”, Mariya Mansurova explores the evolution, applications, and visualization of text embeddings. Beginning with early methods like Bag of Words and TF-IDF, she traces how embeddings have advanced to capture semantic meaning, highlighting significant milestones such as word2vec and transformer-based models like BERT and Sentence-BERT. Mansurova explains how these embeddings transform text into vectors that computers can analyze for tasks like clustering, classification, and anomaly detection. She provides practical examples using tools like OpenAI’s embedding models and dimensionality reduction techniques, making this article an in-depth resource for both theoretical and hands-on understanding of text embeddings.

  • ESM3: Simulating 500 million years of evolution with a language model. The 2024 blog article “ESM3: Simulating 500 million years of evolution with a language model” by EvolutionaryScale introduces ESM3, a revolutionary language model trained on billions of protein sequences. This article explores how ESM3 marks a major advancement in computational biology by enabling researchers to reason over protein sequences, structures, and functions. With massive datasets and powerful computational resources, ESM3 can generate entirely new proteins, including esmGFP, a green fluorescent protein that differs significantly from known natural variants. The article highlights the model’s potential to transform fields like medicine, synthetic biology, and environmental sustainability by making protein design programmable. Please note the “Open Model” section of the blog, which highlights applications of ESM models in the natural sciences.

text embeddings

Here, we will create text embeddings using publication data from PubMed. Text embeddings are numerical representations of text that preserve important information and allow us to apply mathematical and statistical analyses to textual data. Below, we use a series of functions to obtain titles and abstracts from PubMed, create embeddings for their titles, and analyze them using principal component analysis.

First, we use the searchPubMed function to extract relevant publications from PubMed based on specific search terms. This function interacts with the PubMed website via a tool called an API. An API, or Application Programming Interface, is a set of rules that allows different software programs to communicate with each other. In this case, the API allows our code to access data from the PubMed database directly, without needing to manually search through the website. An API key is a unique identifier that allows you to authenticate yourself when using an API. It acts like a password, giving you permission to access the API services. Here, I am reading my API key from a local file. You can obtain by signing up for an NCBI account at https://pubmed.ncbi.nlm.nih.gov/. Once you have an API key, pass it to the searchPubMed function along with your search terms. Here I am using “beta-amyrin synthase,” “friedelin synthase,” “Sorghum bicolor,” and “cuticular wax biosynthesis.” I also specify that I want the results to be sorted according to relevance (as opposed to sorting by date) and I only want three results per term (the top three most relevant hits) to be returned:

search_results <- searchPubMed(
  search_terms = c("beta-amyrin synthase", "friedelin synthase", "sorghum bicolor", "cuticular wax biosynthesis"),
  pubmed_api_key = readLines("/Users/bust0037/Documents/Science/Websites/pubmed_api_key.txt"),
  retmax_per_term = 3,
  sort = "relevance"
)
colnames(search_results)
## [1] "entry_number" "term"         "date"        
## [4] "journal"      "title"        "doi"         
## [7] "abstract"
select(search_results, term, title)
## # A tibble: 12 × 2
##    term                       title                         
##    <chr>                      <chr>                         
##  1 beta-amyrin synthase       β-Amyrin synthase from Conyza…
##  2 beta-amyrin synthase       Ginsenosides in Panax genus a…
##  3 beta-amyrin synthase       β-Amyrin biosynthesis: cataly…
##  4 friedelin synthase         Friedelin Synthase from Mayte…
##  5 friedelin synthase         Friedelin in Maytenus ilicifo…
##  6 friedelin synthase         Functional characterization o…
##  7 sorghum bicolor            Sorghum (Sorghum bicolor).    
##  8 sorghum bicolor            Molecular Breeding of Sorghum…
##  9 sorghum bicolor            Proton-Coupled Electron Trans…
## 10 cuticular wax biosynthesis Regulatory mechanisms underly…
## 11 cuticular wax biosynthesis Update on Cuticular Wax Biosy…
## 12 cuticular wax biosynthesis Advances in Biosynthesis, Reg…

From the output here, you can see that we’ve retrieved records for various publications, each containing information such as the title, journal, and search term used. This gives us a dataset that we can further analyze to gain insights into the relationships between different research topics.

Next, we use the embedText function to create embeddings for the titles of the extracted publications. Just like PubMed, the Hugging Face API requires an API key, which acts as a unique identifier and grants you access to their services. You can obtain an API key by signing up at https://huggingface.co and following the instructions to generate your own key. Once you have your API key, you will need to specify it when using the embedText function. In the example below, I am reading the key from a local file for convenience.

To set up the embedText function, provide the dataset containing the text you want to embed (in this case, search_results, the output from the PubMed search above), the column with the text (title), and your Hugging Face API key. This function will then generate numerical embeddings for each of the publication titles. By default, the embeddings are generated using a pre-trained embedding language model called ‘BAAI/bge-small-en-v1.5’, available through the Hugging Face API at https://api-inference.huggingface.co/models/BAAI/bge-small-en-v1.5. This model is designed to create compact, informative numerical representations of text, making it suitable for a wide range of downstream tasks, such as clustering or similarity analysis. If you would like to know more about the model and its capabilities, you can visit the Hugging Face website at https://huggingface.co, where you will find detailed documentation and additional resources.

search_results_embedded <- embedText(
  df = search_results,
  column_name = "title",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)
search_results_embedded[1:3,1:10]
## # A tibble: 3 × 10
##   entry_number term  date       journal title doi   abstract
##          <dbl> <chr> <date>     <chr>   <chr> <chr> <chr>   
## 1            1 beta… 2019-11-20 FEBS o… β-Am… 10.1… Conyza …
## 2            2 beta… 2024-04-03 Acta p… Gins… 10.1… Ginseno…
## 3            3 beta… 2019-12-10 Organi… β-Am… 10.1… The enz…
## # ℹ 3 more variables: embedding_1 <dbl>, embedding_2 <dbl>,
## #   embedding_3 <dbl>

The output of the embedText function is a data frame where the 384 appended columns represent the embedding variables. These embeddings capture the features of each publication title. These embeddings are like a bar codes:

search_results_embedded %>%
  pivot_longer(
    cols = grep("embed",colnames(search_results_embedded)),
    names_to = "embedding_variable",
    values_to = "value"
  ) %>%
  ggplot() +
    geom_tile(aes(x = embedding_variable, y = factor(entry_number), fill = value)) +
    scale_y_discrete(name = "article") +
    scale_fill_gradient(low = "white", high = "black") +
    theme(
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank()
    )

To examine the relationships between the publication titles, we perform PCA on the text embeddings. We use the runMatrixAnalysis function, specifying PCA as the analysis type and indicating which columns contain the embedding values. We visualize the results using a scatter plot, with each point representing a publication title, colored by the search term it corresponds to. The grep function is used here to search for all column names in the search_results data frame that contain the word ‘embed’. This identifies and selects the columns that hold the embedding values, which will be used as the columns with values for single analytes for the PCA and enable the visualization below. While we’ve seen lots of PCA plots over the course of our explorations, note that this one is different in that it represents the relationships between the meaning of text passages (!) as opposed to relationships between samples for which we have made many measurements of numerical attributes.

runMatrixAnalysis(
  data = search_results_embedded,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(search_results_embedded)[grep("embed", colnames(search_results_embedded))],
  columns_w_sample_ID_info = c("title", "journal", "term")
) %>%
  ggplot() +
    geom_label_repel(
      aes(x = Dim.1, y = Dim.2, label = str_wrap(title, width = 35)),
      size = 2, min.segment.length = 0.5, force = 50
    ) +  
    geom_point(aes(x = Dim.1, y = Dim.2, fill = term), shape = 21, size = 5, alpha = 0.7) +
    scale_fill_brewer(palette = "Set1") +
    scale_x_continuous(expand = c(0,1)) +
    scale_y_continuous(expand = c(0,5)) +
    theme_minimal()

We can also use embeddings to examine data that are not full sentences but rather just lists of terms, such as the descriptions of odors in the beer_components dataset:

n <- 31

odor <- data.frame(
  sample = seq(1,n,1),
  odor = dropNA(unique(beer_components$analyte_odor))[sample(1:96, n)]
)

out <- embedText(
  odor, column_name = "odor",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)

runMatrixAnalysis(
  data = out,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(out)[grep("embed", colnames(out))],
  columns_w_sample_ID_info = c("sample", "odor")
) -> pca_out
## Replacing NAs in your data with mean

pca_out$color <- rgb(
  scales::rescale(pca_out$Dim.1, to = c(0, 1)),
  0,
  scales::rescale(pca_out$Dim.2, to = c(0, 1))
)

ggplot(pca_out) +
  geom_label_repel(
    aes(x = Dim.1, y = Dim.2, label = str_wrap(odor, width = 35)),
    size = 2, min.segment.length = 0.5, force = 25
  ) +  
  geom_point(aes(x = Dim.1, y = Dim.2), fill = pca_out$color, shape = 21, size = 3, alpha = 0.7) +
  # scale_x_continuous(expand = c(1,0)) +
  # scale_y_continuous(expand = c(1,0)) +
  theme_minimal()

protein embeddings

Autoencoders can be trained to accept various types of inputs, such as text (as shown above), images, audio, videos, sensor data, and sequence-based information like peptides and DNA. Protein language models convert protein sequences into numerical representations that can be used for a variety of downstream tasks, such as structure prediction or function annotation. Protein language models, like their text counterparts, are trained on large datasets of protein sequences to learn meaningful patterns and relationships within the sequence data.

Protein language models offer several advantages over traditional approaches, such as multiple sequence alignments (MSAs). One major disadvantage of MSAs is that they are computationally expensive and become increasingly slow as the number of sequences grows. While language models are also computationally demanding, they are primarily resource-intensive during the training phase, whereas applying a trained language model is much faster. Additionally, protein language models can capture both local and global sequence features, allowing them to identify complex relationships that span across different parts of a sequence. Furthermore, unlike MSAs, which rely on evolutionary information, protein language models can be applied to proteins without homologous sequences, making them suitable for analyzing sequences where little evolutionary data is available. This flexibility broadens the scope of proteins that can be effectively studied using these models.

Beyond the benefits described above, protein language models have an additional, highly important capability: the ability to capture information about connections between elements in their input, even if those elements are very distant from each other in the sequence. This capability is achieved through the use of a model architecture called a transformer, which is a more sophisticated version of an autoencoder. For example, amino acids that are far apart in the primary sequence may be very close in the 3D, folded protein structure. Proximate amino acids in 3D space can play crucial roles in protein stability, enzyme catalysis, or binding interactions, depending on their spatial arrangement and interactions with other residues. Embedding models with transformer architecture can effectively capture these functionally important relationships.

By adding a mechanism called an “attention mechanism” to an autoencoder, we can create a simple form of a transformer. The attention mechanism works within the encoder and decoder, allowing each element of the input (e.g., an amino acid) to compare itself to every other element, generating attention scores that weigh how much attention one amino acid should give to another. This mechanism helps capture both local and long-range dependencies in protein sequences, enabling the model to focus on important areas regardless of their position in the sequence. Attention is beneficial because it captures interactions between distant amino acids, weighs relationships to account for protein folding and interactions, adjusts focus across sequences of varying lengths, captures different types of relationships like hydrophobic interactions or secondary structures, and provides contextualized embeddings that reflect the broader sequence environment rather than just local motifs. For more on attention mechanisms, check out the further reading section of this chapter.

In this section, we will explore how to generate embeddings for protein sequences using a pre-trained protein language model and demonstrate how these embeddings can be used to analyze and visualize protein data effectively. First, we need some data. You can use the OSC_sequences object provided by the source() code, though you can also use the searchNCBI() function to retrieve your own sequences. For example:

ncbi_results <- searchNCBI(search_term = "oxidosqualene cyclase", retmax = 100)
ncbi_results
## AAStringSet object of length 100:
##       width seq                         names               
##   [1]   427 MRLLAQLTDDPW...VAALHLACVVSR WP_396422334.1 pr...
##   [2]   323 MQKLMIAAVLGA...SGGPAGAPQLTC WP_396420191.1 pr...
##   [3]   541 MRLAPMTAGLPR...PLATAPLTAASP WP_396323561.1 pr...
##   [4]   431 MLTAARLGAAAL...VLSIQRKRGPKP WP_396319534.1 pr...
##   [5]   533 MTTGEIEMAGTG...LALTGFDNDETP WP_396315749.1 pr...
##   ...   ... ...
##  [96]   414 MNVRRSAAALAA...IMLSGRRKKNQL WP_390546632.1 pr...
##  [97]   415 MNVRRSAAALAA...IMLSGRRKKNQL WP_390541893.1 pr...
##  [98]   929 MSAALLTFGASA...ARTRRDPAEEDR WP_390539473.1 pr...
##  [99]   436 MNTVRRGAAALA...VGIGFLVSGRKK WP_390528608.1 pr...
## [100]   994 MGTAELAERRTG...ARTRRNPAEEDR WP_390523159.1 pr...

Once you have some sequences, we can embed them with the function embedAminoAcids(). An example is below. Note that we need to provide either a biolm API key or an NVIDIA api key, and specify which platform we wish to use. We also need to provide the amino acid sequences as an AAStringSet object. If you use the NVIDIA platform, the model esm2-650m will be used (note: esm2 truncates sequences longer than 1022 AA in length). If you use bioLM, you can pick between a number of models.

embedded_OSCs <- embedAminoAcids(
  amino_acid_stringset = OSC_sequences,
  biolm_api_key = readLines("/Users/bust0037/Documents/Science/Websites/biolm_api_key.txt"),
  nvidia_api_key = readLines("/Users/bust0037/Documents/Science/Websites/nvidia_api_key.txt"),
  platform = "nvidia"
)
embedded_OSCs$product <- tolower(gsub(".*_", "", embedded_OSCs$name))
embedded_OSCs <- select(embedded_OSCs, name, product, everything())
embedded_OSCs[1:3,1:4]
## # A tibble: 3 × 4
##   name                   product     embedding_1 embedding_2
##   <chr>                  <chr>             <dbl>       <dbl>
## 1 ABK76265.1_beta-amyrin beta-amyrin     0.00905 -0.00000746
## 2 ABL07607.1_beta-amyrin beta-amyrin     0.00468  0.00122   
## 3 ABY90140.2_beta-amyrin beta-amyrin     0.0186  -0.00662

Nice! Once we’ve bot the embeddings, we can run a PCA analysis to visualize them in 2D space:

runMatrixAnalysis(
  data = embedded_OSCs,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(embedded_OSCs)[3:dim(embedded_OSCs)[2]],
  columns_w_sample_ID_info = c("name", "product")
) %>%
  ggplot() +
    geom_jitter(
      aes(x = Dim.1, y = Dim.2, fill = product),
      shape = 21, size = 5, height = 2, width = 2, alpha = 0.6
    ) +
    theme_minimal()

further reading

  • creating knowledge graphs with LLMs. This blog post explains how to create knowledge graphs from text using OpenAI functions combined with LangChain and Neo4j. It highlights how large language models (LLMs) have made information extraction more accessible, providing step-by-step instructions for setting up a pipeline to extract structured information and construct a graph from unstructured data.

  • creating RAG systems with LLMs. This article provides a technical overview of implementing complex Retrieval Augmented Generation (RAG) systems, focusing on key concepts like chunking, query augmentation, document hierarchies, and knowledge graphs. It highlights the challenges in data retrieval, multi-hop reasoning, and query planning, while also discussing opportunities to improve RAG infrastructure for more accurate and efficient information extraction.

  • using protein embeddings in biochemical research. This study presents a machine learning pipeline that successfully identifies and characterizes terpene synthases (TPSs), a challenging task due to the limited availability of labeled protein sequences. By combining a curated TPS dataset, advanced structural domain segmentation, and language model techniques, the authors discovered novel TPSs, including the first active enzymes in Archaea, significantly improving the accuracy of substrate prediction across TPS classes.

  • attention mechanims and transformers explained. This Financial Times article explains the development and workings of large language models (LLMs), emphasizing their foundation on the transformer model created by Google researchers in 2017. These models use self-attention mechanisms to understand context, allowing them to respond to subtle relationships between elements in their input, even if those elements are far from one another in the linear input sequence.

  • other types of protein language models. 3D Protein Structure Prediction deepmind / alphafold2-multimer: Predicts the 3D structure of protein complexes from amino acid sequences. deepmind / alphafold2: Predicts the 3D structure of single proteins from amino acid sequences. meta / esmfold: Predicts the 3D structure of proteins based on amino acid sequences. Protein Embedding Generation meta / esm2-650m: Generates protein embeddings from amino acid sequences. Protein Sequence Design ipd / proteinmpnn: Predicts amino acid sequences for given protein backbone structures. Generative Protein Design ipd / rfdiffusion: A generative model for designing protein backbones, particularly for protein binder design. Molecule-Protein Interaction Prediction mit / diffdock: Predicts the 3D interactions between molecules and proteins (docking simulations).

exercises

  1. Recreate the PubMed search and subsequent analysis described in this chapter using search terms that relate to research you are involved in or are interested in. Use multiple search terms and retrieve publications over a period of several years (you may need to set sort = “date”). Embed the titles and visualize the changes in clustering over time using PCA or an x-axis that is the date. Discuss how research trends might evolve and reflect broader changes in the scientific community or societal challenges. Below is an example to help you:
search_results_ex <- searchPubMed(
  search_terms = c("oxidosqualene cyclase", "chemotaxonomy", "protein engineering"),
  pubmed_api_key = readLines("/Users/bust0037/Documents/Science/Websites/pubmed_api_key.txt"),
  retmax_per_term = 50,
  sort = "date"
)

search_results_ex_embed <- embedText(
  search_results_ex, column_name = "abstract",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)

runMatrixAnalysis(
  data = search_results_ex_embed,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(search_results_ex_embed)[grep("embed", colnames(search_results_ex_embed))],
  columns_w_sample_ID_info = c("title", "journal", "term", "date")
) -> search_results_ex_embed_pca

search_results_ex_embed_pca %>%
    ggplot() +
      geom_point(aes(x = Dim.1, y = date, fill = date, shape = term), size = 5, alpha = 0.7) +
    scale_shape_manual(values = c(21, 22, 23)) +
    scale_fill_viridis() +
    scale_x_continuous(expand = c(0,1)) +
    scale_y_continuous(expand = c(0.1,0)) +
    theme_minimal()
  1. Using the hops_components dataset, determine whether there are any major clusters of hops that are grouped by aroma. To do this, compute embeddings for the hop_aroma column of the dataset, then use a dimensional reduction (pca, if you like) to determine if any clear clusters are present.

  2. Generate and visualize a set of protein embeddings. You can use OSC_sequences dataset provided by the source() command, or you can create your own protein sequence dataset using the searchNCBI() function.