embedding models

To run the analyses in this chapter, you will need four things.

Please ensure that your computer can run the following R script. It may prompt you to install additional R packages.

source("https://thebustalab.github.io/phylochemistry/language_model_analysis.R")
## Loading packages...
## Loading functions...
## Done!!

Please create an account at and obtain an API key from https://pubmed.ncbi.nlm.nih.gov/ (Login > Account Settings > API Key Management)
Please create an account at and obtain an API key from https://huggingface.co (Login > Settings > Access Tokens, then configure your access token/key to “Make calls to the serverless Inference API” and “Make calls to Inference Endpoints”)
Please create an account at and obtain an API key from https://biolm.ai/ (Login > Account > API Tokens)
Please create an account (you may also need to create an NVIDIA cloud account if prompted) at and obtain an API key from https://build.nvidia.com/. (To get API key, go to: https://build.nvidia.com/meta/esm2-650m, switch “input” to python and click “Get API Key” > Generate Key)

Keep your API keys (long sequences of numbers and letters, like a password) handy for use in these analyses.

In the last chapter, we looked at models that use numerical data to understand the relationships between different aspects of a data set (inferential model use) and models that make predictions based on numerical data (predictive model use). In this chapter, we will explore a set of models called language models that transform non-numerical data (such as written text or protein sequences) into the numerical domain, enabling the non-numerical data to be analyzed using the techniques we have already covered. Language models are algorithms that are trained on large amounts of text (or, in the case of protein language models, many sequences) and can perform a variety of tasks related to their training data. In particular, we will focus on embedding models, which convert language data into numerical data. An embedding is a numerical representation of data that captures its essential features in a lower-dimensional space or in a different domain. In the context of language models, embeddings transform text, such as words or sentences, into vectors of numbers, enabling machine learning models and other statistical methods to process and analyze the data more effectively.

A basic form of an embedding model is a neural network called an autoencoder. Autoencoders consist of two main parts: an encoder and a decoder. The encoder takes the input data and compresses it into a lower-dimensional representation, called an embedding. The decoder then reconstructs the original input from this embedding, and the output from the decoder is compared against the original input. The model (the encoder and the decoder) are then iteratively optimized with the objective of minimizing a loss function that measures the difference between the original input and its reconstruction, resulting in an embedding model that creates meaningful embeddings that capture the important aspects of the original input.

pre-reading

Please read over the following:

Text Embeddings: Comprehensive Guide. In her article, “Text Embeddings: Comprehensive Guide”, Mariya Mansurova explores the evolution, applications, and visualization of text embeddings. Beginning with early methods like Bag of Words and TF-IDF, she traces how embeddings have advanced to capture semantic meaning, highlighting significant milestones such as word2vec and transformer-based models like BERT and Sentence-BERT. Mansurova explains how these embeddings transform text into vectors that computers can analyze for tasks like clustering, classification, and anomaly detection. She provides practical examples using tools like OpenAI’s embedding models and dimensionality reduction techniques, making this article an in-depth resource for both theoretical and hands-on understanding of text embeddings.
ESM3: Simulating 500 million years of evolution with a language model. The 2024 blog article “ESM3: Simulating 500 million years of evolution with a language model” by EvolutionaryScale introduces ESM3, a revolutionary language model trained on billions of protein sequences. This article explores how ESM3 marks a major advancement in computational biology by enabling researchers to reason over protein sequences, structures, and functions. With massive datasets and powerful computational resources, ESM3 can generate entirely new proteins, including esmGFP, a green fluorescent protein that differs significantly from known natural variants. The article highlights the model’s potential to transform fields like medicine, synthetic biology, and environmental sustainability by making protein design programmable. Please note the “Open Model” section of the blog, which highlights applications of ESM models in the natural sciences.

text embeddings

Here, we will create text embeddings using publication data from PubMed. Text embeddings are numerical representations of text that preserve important information and allow us to apply mathematical and statistical analyses to textual data. Below, we use a series of functions to obtain titles and abstracts from PubMed, create embeddings for their titles, and analyze them using principal component analysis.

First, we use the searchPubMed function to extract relevant publications from PubMed based on specific search terms. This function interacts with the PubMed website via a tool called an API. An API, or Application Programming Interface, is a set of rules that allows different software programs to communicate with each other. In this case, the API allows our code to access data from the PubMed database directly, without needing to manually search through the website. An API key is a unique identifier that allows you to authenticate yourself when using an API. It acts like a password, giving you permission to access the API services. Here, I am reading my API key from a local file. You can obtain by signing up for an NCBI account at https://pubmed.ncbi.nlm.nih.gov/. Once you have an API key, pass it to the searchPubMed function along with your search terms. Here I am using “beta-amyrin synthase,” “friedelin synthase,” “Sorghum bicolor,” and “cuticular wax biosynthesis.” I also specify that I want the results to be sorted according to relevance (as opposed to sorting by date) and I only want three results per term (the top three most relevant hits) to be returned:

search_results <- searchPubMed(
  search_terms = c("beta-amyrin synthase", "friedelin synthase", "sorghum bicolor", "cuticular wax biosynthesis"),
  pubmed_api_key = readLines("/Users/bust0037/Documents/Science/Websites/pubmed_api_key.txt"),
  retmax_per_term = 3,
  sort = "relevance"
)
colnames(search_results)
## [1] "entry_number" "term"         "date"        
## [4] "journal"      "title"        "doi"         
## [7] "abstract"

select(search_results, term, title)
## # A tibble: 12 × 2
##    term                       title                         
##    <chr>                      <chr>                         
##  1 beta-amyrin synthase       Ginsenosides in Panax genus a…
##  2 beta-amyrin synthase       β-Amyrin synthase from Conyza…
##  3 beta-amyrin synthase       β-Amyrin biosynthesis: cataly…
##  4 friedelin synthase         Friedelin in Maytenus ilicifo…
##  5 friedelin synthase         Friedelin Synthase from Mayte…
##  6 friedelin synthase         Genome Mining and Gene Expres…
##  7 sorghum bicolor            Sorghum (Sorghum bicolor).    
##  8 sorghum bicolor            Molecular Breeding of Sorghum…
##  9 sorghum bicolor            Proton-Coupled Electron Trans…
## 10 cuticular wax biosynthesis Regulatory mechanisms underly…
## 11 cuticular wax biosynthesis Update on Cuticular Wax Biosy…
## 12 cuticular wax biosynthesis Cuticular wax in wheat: biosy…

From the output here, you can see that we’ve retrieved records for various publications, each containing information such as the title, journal, and search term used. This gives us a dataset that we can further analyze to gain insights into the relationships between different research topics.

Next, we use the embedText function to create embeddings for the titles of the extracted publications. Just like PubMed, the Hugging Face API requires an API key, which acts as a unique identifier and grants you access to their services. You can obtain an API key by signing up at https://huggingface.co and following the instructions to generate your own key. Once you have your API key, you will need to specify it when using the embedText function. In the example below, I am reading the key from a local file for convenience.

To set up the embedText function, provide the dataset containing the text you want to embed (in this case, search_results, the output from the PubMed search above), the column with the text (title), and your Hugging Face API key. This function will then generate numerical embeddings for each of the publication titles. By default, the embeddings are generated using a pre-trained embedding language model called ‘BAAI/bge-small-en-v1.5’, available through the Hugging Face API at https://api-inference.huggingface.co/models/BAAI/bge-small-en-v1.5. This model is designed to create compact, informative numerical representations of text, making it suitable for a wide range of downstream tasks, such as clustering or similarity analysis. If you would like to know more about the model and its capabilities, you can visit the Hugging Face website at https://huggingface.co, where you will find detailed documentation and additional resources.

search_results_embedded <- embedText(
  df = search_results,
  column_name = "title",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)
search_results_embedded[1:3,1:10]
## # A tibble: 3 × 10
##   entry_number term  date       journal title doi   abstract
##          <dbl> <chr> <date>     <chr>   <chr> <chr> <chr>   
## 1            1 beta… 2024-04-03 Acta p… Gins… 10.1… Ginseno…
## 2            2 beta… 2019-11-20 FEBS o… β-Am… 10.1… Conyza …
## 3            3 beta… 2019-12-10 Organi… β-Am… 10.1… The enz…
## # ℹ 3 more variables: embedding_1 <dbl>, embedding_2 <dbl>,
## #   embedding_3 <dbl>

The output of the embedText function is a data frame where the 384 appended columns represent the embedding variables. These embeddings capture the features of each publication title. These embeddings are like a bar codes:

search_results_embedded %>%
  pivot_longer(
    cols = grep("embed",colnames(search_results_embedded)),
    names_to = "embedding_variable",
    values_to = "value"
  ) %>%
  ggplot() +
    geom_tile(aes(x = embedding_variable, y = factor(entry_number), fill = value)) +
    scale_y_discrete(name = "article") +
    scale_fill_gradient(low = "white", high = "black") +
    theme(
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank()
    )

To examine the relationships between the publication titles, we perform PCA on the text embeddings. We use the runMatrixAnalysis function, specifying PCA as the analysis type and indicating which columns contain the embedding values. We visualize the results using a scatter plot, with each point representing a publication title, colored by the search term it corresponds to. The grep function is used here to search for all column names in the search_results data frame that contain the word ‘embed’. This identifies and selects the columns that hold the embedding values, which will be used as the columns with values for single analytes for the PCA and enable the visualization below. While we’ve seen lots of PCA plots over the course of our explorations, note that this one is different in that it represents the relationships between the meaning of text passages (!) as opposed to relationships between samples for which we have made many measurements of numerical attributes.

runMatrixAnalysis(
  data = search_results_embedded,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(search_results_embedded)[grep("embed", colnames(search_results_embedded))],
  columns_w_sample_ID_info = c("title", "journal", "term")
) %>%
  ggplot() +
    geom_label_repel(
      aes(x = Dim.1, y = Dim.2, label = str_wrap(title, width = 35)),
      size = 2, min.segment.length = 0.5, force = 50
    ) +  
    geom_point(aes(x = Dim.1, y = Dim.2, fill = term), shape = 21, size = 5, alpha = 0.7) +
    scale_fill_brewer(palette = "Set1") +
    scale_x_continuous(expand = c(0,1)) +
    scale_y_continuous(expand = c(0,5)) +
    theme_minimal()

We can also use embeddings to examine data that are not full sentences but rather just lists of terms, such as the descriptions of odors in the beer_components dataset:

n <- 31

odor <- data.frame(
  sample = seq(1,n,1),
  odor = dropNA(unique(beer_components$analyte_odor))[sample(1:96, n)]
)

out <- embedText(
  odor, column_name = "odor",
  hf_api_key = readLines("/Users/bust0037/Documents/Science/Websites/hf_api_key.txt")
)

runMatrixAnalysis(
  data = out,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(out)[grep("embed", colnames(out))],
  columns_w_sample_ID_info = c("sample", "odor")
) -> pca_out

pca_out$color <- rgb(
  scales::rescale(pca_out$Dim.1, to = c(0, 1)),
  0,
  scales::rescale(pca_out$Dim.2, to = c(0, 1))
)

ggplot(pca_out) +
  geom_label_repel(
    aes(x = Dim.1, y = Dim.2, label = str_wrap(odor, width = 35)),
    size = 2, min.segment.length = 0.5, force = 25
  ) +  
  geom_point(aes(x = Dim.1, y = Dim.2), fill = pca_out$color, shape = 21, size = 3, alpha = 0.7) +
  # scale_x_continuous(expand = c(1,0)) +
  # scale_y_continuous(expand = c(1,0)) +
  theme_minimal()

protein embeddings

Autoencoders can be trained to accept various types of inputs, such as text (as shown above), images, audio, videos, sensor data, and sequence-based information like peptides and DNA. Protein language models convert protein sequences into numerical representations that can be used for a variety of downstream tasks, such as structure prediction or function annotation. Protein language models, like their text counterparts, are trained on large datasets of protein sequences to learn meaningful patterns and relationships within the sequence data.

Protein language models offer several advantages over traditional approaches, such as multiple sequence alignments (MSAs). One major disadvantage of MSAs is that they are computationally expensive and become increasingly slow as the number of sequences grows. While language models are also computationally demanding, they are primarily resource-intensive during the training phase, whereas applying a trained language model is much faster. Additionally, protein language models can capture both local and global sequence features, allowing them to identify complex relationships that span across different parts of a sequence. Furthermore, unlike MSAs, which rely on evolutionary information, protein language models can be applied to proteins without homologous sequences, making them suitable for analyzing sequences where little evolutionary data is available. This flexibility broadens the scope of proteins that can be effectively studied using these models.

Beyond the benefits described above, protein language models have an additional, highly important capability: the ability to capture information about connections between elements in their input, even if those elements are very distant from each other in the sequence. This capability is achieved through the use of a model architecture called a transformer, which is a more sophisticated version of an autoencoder. For example, amino acids that are far apart in the primary sequence may be very close in the 3D, folded protein structure. Proximate amino acids in 3D space can play crucial roles in protein stability, enzyme catalysis, or binding interactions, depending on their spatial arrangement and interactions with other residues. Embedding models with transformer architecture can effectively capture these functionally important relationships.

By adding a mechanism called an “attention mechanism” to an autoencoder, we can create a simple form of a transformer. The attention mechanism works within the encoder and decoder, allowing each element of the input (e.g., an amino acid) to compare itself to every other element, generating attention scores that weigh how much attention one amino acid should give to another. This mechanism helps capture both local and long-range dependencies in protein sequences, enabling the model to focus on important areas regardless of their position in the sequence. Attention is beneficial because it captures interactions between distant amino acids, weighs relationships to account for protein folding and interactions, adjusts focus across sequences of varying lengths, captures different types of relationships like hydrophobic interactions or secondary structures, and provides contextualized embeddings that reflect the broader sequence environment rather than just local motifs. For more on attention mechanisms, check out the further reading section of this chapter.

In this section, we will explore how to generate embeddings for protein sequences using a pre-trained protein language model and demonstrate how these embeddings can be used to analyze and visualize protein data effectively. First, we need some data. You can use the OSC_sequences object provided by the source() code, though you can also use the searchNCBI() function to retrieve your own sequences. For example:

ncbi_results <- searchNCBI(search_term = "oxidosqualene cyclase", retmax = 100)
ncbi_results
## AAStringSet object of length 100:
##       width seq                         names               
##   [1]   571 MSAPQFPQGQGL...GQRQHLPSLQLS WP_414195997.1 pr...
##   [2]   390 MDAPAPGPTATA...FATEQYKNRRTA WP_414113916.1 MU...
##   [3]   617 MLLYDKVREEIE...LALAHYIKKYKK WP_000928113.1 pr...
##   [4]   421 MAPSFAVHARRG...ILLSGRMKKNQQ WP_414169360.1 pr...
##   [5]   191 MGFRSVFFARSV...IKAFDRGSWKKD KAL4572563.1 hypo...
##   ...   ... ...
##  [96]   632 MRRLCSLLEDVK...AFIKKAEMRETY WP_412056102.1 pr...
##  [97]   632 MRRLRSLLEDVK...AFIKKAEKRETY WP_412056075.1 pr...
##  [98]   632 MRRLRSLLEDVK...AFIKKAEKRETY WP_412056060.1 pr...
##  [99]   632 MRRLRSLLEDVK...AFIKKTEMRETY WP_412055958.1 pr...
## [100]   632 MRRLRSLLEDVK...AFIKKAEMRETY WP_412055935.1 pr...

Once you have some sequences, we can embed them with the function embedAminoAcids(). An example is below. Note that we need to provide either a biolm API key or an NVIDIA api key, and specify which platform we wish to use. We also need to provide the amino acid sequences as an AAStringSet object. If you use the NVIDIA platform, the model esm2-650m will be used (note: esm2 truncates sequences longer than 1022 AA in length). If you use bioLM, you can pick between a number of models.

embedded_OSCs <- embedAminoAcids(
  amino_acid_stringset = OSC_sequences,
  biolm_api_key = readLines("/Users/bust0037/Documents/Science/Websites/biolm_api_key.txt"),
  nvidia_api_key = readLines("/Users/bust0037/Documents/Science/Websites/nvidia_api_key.txt"),
  platform = "biolm"
)
embedded_OSCs$product <- tolower(gsub(".*_", "", embedded_OSCs$name))
embedded_OSCs <- select(embedded_OSCs, name, product, everything())
embedded_OSCs[1:3,1:4]

Nice! Once we’ve bot the embeddings, we can run a PCA analysis to visualize them in 2D space:

runMatrixAnalysis(
  data = embedded_OSCs,
  analysis = "pca",
  columns_w_values_for_single_analyte = colnames(embedded_OSCs)[3:dim(embedded_OSCs)[2]],
  columns_w_sample_ID_info = c("name", "product")
) %>%
  ggplot() +
    geom_jitter(
      aes(x = Dim.1, y = Dim.2, fill = product),
      shape = 21, size = 5, height = 2, width = 2, alpha = 0.6
    ) +
    theme_minimal()

Integrated Bioanalytics

embedding models

pre-reading

text embeddings

protein embeddings

further reading

Note: Second Edition is under construction 🏗