dimensional reduction
![](https://thebustalab.github.io/integrated_bioanalytics/images/dimensionality.png)
In the previous chapters, we looked at how to explore our data sets by visualizing many variables and manually identifying trends. Sometimes, we encounter data sets with so many variables, that it is not reasonable to manually select certain variables with which to create plots and manually search for trends. In these cases, we need dimensionality reduction - a set of techniques that helps us identify which variables are driving differences among our samples. In this course, we will conduct dimensionality reduction useing runMatrixAnalysis()
, a function that is loaded into your R Session when you run the source() command.
Matrix analyses can be a bit tricky to set up. There are two things that we can do to help us with this: (i) we will use a template for runMatrixAnalysis()
(see below) and (ii) it is critical that we think about our data in terms of samples and analytes. Let’s consider our Alaska lakes data set:
alaska_lake_data
## # A tibble: 220 × 7
## lake park water_temp pH element mg_per_L
## <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 Devil_Mountain_L… BELA 6.46 7.69 C 3.4
## 2 Devil_Mountain_L… BELA 6.46 7.69 N 0.028
## 3 Devil_Mountain_L… BELA 6.46 7.69 P 0
## 4 Devil_Mountain_L… BELA 6.46 7.69 Cl 10.4
## 5 Devil_Mountain_L… BELA 6.46 7.69 S 0.62
## 6 Devil_Mountain_L… BELA 6.46 7.69 F 0.04
## 7 Devil_Mountain_L… BELA 6.46 7.69 Br 0.02
## 8 Devil_Mountain_L… BELA 6.46 7.69 Na 8.92
## 9 Devil_Mountain_L… BELA 6.46 7.69 K 1.2
## 10 Devil_Mountain_L… BELA 6.46 7.69 Ca 5.73
## # ℹ 210 more rows
## # ℹ 1 more variable: element_type <chr>
We can see that this dataset is comprised of measurements of various analytes (i.e. several chemical elements, as well as water_temp, and pH), in different samples (i.e. lakes). We need to tell the runMatrixAnalysis()
function how each column relates to this samples and analytes structure. See the image below for an explanation.
![](https://thebustalab.github.io/integrated_bioanalytics/images/runMatrixAnalysis1.png)
pca
“Which analytes are driving differences among my samples?” “Which analytes in my data set are correlated?”
theory
PCA looks at all the variance in a high dimensional data set and chooses new axes within that data set that align with the directions containing highest variance. These new axes are called principal components. Let’s look at an example:
![](https://thebustalab.github.io/integrated_bioanalytics/images/PCA.png)
In the example above, the three dimensional space can be reduced to a two dimensional space with the principal components analysis. New axes (principal components) are selected (bold arrows on left) that become the x and y axes in the principal components space (right).
We can run and visualize principal components analyses using the runMatrixAnalysis()
function as in the example below. As you can see in the output, the command provides the sample_IDs, sample information, then the coordinates for each sample in the 2D projection (the “PCA plot”) and the raw data, in case you wish to do further processing.
AK_lakes_pca <- runMatrixAnalysis(
data = alaska_lake_data,
analysis = c("pca"),
column_w_names_of_multiple_analytes = "element",
column_w_values_for_multiple_analytes = "mg_per_L",
columns_w_values_for_single_analyte = c("water_temp", "pH"),
columns_w_additional_analyte_info = "element_type",
columns_w_sample_ID_info = c("lake", "park"),
scale_variance = TRUE
)
head(AK_lakes_pca)
## # A tibble: 6 × 18
## sample_unique_ID lake park Dim.1 Dim.2 water_temp
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Devil_Mountain_Lake_… Devi… BELA 0.229 -0.861 6.46
## 2 Imuruk_Lake_BELA Imur… BELA -1.17 -1.62 17.4
## 3 Kuzitrin_Lake_BELA Kuzi… BELA -0.918 -1.15 8.06
## 4 Lava_Lake_BELA Lava… BELA 0.219 -1.60 20.2
## 5 North_Killeak_Lake_B… Nort… BELA 9.46 0.450 11.3
## 6 White_Fish_Lake_BELA Whit… BELA 4.17 -0.972 12.0
## # ℹ 12 more variables: pH <dbl>, C <dbl>, N <dbl>, P <dbl>,
## # Cl <dbl>, S <dbl>, F <dbl>, Br <dbl>, Na <dbl>,
## # K <dbl>, Ca <dbl>, Mg <dbl>
Let’s plot the 2D projection of the Alaska lakes data:
ggplot(data = AK_lakes_pca, aes(x = Dim.1, y = Dim.2)) +
geom_point(aes(fill = park), shape = 21, size = 4, alpha = 0.8) +
geom_label_repel(aes(label = lake), alpha = 0.5) +
theme_classic()
![](index_files/figure-html/unnamed-chunk-121-1.png)
Great! In this plot we can see that White Fish Lake and North Killeak Lake, both in BELA park, are quite different from the other parks (they are separated from the others along dimension 1, i.e. the first principal component). At the same time, Wild Lake, Iniakuk Lake, Walker Lake, and several other lakes in GAAR park are different from all the others (they are separated from the others along dimension 2, i.e. the second principal component).
Important question: what makes the lakes listed above different from the others? Certainly some aspect of their chemistry, since that’s the data that this analysis is built upon, but how do we determine which analyte(s) are driving the differences among the lakes that we see in the PCA plot?
ordination plots
Let’s look at how to access the information about which analytes are major contributors to each principal component. This is important because it will tell you which analytes are associated with particular dimensions, and by extension, which analytes are associated with (and are markers for) particular groups in the PCA plot. This can be determined using an ordination plot. Let’s look at an example. We can obtain the ordination plot information using runMatrixAnalysis()
with analysis = "pca_ord"
:
## # A tibble: 6 × 3
## analyte Dim.1 Dim.2
## <chr> <dbl> <dbl>
## 1 water_temp 0.0750 -0.261
## 2 pH 0.686 0.0185
## 3 C 0.290 -0.242
## 4 N 0.00435 0.714
## 5 P 0.473 -0.0796
## 6 Cl 0.953 0.0148
We can now visualize the ordination plot using our standard ggplot plotting techniques. Note the use of geom_label_repel()
and filter()
to label certain segments in the ordination plot. You do not need to use geom_label_repel()
, you could use the built in geom_label()
, but geom_label_repel()
can make labelling your segments easier.
AK_lakes_pca_ord <- runMatrixAnalysis(
data = alaska_lake_data,
analysis = c("pca_ord"),
column_w_names_of_multiple_analytes = "element",
column_w_values_for_multiple_analytes = "mg_per_L",
columns_w_values_for_single_analyte = c("water_temp", "pH"),
columns_w_additional_analyte_info = "element_type",
columns_w_sample_ID_info = c("lake", "park")
)
head(AK_lakes_pca_ord)
## # A tibble: 6 × 3
## analyte Dim.1 Dim.2
## <chr> <dbl> <dbl>
## 1 water_temp 0.0750 -0.261
## 2 pH 0.686 0.0185
## 3 C 0.290 -0.242
## 4 N 0.00435 0.714
## 5 P 0.473 -0.0796
## 6 Cl 0.953 0.0148
ggplot(AK_lakes_pca_ord) +
geom_segment(aes(x = 0, y = 0, xend = Dim.1, yend = Dim.2, color = analyte), size = 1) +
geom_circle(aes(x0 = 0, y0 = 0, r = 1)) +
geom_label_repel(
data = filter(AK_lakes_pca_ord, Dim.1 > 0.9, Dim.2 < 0.1, Dim.2 > -0.1),
aes(x = Dim.1, y = Dim.2, label = analyte), xlim = c(1,1.5)
) +
geom_label_repel(
data = filter(AK_lakes_pca_ord, Dim.2 > 0.5),
aes(x = Dim.1, y = Dim.2, label = analyte), direction = "y", ylim = c(1,1.5)
) +
coord_cartesian(xlim = c(-1,1.5), ylim = c(-1,1.5)) +
theme_bw()
![](index_files/figure-html/unnamed-chunk-123-1.png)
Great! Here is how to read the ordination plot:
When considering one analyte’s vector: the vector’s projected value on an axis shows how much its variance is aligned with that principal component.
When considering two analyte vectors: the angle between two vectors indicates how correlated those two variables are. If they point in the same direction, they are highly correlated. If they meet each other at 90 degrees, they are not very correlated. If they meet at ~180 degrees, they are negatively correlated. If say that one analyte is “1.9” with respect to dimension 2 and another is “-1.9” with respect to dimension 2. Let’s also say that these vectors are ~“0” with respect to dimension 1.
With the ordination plot above, we can now see that the abundances of K, Cl, Br, and Na are the major contributors of variance to the first principal component (or the first dimension). The abundances of these elements are what make White Fish Lake and North Killeak Lake different from the other lakes. We can also see that the abundances of N, S, and Ca are the major contributors to variance in the second dimension, which means that these elements ar what set Wild Lake, Iniakuk Lake, Walker Lake, and several other lakes in GAAR park apart from the rest of the lakes in the data set. It slightly easier to understand this if we look at an overlay of the two plots, which is often called a “biplot”:
AK_lakes_pca <- runMatrixAnalysis(
data = alaska_lake_data,
analysis = c("pca"),
column_w_names_of_multiple_analytes = "element",
column_w_values_for_multiple_analytes = "mg_per_L",
columns_w_values_for_single_analyte = c("water_temp", "pH"),
columns_w_additional_analyte_info = "element_type",
columns_w_sample_ID_info = c("lake", "park"),
scale_variance = TRUE
)
## Replacing NAs in your data with mean
AK_lakes_pca_ord <- runMatrixAnalysis(
data = alaska_lake_data,
analysis = c("pca_ord"),
column_w_names_of_multiple_analytes = "element",
column_w_values_for_multiple_analytes = "mg_per_L",
columns_w_values_for_single_analyte = c("water_temp", "pH"),
columns_w_additional_analyte_info = "element_type",
columns_w_sample_ID_info = c("lake", "park")
)
## Replacing NAs in your data with mean
ggplot() +
geom_point(
data = AK_lakes_pca,
aes(x = Dim.1, y = Dim.2, fill = park), shape = 21, size = 4, alpha = 0.8
) +
# geom_label_repel(aes(label = lake), alpha = 0.5) +
geom_segment(
data = AK_lakes_pca_ord,
aes(x = 0, y = 0, xend = Dim.1, yend = Dim.2, color = analyte),
size = 1
) +
scale_color_manual(values = discrete_palette) +
theme_classic()
![](index_files/figure-html/unnamed-chunk-124-1.png)
Note that you do not have to plot ordination data as a circular layout of segments. Sometimes it is much easier to plot (and interpret!) alternatives:
AK_lakes_pca_ord %>%
ggplot(aes(x = Dim.1, y = analyte)) +
geom_point(aes(fill = analyte), shape = 22, size = 3) +
scale_fill_manual(values = discrete_palette) +
theme_bw()
![](index_files/figure-html/unnamed-chunk-125-1.png)
principal components
We also can access information about the how much of the variance in the data set is explained by each principal component, and we can plot that using ggplot:
AK_lakes_pca_dim <- runMatrixAnalysis(
data = alaska_lake_data,
analysis = c("pca_dim"),
column_w_names_of_multiple_analytes = "element",
column_w_values_for_multiple_analytes = "mg_per_L",
columns_w_values_for_single_analyte = c("water_temp", "pH"),
columns_w_additional_analyte_info = "element_type",
columns_w_sample_ID_info = c("lake", "park")
)
head(AK_lakes_pca_dim)
## # A tibble: 6 × 2
## principal_component percent_variance_explained
## <dbl> <dbl>
## 1 1 48.8
## 2 2 18.6
## 3 3 11.6
## 4 4 7.88
## 5 5 4.68
## 6 6 3.33
ggplot(
data = AK_lakes_pca_dim,
aes(x = principal_component, y = percent_variance_explained)
) +
geom_line() +
geom_point() +
theme_bw()
![](index_files/figure-html/unnamed-chunk-126-1.png)
Cool! We can see that the first principal component retains nearly 50% of the variance in the original dataset, while the second dimension contains only about 20%. We can derive an important notion about PCA visualization from this: the scales on the two axes need to be the same for distances between points in the x and y directions to be comparable. This can be accomplished using coord_fixed()
as an addition to your ggplots.
tsne and umap
set.seed(235)
runMatrixAnalysis(
data = metabolomics_data,
analysis = "pca",
column_w_names_of_multiple_analytes = NULL,
column_w_values_for_multiple_analytes = NULL,
columns_w_values_for_single_analyte = colnames(metabolomics_data)[c(3:126)],
columns_w_additional_analyte_info = NULL,
columns_w_sample_ID_info = colnames(metabolomics_data)[c(1:2)],
na_replacement = "mean"
) -> pca_data
## Replacing NAs in your data with mean
pca_data$technique <- "pca_data"
colnames(pca_data) <- gsub("\\.", "_", colnames(pca_data))
pca_data$Dim_1 <- as.numeric(scale(pca_data$Dim_1))
pca_data$Dim_2 <- as.numeric(scale(pca_data$Dim_2))
runMatrixAnalysis(
data = metabolomics_data,
analysis = "umap",
column_w_names_of_multiple_analytes = NULL,
column_w_values_for_multiple_analytes = NULL,
columns_w_values_for_single_analyte = colnames(metabolomics_data)[c(3:126)],
columns_w_additional_analyte_info = NULL,
columns_w_sample_ID_info = colnames(metabolomics_data)[c(1:2)],
na_replacement = "mean"
) -> umap_data
## Replacing NAs in your data with mean
umap_data$technique <- "umap_data"
umap_data$Dim_1 <- as.numeric(scale(umap_data$Dim_1))
umap_data$Dim_2 <- as.numeric(scale(umap_data$Dim_2))
runMatrixAnalysis(
data = metabolomics_data,
analysis = "tsne",
column_w_names_of_multiple_analytes = NULL,
column_w_values_for_multiple_analytes = NULL,
columns_w_values_for_single_analyte = colnames(metabolomics_data)[c(3:126)],
columns_w_additional_analyte_info = NULL,
columns_w_sample_ID_info = colnames(metabolomics_data)[c(1:2)],
na_replacement = "mean"
) -> tsne_data
## Replacing NAs in your data with mean
tsne_data$technique <- "tsne_data"
tsne_data$Dim_1 <- as.numeric(scale(tsne_data$Dim_1))
tsne_data$Dim_2 <- as.numeric(scale(tsne_data$Dim_2))
data <- rbind(pca_data, umap_data, tsne_data)
p1 <- ggplot(data) +
geom_point(aes(x = Dim_1, y = Dim_2, fill = patient_status), shape = 21, size= 4) +
facet_grid(technique~., scales = "free") +
scale_fill_brewer(palette = "Set1")
p2 <- ggplot(data) +
geom_point(aes(x = Dim_1, y = Dim_2, fill = patient_status), shape = 21, size= 4) +
facet_grid(technique~., scales = "free") +
scale_fill_brewer(palette = "Set1")
p1 + p2
![](index_files/figure-html/unnamed-chunk-130-1.png)
further reading
Here is a video that nicely explains PCA: https://www.youtube.com/watch?v=FgakZw6K1QQ&list=PLblh5JKOoLUIcdlgu78MnlATeyx4cEVeR
https://datavizpyr.com/how-to-make-umap-plot-in-r/
https://datavizpyr.com/how-to-make-tsne-plot-in-r/