hierarchical clustering

“Which of my samples are most closely related?”
3.1 clustering
So far we have been looking at how to plot raw data, summarize data, and reduce a data set’s dimensionality. It’s time to look at how to identify relationships between the samples in our data sets. For example: in the Alaska lakes dataset, which lake is most similar, chemically speaking, to Lake Narvakrak? Answering this requires calculating numeric distances between samples based on their chemical properties. For this, the first thing we need is a distance matrix:

Please note that we can get distance matrices directly from runMatrixAnalyses
by specifying analysis = "dist"
:
alaska_lake_data %>%
select(-element_type) %>%
pivot_wider(names_from = "element", values_from = "mg_per_L") -> alaska_lake_data_wide
dist <- runMatrixAnalyses(
data = alaska_lake_data_wide,
analysis = c("dist"),
columns_w_values_for_single_analyte = colnames(alaska_lake_data_wide)[3:15],
columns_w_sample_ID_info = c("lake", "park")
)
## Replacing NAs in your data with mean
as.matrix(dist)[1:3,1:3]
## Devil_Mountain_Lake_BELA
## Devil_Mountain_Lake_BELA 0.000000
## Imuruk_Lake_BELA 3.672034
## Kuzitrin_Lake_BELA 1.663147
## Imuruk_Lake_BELA
## Devil_Mountain_Lake_BELA 3.672034
## Imuruk_Lake_BELA 0.000000
## Kuzitrin_Lake_BELA 3.062381
## Kuzitrin_Lake_BELA
## Devil_Mountain_Lake_BELA 1.663147
## Imuruk_Lake_BELA 3.062381
## Kuzitrin_Lake_BELA 0.000000
There is more that we can do with distance matrices though, lots more. Let’s start by looking at an example of hierarchical clustering. For this, we just need to tell runMatrixAnalyses()
to use analysis = "hclust"
:
AK_lakes_clustered <- runMatrixAnalyses(
data = alaska_lake_data_wide,
analysis = c("hclust"),
columns_w_values_for_single_analyte = colnames(alaska_lake_data_wide)[3:15],
columns_w_sample_ID_info = c("lake", "park")
)
AK_lakes_clustered
## # A tibble: 38 × 26
## sample_unique_ID lake park parent node branch.length
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 Devil_Mountain_La… Devi… BELA 23 1 7.81
## 2 Imuruk_Lake_BELA Imur… BELA 25 2 6.01
## 3 Kuzitrin_Lake_BELA Kuzi… BELA 24 3 3.27
## 4 Lava_Lake_BELA Lava… BELA 32 4 3.14
## 5 North_Killeak_Lak… Nort… BELA 38 5 256.
## 6 White_Fish_Lake_B… Whit… BELA 38 6 0.828
## 7 Iniakuk_Lake_GAAR Inia… GAAR 36 7 2.59
## 8 Kurupa_Lake_GAAR Kuru… GAAR 30 8 6.00
## 9 Lake_Matcharak_GA… Lake… GAAR 35 9 2.09
## 10 Lake_Selby_GAAR Lake… GAAR 29 10 3.49
## # ℹ 28 more rows
## # ℹ 20 more variables: label <chr>, isTip <lgl>, x <dbl>,
## # y <dbl>, branch <dbl>, angle <dbl>, bootstrap <lgl>,
## # water_temp <dbl>, pH <dbl>, C <dbl>, N <dbl>, P <dbl>,
## # Cl <dbl>, S <dbl>, F <dbl>, Br <dbl>, Na <dbl>,
## # K <dbl>, Ca <dbl>, Mg <dbl>
It works! Now we can plot our cluster diagram with a ggplot add-on called ggtree. We’ve seen that ggplot takes a “data” argument (i.e. ggplot(data = <some_data>) + geom_*()
etc.). In contrast, ggtree takes an argument called tr
, though if you’re using the runMatrixAnalysis()
function, you can treat these two (data
and tr
) the same, so, use: ggtree(tr = <output_from_runMatrixAnalyses>) + geom_*()
etc.
Note that ggtree
also comes with several great new geoms: geom_tiplab()
and geom_tippoint()
. Let’s try those out:
library(ggtree)
AK_lakes_clustered %>%
ggtree() +
geom_tiplab() +
geom_tippoint() +
theme_classic() +
scale_x_continuous(limits = c(0,700))

Cool! Though that plot could use some tweaking… let’s try:
AK_lakes_clustered %>%
ggtree() +
geom_tiplab(aes(label = lake), offset = 10, align = TRUE) +
geom_tippoint(shape = 21, aes(fill = park), size = 4) +
scale_x_continuous(limits = c(0,600)) +
scale_fill_brewer(palette = "Set1") +
# theme_classic() +
theme(
legend.position = c(0.4,0.2)
)

Very nice! Since North Killeak and White Fish are so different from the others, we could re-analyze the data with those two removed:
alaska_lake_data_wide %>%
filter(!lake %in% c("North_Killeak_Lake","White_Fish_Lake")) -> alaska_lake_data_wide_filtered
runMatrixAnalyses(
data = alaska_lake_data_wide_filtered,
analysis = c("hclust"),
columns_w_values_for_single_analyte = colnames(alaska_lake_data_wide_filtered)[3:15],
columns_w_sample_ID_info = c("lake", "park")
) %>%
ggtree() +
geom_tiplab(aes(label = lake), offset = 10, align = TRUE) +
geom_tippoint(shape = 21, aes(fill = park), size = 4) +
scale_x_continuous(limits = c(0,100)) +
scale_fill_brewer(palette = "Set1") +
# theme_classic() +
theme(
legend.position = c(0.05,0.9)
)
## Replacing NAs in your data with mean

Annotating trees
Overlaying sample traits on a ggtree-based plot is straightforward when we combine ggtree
with ggplot2
. We begin by running the hierarchical clustering analysis and keeping its output for later plotting.
hclust_out <- runMatrixAnalyses(
data = chemical_blooms,
analysis = c("hclust"),
columns_w_values_for_single_analyte = colnames(chemical_blooms)[2:10],
columns_w_sample_ID_info = "label"
)
## ! The tree contained negative edge lengths. If you want to
## ignore the edges, you can set
## `options(ignore.negative.edge=TRUE)`, then re-run ggtree.
The object returned by runMatrixAnalyses()
already contains the branch coordinates, so we can pass it directly to ggtree()
and add tip labels while keeping the tree readable. Note that we should deliberately control the y-axis here, that will be key for aligning the tree with other plots later.
tree_plot <- ggtree(hclust_out) +
geom_tiplab(size = 2, align = TRUE) +
scale_x_continuous(limits = c(0, 300)) +
scale_y_continuous(limits = c(0, 80)) +
theme_classic()
## Scale for y is already present.
## Adding another scale for y, which will replace the existing
## scale.
tree_plot

Next, reshape the tip-level measurements to long form so each chemical becomes its own column of tiles. Because we reuse the y
coordinate supplied by ggtree
, the tiles inherit the same vertical order as the tips in the tree. Note that we remove the other columns in the hclust output for simplicity - they are only needed if we want to draw the full tree. Note that we also control the y-axis here to make sure it has the same bounds (limits) as the tree we made previously.
heat_plot <- hclust_out %>%
filter(isTip) %>%
select(-parent, -node, -branch.length, -label, -isTip, -x, -branch, -angle, -bootstrap) %>%
pivot_longer(cols = 3:11, names_to = "chemical", values_to = "abundance") %>%
ggplot(aes(x = chemical, y = y, fill = abundance)) +
geom_tile() +
scale_y_continuous(limits = c(0, 80)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
heat_plot

With matching y scales, plot_grid()
can align the tree and the heat map so the tiles line up with the corresponding samples. Using align = "h"
snaps them together horizontally, and axis = "tb"
keeps the panel heights consistent.
plot_grid(tree_plot, heat_plot, axis = "tb", align = "h")

Note: if we were to instead build the heat map directly from the raw chemical_blooms
table, the rows fall back to their alphabetical order and the heat map no longer matches the dendrogram ordering:
chemical_blooms %>%
pivot_longer(cols = 2:10, names_to = "chemical", values_to = "abundance") %>%
ggplot(aes(x = chemical, y = label, fill = abundance)) +
geom_tile()

further reading
annotating phylogenetic trees with ggtree. This free, book-length reference from the ggtree authors walks through annotating dendrograms and phylogenetic trees in R, explaining how to layer metadata, add tip labels, and customize themes—perfect for polishing the cluster plots we generated in this chapter.
hierarchical clustering in R. UC Business Analytics’ tutorial provides a gentle, R-focused walkthrough of agglomerative clustering, covering distance matrices, linkage choices, dendrogram interpretation, and cluster cutting with reproducible code examples. Note that this text uses base R instead of our in-class functions.