phylogenies

buildTree

This function is a swiss army knife for tree building. It takes as input alignments or existing phylogenies from which to derive a phylogeny of interest, it can use neighbor-joining or maximum liklihood methods (with model optimization), it can run bootstrap replicates, and it can calculate ancestral sequence states. To illustrate, let’s look at some examples:

newick input

Let’s use the Busta lab’s plant phylogeny [derived from Qian et al., 2016] to build a phylogeny with five species in it.

tree <- buildTree(
  scaffold_type = "newick",
  scaffold = "https://thebustalab.github.io/data/plant_phylogeny.newick",
  members = c("Sorghum_bicolor", "Zea_mays", "Setaria_viridis", "Arabidopsis_thaliana", "Amborella_trichopoda")
)
## Pro tip: most tree read/write functions reset node numbers.
## Fortify your tree and save it as a csv file to preserve node numbering.
## Do not save your tree as a newick or nexus file.

tree
## 
## Phylogenetic tree with 5 tips and 4 internal nodes.
## 
## Tip labels:
##   Amborella_trichopoda, Zea_mays, Sorghum_bicolor, Setaria_viridis, Arabidopsis_thaliana
## Node labels:
##   , , , 
## 
## Rooted; includes branch lengths.

plot(tree)

Cool! We got our phylogeny. What happens if we want to build a phylogeny that has a species on it that isn’t in our scaffold? For example, what if we want to build a phylogeny that includes Arabidopsis neglecta? We can include that name in our list of members:

tree <- buildTree(
  scaffold_type = "newick",
  scaffold_in_path = "https://thebustalab.github.io/data/plant_phylogeny.newick",
  members = c("Sorghum_bicolor", "Zea_mays", "Setaria_viridis", "Arabidopsis_neglecta", "Amborella_trichopoda")
)
## Scaffold newick tip Arabidopsis_thaliana substituted with Arabidopsis_neglecta
## Pro tip: most tree read/write functions reset node numbers.
## Fortify your tree and save it as a csv file to preserve node numbering.
## Do not save your tree as a newick or nexus file.

tree
## 
## Phylogenetic tree with 5 tips and 4 internal nodes.
## 
## Tip labels:
##   Amborella_trichopoda, Zea_mays, Sorghum_bicolor, Setaria_viridis, Arabidopsis_neglecta
## Node labels:
##   , , , 
## 
## Rooted; includes branch lengths.

plot(tree)

Note that buildTree informs us: “Scaffold newick tip Arabidopsis_thaliana substituted with Arabidopsis_neglecta”. This means that Arabidopsis neglecta was grafted onto the tip originally occupied by Arabidopsis thaliana. This behaviour is useful when operating on a large phylogenetic scale (i.e. where exact phylogeny topology is not critical below the family level). However, if a person is interested in using an existing newick tree as a scaffold for a phylogeny where genus-level topology is critical, then beware! Your scaffold may not be appropriate if you see that message. When operating at the genus level, you probably want to use sequence data to build your phylogeny anyway. So let’s look at how to do that:

alignment input

Arguments in this case are:

  • “scaffold_type”: “amin_alignment” or “nucl_alignment” for amino acids or nucleotides.
  • “scaffold_in_path”: path to the fasta file that contains the alignment from which you want to build a tree.
  • “ml”: Logical, TRUE if you want to use maximum liklihood, FALSE if not, in which case neighbor joining will ne used.
  • “model_test”: if you say TRUE to “ml”, should buildTree test different maximum liklihood models and then use the “best” one?
  • “bootstrap”: TRUE or FALSE, whether you want bootstrap values on the nodes.
  • “ancestral_states”: TRUE or FALSE, should buildTree() compute the ancestral sequence at each node?
  • “root”: NULL, or the name of an accession that should form the root of the tree.
buildTree(
  scaffold_type = "amin_alignment",
  scaffold_in_path = "/path_to/a_folder_for_alignments/all_amin_seqs.fa",
  ml = FALSE, 
  model_test = FALSE,
  bootstrap = FALSE,
  ancestral_states = FALSE,
  root = NULL
)

plotting trees

There are several approaches to plotting trees. A simple one is using the base plot function:

test_tree_small <- buildTree(
  scaffold_type = "newick",
  scaffold_in_path = "https://thebustalab.github.io/data/plant_phylogeny.newick",
  members = c("Sorghum_bicolor", "Zea_mays", "Setaria_viridis")
)
## Pro tip: most tree read/write functions reset node numbers.
## Fortify your tree and save it as a csv file to preserve node numbering.
## Do not save your tree as a newick or nexus file.

plot(test_tree_small)

Though this can get messy when there are lots of tip labels:

set.seed(122)
test_tree_big <- buildTree(
  scaffold_type = "newick",
  scaffold_in_path = "https://thebustalab.github.io/data/plant_phylogeny.newick",
  members = plant_species$Genus_species[abs(floor(rnorm(60)*100000))]
)

plot(test_tree_big)

One solution is to use ggtree, which by default doesn’t show tip labels. plot can do that too, but ggtree does a bunch of other useful things, so I recommend that:

ggtree(test_tree_big)

Another convenient fucntion is ggplot’s fortify. This will convert your phylo object into a data frame:

test_tree_big_fortified <- fortify(test_tree_big)
test_tree_big_fortified
## # A tbl_tree abstraction: 101 × 9
## # which can be converted to treedata or phylo 
## # via as.treedata or as.phylo
##    parent  node branch.length label isTip     x     y branch
##     <int> <int>         <dbl> <chr> <lgl> <dbl> <dbl>  <dbl>
##  1     54     1          83.0 Wolf… TRUE   188.     1   147.
##  2     54     2          83.0 Spat… TRUE   188.     2   147.
##  3     55     3         138.  Dios… TRUE   188.     3   120.
##  4     58     4          42.7 Bulb… TRUE   188.     5   167.
##  5     58     5          42.7 Ober… TRUE   188.     6   167.
##  6     59     6          32.0 Poma… TRUE   188.     7   172.
##  7     59     7          32.0 Teli… TRUE   188.     8   172.
##  8     56     8         135.  Cala… TRUE   188.     4   121.
##  9     61     9         147.  Pepe… TRUE   188.     9   115.
## 10     62    10         121.  Endl… TRUE   188.    10   128.
## # ℹ 91 more rows
## # ℹ 1 more variable: angle <dbl>

ggtree can still plot this dataframe, and it allows metadata to be stored in a human readable format by using mutating joins (explained below). This metadata can be plotted with standard ggplot geoms, and these dataframes can also conveniently be saved as .csv files:


## Note that "plant_species" comes with the phylochemistry source.

test_tree_big_fortified_w_data <- left_join(test_tree_big_fortified, plant_species, by = c("label" = "Genus_species"))

test_tree_big_fortified_w_data
## # A tbl_tree abstraction: 101 × 14
## # which can be converted to treedata or phylo 
## # via as.treedata or as.phylo
##    parent  node branch.length label isTip     x     y branch
##     <int> <int>         <dbl> <chr> <lgl> <dbl> <dbl>  <dbl>
##  1     54     1          83.0 Wolf… TRUE   188.     1   147.
##  2     54     2          83.0 Spat… TRUE   188.     2   147.
##  3     55     3         138.  Dios… TRUE   188.     3   120.
##  4     58     4          42.7 Bulb… TRUE   188.     5   167.
##  5     58     5          42.7 Ober… TRUE   188.     6   167.
##  6     59     6          32.0 Poma… TRUE   188.     7   172.
##  7     59     7          32.0 Teli… TRUE   188.     8   172.
##  8     56     8         135.  Cala… TRUE   188.     4   121.
##  9     61     9         147.  Pepe… TRUE   188.     9   115.
## 10     62    10         121.  Endl… TRUE   188.    10   128.
## # ℹ 91 more rows
## # ℹ 6 more variables: angle <dbl>, Phylum <chr>,
## #   Order <chr>, Family <chr>, Genus <chr>, species <chr>

ggtree(test_tree_big_fortified_w_data) + 
  geom_point(
    data = filter(test_tree_big_fortified_w_data, isTip == TRUE),
    aes(x = x, y = y, fill = Order), size = 3, shape = 21, color = "black") +
  geom_text(
    data = filter(test_tree_big_fortified_w_data, isTip == TRUE),
    aes(x = x, y = y, label = y), size = 2, color = "white") +
  geom_tiplab(aes(label = label), offset = 10, size = 2) +
  theme_void() +
  scale_fill_manual(values = discrete_palette) +
  coord_cartesian(xlim = c(0,280)) +
  theme(
    legend.position = c(0.15, 0.75)
  )

collapseTree

Sometimes we want to view a tree at a higher level of taxonomical organization, or some other higher level. This can be done easily using the collapseTree function. It takes two arguments: an un-fortified tree (tree), and a two-column data frame (associations). In the first column of the data frame are all the tip labels of the tree, and in the second column are the higher level of organization to which each tip belongs. The function will prune the tree so that only one member of the higher level of organization is included in the output. For example, let’s look at the tree from the previous section at the family level:

collapseTree(
  tree = test_tree_big,
  associations = data.frame(
    tip.label = test_tree_big$tip.label,
    family = plant_species$Family[match(test_tree_big$tip.label, plant_species$Genus_species)]
  )
) -> test_tree_big_families

ggtree(test_tree_big_families) + geom_tiplab() + coord_cartesian(xlim = c(0,300))

trees and traits

To plot traits alongside a tree, we can use ggtree in combination with ggplot. Here is an example. First, we make the tree:

chemical_bloom_tree <- buildTree(
  scaffold_type = "newick",
  scaffold_in_path = "http://thebustalab.github.io/data/angiosperms.newick",
  members = unique(chemical_blooms$label)
)
## Scaffold newick tip Sabal_pumos substituted with Sabal_palmetto
## Scaffold newick tip Iris_lazica substituted with Iris_sp
## Scaffold newick tip Iris_lacustris substituted with Iris_germanica
## Scaffold newick tip Allium_textile substituted with Allium_sp
## Scaffold newick tip Allium_subhirsutum substituted with Allium_brevistylum
## Scaffold newick tip Ornithogalum_saundersiae substituted with Ornithogalum_candicans
## Scaffold newick tip Hosta_plantaginea substituted with Hosta_sp
## Scaffold newick tip Agave_striata substituted with Agave_cerulata
## Scaffold newick tip Agave_tequilana substituted with Agave_chrysantha
## Scaffold newick tip Aristolochia_serpentaria substituted with Aristolochia_labiata
## Scaffold newick tip Thalictrum_clavatum substituted with Thalictrum_rochebrunianum
## Scaffold newick tip Delphinium_pictum substituted with Delphinium_elatum
## Scaffold newick tip Ferocactus_recurvus substituted with Ferocactus_wislizeni
## Scaffold newick tip Opuntia_articulata substituted with Opuntia_sp
## Scaffold newick tip Opuntia_quimilo substituted with Opuntia_robusta
## Scaffold newick tip Cylindropuntia_fulgida substituted with Cylindropuntia_bigelovii
## Scaffold newick tip Cylindropuntia_prolifera substituted with Cylindropuntia_versicolor
## Scaffold newick tip Cylindropuntia_echinocarpa substituted with Cylindropuntia_ramosissima
## Scaffold newick tip Kirengeshoma_palmata substituted with Kirengeshoma_Palmata
## Scaffold newick tip Lavandula_bipinnata substituted with Lavandula_sp
## Scaffold newick tip Rudbeckia_hirta substituted with Rudbeckia_occidentalis
## Scaffold newick tip Crassula_campestris substituted with Crassula_ovata
## Scaffold newick tip Crassula_tillaea substituted with Crassula_deceptor
## Scaffold newick tip Crassula_alata substituted with Crassula_arborescens
## Scaffold newick tip Crassula_colligata substituted with Crassula_perfoliata
## Scaffold newick tip Hylotelephium_erythrostictum substituted with Hylotelephium_sp
## Scaffold newick tip Echeveria_setosa substituted with Echeveria_pulidonis
## Scaffold newick tip Bryophyllum_pinnatum substituted with Bryophyllum_fedtschenkoi
## Scaffold newick tip Kalanchoe_linearifolia substituted with Kalanchoe_marginata
## Scaffold newick tip Kalanchoe_tomentosa substituted with Kalanchoe_luciae
## Scaffold newick tip Kalanchoe_beharensis substituted with Kalanchoe_thrysiflora
## Scaffold newick tip Iliamna_latibracteata substituted with Iliamna_rivularis
## Scaffold newick tip Euphorbia_lathyris substituted with Euphorbia_resinifera
## Scaffold newick tip Quercus_valdinervosa substituted with Quercus_muehlenbergii
## Scaffold newick tip Rubus_repens substituted with Rubus_sp
## Pro tip: most tree read/write functions reset node numbers.
## Fortify your tree and save it as a csv file to preserve node numbering.
## Do not save your tree as a newick or nexus file.

Next we join the tree with the data:

data <- left_join(fortify(chemical_bloom_tree), chemical_blooms)
## Joining with `by = join_by(label)`
head(data)
## # A tibble: 6 × 18
##   parent  node branch.length label  isTip     x     y branch
##    <int> <int>         <dbl> <chr>  <lgl> <dbl> <dbl>  <dbl>
## 1     80     1         290.  Ginkg… TRUE   352.     1   207.
## 2     81     2         267.  Picea… TRUE   352.     2   219.
## 3     81     3         267.  Cupre… TRUE   352.     3   219.
## 4     84     4         135.  Eryth… TRUE   352.     5   285.
## 5     86     5          16.2 Iris_… TRUE   352.     6   344.
## 6     86     6          16.2 Iris_… TRUE   352.     7   344.
## # ℹ 10 more variables: angle <dbl>, Alkanes <dbl>,
## #   Sec_Alcohols <dbl>, Others <dbl>, Fatty_acids <dbl>,
## #   Alcohols <dbl>, Triterpenoids <dbl>, Ketones <dbl>,
## #   Other_compounds <dbl>, Aldehydes <dbl>

Now we can plot the tree:

tree_plot <- ggtree(data) +
  geom_tiplab(
    align = TRUE, hjust = 1, offset = 350,
    geom = "label", label.size = 0, size = 3
  ) +
  scale_x_continuous(limits = c(0,750))

IMPORTANT! When we plot the traits, we need to reorder whatever is on the shared axis (in this case, the y axis) so that it matches the order of the tree. In this case, we need to reorder the species names so that they match the order of the tree. We can do this by using the reorder function, which takes two arguments: the thing to be reordered, and the thing to be reordered by. In this case, we want to reorder the species names by their y coordinate on the tree. We can do this by using the y column of the data frame that we created when we fortified the tree. We can then plot the traits:

trait_plot <- ggplot(
    data = pivot_longer(
      filter(data, isTip == TRUE),
      cols = 10:18, names_to = "compound", values_to = "abundance"
    ),
    aes(x = compound, y = reorder(label, y), size = abundance)
  ) +
  geom_point() +
  scale_y_discrete(name = "") +
  theme(
    plot.margin = unit(c(1,1,1,1), "cm")
  )

Finally, we can plot the two plots together using plot_grid. It is important to manually inspect the tree tips and the y axis text to make sure that everything lines up. We don’t want to be plotting the abundance of one species on the y axis of another species. In this case, everything looks good:

plot_grid(
  tree_plot,
  trait_plot,
  nrow = 1, align = "h", axis = "tb"
)

Once our manual inspection is complete, we can make a new version of the plot in which the y axis text is removed from the trait plot and we can reduce the margin on the left side of the trait plot to make it look nicer:

tree_plot <- ggtree(data) +
  geom_tiplab(
    align = TRUE, hjust = 1, offset = 350,
    geom = "label", label.size = 0, size = 3
  ) +
  scale_x_continuous(limits = c(0,750))

trait_plot <- ggplot(
    data = pivot_longer(
      filter(data, isTip == TRUE),
      cols = 10:18, names_to = "compound", values_to = "abundance"
    ),
    aes(x = compound, y = reorder(label, y), size = abundance)
  ) +
  geom_point() +
  scale_y_discrete(name = "") +
  theme(
    axis.text.y = element_blank(),
    plot.margin = unit(c(1,1,1,-1.5), "cm")
  )

plot_grid(
  tree_plot,
  trait_plot,
  nrow = 1, align = "h", axis = "tb"
)