alignments

Multiple sequence alignments form the bridge between raw sequences and downstream phylogenetic analysis. In this chapter we first use alignSequences()
to assemble consistent nucleotide, amino-acid, or codon alignments. This is easily accomplished from the monolist generated by polyBlast()
, if your sequences were generated that way. Then we turn to analyzeAlignment()
to diagnose gap-heavy or poorly conserved sites, interactively trim them away, and export a cleaned alignment for tree building.
alignSequences
There are, of course, many tools for aligning sequences. alignSequences()
, from the phylochemistry toolkit, is designed to be flexible: it will align nucleotide, amino-acid, or codon sequences, and it can restrict the alignment to any subset of records in your BLAST monolist. If you already ran polyBlast()
, most of the inputs outlined below should be ready to go. The function writes the aligned FASTA to alignment_out_path
and returns nothing (invisibly).
-
"monolist"
: a data frame where each row represents a hit of interest. It must contain anaccession
column whose values match individual FASTA files insequences_of_interest_directory_path
. The helperreadMonolist()
returns this object from the CSV generated bypolyBlast()
. -
"subset"
: the name of a logical column in the monolist (for examplesubset_all
). Only rows where that column isTRUE
are aligned. Create additionalsubset_*
columns if you want multiple alignment configurations. -
"alignment_out_path"
: full file path (including filename) for the alignment that will be written, e.g./tmp/my_alignment.fa
. Existing files at this path are overwritten. -
"sequences_of_interest_directory_path"
: directory that stores one FASTA per accession, as produced bypolyBlast()
. The function normalises the path internally, so either absolute or relative locations are fine. -
"input_sequence_type"
: set to"nucl"
when the subject FASTA files contain nucleotide sequences and"amin"
for amino-acid sequences. Codon alignments start from nucleotides, while amino-acid alignments can start from either input type (translation happens automatically when needed). -
"mode"
: chooses the alignment strategy."nucl_align"
performs a straight nucleotide multiple sequence alignment."amin_align"
aligns amino-acid sequences (translated from nucleotides if required)."codon_align"
translates to protein, aligns there, and then projects the alignment back to nucleotides."fragment_align"
is reserved for fragment-to-fragment workflows and currently expects a base fragment defined viabase_fragment
. -
"base_fragment"
: optional path to a FASTA file containing the fragment that other sequences should be aligned against. Only used whenmode = "fragment_align"
.
alignSequences(
monolist = readMonolist("/path_to/a_csv_file_that_will_list_all_blast_hits.csv"),
subset = "subset_all",
alignment_out_path = "/path_to/a_folder_for_alignments/subset_all_amin_seqs_aligned.fa",
sequences_of_interest_directory_path = "/path_to/a_folder_for_hit_sequences/",
input_sequence_type = "amin",
mode = "amin_align",
base_fragment = NULL
)
analyzeAlignment
Once you have an alignment on disk, analyzeAlignment()
helps you decide which positions to keep before phylogeny building. It reads a FASTA alignment, profiles each column for gap content and conservation, and launches a Shiny interface that lets you tune thresholds while inspecting the original and trimmed alignments alongside neighbour-joining trees. The function returns the trimmed alignment (as a DNAStringSet
or AAStringSet
) invisibly and also writes it to <alignment_in_path>_trimmed
.
-
"alignment_in_path"
: path to the FASTA alignment to review. This is typically the file produced byalignSequences()
, e.g./tmp/subset_all_amin_seqs_aligned.fa
. -
"type"
:"DNA"
for nucleotide alignments or"AA"
for amino acids. This determines how the alignment is read, which tree scaffold is built, and which Biostrings container is used when writing the trimmed alignment. -
"jupyter"
: set toTRUE
when launching from a Jupyter environment so the app claims an available internal port and prints the connection URL. Leave at the defaultFALSE
for regular RStudio or terminal sessions.
The interface plots gap percentages and conservation across the alignment, highlights the positions that survive the current filters, and previews the trimmed tree. When you click Return Trimmed Alignment & Close App, the filtered alignment is written to <alignment_in_path>_trimmed
(with _trimmed
appended) and returned to the caller.
analyzeAlignment(
alignment_in_path = "/path_to/a_folder_for_alignments/subset_all_amin_seqs_aligned.fa",
type = "AA",
jupyter = FALSE
)