genome assembly
To some degree, refer to: https://github.com/dithiii/ant-pipeline/blob/main/README.md.
2.3 assembly
2.3.1 equipment
Genome assembly requires computing resources - and since not all genomes are of equal size, the computing resources required for different assemblies may differ. To run a genome assembly, start by determining what computing resources are available. Some helpful commands when investigating these resources on Linux machines:
- Assessing RAM (It is recommended to assign about 75% of available RAM to the assembly process):
grep MeMTotal /proc/meminfo
- Assessing CPU resources (note that “threads per CPU” can denote the availability of hyperthreading):
lscpu
- Assessing disk/storage space (Make sure you are running your assembly on a disk with lots of open space. Ideally > 2TB):
df -h
2.3.2 assembly software
2.3.2.1 abyss
abyss-pe k=111 name=SS1 B=10G in='SS1_1.fa SS1_2.fa'
A rough indicator is, for 2x150bp reads and 40x coverage, the right k value is often around 70 to 90. For 2x250bp reads and 40x coverage, the right value might be around 110 to 140.
A good value for B depends on a number of factors, but primarily on the genome being assembled. A general guideline is: P. glauca (~20Gbp): B=500G; H. sapiens (~3.1Gbp): B=50G; C. elegans (~101Mbp): B=2G. Using more is fine though,
2.3.2.2 canu
docker pull staphb/canu-racon
For assembly on the BustaLab storage box, navigate to the directory that contains your reads. Merge all reads into one file using:
cat *.fastq > all_reads.fastq
Then use Canu to assemble, we suggest creating a file that contains the Canu call. You can create the file using nano
. In it, try something like:
sudo docker run -u $(id -u) -v /data/ben_diatom2/basecalled_reads/:/canu-racon_wd staphb/canu-racon canu -p n_frust2assembly -d /canu-racon_wd/ -genomeSize=150m -nanopore /canu-racon_wd/all_reads2.fastq -minReadLength=1000 -correctedErrorRate=0.12 -minOverlapLength=500 -useGrid=false -minInputCoverage=0.5 -maxInputCoverage=100 -stopOnLowCoverage=0.5 -corMemory=48 -corThreads=4 -hapMemory=48 -hapThreads=4 -merylMemory=48 -merylthreads=4 -batMemory=48 -batThreads=4
Notes on Canu options:
Defaults:
- minReadLength=1000
- minOverlapLength=500bp
- correctedErrorRate=0.114
- stopOnLowCoverage <integer=10>
Essentially only speed optimization:
- For over 30X coverage:
- Nanopore flip-flop R9.4 or R10.3: try:
corMhapOptions=--threshold 0.8 –ordered-sketch-size 1000 –ordered-kmer-size 14’ correctedErrorRate=0.105
- For over 60X coverage: 2 recommendations were made, one said to decrease slightly (~1%). Another suggested using 12% correctedErrorRate=0.12
- Increasing minReadLength increases run time, increasing minOverlapLength improves assembly quality but increasing too much quickly degrades assemblies.
2.3.3 kmer-based metrics
Canu will take some time to run. As it goes along, you can both check on its progress and learn about the genome you are assembling from some intermediate results. Take the k-mer data found in the .histogram files (i.e. in correction/0-mercounts/x.histogram, trimming/0-mercounts/x.histogram, unitigging/0-mercounts/x.histogram) and process them with canuHistogramToKmerTable()
, as shown below. You can upload the output to : http://qb.cshl.edu/genomescope/genomescope2.0/. This will give you approximate genome size, ploidy, heterozygosity, repeat content, and read error rate. All good stuff to know!
canuHistogramToKmerTable(
file_in_path = "/Users/bust0037/Desktop/n_frust3assemblyB.ms22.histogram",
file_out_path = "/Users/bust0037/Desktop/n_frust3assemblyB.ms22.histogram_table"
)
Also check on this tutorial:
2.3.3.1 merqury
docker pull quay.io/chai/merqury
sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/round2_pass_reads_assembly/:/merqury/ quay.io/chai/merqury:latest quast.py -h
References: https://www.biorxiv.org/content/10.1101/2020.03.15.992941v1.abstract Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies
2.4 evaluating contigs
Can these be merged into a single wrapper that can be run after each step in assembly/polishing/scaffolding etc?
2.4.1 BUSCO
- Real BUSCO input will be /home/bust0037/data1/comparative_genomics/Kfedtschenkoi_382_v1.0.fa
- Note BUSCO uses current working directory for input and output
docker pull ezlabgva/busco:v5.3.2_cv1
sudo docker run -u $(id -u) -v /home/bust0037/data1/comparative_genomics/:/busco_wd ezlabgva/busco:v5.3.2_cv1 busco -i k_fed.contigs.fa -o busco_our_kfed/ -m genome -l eudicots_odb10
2.4.2 quast
Quast provides a score called ALE: alignment liklihood estimate.
docker pull longas/quast
sudo docker run -u $(id -u) -v /home/bust0037/:/tmp/work/quast_results longas/quast:latest quast.py -h
sudo docker run -u $(id -u) -v /home/bust0037/ben_test/_ben_genomes/:/tmp/work/quast_results longas/quast:latest quast.py --fragmented /tmp/work/quast_results/ben_pre_n_frust.contigs.fasta --nanopore /tmp/work/quast_results/all_pass_reads.fastq --space-efficient --memory-efficient --fast
2.5 polishing contigs
2.5.1 medaka
Pull the docker image:
docker pull ontresearch/medaka
Run medaka:
sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/round2_pass_reads_assembly/:/medaka/ ontresearch/medaka:latest medaka_consensus -i medaka/all_reads.fastq -d medaka/k_fed.contigs.fasta -b 50
Good documentation here: https://labs.epi2me.io/notebooks/Introduction_to_how_ONT's_medaka_works.html
Medaka will:
* Map all your raw reads. Look for feedback like: [M::worker_pipeline::2924.325*0.25] mapped 75823 sequences
.
* Do something else, updates in the form of: 21.7% Done (88.2/406.1 Mbases) in 6815.1s
.
2.6 methylation with remora
can we call methylation status on our Kalanchoe genomes? -> yes, we can use Remora -> watch for new guppy release, MinKNOW integration -> currently in bonito
bonito basecaller dna_r10.4_e8.1_sup@v3.4 /data/reads --modified-bases 5mC --reference ref.mmi > basecalls_with_mods.bam
bonito basecaller dna_r10.4_e8.1_sup@v3.4 --reference consensus.fasta --modified-bases 5mC
2.7 scaffolding assembly
2.7.1 RagTag
INSTALL PYTHON (or upgrade python) <- can this be done using docker?
INSTALL BIOCONDA 1. install miniconda: curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh …
2.8 annotation
2.8.1 augustus
see: https://hub.docker.com/r/pegi3s/autoaugustus/
docker pull pegi3s/autoaugustus
sudo docker run --rm -v $(pwd):$(pwd) pegi3s/autoaugustus augustus --species=help
sudo docker run --rm -v $(pwd):/augustus/ pegi3s/autoaugustus augustus --species=tomato /augustus/consensus.fasta > consensus-preductions.gff --progress=TRUE
2.9 final assessment
https://www.molecularecologist.com/2017/03/29/whats-n50/
2.9.1 mosdepth
sudo docker pull quay.io/biocontainers/mosdepth:0.2.4--he527e40_0
sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth -h
sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth -n --fast-mode --by 1000 mosdepth_wd/mosdepth_out /mosdepth_wd/calls_to_draft.bam
sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd gfanz/mosdepth -n --fast-mode --by 1000 mosdepth_wd/mosdepth_out /mosdepth_wd/calls_to_draft.bam
sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth --quantize 0:1:4:100:200: --fast-mode --by 1000 mosdepth_wd/mosdepth_out /mosdepth_wd/calls_to_draft.bam
- set up channels
/home/bust0037/miniconda3/bin/conda config –add channels defaults /home/bust0037/miniconda3/bin/conda config –add channels bioconda /home/bust0037/miniconda3/bin/conda config –add channels conda-forge
- install packages
/home/bust0037/miniconda3/bin/conda install sibeliaz /home/bust0037/miniconda3/bin/conda install -c bioconda ragtag
- run your stuff
/home/bust0037/miniconda3/bin/conda run sibeliaz -n k_fed.contigs.scaffolded.fasta KlaxifloraFTBG2000359A_699_v3.0.fa
/home/bust0037/miniconda3/bin/conda run ragtag
/home/bust0037/miniconda3/bin/conda run ragtag
ragtag.py
/home/bust0037/meryl-1.3/bin/meryl /home/bust0037/merqury-1.3/merqury.sh
sh $MERQURY/best_k.sh
Let’s look at some examples. For these example, we will use some fasta files stored in a Google Drive folder:
# reads <- readFasta("https://drive.google.com/file/d/1r6E0U5LyYwjWenxy9yqh5QQ2mq1umWOW/view?usp=sharing")
# # post <- readFasta("/Users/bust0037/Desktop/ragtag.scaffold.fasta")
# n_chroms <- 18
# pb <- progress::progress_bar$new(total = n_chroms)
# out <- list()
# for (i in 1:n_chroms) {
# pb$tick()
# dat <- strsplit(substr(as.character(post[i]), 1, 50000000), "")[[1]]
# b <- rle(dat)
# # Create a data frame
# dt <- data.frame(number = b$values, lengths = b$lengths, scaff = i)
# # Get the end
# dt$end <- cumsum(dt$lengths)
# # Get the start
# dt$start <- dt$end - dt$lengths + 1
# # Select columns
# dt <- dt[, c("number", "start", "end", "scaff")]
# # Sort rows
# dt <- dt[order(dt$number), ]
# dt %>%
# filter(number == "N") -> N_dat
# out[[i]] <- N_dat
# }
# out <- do.call(rbind, out)
# chroms <- data.frame(
# lengths = post@ranges@width[1:n_chroms],
# scaff = seq(1,n_chroms,1)
# )
# ggplot() +
# statebins:::geom_rrect(data = chroms, aes(xmin = 0, xmax = lengths, ymin = -1, ymax = 1, fill = scaff), color = "black") +
# geom_rect(data = out, aes(xmin = start, xmax = end, ymin = -0.95, ymax = 0.95), color = "white", fill = "white", size = 0.08) +
# facet_grid(scaff~.) +
# scale_fill_viridis(end = 0.8) +
# theme_classic()
# ggplot() +
# geom_rect(data = filter(chroms, scaff == 1 | scaff == 2), aes(xmin = 0, xmax = lengths, ymin = -1, ymax = 1, fill = scaff), color = "black") +
# geom_rect(data = filter(out, scaff == 1 | scaff == 2), aes(xmin = start, xmax = end, ymin = -0.95, ymax = 0.95), color = "white", fill = "white", size = 0.08) +
# facet_grid(scaff~.) +
# scale_y_continuous(limits = c(-2,2)) +
# scale_fill_viridis(end = 0.8) +
# theme_classic() +
# coord_polar()