genome assembly

To some degree, refer to: https://github.com/dithiii/ant-pipeline/blob/main/README.md.

2.3 assembly

2.3.1 equipment

Genome assembly requires computing resources - and since not all genomes are of equal size, the computing resources required for different assemblies may differ. To run a genome assembly, start by determining what computing resources are available. Some helpful commands when investigating these resources on Linux machines:

Assessing RAM (It is recommended to assign about 75% of available RAM to the assembly process):

grep MeMTotal /proc/meminfo

Assessing CPU resources (note that “threads per CPU” can denote the availability of hyperthreading):

lscpu

Assessing disk/storage space (Make sure you are running your assembly on a disk with lots of open space. Ideally > 2TB):

df -h

2.3.2 assembly software

2.3.2.1 abyss

abyss-pe k=111 name=SS1 B=10G in='SS1_1.fa SS1_2.fa'

A rough indicator is, for 2x150bp reads and 40x coverage, the right k value is often around 70 to 90. For 2x250bp reads and 40x coverage, the right value might be around 110 to 140.

A good value for B depends on a number of factors, but primarily on the genome being assembled. A general guideline is: P. glauca (~20Gbp): B=500G; H. sapiens (~3.1Gbp): B=50G; C. elegans (~101Mbp): B=2G. Using more is fine though,

2.3.2.2 canu

docker pull staphb/canu-racon

For assembly on the BustaLab storage box, navigate to the directory that contains your reads. Merge all reads into one file using:

cat *.fastq > all_reads.fastq

Then use Canu to assemble, we suggest creating a file that contains the Canu call. You can create the file using nano. In it, try something like:

sudo docker run -u $(id -u) -v /data/ben_diatom2/basecalled_reads/:/canu-racon_wd staphb/canu-racon canu -p n_frust2assembly -d /canu-racon_wd/ -genomeSize=150m -nanopore /canu-racon_wd/all_reads2.fastq -minReadLength=1000 -correctedErrorRate=0.12 -minOverlapLength=500 -useGrid=false -minInputCoverage=0.5 -maxInputCoverage=100 -stopOnLowCoverage=0.5 -corMemory=48 -corThreads=4 -hapMemory=48 -hapThreads=4 -merylMemory=48 -merylthreads=4 -batMemory=48 -batThreads=4

Notes on Canu options:

Defaults:

minReadLength=1000
minOverlapLength=500bp
correctedErrorRate=0.114
stopOnLowCoverage <integer=10>

Essentially only speed optimization:

For over 30X coverage:
Nanopore flip-flop R9.4 or R10.3: try: corMhapOptions=--threshold 0.8 –ordered-sketch-size 1000 –ordered-kmer-size 14’ correctedErrorRate=0.105
For over 60X coverage: 2 recommendations were made, one said to decrease slightly (~1%). Another suggested using 12% correctedErrorRate=0.12
Increasing minReadLength increases run time, increasing minOverlapLength improves assembly quality but increasing too much quickly degrades assemblies.

2.3.2.3 flye

docker pull staphb/flye

2.3.3 kmer-based metrics

Canu will take some time to run. As it goes along, you can both check on its progress and learn about the genome you are assembling from some intermediate results. Take the k-mer data found in the .histogram files (i.e. in correction/0-mercounts/x.histogram, trimming/0-mercounts/x.histogram, unitigging/0-mercounts/x.histogram) and process them with canuHistogramToKmerTable(), as shown below. You can upload the output to : http://qb.cshl.edu/genomescope/genomescope2.0/. This will give you approximate genome size, ploidy, heterozygosity, repeat content, and read error rate. All good stuff to know!

canuHistogramToKmerTable(
  file_in_path = "/Users/bust0037/Desktop/n_frust3assemblyB.ms22.histogram",
  file_out_path = "/Users/bust0037/Desktop/n_frust3assemblyB.ms22.histogram_table"
)

Also check on this tutorial:

genomics tutorial

2.3.3.1 merqury

docker pull quay.io/chai/merqury

sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/round2_pass_reads_assembly/:/merqury/ quay.io/chai/merqury:latest quast.py -h

References: https://www.biorxiv.org/content/10.1101/2020.03.15.992941v1.abstract Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies

2.4 evaluating contigs

Can these be merged into a single wrapper that can be run after each step in assembly/polishing/scaffolding etc?

2.4.1 BUSCO

Real BUSCO input will be /home/bust0037/data1/comparative_genomics/Kfedtschenkoi_382_v1.0.fa
Note BUSCO uses current working directory for input and output

docker pull ezlabgva/busco:v5.3.2_cv1

sudo docker run -u $(id -u) -v /home/bust0037/data1/comparative_genomics/:/busco_wd ezlabgva/busco:v5.3.2_cv1 busco -i k_fed.contigs.fa -o busco_our_kfed/ -m genome -l eudicots_odb10

2.4.2 quast

Quast provides a score called ALE: alignment liklihood estimate.

docker pull longas/quast

sudo docker run -u $(id -u) -v /home/bust0037/:/tmp/work/quast_results longas/quast:latest quast.py -h

sudo docker run -u $(id -u) -v /home/bust0037/ben_test/_ben_genomes/:/tmp/work/quast_results longas/quast:latest quast.py --fragmented /tmp/work/quast_results/ben_pre_n_frust.contigs.fasta --nanopore /tmp/work/quast_results/all_pass_reads.fastq --space-efficient --memory-efficient --fast

2.5 polishing contigs

2.5.1 medaka

Pull the docker image:

docker pull ontresearch/medaka

Run medaka:

sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/round2_pass_reads_assembly/:/medaka/ ontresearch/medaka:latest medaka_consensus -i medaka/all_reads.fastq -d medaka/k_fed.contigs.fasta -b 50

Good documentation here: https://labs.epi2me.io/notebooks/Introduction_to_how_ONT's_medaka_works.html

Medaka will: * Map all your raw reads. Look for feedback like: [M::worker_pipeline::2924.325*0.25] mapped 75823 sequences. * Do something else, updates in the form of: 21.7% Done (88.2/406.1 Mbases) in 6815.1s.

2.6 methylation with remora

can we call methylation status on our Kalanchoe genomes? -> yes, we can use Remora -> watch for new guppy release, MinKNOW integration -> currently in bonito

bonito basecaller dna_r10.4_e8.1_sup@v3.4 /data/reads --modified-bases 5mC --reference ref.mmi > basecalls_with_mods.bam
bonito basecaller dna_r10.4_e8.1_sup@v3.4 --reference consensus.fasta --modified-bases 5mC

2.7 scaffolding assembly

2.7.1 RagTag

INSTALL PYTHON (or upgrade python) <- can this be done using docker?

INSTALL BIOCONDA 1. install miniconda: curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh …

2.8 annotation

2.8.1 augustus

see: https://hub.docker.com/r/pegi3s/autoaugustus/

docker pull pegi3s/autoaugustus

sudo docker run --rm -v $(pwd):$(pwd) pegi3s/autoaugustus augustus --species=help
sudo docker run --rm -v $(pwd):/augustus/ pegi3s/autoaugustus augustus --species=tomato /augustus/consensus.fasta > consensus-preductions.gff --progress=TRUE

2.9 final assessment

https://www.molecularecologist.com/2017/03/29/whats-n50/

2.9.1 mosdepth

sudo docker pull quay.io/biocontainers/mosdepth:0.2.4--he527e40_0

sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth -h

sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth -n --fast-mode --by 1000 mosdepth_wd/mosdepth_out /mosdepth_wd/calls_to_draft.bam

sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd gfanz/mosdepth -n --fast-mode --by 1000 mosdepth_wd/mosdepth_out /mosdepth_wd/calls_to_draft.bam

sudo docker run -u $(id -u) -v /home/bust0037/data1/Kalanchoe_DNASeq/rounds_1and2_pass_assembly/:/mosdepth_wd quay.io/biocontainers/mosdepth:0.2.4--he527e40_0 mosdepth --quantize 0:1:4:100:200: --fast-mode --by 1000 mosdepth_wd/mosdepth_out /mosdepth_wd/calls_to_draft.bam

set up channels

/home/bust0037/miniconda3/bin/conda config –add channels defaults /home/bust0037/miniconda3/bin/conda config –add channels bioconda /home/bust0037/miniconda3/bin/conda config –add channels conda-forge

install packages

/home/bust0037/miniconda3/bin/conda install sibeliaz /home/bust0037/miniconda3/bin/conda install -c bioconda ragtag

run your stuff

/home/bust0037/miniconda3/bin/conda run sibeliaz -n k_fed.contigs.scaffolded.fasta KlaxifloraFTBG2000359A_699_v3.0.fa

/home/bust0037/miniconda3/bin/conda run ragtag

ragtag.py

/home/bust0037/meryl-1.3/bin/meryl /home/bust0037/merqury-1.3/merqury.sh

sh $MERQURY/best_k.sh

Let’s look at some examples. For these example, we will use some fasta files stored in a Google Drive folder:

# reads <- readFasta("https://drive.google.com/file/d/1r6E0U5LyYwjWenxy9yqh5QQ2mq1umWOW/view?usp=sharing")

# # post <- readFasta("/Users/bust0037/Desktop/ragtag.scaffold.fasta")
# n_chroms <- 18

# pb <- progress::progress_bar$new(total = n_chroms)

# out <- list()

# for (i in 1:n_chroms) {

#   pb$tick()

#   dat <- strsplit(substr(as.character(post[i]), 1, 50000000), "")[[1]]
  
#   b <- rle(dat)

#   # Create a data frame
#   dt <- data.frame(number = b$values, lengths = b$lengths, scaff = i)
#   # Get the end
#   dt$end <- cumsum(dt$lengths)
#   # Get the start
#   dt$start <- dt$end - dt$lengths + 1

#   # Select columns
#   dt <- dt[, c("number", "start", "end", "scaff")]
#   # Sort rows
#   dt <- dt[order(dt$number), ]

#   dt %>%
#     filter(number == "N") -> N_dat

#   out[[i]] <- N_dat

# }

# out <- do.call(rbind, out)


# chroms <- data.frame(
#   lengths = post@ranges@width[1:n_chroms],
#   scaff = seq(1,n_chroms,1)
# )

# ggplot() +
#   statebins:::geom_rrect(data = chroms, aes(xmin = 0, xmax = lengths, ymin = -1, ymax = 1, fill = scaff), color = "black") +
#   geom_rect(data = out, aes(xmin = start, xmax = end, ymin = -0.95, ymax = 0.95), color = "white", fill = "white", size = 0.08) +
#   facet_grid(scaff~.) +
#   scale_fill_viridis(end = 0.8) +
#   theme_classic()

# ggplot() +
#   geom_rect(data = filter(chroms, scaff == 1 | scaff == 2), aes(xmin = 0, xmax = lengths, ymin = -1, ymax = 1, fill = scaff), color = "black") +
#   geom_rect(data = filter(out, scaff == 1 | scaff == 2), aes(xmin = start, xmax = end, ymin = -0.95, ymax = 0.95), color = "white", fill = "white", size = 0.08) +
#   facet_grid(scaff~.) +
#   scale_y_continuous(limits = c(-2,2)) +
#   scale_fill_viridis(end = 0.8) +
#   theme_classic() +
#   coord_polar()

setup

annotation

Integrated Bioanalytics