MaSuRCA genome assembly package: November 2018

Friday, November 30, 2018

Beta SLURM support in MaSuRCA 3.3.0b

I have been working to implement grid execution in MaSuRCA for a while, and MaSuRCA supported SGE for nearly a year now. The implementation of support for SLURM was behind due to an important difference between job submission in SGE and SLURM: absence of "sync" option, where the submit command in a script would not return until all jobs have exited. There are ways around it and we are still testing what would work best in the framework of MaSuRCA. For now one can run the four most computation intensive parts of MaSuRCA on the grid:

1. mega-reads correction 1st pass
2. mega-reads correction second pass (sometimes grid is not necessary for this one)
3. overlap-based trimming in CABOG
4. overlapping in CABOG

The way this works, one runs assemble.sh and then the run prepares batch jobs and exits, instructing user to run a given sbatch command with given parameters. Once all sbatch jobs are done, one should re-run assemble.sh script (would not hurt to re-generate it just in case). The scrip checks if all jobs completed successfully and then proceeds. If some jobs have failed, the script exits, instructing the user to run sbatch command again.

To configure the grid run one must add the following options to the config file. Hint: these options are already present in the example config file created upon installation of MaSuRCA

USE_GRID=1
GRID_QUEUE=your_slurm_partition
GRID_ENGINE=SLURM
GRID_BATCH_SIZE=<amount of Nanopore/Pacbio sequence to use per batch>, 300000000 default, will create 10 batches for every 1x of mammalian genome coverage
NUM_THREADS=32

The sbatch command will run one job per 32-core node in this case. If your nodes have fewer or more cores, adjust accordingly.

If grid jobs fail in mega-reads pass2 (jf_aligner jobs) it is likely due to not enough memory on the nodes. To solve this, reduce GRID_BATCH_SIZE, re-generate assemble.sh , remove mr_pass2 folder and re-run.

To get the 3.3.0b version, please go to github
https://github.com/alekseyzimin/masurca/blob/master/MaSuRCA-3.3.0b.tar.gz

Thanks to Daniela Puiu, the software engineer at the Center for Computational Biology for implementing SLURM support in CABOG and helping me test MaSuRCA!

Tuesday, November 20, 2018

MaSuRCA 3.2.9 hotfix

I decided to release a minor hotfix to MaSuRCA 3.2.9 on 11/20/2019 keeping version number the same, but with a small change: now the final output file final.genome.scf.fasta contains scaffolds. Previous version split scaffolds and output contigs in that file. To recover the scaffolds from 3.2.8 assemblies you can run:

<MaSuRCA path>/bin/recover_scaffolds.pl < final.genome.scf.fasta >final.genome.scf.scaffolds.fasta

The NA12878 assembly of newly basecalled release3+4 Nanopore+Illumina data (see previous posts) now has the following statistics (N50 computed with 3.1Gbp human genome size):

Total sequence: 2,886,141,146
Number of scaffolds: 3,240
Longest scaffold: 56,745,809
Scaffold N50: 14,764,183
Scaffold L50: 63
Number of contigs: 3,448
Longest contig: 51,762,539
Contig N50: 9,580,052
Contig L50: 88

The assembly is posted here:

ftp://ftp.ccb.jhu.edu/pub/alekseyz/na12878/na12878_MaSuRCA_3.2.8_nanopore_wgs_consortium_37x_GIAB_100x.scaffolds.fa

The consensus quality stats:
Substitution Errors: 49,520
Insertion/Deletion Errors: 115,752
Consensus Quality: 99.9943 (or 0.4 errors per 10000 bases)

Thursday, November 15, 2018

MaSuRCA 3.2.9 release: >99.99% consensus quality of MaSuRCA assemblies from Nanopore+Illumina data

This is a maintenance release with small bugfixes and speed/memory usage improvements. Some scripts have been converted from Perl to C++ to improve performance and reduce memory usage on very large data sets (10Gbp+ genomes).

There is a new script:

evaluate_consensus_error_rate.sh (run with -h to get usage)

This script follows guidelines and procedure of consensus quality evaluation described in https://www.nature.com/articles/nbt.4060 . It uses bwa to map Illumina data to the assembly, and then freebayes to get the variants. Bwa, samtools and freebayes must be installed and available on the PATH. Any assembly consensus variants (e.g. SNPs, indels) that are not supported by any Illumina reads, but where there is one or more alternatives that are supported by at least 3 Illumina reads are called errors in the consensus. The script thus estimates the total number of errors in the genome assembly consensus and computes sequence quality. The output is <>.report file, where <> is the name of the input assembly fasta file.

According to this evaluation, MaSuRCA assemblies have very high consensus quality; in my experiments 30x Pacbio+ 100x Illumina assembly of A.thaliana had 99.9972% quality (3 errors per 100,000 bases). Human NA12878 assembly from 37x Nanopore+100x Illumina data described in the previous post and available here ftp://ftp.ccb.jhu.edu/pub/alekseyz/na12878/na12878_MaSuRCA_3.2.8_nanopore_wgs_consortium_37x_GIAB_100x.fa (nanopore data from https://www.nature.com/articles/nbt.4060 and Illumina data from GIAB project), had 99.9913% quality (less than 1 error per 10,000 bases).

Note that there is no "polishing" required for MaSuRCA assemblies. The sequence can be used as output with no additional processing. Polishing with Pilon should be used with caution and can lead to adverse effects on assemblies. Here is a quote from Adam Phillippy blog:

"...Finally, a note of caution on Illumina polishing with Pilon. While it can improve consensus statistics overall, it can worsen the assembly in some regions, especially complex repetitive sequence like the MHC. If using Pilon, we recommend limiting the allowable edits and focusing on the primary nanopore error mode (indels)...."

The entire blog post can be found here: https://genomeinformatics.github.io/na12878update/