Monday, July 23, 2018

New version MaSuRCA 3.2.7 -- significant speed and quality enhancements

Today I am releasing a new version of MaSuRCA, version 3.2.7.  This version has two significant updates over the 3.2.6 version.

The first update has to do with regions in the Long High Error (LHE) read data that are not captured by Illumina reads due to high or low GC content or are too repetitive to be corrected reliably. In all previous versions of MaSuRCA I collected these regions and used them without correction or consensus in joining the Illumina-corrected sections of LHE reads.  Each such section had to be present in multiple corrected reads and had to have about the same length to be used. It was up to the CABOG assembler then to put them together correctly and to call consensus.  Now the consensus for these regions is done prior to assembly, using high coverage by the LHE reads. This led to much better contiguity and quality of these regions.  The consensus quality of these regions has also improved, because CABOG assembler discarded or split mega-reads that contained uncorrected segments, which reduced the coverage of these sections.

The second update has to do with performance.  So far, in running the CABOG assembler, which is part of MaSuRCA it was necessary to utilize overlap-based trimming module (OBT) in CABOG. Thus the overlapper, the routine that computes the overlaps between the reads had to be run twice -- first to compute the overlaps for trimming, and then to re-compute them on the trimmed reads.  I have implemented an efficient version of the trimmer for the mega-reads that runs in almost no additional time and allows to avoid using the OBT module in CABOG. This reduced the CABOG run time by almost half.
These updates not only reduced the run times but also led to significant updates in the assembly quality.  For example, among other data sets, I re-assembled the data used in our recent publication "First Draft Genome Sequence of the Pathogenic Fungus Lomentospora prolificans (formerly Scedosporium prolificans)" (http://www.g3journal.org/content/early/2017/09/29/g3.117.300107).
The published version of the genome was assembled with MaSuRCA version 3.2.2.  Here is improvement in the assembly quality between version 3.2.2 and version 3.2.7.  The N50 contig size more than doubled.

ver.3.2.2 -- basis for the published assembly, was the best assembly achieved then among all assemblers that we tested
Sequence: 37,087,688
N50 contig: 1,509,237
N50 scaffold: 1,509,237
Number of scaffolds: 156

ver.3.2.7 assembly
Sequence: 36,892,617
N50 contig: 3,157,388
Number of contigs: 68 == Number of scaffolds

The new release has been tested on yeast, arabidopsis and human and it is available on github:  https://github.com/alekseyzimin/masurca/blob/master/MaSuRCA-3.2.7.tar.gz and in the "Releases" section: https://github.com/alekseyzimin/masurca/releases