Thursday, December 17, 2020

MaSuRCA 4.0.0 incorporates new techniques that speed up assemblies of mammalian genomes by a factor of six

MaSuRCA has been serving the research community well for the past few years, with many low-cost sequencing and assembly projects benefiting from its reduced requirements on coverage depth for long read sequencing data and high quality output assemblies.  One of the complaints about MaSuRCA has been speed -- and indeed a mammalian sized ~3Gbp genome usually took about a month to assemble even on research computing clusters.  For the past few months I have been concentrating on developing new efficient algorithms for correction of long high error reads such as the ones produced by Oxford Nanopore MinION/PromethION and PacBio SMRT CLR sequencing.  

The new MaSuRCA 4.0.0 improves the run time for the mammalian genome assemblies by at least a factor of six by eliminating the second pass of correction and replacing it with the new ultrafast k-unitig pre-correction algorithm. The k-unitig pre-correction reduces the error rate of the input reads by up to 50% with very small time investment, less than 1 day on a single 32-64 core server for a mammalian genome data set, which in turn allows for use of longer k-mers in (17 vs 15) in the main mega-read building phase.  Use of longer mers increases specificity of the super-reads alignments to the long reads and thus reduces the complexity of the super-read graph, which has major impact on the run time.

In direct run time comparisons of the assembly times for human HG01243 Illumina/Nanopore data set with about 62x genome coverage by Illumina reads and about 47 x coverage by ultralong Nanopore reads (read N50 >80kb) new MaSuRCA completed the assembly in less than 5 days, where the previous MaSuRCA (version 3.4.2) required 32 days on the same 256-core computer cluster.  The assembly results were comparable between the two runs, with the new version producing slightly more contiguous assembly.  On experiments with A.thaliana (genome size ~120Mbp) data set of 40x coverage by PacBio RSII reads combined with 100x coverage the new MaSuRCA produces better assembly with fewer misassemblies and N50 contig size of 6.2Mbp vs N50 contig of 5.45Mbp for the previous version.  Overall I expect the new MaSuRCA  to deliver very similar or slightly better assembly results in significantly less time.

The pre-release of the new MaSuRCA is available on github at 

https://github.com/alekseyzimin/masurca/raw/master/MaSuRCA-4.0.0.tar.gz

This version will be published as the official release once I complete the testing on multiple large genome data sets.  In meanwhile if you would like to install and try the new release and find any problems with installation or usage, please report them on the github issues board

Issues · alekseyzimin/masurca (github.com)