Thursday, September 6, 2018

MaSuRCA 3.2.8 release with better assemblies of large genomes; 37x Nanopore + Illumina yields 8.4Mbp NG50 assembly size for human NA12878

Today I am releasing an update to MaSuRCA assembler, version 3.2.8.  The release is available from the MaSuRCA github releases page:

https://github.com/alekseyzimin/masurca/releases/tag/3.2.8

This version produces much more contiguous (in some cases by a factor of two or three) assemblies of complex genomes, such as mammalian or plant genomes.  It does just as well on small genomes such as insects, small plant genomes, or fungal genomes. The run time has increased somewhat from the 3.2.7 version because I have re-introduced the overlap-based-trimming module during assembly by default.  Here is the list of major improvements:
-- reworked the joining algorithm for incorporating long high error read sequence into the corrected reads where the sequence could not be corrected by Illumina data
-- cleaned up the code and the output/error messages
-- added final gapclosing step for scaffold gaps spanned by long high error reads
-- bugfixes, such as error in executing do_consensus.sh on some systems
-- re-enabled overlap based trimming in CABOG assembler and reduced the default coverage input for correction to 25x; if you have more than 25x coverage, the assembler will use 25x coverage in the longest reads

The changes made significant impact on contiguity and correctness of large mammalian and plant genome assemblies, for some of my test assemblies now N50 contig increased from ~300Kbp to ~950Kbp on 20x Pacbio + 100x Illumina data set. The run time has increased about 10% over 3.2.7 release but still faster than 3.2.6 release. 
This version produced a very good assembly of NA12878 human genome from the combined Release 3 (30x nanopore reads) and Release 4 (7x ultra-long nanopore reads) https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md , and Illumina paired end data (100x 2x250bp reads from GIAB project). Previously, nanopore-only assembly of these data followed by polishing the consensus with Illumina data have been published in Nature Biotechnology https://www.nature.com/articles/nbt.4060. This assembly contained 2,867Mb of sequence and had NG50 contig size of 6.4Mbp. G=3,098,794,149 bp (3.1 Gbp) was used to compute the NG statistic.  
The MaSuRCA 3.2.8 assembly has NG50 contig size of 8.4Mbp with 2,877Mbp of sequence in only 3501 contigs, and it is available here:
Please communicate any issues that you may encounter with MaSuRCA 3.2.8 through github "Issues" forum https://github.com/alekseyzimin/masurca/issues