Thursday, September 6, 2018

MaSuRCA 3.2.8 release with better assemblies of large genomes; 37x Nanopore + Illumina yields 8.4Mbp NG50 assembly size for human NA12878

Today I am releasing an update to MaSuRCA assembler, version 3.2.8.  The release is available from the MaSuRCA github releases page:

https://github.com/alekseyzimin/masurca/releases/tag/3.2.8

This version produces much more contiguous (in some cases by a factor of two or three) assemblies of complex genomes, such as mammalian or plant genomes.  It does just as well on small genomes such as insects, small plant genomes, or fungal genomes. The run time has increased somewhat from the 3.2.7 version because I have re-introduced the overlap-based-trimming module during assembly by default.  Here is the list of major improvements:
-- reworked the joining algorithm for incorporating long high error read sequence into the corrected reads where the sequence could not be corrected by Illumina data
-- cleaned up the code and the output/error messages
-- added final gapclosing step for scaffold gaps spanned by long high error reads
-- bugfixes, such as error in executing do_consensus.sh on some systems
-- re-enabled overlap based trimming in CABOG assembler and reduced the default coverage input for correction to 25x; if you have more than 25x coverage, the assembler will use 25x coverage in the longest reads

The changes made significant impact on contiguity and correctness of large mammalian and plant genome assemblies, for some of my test assemblies now N50 contig increased from ~300Kbp to ~950Kbp on 20x Pacbio + 100x Illumina data set. The run time has increased about 10% over 3.2.7 release but still faster than 3.2.6 release. 
This version produced a very good assembly of NA12878 human genome from the combined Release 3 (30x nanopore reads) and Release 4 (7x ultra-long nanopore reads) https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md , and Illumina paired end data (100x 2x250bp reads from GIAB project). Previously, nanopore-only assembly of these data followed by polishing the consensus with Illumina data have been published in Nature Biotechnology https://www.nature.com/articles/nbt.4060. This assembly contained 2,867Mb of sequence and had NG50 contig size of 6.4Mbp. G=3,098,794,149 bp (3.1 Gbp) was used to compute the NG statistic.  
The MaSuRCA 3.2.8 assembly has NG50 contig size of 8.4Mbp with 2,877Mbp of sequence in only 3501 contigs, and it is available here:
Please communicate any issues that you may encounter with MaSuRCA 3.2.8 through github "Issues" forum https://github.com/alekseyzimin/masurca/issues

3 comments:

  1. Can you share the config file you used?

    ReplyDelete



  2. Sure, here are the parameters I used:

    PARAMETERS
    JF_SIZE=30000000000
    CA_PARAMETERS=frgCorrConcurrency=8 ovlCorrConcurrency=24
    USE_GRID=1
    GRID_BATCH_SIZE=1000000000
    GRID_QUEUE=all.q
    USE_LINKING_MATES=0
    LHE_COVERAGE=25
    NUM_THREADS=64
    END

    ReplyDelete
  3. The Illumina data was from GIAB project, runs:

    131219_D00360_005_BH814YADXX
    131219_D00360_006_AH81VLADXX
    140115_D00360_0009_AH8962ADXX
    140207_D00360_0013_AH8G92ADXX

    ReplyDelete