This release contains major improvements in speed of hybrid assemblies. Thanks to the new k-unitig pre-correction algorithm, the speed of mega-reads algorithm, which corrects the long high error reads from Oxford Nanopore or Pacific Biosciences platforms, increased by about a factor of 6 for large genomes. The new algorithm eliminated the need to run second pass of the mega-reads, resulting in lower memory requirements. This resulted in major improvements of run times, especially for big genomes. It is now possible to run an hybrid assembly of a human genome starting with ~60x Illumina paired end data and ~30x Oxford Nanopore data in less than 6 days on a small computing cluster with ~200 CPU-cores. Bigger clusters will allow for assembly run times of as little as 2-3 days. MaSuRCA hybrid technique outputs high quality consensus that does not require any polishing. Thus MaSuRCA assemblies can be used without any additional post-processing for downstream steps, such as gene annotation, in any genome project.
This release also improves compatibility with SLURM scheduler, by eliminating the second pass of mega-reads that was unstable on some systems. There are many stability and efficiency improvements. Here are some highlights:
- worked around the "consensus sequence mismatch" error in POLCA that occurred rarely in complex sequence regions
- chromosome scaffolder now picks low and high coverage thresholds automatically based on mapped read coverage
- improved chromosome scaffolder speed and accuracy using more efficient algorithms
- updated the version of MUMmer to 4.0.0rc1
- fixed a bug that sometimes caused minor under-reporting of actual errors in POLCA report file
- updated code for reference-assisted assembly for better performance when multiple references are used