Tuesday, January 26, 2021

MaSuRCA 4.0.1 release

This release contains major improvements in speed of hybrid assemblies. Thanks to the new k-unitig pre-correction algorithm, the speed of mega-reads algorithm, which corrects the long high error reads from Oxford Nanopore or Pacific Biosciences platforms, increased by about a factor of 6 for large genomes. The new algorithm eliminated the need to run second pass of the mega-reads, resulting in lower memory requirements. This resulted in major improvements of run times, especially for big genomes. It is now possible to run an hybrid assembly of a human genome starting with ~60x Illumina paired end data and ~30x Oxford Nanopore data in less than 6 days on a small computing cluster with ~200 CPU-cores. Bigger clusters will allow for assembly run times of as little as 2-3 days. MaSuRCA hybrid technique outputs high quality consensus that does not require any polishing.  Thus MaSuRCA assemblies can be used without any additional post-processing for downstream steps, such as gene annotation, in any genome project.

This release also improves compatibility with SLURM scheduler, by eliminating the second pass of mega-reads that was unstable on some systems. There are many stability and efficiency improvements. Here are some highlights:

  1. worked around the "consensus sequence mismatch" error in POLCA that occurred rarely in complex sequence regions
  2. chromosome scaffolder now picks low and high coverage thresholds automatically based on mapped read coverage
  3. improved chromosome scaffolder speed and accuracy using more efficient algorithms
  4. updated the version of MUMmer to 4.0.0rc1
  5. fixed a bug that sometimes caused minor under-reporting of actual errors in POLCA report file
  6. updated code for reference-assisted assembly for better performance when multiple references are used

The release is available here:

https://github.com/alekseyzimin/masurca/releases/download/v4.0.1/MaSuRCA-4.0.1.tar.gz