MaSuRCA genome assembly package: MaSuRCA 3.2.9 release: >99.99% consensus quality of MaSuRCA assemblies from Nanopore+Illumina data

This is a maintenance release with small bugfixes and speed/memory usage improvements. Some scripts have been converted from Perl to C++ to improve performance and reduce memory usage on very large data sets (10Gbp+ genomes).

There is a new script:

evaluate_consensus_error_rate.sh (run with -h to get usage)

This script follows guidelines and procedure of consensus quality evaluation described in https://www.nature.com/articles/nbt.4060 . It uses bwa to map Illumina data to the assembly, and then freebayes to get the variants. Bwa, samtools and freebayes must be installed and available on the PATH. Any assembly consensus variants (e.g. SNPs, indels) that are not supported by any Illumina reads, but where there is one or more alternatives that are supported by at least 3 Illumina reads are called errors in the consensus. The script thus estimates the total number of errors in the genome assembly consensus and computes sequence quality. The output is <>.report file, where <> is the name of the input assembly fasta file.

According to this evaluation, MaSuRCA assemblies have very high consensus quality; in my experiments 30x Pacbio+ 100x Illumina assembly of A.thaliana had 99.9972% quality (3 errors per 100,000 bases). Human NA12878 assembly from 37x Nanopore+100x Illumina data described in the previous post and available here ftp://ftp.ccb.jhu.edu/pub/alekseyz/na12878/na12878_MaSuRCA_3.2.8_nanopore_wgs_consortium_37x_GIAB_100x.fa (nanopore data from https://www.nature.com/articles/nbt.4060 and Illumina data from GIAB project), had 99.9913% quality (less than 1 error per 10,000 bases).

Note that there is no "polishing" required for MaSuRCA assemblies. The sequence can be used as output with no additional processing. Polishing with Pilon should be used with caution and can lead to adverse effects on assemblies. Here is a quote from Adam Phillippy blog:

"...Finally, a note of caution on Illumina polishing with Pilon. While it can improve consensus statistics overall, it can worsen the assembly in some regions, especially complex repetitive sequence like the MHC. If using Pilon, we recommend limiting the allowable edits and focusing on the primary nanopore error mode (indels)...."

The entire blog post can be found here: https://genomeinformatics.github.io/na12878update/

MaSuRCA genome assembly package

Thursday, November 15, 2018

MaSuRCA 3.2.9 release: >99.99% consensus quality of MaSuRCA assemblies from Nanopore+Illumina data

No comments:

Post a Comment

Blog Archive