Monday, September 16, 2019

MaSuRCA-polish tool in MaSuRCA 3.3.4

There is a new tool available in MaSuRCA, MaSuRCA-polish tool assembly consensus quality evaluation/polishing tool. The tool is designed to detect and correct single-base and short insertion/deletion errors in assembled genomes using Illumina data.  The tool is is partially based on the error evaluation method described in (Jain et al, 2018).  It works best for assemblies with at least 99% consensus quality, as it is based on mapping Illumina reads to the assembly, and mapping accuracy decreases as assembly error rate increases.  

The tool first uses bwa mem (Li et al, 2009) aligner to map the Illumina reads to the assembly.  It then uses the alignments to find short sequence variants using freebayes tool (Garrison et al, 2012).  Any variant that has no allele that agrees with the consensus and where there is at least one alternative allele with frequency 3 or more is counted as an error in the consensus.  Then, for every location where we detect an error, we replace the consensus allele with the highest count alternative.  The code is parallel and it runs in less than 24 hours on a human genome with 30x Illumina data on 48 core AMD Opteron server with 128Gb of RAM.  The tool is distributed with MaSuRCA assembler (https://github.com/alekseyzimin/masurca) (Zimin at al, 2013). 

We applied the MaSuRCA-polish tool to the assembly of human HG002 genome using a 30x coverage subset of the 65x total coverage. The tool found 138,386 single base substitutions and 410,116 insertion/deletion errors in the assembly, and performed correction.  To evaluate the error rate after correction we used a different 30x coverage subset of the Illumina data. The tool was able to correct about 95% of the substitution errors and about 96% of insertion/deletion errors, reducing the overall consensus error rate by approximately a factor of 20. 

The usage is:

masurca-polish.sh -a -r <'Illumina_reads_fastq1 Illumina_reads_fastq'> -t [-n] <optional:do not fix errors that are found>

Note that a beta version of the tool was called evaluate_consensus_error_rate.sh in earlier MaSuRCA releases.

References
1. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, Malla S. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology. 2018 Apr;36(4):338.
2. Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
3. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907. 2012 Jul 17.
4. Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013 Aug 29;29(21):2669-77.