Monday, September 16, 2019

MaSuRCA-polish tool in MaSuRCA 3.3.4

There is a new tool available in MaSuRCA, MaSuRCA-polish tool assembly consensus quality evaluation/polishing tool. The tool is designed to detect and correct single-base and short insertion/deletion errors in assembled genomes using Illumina data.  The tool is is partially based on the error evaluation method described in (Jain et al, 2018).  It works best for assemblies with at least 99% consensus quality, as it is based on mapping Illumina reads to the assembly, and mapping accuracy decreases as assembly error rate increases.  

The tool first uses bwa mem (Li et al, 2009) aligner to map the Illumina reads to the assembly.  It then uses the alignments to find short sequence variants using freebayes tool (Garrison et al, 2012).  Any variant that has no allele that agrees with the consensus and where there is at least one alternative allele with frequency 3 or more is counted as an error in the consensus.  Then, for every location where we detect an error, we replace the consensus allele with the highest count alternative.  The code is parallel and it runs in less than 24 hours on a human genome with 30x Illumina data on 48 core AMD Opteron server with 128Gb of RAM.  The tool is distributed with MaSuRCA assembler (https://github.com/alekseyzimin/masurca) (Zimin at al, 2013). 

We applied the MaSuRCA-polish tool to the assembly of human HG002 genome using a 30x coverage subset of the 65x total coverage. The tool found 138,386 single base substitutions and 410,116 insertion/deletion errors in the assembly, and performed correction.  To evaluate the error rate after correction we used a different 30x coverage subset of the Illumina data. The tool was able to correct about 95% of the substitution errors and about 96% of insertion/deletion errors, reducing the overall consensus error rate by approximately a factor of 20. 

The usage is:

masurca-polish.sh -a -r <'Illumina_reads_fastq1 Illumina_reads_fastq'> -t [-n] <optional:do not fix errors that are found>

Note that a beta version of the tool was called evaluate_consensus_error_rate.sh in earlier MaSuRCA releases.

References
1. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, Malla S. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology. 2018 Apr;36(4):338.
2. Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
3. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907. 2012 Jul 17.
4. Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013 Aug 29;29(21):2669-77.

Monday, July 29, 2019

Test Illumina+Pacbio data set for MaSuRCA

There is new ftp site for a relatively small data set for 12Mbp genome of S.cerevisiae W303 to test MaSuRCA installation.

ftp://ftp.ccb.jhu.edu/pub/alekseyz/test_data/

This data set should assemble in under 30 scaffolds with N50 of over 650Kbp.  The data and config files are provided along with masurca.sh script that shows the command lines.  The data set is a subset of data available at

http://schatzlab.cshl.edu/data/ectools/

Thursday, March 7, 2019

MaSuRCA version 3.3.1

Today I am releasing a new version of MaSuRCA assembler, 3.3.1.  This version has no new features, only performance improvements and bugfixes. The release is available from the usual github download page:
https://github.com/alekseyzimin/masurca/releases/tag/v3.3.1

I am currently working on the MaSuRCA 4 version.  This version will replace CABOG assembler with Flye (https://github.com/fenderglass/Flye) for hybrid assembly of Illumina paired end + Oxford Nanopore/PacBio long reads. This will result in significant performance improvements, as Flye takes about 1 day on a 64-core server to assemble error-corrected human 20x data set, and CABOG takes about a week on 300-core cluster to do the same task. 

You can use Fly now to assemble error corrected reads output by MaSuRCA.  To do that you can stop the assembly after the following file has been generated:
mr.41.15.15.0.02.1.fa -- for nanopore assemblies
mr.41.15.17.0.029.1.fa -- for pacbio assemblies

and then run the Flye assembler as follows:

GS=`cat ESTIMATED_GENOME_SIZE.txt` && <flye_path>/bin/flye --nano-corr <mr.41.15.15.0.02.1.fa or mr.41.15.17.0.029.1.fa> -t <number_of-threads> -g $GS -m 2000 -o flye_assembly -i 0