Monday, September 16, 2019

MaSuRCA-polish tool in MaSuRCA 3.3.4

There is a new tool available in MaSuRCA, MaSuRCA-polish tool assembly consensus quality evaluation/polishing tool. The tool is designed to detect and correct single-base and short insertion/deletion errors in assembled genomes using Illumina data.  The tool is is partially based on the error evaluation method described in (Jain et al, 2018).  It works best for assemblies with at least 99% consensus quality, as it is based on mapping Illumina reads to the assembly, and mapping accuracy decreases as assembly error rate increases.  

The tool first uses bwa mem (Li et al, 2009) aligner to map the Illumina reads to the assembly.  It then uses the alignments to find short sequence variants using freebayes tool (Garrison et al, 2012).  Any variant that has no allele that agrees with the consensus and where there is at least one alternative allele with frequency 3 or more is counted as an error in the consensus.  Then, for every location where we detect an error, we replace the consensus allele with the highest count alternative.  The code is parallel and it runs in less than 24 hours on a human genome with 30x Illumina data on 48 core AMD Opteron server with 128Gb of RAM.  The tool is distributed with MaSuRCA assembler (https://github.com/alekseyzimin/masurca) (Zimin at al, 2013). 

We applied the MaSuRCA-polish tool to the assembly of human HG002 genome using a 30x coverage subset of the 65x total coverage. The tool found 138,386 single base substitutions and 410,116 insertion/deletion errors in the assembly, and performed correction.  To evaluate the error rate after correction we used a different 30x coverage subset of the Illumina data. The tool was able to correct about 95% of the substitution errors and about 96% of insertion/deletion errors, reducing the overall consensus error rate by approximately a factor of 20. 

The usage is:

masurca-polish.sh -a -r <'Illumina_reads_fastq1 Illumina_reads_fastq'> -t [-n] <optional:do not fix errors that are found>

Note that a beta version of the tool was called evaluate_consensus_error_rate.sh in earlier MaSuRCA releases.

References
1. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, Malla S. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology. 2018 Apr;36(4):338.
2. Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]
3. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907. 2012 Jul 17.
4. Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013 Aug 29;29(21):2669-77.

Monday, July 29, 2019

Test Illumina+Pacbio data set for MaSuRCA

There is new ftp site for a relatively small data set for 12Mbp genome of S.cerevisiae W303 to test MaSuRCA installation.

ftp://ftp.ccb.jhu.edu/pub/alekseyz/test_data/

This data set should assemble in under 30 scaffolds with N50 of over 650Kbp.  The data and config files are provided along with masurca.sh script that shows the command lines.  The data set is a subset of data available at

http://schatzlab.cshl.edu/data/ectools/

Thursday, March 7, 2019

MaSuRCA version 3.3.1

Today I am releasing a new version of MaSuRCA assembler, 3.3.1.  This version has no new features, only performance improvements and bugfixes. The release is available from the usual github download page:
https://github.com/alekseyzimin/masurca/releases/tag/v3.3.1

I am currently working on the MaSuRCA 4 version.  This version will replace CABOG assembler with Flye (https://github.com/fenderglass/Flye) for hybrid assembly of Illumina paired end + Oxford Nanopore/PacBio long reads. This will result in significant performance improvements, as Flye takes about 1 day on a 64-core server to assemble error-corrected human 20x data set, and CABOG takes about a week on 300-core cluster to do the same task. 

You can use Fly now to assemble error corrected reads output by MaSuRCA.  To do that you can stop the assembly after the following file has been generated:
mr.41.15.15.0.02.1.fa -- for nanopore assemblies
mr.41.15.17.0.029.1.fa -- for pacbio assemblies

and then run the Flye assembler as follows:

GS=`cat ESTIMATED_GENOME_SIZE.txt` && <flye_path>/bin/flye --nano-corr <mr.41.15.15.0.02.1.fa or mr.41.15.17.0.029.1.fa> -t <number_of-threads> -g $GS -m 2000 -o flye_assembly -i 0


Friday, November 30, 2018

Beta SLURM support in MaSuRCA 3.3.0b

I have been working to implement grid execution in MaSuRCA for a while, and MaSuRCA supported SGE for nearly a year now.  The implementation of support for SLURM was behind due to an important difference between job submission in SGE and SLURM: absence of "sync" option, where the submit command in a script would not return until all jobs have exited.  There are ways around it and we are still testing what would work best in the framework of MaSuRCA.  For now one can run the four most computation intensive parts of MaSuRCA on the grid:

1. mega-reads correction 1st pass
2. mega-reads correction second pass (sometimes grid is not necessary for this one)
3. overlap-based trimming in CABOG
4. overlapping in CABOG

The way this works, one runs assemble.sh and then the run prepares batch jobs and exits, instructing user to run a given sbatch command with given parameters.  Once all sbatch jobs are done, one should re-run assemble.sh script (would not hurt to re-generate it just in case).  The scrip checks if all jobs completed successfully and then proceeds.  If some jobs have failed, the script exits, instructing the user to run sbatch command again.

To configure the grid run one must add the following options to the config file.  Hint: these options are already present in the example config file created upon installation of MaSuRCA

USE_GRID=1
GRID_QUEUE=your_slurm_partition
GRID_ENGINE=SLURM
GRID_BATCH_SIZE=<amount of Nanopore/Pacbio sequence to use per batch>, 300000000 default, will create 10 batches for every 1x of mammalian genome coverage
NUM_THREADS=32

The sbatch command will run one job per 32-core node in this case.  If your nodes have fewer or more cores, adjust accordingly.

If grid jobs fail in mega-reads pass2 (jf_aligner jobs) it is likely due to not enough memory on the nodes.  To solve this, reduce GRID_BATCH_SIZE, re-generate assemble.sh , remove mr_pass2 folder and re-run.

To get the 3.3.0b version, please go to github
https://github.com/alekseyzimin/masurca/blob/master/MaSuRCA-3.3.0b.tar.gz

Thanks to Daniela Puiu, the software engineer at the Center for Computational Biology for implementing SLURM support in CABOG and helping me test MaSuRCA!


Tuesday, November 20, 2018

MaSuRCA 3.2.9 hotfix

I decided to release a minor hotfix to MaSuRCA 3.2.9 on 11/20/2019 keeping version number the same, but with a small change: now the final output file final.genome.scf.fasta contains scaffolds.  Previous version split scaffolds and output contigs in that file. To recover the scaffolds from 3.2.8 assemblies you can run:

<MaSuRCA path>/bin/recover_scaffolds.pl < final.genome.scf.fasta >final.genome.scf.scaffolds.fasta

The NA12878 assembly of newly basecalled release3+4 Nanopore+Illumina data (see previous posts) now has the following statistics (N50 computed with 3.1Gbp human genome size):

Total sequence: 2,886,141,146
Number of scaffolds: 3,240
Longest scaffold: 56,745,809
Scaffold N50: 14,764,183
Scaffold L50: 63
Number of contigs: 3,448
Longest contig: 51,762,539
Contig N50: 9,580,052
Contig L50: 88

The assembly is posted here:

ftp://ftp.ccb.jhu.edu/pub/alekseyz/na12878/na12878_MaSuRCA_3.2.8_nanopore_wgs_consortium_37x_GIAB_100x.scaffolds.fa

The consensus quality stats:
Substitution Errors: 49,520
Insertion/Deletion Errors: 115,752
Consensus Quality: 99.9943 (or 0.4 errors per 10000 bases)



Thursday, November 15, 2018

MaSuRCA 3.2.9 release: >99.99% consensus quality of MaSuRCA assemblies from Nanopore+Illumina data

This is a maintenance release with small bugfixes and speed/memory usage improvements. Some scripts have been converted from Perl to C++ to improve performance and reduce memory usage on very large data sets (10Gbp+ genomes). 

There is a new script:

evaluate_consensus_error_rate.sh (run with -h to get usage)

This script follows guidelines and procedure of consensus quality evaluation described in https://www.nature.com/articles/nbt.4060 . It uses bwa to map Illumina data to the assembly, and then freebayes to get the variants. Bwa, samtools and freebayes must be installed and available on the PATH. Any assembly consensus variants (e.g. SNPs, indels) that are not supported by any Illumina reads, but where there is one or more alternatives that are supported by at least 3 Illumina reads are called errors in the consensus. The script thus estimates the total number of errors in the genome assembly consensus and computes sequence quality. The output is <>.report file, where <> is the name of the input assembly fasta file.

According to this evaluation, MaSuRCA assemblies have very high consensus quality; in my experiments 30x Pacbio+ 100x Illumina assembly of A.thaliana had 99.9972% quality (3 errors per 100,000 bases).  Human NA12878 assembly from 37x Nanopore+100x Illumina data described in the previous post and available here ftp://ftp.ccb.jhu.edu/pub/alekseyz/na12878/na12878_MaSuRCA_3.2.8_nanopore_wgs_consortium_37x_GIAB_100x.fa (nanopore data from https://www.nature.com/articles/nbt.4060 and Illumina data from GIAB project), had 99.9913% quality (less than 1 error per 10,000 bases).  

Note that there is no "polishing" required for MaSuRCA assemblies.  The sequence can be used as output with no additional processing.  Polishing with Pilon should be used with caution and can lead to adverse effects on assemblies.  Here is a quote from Adam Phillippy blog:

"...Finally, a note of caution on Illumina polishing with Pilon. While it can improve consensus statistics overall, it can worsen the assembly in some regions, especially complex repetitive sequence like the MHC. If using Pilon, we recommend limiting the allowable edits and focusing on the primary nanopore error mode (indels)...."

The entire blog post can be found here:  https://genomeinformatics.github.io/na12878update/

Thursday, September 6, 2018

MaSuRCA 3.2.8 release with better assemblies of large genomes; 37x Nanopore + Illumina yields 8.4Mbp NG50 assembly size for human NA12878

Today I am releasing an update to MaSuRCA assembler, version 3.2.8.  The release is available from the MaSuRCA github releases page:

https://github.com/alekseyzimin/masurca/releases/tag/3.2.8

This version produces much more contiguous (in some cases by a factor of two or three) assemblies of complex genomes, such as mammalian or plant genomes.  It does just as well on small genomes such as insects, small plant genomes, or fungal genomes. The run time has increased somewhat from the 3.2.7 version because I have re-introduced the overlap-based-trimming module during assembly by default.  Here is the list of major improvements:
-- reworked the joining algorithm for incorporating long high error read sequence into the corrected reads where the sequence could not be corrected by Illumina data
-- cleaned up the code and the output/error messages
-- added final gapclosing step for scaffold gaps spanned by long high error reads
-- bugfixes, such as error in executing do_consensus.sh on some systems
-- re-enabled overlap based trimming in CABOG assembler and reduced the default coverage input for correction to 25x; if you have more than 25x coverage, the assembler will use 25x coverage in the longest reads

The changes made significant impact on contiguity and correctness of large mammalian and plant genome assemblies, for some of my test assemblies now N50 contig increased from ~300Kbp to ~950Kbp on 20x Pacbio + 100x Illumina data set. The run time has increased about 10% over 3.2.7 release but still faster than 3.2.6 release. 
This version produced a very good assembly of NA12878 human genome from the combined Release 3 (30x nanopore reads) and Release 4 (7x ultra-long nanopore reads) https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md , and Illumina paired end data (100x 2x250bp reads from GIAB project). Previously, nanopore-only assembly of these data followed by polishing the consensus with Illumina data have been published in Nature Biotechnology https://www.nature.com/articles/nbt.4060. This assembly contained 2,867Mb of sequence and had NG50 contig size of 6.4Mbp. G=3,098,794,149 bp (3.1 Gbp) was used to compute the NG statistic.  
The MaSuRCA 3.2.8 assembly has NG50 contig size of 8.4Mbp with 2,877Mbp of sequence in only 3501 contigs, and it is available here:
Please communicate any issues that you may encounter with MaSuRCA 3.2.8 through github "Issues" forum https://github.com/alekseyzimin/masurca/issues

Monday, July 23, 2018

New version MaSuRCA 3.2.7 -- significant speed and quality enhancements

Today I am releasing a new version of MaSuRCA, version 3.2.7.  This version has two significant updates over the 3.2.6 version.

The first update has to do with regions in the Long High Error (LHE) read data that are not captured by Illumina reads due to high or low GC content or are too repetitive to be corrected reliably. In all previous versions of MaSuRCA I collected these regions and used them without correction or consensus in joining the Illumina-corrected sections of LHE reads.  Each such section had to be present in multiple corrected reads and had to have about the same length to be used. It was up to the CABOG assembler then to put them together correctly and to call consensus.  Now the consensus for these regions is done prior to assembly, using high coverage by the LHE reads. This led to much better contiguity and quality of these regions.  The consensus quality of these regions has also improved, because CABOG assembler discarded or split mega-reads that contained uncorrected segments, which reduced the coverage of these sections.

The second update has to do with performance.  So far, in running the CABOG assembler, which is part of MaSuRCA it was necessary to utilize overlap-based trimming module (OBT) in CABOG. Thus the overlapper, the routine that computes the overlaps between the reads had to be run twice -- first to compute the overlaps for trimming, and then to re-compute them on the trimmed reads.  I have implemented an efficient version of the trimmer for the mega-reads that runs in almost no additional time and allows to avoid using the OBT module in CABOG. This reduced the CABOG run time by almost half.
These updates not only reduced the run times but also led to significant updates in the assembly quality.  For example, among other data sets, I re-assembled the data used in our recent publication "First Draft Genome Sequence of the Pathogenic Fungus Lomentospora prolificans (formerly Scedosporium prolificans)" (http://www.g3journal.org/content/early/2017/09/29/g3.117.300107).
The published version of the genome was assembled with MaSuRCA version 3.2.2.  Here is improvement in the assembly quality between version 3.2.2 and version 3.2.7.  The N50 contig size more than doubled.

ver.3.2.2 -- basis for the published assembly, was the best assembly achieved then among all assemblers that we tested
Sequence: 37,087,688
N50 contig: 1,509,237
N50 scaffold: 1,509,237
Number of scaffolds: 156

ver.3.2.7 assembly
Sequence: 36,892,617
N50 contig: 3,157,388
Number of contigs: 68 == Number of scaffolds

The new release has been tested on yeast, arabidopsis and human and it is available on github:  https://github.com/alekseyzimin/masurca/blob/master/MaSuRCA-3.2.7.tar.gz and in the "Releases" section: https://github.com/alekseyzimin/masurca/releases

Wednesday, May 2, 2018

MaSuRCA 3.2.6 official

I have released the official 3.2.6 version of MaSuRCA.  It is available on the ftp site here:
ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.6.tar.gz, and also on github.

Upgrading is easy, simply remove the 3.2.4 (or older) version and install this one.  Please use this version going forward. Big thanks to all users who reported errors and bugs!  

Please see this post for the list of improvements in 3.2.6 version: http://masurca.blogspot.com/2018/03/pre-release-version-masurca-326beta.html 

Thursday, April 19, 2018

Reporting issues with MaSuRCA on github

MaSuRCA is now on github.  Github has an excellent system for reporting bugs/issues with the software.  I encourage all users of MaSuRCA to utilize this resource and report issues here

https://github.com/alekseyzimin/masurca/issues

Also if you are having a problem, please check the github issues page to see if the problem has been addressed already.