Thursday, September 6, 2018

MaSuRCA 3.2.8 release with better assemblies of large genomes; 37x Nanopore + Illumina yields 8.4Mbp NG50 assembly size for human NA12878

Today I am releasing an update to MaSuRCA assembler, version 3.2.8.  The release is available from the MaSuRCA github releases page:

https://github.com/alekseyzimin/masurca/releases/tag/3.2.8

This version produces much more contiguous (in some cases by a factor of two or three) assemblies of complex genomes, such as mammalian or plant genomes.  It does just as well on small genomes such as insects, small plant genomes, or fungal genomes. The run time has increased somewhat from the 3.2.7 version because I have re-introduced the overlap-based-trimming module during assembly by default.  Here is the list of major improvements:
-- reworked the joining algorithm for incorporating long high error read sequence into the corrected reads where the sequence could not be corrected by Illumina data
-- cleaned up the code and the output/error messages
-- added final gapclosing step for scaffold gaps spanned by long high error reads
-- bugfixes, such as error in executing do_consensus.sh on some systems
-- re-enabled overlap based trimming in CABOG assembler and reduced the default coverage input for correction to 25x; if you have more than 25x coverage, the assembler will use 25x coverage in the longest reads

The changes made significant impact on contiguity and correctness of large mammalian and plant genome assemblies, for some of my test assemblies now N50 contig increased from ~300Kbp to ~950Kbp on 20x Pacbio + 100x Illumina data set. The run time has increased about 10% over 3.2.7 release but still faster than 3.2.6 release. 
This version produced a very good assembly of NA12878 human genome from the combined Release 3 (30x nanopore reads) and Release 4 (7x ultra-long nanopore reads) https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md , and Illumina paired end data (100x 2x250bp reads from GIAB project). Previously, nanopore-only assembly of these data followed by polishing the consensus with Illumina data have been published in Nature Biotechnology https://www.nature.com/articles/nbt.4060. This assembly contained 2,867Mb of sequence and had NG50 contig size of 6.4Mbp. G=3,098,794,149 bp (3.1 Gbp) was used to compute the NG statistic.  
The MaSuRCA 3.2.8 assembly has NG50 contig size of 8.4Mbp with 2,877Mbp of sequence in only 3501 contigs, and it is available here:
Please communicate any issues that you may encounter with MaSuRCA 3.2.8 through github "Issues" forum https://github.com/alekseyzimin/masurca/issues

Monday, July 23, 2018

New version MaSuRCA 3.2.7 -- significant speed and quality enhancements

Today I am releasing a new version of MaSuRCA, version 3.2.7.  This version has two significant updates over the 3.2.6 version.

The first update has to do with regions in the Long High Error (LHE) read data that are not captured by Illumina reads due to high or low GC content or are too repetitive to be corrected reliably. In all previous versions of MaSuRCA I collected these regions and used them without correction or consensus in joining the Illumina-corrected sections of LHE reads.  Each such section had to be present in multiple corrected reads and had to have about the same length to be used. It was up to the CABOG assembler then to put them together correctly and to call consensus.  Now the consensus for these regions is done prior to assembly, using high coverage by the LHE reads. This led to much better contiguity and quality of these regions.  The consensus quality of these regions has also improved, because CABOG assembler discarded or split mega-reads that contained uncorrected segments, which reduced the coverage of these sections.

The second update has to do with performance.  So far, in running the CABOG assembler, which is part of MaSuRCA it was necessary to utilize overlap-based trimming module (OBT) in CABOG. Thus the overlapper, the routine that computes the overlaps between the reads had to be run twice -- first to compute the overlaps for trimming, and then to re-compute them on the trimmed reads.  I have implemented an efficient version of the trimmer for the mega-reads that runs in almost no additional time and allows to avoid using the OBT module in CABOG. This reduced the CABOG run time by almost half.
These updates not only reduced the run times but also led to significant updates in the assembly quality.  For example, among other data sets, I re-assembled the data used in our recent publication "First Draft Genome Sequence of the Pathogenic Fungus Lomentospora prolificans (formerly Scedosporium prolificans)" (http://www.g3journal.org/content/early/2017/09/29/g3.117.300107).
The published version of the genome was assembled with MaSuRCA version 3.2.2.  Here is improvement in the assembly quality between version 3.2.2 and version 3.2.7.  The N50 contig size more than doubled.

ver.3.2.2 -- basis for the published assembly, was the best assembly achieved then among all assemblers that we tested
Sequence: 37,087,688
N50 contig: 1,509,237
N50 scaffold: 1,509,237
Number of scaffolds: 156

ver.3.2.7 assembly
Sequence: 36,892,617
N50 contig: 3,157,388
Number of contigs: 68 == Number of scaffolds

The new release has been tested on yeast, arabidopsis and human and it is available on github:  https://github.com/alekseyzimin/masurca/blob/master/MaSuRCA-3.2.7.tar.gz and in the "Releases" section: https://github.com/alekseyzimin/masurca/releases

Wednesday, May 2, 2018

MaSuRCA 3.2.6 official

I have released the official 3.2.6 version of MaSuRCA.  It is available on the ftp site here:
ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.6.tar.gz, and also on github.

Upgrading is easy, simply remove the 3.2.4 (or older) version and install this one.  Please use this version going forward. Big thanks to all users who reported errors and bugs!  

Please see this post for the list of improvements in 3.2.6 version: http://masurca.blogspot.com/2018/03/pre-release-version-masurca-326beta.html 

Thursday, April 19, 2018

Reporting issues with MaSuRCA on github

MaSuRCA is now on github.  Github has an excellent system for reporting bugs/issues with the software.  I encourage all users of MaSuRCA to utilize this resource and report issues here

https://github.com/alekseyzimin/masurca/issues

Also if you are having a problem, please check the github issues page to see if the problem has been addressed already.

Thursday, April 5, 2018

Tree tobacco plant assembled with MaSuRCA

I am glad to see assemblies of the novel genomes that used MaSuRCA published.  Here is a recent data note published in BMC:
https://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-018-3127-x?utm_source=BMC_mailing&utm_medium=Email_Internal&utm_content=ElaLee-BMC-BMC_Research_Notes-Multidisciplinary-Global&utm_campaign=BMCS_SUB_One_year_Anniversary&sap-outbound-id=6B5458575F38E62CA5FD4B25C94434C6C3ABEBF6

This is an assembly of tree tobacco Nicotiana glauca from Illumina-only data (350bp fragment Paired End library and ~4,000bp fragment mate pair library), that yielded N50 contig size of about 31Kbp.  The assembly size (~3.2Gbp) was bigger than the estimated genome size (~2Gbp) which points to relatively high heterozygosity of the plant.

Monday, March 26, 2018

Pre-release version maSuRCA 3.2.6beta

Over the past several weeks I have been working on improving stability of MaSuRCA.  I thank all users who reported problems to me and I have been addressing these problems in the code. The improved pre-release version of MaSuRCA 3.2.6beta is posted here:

ftp://ftp.genome.umd.edu/pub/MaSuRCA/beta/MaSuRCA-3.2.6b.tar.gz

This is a maintenance release.  There are no new features from 3.2.4 version, but there are many stability and performance improvements based on the feedback from the users (AGAIN BIG THANKS EVERYONE!!!) and my own use of MaSuRCA with the assemblies that I run.

List of major improvements:

1. occasional failure on overlapcorrection workaround
2. Illumina-only assembly unitig consensus failure workaround
3. running mega-reads on SGE grid improvements in performance and stability
4. cleaned up the code and improved re-starting assemblies with Illumina-only data
5. Updated version of MUMmer4 included
6. Improved compilation and install script on platforms where @ is present in the PWD
7. fixed bugs and improved performance of the assembly polishing code
8. speed and stability improvements to the Oxford Nanopore correction code
9. fixed bug that resulted in gap filling running in endless loop

The complete list of bugfixes and improvements for masurca and its submodules can be found on github https://github.com/alekseyzimin

I would like this release to be a stable point before I continue adding new features.  Please let me know in the comments if you have any issues with this release.  I will remove the beta status after 2 weeks of testing and post it as an official release.

Monday, January 22, 2018

MaSuRCA is now on github



MaSuRCA has new home on github at https://github.com/alekseyzimin/masurca. MaSuRCA combines jellyfish, QuORUM, and other modules into one repository. The individual modules are submodules in the repository. The master branch of the masurca repository tracks the latest working commits. To checkout and compile MaSuRCA do the following:

git clone https://github.com/alekseyzimin/masurca

git submodule init

git submodule update

make

MaSuRCA will compile under build/inst/bin/

To create a distribution, run make install. This will create MaSuRCA-3.2.4.tar.gz distributable tarball.

EDIT: to compile MaSuRCA from development tree, you will need the following dependencies:
swig and yaggo (http://www.swig.org/ and https://github.com/gmarcais/yaggo). Both must be available on the path.

Please post all questions and bug reports under "issues" in github: https://github.com/alekseyzimin/masurca/issues

Friday, January 12, 2018

New MaSuRCA version 3.2.4

I have just finished testing a new release of MaSuRCA version 3.2.4. The major improvement in this version is ability to run the hybrid assembly (Illumina+PacBio/Oxford Nanopore data) on a grid.  At this point only SGE is supported, and I am working on SLURM support which will be implemented shortly. Other improvements include:

1. gzippped fasta/fastq input files of PacBio/Oxford Nanopore reads supported
2. general speed and accuracy improvements
3. minor bugfixes based on user feedback

The new version is designed in such a way to allow mammalian genome assembly on a grid of computers with 128Gb of RAM.

The new release is available here
ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.4.tar.gz

I am now updating the MaSuRCA manual to reflect the new options for grid execution, and I will upload it later today.