Friday, January 12, 2018

New MaSuRCA version 3.2.4

I have just finished testing a new release of MaSuRCA version 3.2.4. The major improvement in this version is ability to run the hybrid assembly (Illumina+PacBio/Oxford Nanopore data) on a grid.  At this point only SGE is supported, and I am working on SLURM support which will be implemented shortly. Other improvements include:

1. gzippped fasta/fastq input files of PacBio/Oxford Nanopore reads supported
2. general speed and accuracy improvements
3. minor bugfixes based on user feedback

The new version is designed in such a way to allow mammalian genome assembly on a grid of computers with 128Gb of RAM.

The new release is available here

I am now updating the MaSuRCA manual to reflect the new options for grid execution, and I will upload it later today.

Thursday, November 2, 2017

News article about Bread wheat genome assembled with MaSuRCA and Falcon

According to this article in Nature news, we scooped the IWGSC (International Wheat Genome Sequencing Consortium) to publication the Bread wheat assembly, the most complete and contiguous to date.

Wednesday, October 25, 2017

Note on ploidy estimation in MaSuRCA

Since the 3.2.2 version, MaSuRCA uses modified algorithms and settings for assembly of heterozygous diploid/polyploid genomes. Therefore there is a ploidy setting that is auto-computed and saved in PLOIDY.txt.  Valid values for PLOIDY are 1 and 2. Editing this file will result in forcing the assembler to use ploidy as indicated.

Ploidy 1 means haploid and ploidy 2 means diploid. This is a gross over-simplification that is used I the assembler for the time being.

Ploidy for non-clonal genomes is always 2,  but for the internal algorithms ploidy 1 means that the genome is relatively inbred and ploidy 2 means that it is relatively outbred. The reasoning is that in most genomes there is a proportion of the sequence that is conserved between the two haplotypes, and then there is proportion of sequence that is divergent.  I treat ploidy as measurement of ratio of the total amount of unique sequence in the genome / haploid genome size. This is a number between 1 and 2.  1 means no divergence ( the homologous chromosomes are identical) and 2 means two haplotypes are 100% different.  At this time I do not treat this as a floating parameter between 1 and 2,  but instead I set a threshold in the middle based on heuristical computation.  This is an over-simplification and I will introduce a refinement of this parameter in later versions.

Triticum Aestivum (Bread Wheat) assembly paper is out.

Triticum Aestivum (Bread Wheat) assembly paper has just appeared in GigaScience.  100 CPU-years to assemble Bread Wheat 16Gb genome with MaSuRCA and Falcon!!!

Wednesday, September 27, 2017

MaSuRCA hybrid assembly strategy and recent results on Illumina and Oxford Nanopore data video presentation

I have just uploaded to YouTube ( my presentation on the latest MaSuRCA mega-reads results on assembly of Illumina and Oxford Nanopore MinION human genome data.  In this presentation I describe a de novo human genome assembly of NA12878 data set with N50 contig size of over 1Mb from $10000 worth of sequencing data and outline MaSuRCA mega-reads strategy.

Friday, September 15, 2017

New version MaSuRCA 3.2.3

I have just finished testing the new version of MaSuRCA, version 3.2.3.  The only new notable added feature in the new version is gap closing for assemblies that use PacBio/Oxford Nanopore data.  The other changes are all improvements related to stability, usability and speed:

1. Added scaffold gap closing for hybrid assemblies that use PacBio/Oxford Nanopore
2. Improved the speed and stability of filter for Illumina mate pairs
3. Ploidy and Estimated genome size for the genome are now saved and can be read from ESTIMATED_GENOME_SIZE.txt and PLOIDY.txt files.
4. run Nucmer multi-threaded when SoapDenovo2 used as contigger/scaffolder, for filtering out redundant small contigs after gap closing.
5. updated MUMmer to the latest version
6. many small performance improvements to avoid re-running steps if they have been run on assembler re-start

The new version is available from my ftp site:

Tuesday, July 11, 2017

FTP server back online

I had an ftp server outage for a few days, because of two disk failures.  I replaced the disks and everything is back online as of today.

Thursday, June 8, 2017

MaSuRCA assembly of NA12878 low-coverage (7x) Release 4 MinION data set

Recently, release4 MinION ultra-long read data set for human NA12878 genome has been posted to GitHub: 14 flowcells, "...23140190547 bases in 1415868 reads, predominantly using the new ultra-long read protocol". This is another amazing data set from MinION by Oxford Nanopore Technologies.  I estimate the cost of this data at less than $8000. Combined with 100x Illumina data, this becomes ~$10000 human genome dataset.  I was interested in MaSuRCA performance on this data set alone (without using the rel3 data).  The long reads from this data set cover the human genome at about 7x.
MaSuRCA mega-reads yielded 21.3Gb of sequence in corrected mega-reads the average size of 13,406bp.  Compare to the original MiniION reads stats of 23.1Gb of sequence the average size of 16,343bp. The longest MinION read before correction was 1,537,349bp!!!  After correction, the longest mega-read was still a quite respectable 432,973bp.  I am looking into why the mega-reads were split and if there was any way to increase the mega-reads lengths in this low-coverage data set.
UPDATE:  the assembly finished a few days ago.
NG50 contig size is 921,462 and NG50 scaffold size is 1,592,643 with 2,844,483,168 bases in the assembly.  ALL scaffold gaps are spanned with MinION reads by design of the algorithm. I am looking into ways of creating MinION sequence consensus to fill them.
The assembly is available here:
It would be interesting to see what 10x data can bring to this project.  For the mega-reads correction, 10x Illumina 2x150bp data set sequenced as paired ends works as well as the WGS Illumina paired end data set.  The cost increase for getting 10x data vs regular WGS Illumina reads is fairly small. The additional utility of the barcodes on the reads may enable us to scaffold the resulting contigs, potentially into chromosome-sized scaffolds and phase haplotypes. GIAB project has 10x data for NA12878 and I will be looking into an assembly with these data.

Friday, May 19, 2017

Please upgrade to MaSuRCA release 3.2.2

Dear Users, I wanted to stress one more time that there have been significant improvements in speed (>2x), accuracy and disk usage footprint (reduced by about factor of 10), as well as many bugfixes in the MaSuRCA 3.2.2 release compared to the previous 3.2.2_RCX and 3.2.1_XXXXXXXX versions.  Please avoid using previously released beta-versions, now that a release is available.  I urge you to take a moment to update your installation of MaSuRCA to the latest version available here:

Monday, May 15, 2017

Human NA12878 hybrid 30x MinION+100x Illumina assembly by MaSuRCA 3.2.2

As a final test of the release version of the MaSuRCA 3.2.2 assembler, I created an assembly of the human NA12878 data set from ~30x coverage of Oxford Nanopore data (rel3, and ~100x coverage Illumina data (GIAB project). The data is described in detail at the end of this post.

The hybrid assembly took about 50,000 CPU-hours on my AMD Opteron 6000-series 400-core cluster.  The assembly had the following quantitative statistics(G=3Gb):

Sequence in scaffolds: 2.88Gb
NG50 scaffold size: 5.04Mb
NG50 contig size: 4.06Mb

Update 05/23/2017:  Just for the sake of comparison here are the stats of the MaSuRCA assembly that used only the Illumina data:

Sequence in scaffolds: 2.81Gb
NG50 scaffold size: 0.065Mb
NG50 contig size: 0.068Mb

Addition of the nanopore data makes huge difference in the assembly contiguity.

Compared to the available Canu Nanopore-only assembly of the rel3 data, The MaSuRCA assembly has 37% bigger N50 contig size, and has about 9% more sequence:

The MaSuRCA assembly aligns to the GRCh38.p10 human reference (this is not the same human) at average 99.73% identity for 1-to-1 alignments. 

The 1-to-1 alignments to Illumina-only assembly of the NA12878 have an average identity of 99.96%, implying consensus error rate of 4 errors per 10,000 bases. Below is an alignment plot created by mummerplot of the biggest scaffold in the assembly (22.4Mb) aligning to the corresponding GRCh38 chromosome, that shows excellent large scale agreement:

The assembly is available here:

I am now working on validation of the structure of the assembly and classification of consensus errors.

Data used. For this assembly I used the publicly available Oxford Nanopore MinION data set rel3:
and about 100x coverage by Illumina 2x150b HiSeq 2500 Paired End reads from the Genome In A Bottle project: .
I used the following Illumina runs from that FTP site: