MaSuRCA genome assembly package: 2017

Thursday, November 2, 2017

News article about Bread wheat genome assembled with MaSuRCA and Falcon

According to this article in Nature news, we scooped the IWGSC (International Wheat Genome Sequencing Consortium) to publication the Bread wheat assembly, the most complete and contiguous to date.

http://www.nature.com/news/small-group-scoops-international-effort-to-sequence-huge-wheat-genome-1.22924

Wednesday, October 25, 2017

Note on ploidy estimation in MaSuRCA

Since the 3.2.2 version, MaSuRCA uses modified algorithms and settings for assembly of heterozygous diploid/polyploid genomes. Therefore there is a ploidy setting that is auto-computed and saved in PLOIDY.txt. Valid values for PLOIDY are 1 and 2. Editing this file will result in forcing the assembler to use ploidy as indicated.

Ploidy 1 means haploid and ploidy 2 means diploid. This is a gross over-simplification that is used I the assembler for the time being.

Ploidy for non-clonal genomes is always 2, but for the internal algorithms ploidy 1 means that the genome is relatively inbred and ploidy 2 means that it is relatively outbred. The reasoning is that in most genomes there is a proportion of the sequence that is conserved between the two haplotypes, and then there is proportion of sequence that is divergent. I treat ploidy as measurement of ratio of the total amount of unique sequence in the genome / haploid genome size. This is a number between 1 and 2. 1 means no divergence ( the homologous chromosomes are identical) and 2 means two haplotypes are 100% different. At this time I do not treat this as a floating parameter between 1 and 2, but instead I set a threshold in the middle based on heuristical computation. This is an over-simplification and I will introduce a refinement of this parameter in later versions.

Triticum Aestivum (Bread Wheat) assembly paper is out.

Triticum Aestivum (Bread Wheat) assembly paper has just appeared in GigaScience. 100 CPU-years to assemble Bread Wheat 16Gb genome with MaSuRCA and Falcon!!!
https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/gix097/4561661/The-first-near-complete-assembly-of-the-hexaploid

Wednesday, September 27, 2017

MaSuRCA hybrid assembly strategy and recent results on Illumina and Oxford Nanopore data video presentation

I have just uploaded to YouTube (https://youtu.be/BEsDkjYKHCE) my presentation on the latest MaSuRCA mega-reads results on assembly of Illumina and Oxford Nanopore MinION human genome data. In this presentation I describe a de novo human genome assembly of NA12878 data set with N50 contig size of over 1Mb from $10000 worth of sequencing data and outline MaSuRCA mega-reads strategy.

Friday, September 15, 2017

New version MaSuRCA 3.2.3

I have just finished testing the new version of MaSuRCA, version 3.2.3. The only new notable added feature in the new version is gap closing for assemblies that use PacBio/Oxford Nanopore data. The other changes are all improvements related to stability, usability and speed:

1. Added scaffold gap closing for hybrid assemblies that use PacBio/Oxford Nanopore
2. Improved the speed and stability of filter for Illumina mate pairs
3. Ploidy and Estimated genome size for the genome are now saved and can be read from ESTIMATED_GENOME_SIZE.txt and PLOIDY.txt files.
4. run Nucmer multi-threaded when SoapDenovo2 used as contigger/scaffolder, for filtering out redundant small contigs after gap closing.
5. updated MUMmer to the latest version
6. many small performance improvements to avoid re-running steps if they have been run on assembler re-start

The new version is available from my ftp site:

ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/

Tuesday, July 11, 2017

FTP server back online

I had an ftp server outage for a few days, because of two disk failures. I replaced the disks and everything is back online as of today.

Thursday, June 8, 2017

MaSuRCA assembly of NA12878 low-coverage (7x) Release 4 MinION data set

Recently, release4 MinION ultra-long read data set for human NA12878 genome has been posted to GitHub: https://github.com/nanopore-wgs-consortium/NA12878. 14 flowcells, "...23140190547 bases in 1415868 reads, predominantly using the new ultra-long read protocol". This is another amazing data set from MinION by Oxford Nanopore Technologies. I estimate the cost of this data at less than $8000. Combined with 100x Illumina data, this becomes ~$10000 human genome dataset. I was interested in MaSuRCA performance on this data set alone (without using the rel3 data). The long reads from this data set cover the human genome at about 7x.
MaSuRCA mega-reads yielded 21.3Gb of sequence in corrected mega-reads the average size of 13,406bp. Compare to the original MiniION reads stats of 23.1Gb of sequence the average size of 16,343bp. The longest MinION read before correction was 1,537,349bp!!! After correction, the longest mega-read was still a quite respectable 432,973bp. I am looking into why the mega-reads were split and if there was any way to increase the mega-reads lengths in this low-coverage data set.
UPDATE: the assembly finished a few days ago.
NG50 contig size is 921,462 and NG50 scaffold size is 1,592,643 with 2,844,483,168 bases in the assembly. ALL scaffold gaps are spanned with MinION reads by design of the algorithm. I am looking into ways of creating MinION sequence consensus to fill them.
The assembly is available here:
ftp://ftp.genome.umd.edu/pub/NA12878/assembly.7xlong.fa
It would be interesting to see what 10x data can bring to this project. For the mega-reads correction, 10x Illumina 2x150bp data set sequenced as paired ends works as well as the WGS Illumina paired end data set. The cost increase for getting 10x data vs regular WGS Illumina reads is fairly small. The additional utility of the barcodes on the reads may enable us to scaffold the resulting contigs, potentially into chromosome-sized scaffolds and phase haplotypes. GIAB project has 10x data for NA12878 and I will be looking into an assembly with these data.

Friday, May 19, 2017

Please upgrade to MaSuRCA release 3.2.2

Dear Users, I wanted to stress one more time that there have been significant improvements in speed (>2x), accuracy and disk usage footprint (reduced by about factor of 10), as well as many bugfixes in the MaSuRCA 3.2.2 release compared to the previous 3.2.2_RCX and 3.2.1_XXXXXXXX versions. Please avoid using previously released beta-versions, now that a release is available. I urge you to take a moment to update your installation of MaSuRCA to the latest version available here: ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.2.tar.gz

Monday, May 15, 2017

Human NA12878 hybrid 30x MinION+100x Illumina assembly by MaSuRCA 3.2.2

As a final test of the release version of the MaSuRCA 3.2.2 assembler, I created an assembly of the human NA12878 data set from ~30x coverage of Oxford Nanopore data (rel3, https://github.com/nanopore-wgs-consortium/NA12878) and ~100x coverage Illumina data (GIAB project). The data is described in detail at the end of this post.

The hybrid assembly took about 50,000 CPU-hours on my AMD Opteron 6000-series 400-core cluster. The assembly had the following quantitative statistics(G=3Gb):

Sequence in scaffolds: 2.88Gb

NG50 scaffold size: 5.04Mb

NG50 contig size: 4.06Mb

Update 05/23/2017: Just for the sake of comparison here are the stats of the MaSuRCA assembly that used only the Illumina data:

Sequence in scaffolds: 2.81Gb

NG50 scaffold size: 0.065Mb

NG50 contig size: 0.068Mb

Addition of the nanopore data makes huge difference in the assembly contiguity.

Compared to the available Canu Nanopore-only assembly of the rel3 data, The MaSuRCA assembly has 37% bigger N50 contig size, and has about 9% more sequence:
https://genomeinformatics.github.io/NA12878-nanopore-assembly/

The MaSuRCA assembly aligns to the GRCh38.p10 human reference (this is not the same human) at average 99.73% identity for 1-to-1 alignments.

The 1-to-1 alignments to Illumina-only assembly of the NA12878 have an average identity of 99.96%, implying consensus error rate of 4 errors per 10,000 bases. Below is an alignment plot created by mummerplot of the biggest scaffold in the assembly (22.4Mb) aligning to the corresponding GRCh38 chromosome, that shows excellent large scale agreement:

The assembly is available here:
ftp://ftp.genome.umd.edu/pub/NA12878/assembly.fa

I am now working on validation of the structure of the assembly and classification of consensus errors.

Data used. For this assembly I used the publicly available Oxford Nanopore MinION data set rel3:
https://github.com/nanopore-wgs-consortium/NA12878
and about 100x coverage by Illumina 2x150b HiSeq 2500 Paired End reads from the Genome In A Bottle project:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x .
I used the following Illumina runs from that FTP site:
131219_D00360_005_BH814YADXX
131219_D00360_005_AN81VLADXX
140115_D00360_0009_AH8962ADXX
140207_D00360_0013_AH8G92ADXX

Thursday, May 4, 2017

New version MaSuRCA release 3.2.2

Today I posted a new version that I can finally call a "release" version 3.2.2. You can get it here ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.2.tar.gz . This version has numerous speed and stability improvements based on feedback from the users and my own experience running assembly experiments.

Here are the major changes, in addition to many bugfixes based on the user feedback:

1. based on the user feedback, disabled stones and Extend Clear Ranges steps in CABOG, because they have been failing and/or causing long run times without significant benefit on assemblies with long read PacBio/MinION data.

2. introduced speedups in creation of the mega-reads a long read that yields a single mega-read in pass 1 is not re-processed in pass 2. This reduced the mega-read correction time by about 25%.

3. introduced contained reads filter before CABOG; by nature of the mega-reads creation from PacBio/MinION reads, many mega-reads end up being exactly contained in other mega-reads and thus they do not contribute any new information in the assembly. I found an efficient way to remove these exact containees, reducing the coverage and the number of reads that CABOG has to deal with.
This improved the run time of CABOG by about factor of 4. Now 120Mb plant genome with 30x PacBio coverage and 100x Illumina coverage assembles in about 8 hours on a single 48-core AMD Opteron server; 12Mb yeast data with about the same level of coverage assembles in under 1 hour on the same server.

4. reworked the way assembly failures are reported making the error messages more informative

5. fixed bugs in SOAPDenovo2 assembly module thanks to the input from Rubang Luo, author of SOAPDenovo2.

6. removed dependency on "parallel" that has been causing problems for many users

Thanks to all users who reported their failures and successes to me. Your feedback is extremely valuable, it helps me make MaSuRCA more stable and easier to use!

Wednesday, March 8, 2017

Release Candidate 1 version 3.2.2_RC1

In the past several weeks I have been using MaSuRCA to create improved assemblies of several genomes of varying difficulty and I was able to detect and fix several performance issues and bugs:

1. run time too long in scaffolder and consensus when using Illumina linking mate pairs or other mater reads such as Sanger or 454
2. failure when combining single end and paired end paired end libraries when USE_LINKING_MATES is turned on
3. accuracy and contiguity of the assembly does not improve uniformly with more PacBio data
4. overlap filtering step after the unitigger takes too long for large data sets
5. mapping in duplicate/haplotype filtering is faster, because I excluded the singleton unitigs from the mapping
6. Illumina mate pairs are now properly used in scaffolding of assemblies that contain PacBio or Nanopore reads

Thanks to all MaSuRCA users who reported their issues to me and I hope I was able to address most of them.

The resulting package is much faster and more accurate. Here is an example of an assembly of Arabidopsis thaliana data set with 50x PacBio data, aligned to the finished genome. Note that the finished sequence is different species but there should not be any large structural differences. The plot shows a line for each contig with start and end positions corresponding tot the alignment positions in the finished genome (X axis) and in the assembly (Y axis). There are three off-diagonal matches which correspond to three small misassemblies. The N50 contig size of this assembly is 8Mb, which is close to about 10Mb one can achieve with 100x PacBio data. The NGA50 -- N50 after breaking at misassemblies is 7.1Mbp.

I also changed the way super-reads are mapped to the PacBio and Nanopore reads. The problem was in connectivity of the super-reads -- to snap two super-reads together to create a mega-read, an overlap of K or more is needed, and we try to maximize K to create longer super-reads e.g. K could be 127bp. For example we have super-reads A_B_C, C_D and D_E_F. To connect super-reads A_B_C and D_E_F to create the mega-read A_B_C_D_E_F, we need to have A_B_C and D_E_F to have exact overlaps with C_D. So if for some reason C_D is lost in the mapping a gap may be created. But the gap does not need to be there because A_B_C and D_E_F may still overlap exactly, with the overlap shorter than K. Now the super-reads are pre-processed to reduce K to 41 which effectively reduces the minimum overlap length for creating mega-reads from super-reads while still keeping the principle of the algorithm intact.

I feel that MaSuRCA mega-reads is not mature enough and tested to give it Release Candidate status. The new MaSuRCA is available here:

ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/MaSuRCA-3.2.2_RC1.tar.gz

Please post any issues with the new version in comments to this blog post.

Thursday, January 19, 2017

New version 3.2.1_01032017

I just posted a new version of the assembler. The new version includes the following changes:

1. improved ploidy detection for heterozygous genome assembly; now the ploidy is auto-detected and difference settings are used for heterozygous genomes to filter the alternative haplotype
2. improvement in speed and correctness of mega-reads correction of PacBio/nanopore reads; now we require all gaps in mega-reads that are filled with raw PacBio/nanopore reads to have >=3 coverage even for high coverage long read data sets
3. changed local alignment parameters in CA8 overlap/scaffold/consensus to speed up the alignments without apparent loss of sensitivity

Overall the new version should be 10-20% faster and more correct. The heterozygous genomes will output one haplotype as final assembly in dedup.genome.scf.fasta file. Both haplotypes are still available in 9-terminator/genome.scf.fasta.

The new version is posted in
ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/