Thursday, June 8, 2017

MaSuRCA assembly of NA12878 low-coverage (7x) Release 4 MinION data set

Recently, release4 MinION ultra-long read data set for human NA12878 genome has been posted to GitHub: 14 flowcells, "...23140190547 bases in 1415868 reads, predominantly using the new ultra-long read protocol". This is another amazing data set from MinION by Oxford Nanopore Technologies.  I estimate the cost of this data at less than $8000. Combined with 100x Illumina data, this becomes ~$10000 human genome dataset.  I was interested in MaSuRCA performance on this data set alone (without using the rel3 data).  The long reads from this data set cover the human genome at about 7x.
MaSuRCA mega-reads yielded 21.3Gb of sequence in corrected mega-reads the average size of 13,406bp.  Compare to the original MiniION reads stats of 23.1Gb of sequence the average size of 16,343bp. The longest MinION read before correction was 1,537,349bp!!!  After correction, the longest mega-read was still a quite respectable 432,973bp.  I am looking into why the mega-reads were split and if there was any way to increase the mega-reads lengths in this low-coverage data set.
UPDATE:  the assembly finished a few days ago.
NG50 contig size is 921,462 and NG50 scaffold size is 1,592,643 with 2,844,483,168 bases in the assembly.  ALL scaffold gaps are spanned with MinION reads by design of the algorithm. I am looking into ways of creating MinION sequence consensus to fill them.
The assembly is available here:
It would be interesting to see what 10x data can bring to this project.  For the mega-reads correction, 10x Illumina 2x150bp data set sequenced as paired ends works as well as the WGS Illumina paired end data set.  The cost increase for getting 10x data vs regular WGS Illumina reads is fairly small. The additional utility of the barcodes on the reads may enable us to scaffold the resulting contigs, potentially into chromosome-sized scaffolds and phase haplotypes. GIAB project has 10x data for NA12878 and I will be looking into an assembly with these data.