1. run time too long in scaffolder and consensus when using Illumina linking mate pairs or other mater reads such as Sanger or 454
2. failure when combining single end and paired end paired end libraries when USE_LINKING_MATES is turned on
3. accuracy and contiguity of the assembly does not improve uniformly with more PacBio data
4. overlap filtering step after the unitigger takes too long for large data sets
5. mapping in duplicate/haplotype filtering is faster, because I excluded the singleton unitigs from the mapping
6. Illumina mate pairs are now properly used in scaffolding of assemblies that contain PacBio or Nanopore reads
Thanks to all MaSuRCA users who reported their issues to me and I hope I was able to address most of them.
I also changed the way super-reads are mapped to the PacBio and Nanopore reads. The problem was in connectivity of the super-reads -- to snap two super-reads together to create a mega-read, an overlap of K or more is needed, and we try to maximize K to create longer super-reads e.g. K could be 127bp. For example we have super-reads A_B_C, C_D and D_E_F. To connect super-reads A_B_C and D_E_F to create the mega-read A_B_C_D_E_F, we need to have A_B_C and D_E_F to have exact overlaps with C_D. So if for some reason C_D is lost in the mapping a gap may be created. But the gap does not need to be there because A_B_C and D_E_F may still overlap exactly, with the overlap shorter than K. Now the super-reads are pre-processed to reduce K to 41 which effectively reduces the minimum overlap length for creating mega-reads from super-reads while still keeping the principle of the algorithm intact.
I feel that MaSuRCA mega-reads is not mature enough and tested to give it Release Candidate status. The new MaSuRCA is available here:
Please post any issues with the new version in comments to this blog post.