Wednesday, March 8, 2017

Release Candidate 1 version 3.2.2_RC1

In the past several weeks I have been using MaSuRCA to create improved assemblies of several genomes of varying difficulty and I was able to detect and fix several performance issues and bugs:

1. run time too long in scaffolder and consensus when using Illumina linking mate pairs or other mater reads such as Sanger or 454
2. failure when combining single end and paired end paired end libraries when USE_LINKING_MATES is turned on
3. accuracy and contiguity of the assembly does not improve uniformly with more PacBio data
4. overlap filtering step after the unitigger takes too long for large data sets
5. mapping in duplicate/haplotype filtering is faster, because I excluded the singleton unitigs from the mapping
6. Illumina mate pairs are now properly used in scaffolding of assemblies that contain PacBio or Nanopore reads

Thanks to all MaSuRCA users who reported their issues to me and I hope I was able to address most of them.

The resulting package is much faster and more accurate. Here is an example of an assembly of Arabidopsis thaliana data set with 50x PacBio data, aligned to the finished genome.  Note that the finished sequence is different species but there should not be any large  structural differences.  The plot shows a line for each contig with start and end positions corresponding tot the alignment positions in the finished genome (X axis) and in the assembly (Y axis).  There are three off-diagonal matches which correspond to three small misassemblies.  The N50 contig size of this assembly is 8Mb, which is close to about 10Mb one can achieve with 100x PacBio data.  The NGA50 -- N50 after breaking at misassemblies is 7.1Mbp.

I also changed the way super-reads are mapped to the PacBio and Nanopore reads.  The problem was in connectivity of the super-reads -- to snap two super-reads together to create a mega-read, an overlap of K or more is needed, and we try to maximize K to create longer super-reads e.g. K could be 127bp.  For example we have super-reads A_B_C, C_D and D_E_F.  To connect super-reads A_B_C and D_E_F to create the mega-read A_B_C_D_E_F, we need to have A_B_C and D_E_F to have exact overlaps with C_D.  So if for some reason C_D is lost in the mapping a gap may be created. But the gap does not need to be there because A_B_C and D_E_F may still overlap exactly, with the overlap shorter than K. Now the super-reads are pre-processed to reduce K to 41 which effectively reduces the minimum overlap length for creating mega-reads from super-reads while still keeping the principle of the algorithm intact.

I feel that MaSuRCA mega-reads is not mature enough and tested to give it Release Candidate status.  The new MaSuRCA is available here:

Please post any issues with the new version in comments to this blog post.

Thursday, January 19, 2017

New version 3.2.1_01032017

I just posted a new version of the assembler.  The new version includes the following changes:

1. improved ploidy detection for heterozygous genome assembly; now the ploidy is auto-detected and difference settings are used for heterozygous genomes to filter the alternative haplotype
2. improvement in speed and correctness of mega-reads correction of PacBio/nanopore reads; now we require all gaps in mega-reads that are filled with raw PacBio/nanopore reads to have >=3 coverage even for high coverage long read data sets
3. changed local alignment parameters in CA8 overlap/scaffold/consensus to speed up the alignments without apparent loss of sensitivity

Overall the new version should be 10-20% faster and more correct.  The heterozygous genomes will output one haplotype as final assembly in dedup.genome.scf.fasta file.  Both haplotypes are still available in 9-terminator/genome.scf.fasta.

The new version is posted in

Tuesday, December 13, 2016

ovlMemory error in hybrid Pacbio/MinION assembly

Please note that if you have the option ovlMemory set in the CA_PARAMETERS section of the config file, the Illumina+Pacbio/MinION assembly will give an error in CABOG.  This is normal behavior, this option is not compatible with CABOG version 8 used for the long-read assembly.  I suggest to remove any additional options that you may have in CA_PARAMETERS section of the configuration file, then re-generate and run the assembly and the assembly will proceed without error.

Comments are welcome

I welcome any comments on the posts.  You can use comments to report bugs on the new versions or ask questions about the new features.  For now all comments are moderated, that is I have to look at every comment and approve it before it appears on the public blog.  This way I can make sure I have looked at all comments.

New version 3.2.1_12132016

I just posted a new MaSuRCA version on the alternative download site:  This version improves speed by running delta-filter in parallel after Nucmer in the deduplication of contigs and unitigs.  The new script is called and its usage is:

parallel_delta-filter delta-prefix "switches" num_processes

the output is the filtered delta-prefix.fdelta file.  delta-prefix is the delta file name without .delta. "switches" are command line parameters for delta-filter, e.g. "-q -I 95 -l 100".  num_processes is a number of processes that you wish to run.

Friday, December 9, 2016

Alternative download site

MaSuRCA source code is always available from  If the site is not available for some reason, the latest version is available at

Documentation update

MaSuRCA documentation has been updated.  The  link on the website points to the correct version of the documentation.  The Quick Start Guide is available at

Thursday, December 8, 2016

New version 3.2.1_12082016

An updated version has been posted: 3.2.1_12082016.  This version adds de-duplication of scaffolds (removal of redundant scaffolds from the assembly) as final step.  If you need to run the de-duplication manually you can do so by running "bin/ <CA assembly folder name> genome <NUM_THREADS>" .

Wednesday, December 7, 2016

New version 3.2.1_12062016

New version of MaSuRCA (3.2.1_12062016) is available.  This version has an extensive list of new features and updates from the previous 3.2.1_10202016 release.  The improvements include:

1. faster unitig consensus (use pbutgcns along with utgcns)
2. faster contig consensus (use pbutgcns along with ctgcns)
3. new feature: deduplication of heterozygous genomes at the unitig stage; one haplotype is removed, and each contig in the assembly is output as single haplotype; however not all contigs represent the same haplotype -- this feature is under development
4. overlap filtering to detect and prevent repeat-induced misassemblies in contigs
5. bugfixes and small improvements in speed and accuracy in Celera Assembler -- the main assembly engine in MaSuRCA


Welcome to the official MaSuRCA development blog maintained by Aleksey Zimin.  This blog is dedicated to announcements and support of MaSuRCA whole genome de novo assembly package.

At this time MaSuRCA supports Illumina,Sanger, 454, Oxford Nanopore and PacBio data.  Please refer to MaSuRCA quick start manual before using MaSuRCA.

Please supply either Pacbio or Nanopore reads (not both) as PACBIO=file.fa or NANOPORE=file.fa in the configuration file. All Pacbio or Nanopore reads must be in a single fasta file!

To obtain the latest SOURCE CODE of MaSuRCA and compilation instructions, see this page. BINARIES for various Linux distributions are NOT YET AVAILABLE and they are coming soon.

The package is freely available under GPL open source license at  If you use MaSuRCA, please cite the following publications: