Tuesday, March 23, 2021

Making MaSuRCA easier to use for small projects

 MaSuRCA is a very flexible and versatile assembler.  It is capable of producing high quality assemblies from Illumina and long high error reads.  For big projects that utilize data from many Illumina/Nanopore/PacBio instrument runs and may need adjustments to assembly parameters, one can use the configuration file to conveniently input all data and set all parameters.  The configuration file is well commented and a sample is created during installation (sr_config_example.txt). However, many small projects, such as assemblies of small bacteria and small eukaryotes use data from a single Illumina run and sometimes a single Nanopore or PacBio flowcell.  It is bothersome to create or edit a configuration file for such small projects.  This is why I created a new simpler way to run MaSuRCA for such projects.  This is now available in the latest MaSuRCA version 4.0.3. This runs the full version of the MaSuRCA assembly pipeline with default settings.

Releases · alekseyzimin/masurca (github.com)

If your project uses data from a single Illumina run that produced either a file of single-end reads or two files for paired end reads, and optionally a single file containing long Nanopore or PacBio reads, you can skip creating a configuration file and use a simple command-line interface to run MaSuRCA. The options are described in the usage message that is displayed by using -h or --help switch.  There are three command line switches, -i, -t and -r.  -t specifies the number of threads to use, -i specifies the names and paths to Illumina paired end reads files and -r specifies the name and the path to the long reads file.  For example:

/path_to_MaSuRCA/bin/masurca -t 32 -i /path_to/pe_R1.fa,/path_to/pe_R2.fa

will run assembly with only Illumina paired end reads from files /path_to/pe_R1.fa (forward) and /path_to/pe_R2.fa (reverse). An example of the hybrid assembly:

/path_to_MaSuRCA/bin/masurca -t 32 -i /path_to/pe_R1.fa,/path_to/pe_R2.fa -r /path_to/nanopore.fastq.gz

This command will run a hybrid assembly, correcting Nanopore reads with Illumina data first.  Ilumina paired end reads files must be fastq, can be gzipped, and Nanopore/PacBio data files for the -r option can be fasta or fastq and can be gzipped. 

Tuesday, January 26, 2021

MaSuRCA 4.0.1 release

This release contains major improvements in speed of hybrid assemblies. Thanks to the new k-unitig pre-correction algorithm, the speed of mega-reads algorithm, which corrects the long high error reads from Oxford Nanopore or Pacific Biosciences platforms, increased by about a factor of 6 for large genomes. The new algorithm eliminated the need to run second pass of the mega-reads, resulting in lower memory requirements. This resulted in major improvements of run times, especially for big genomes. It is now possible to run an hybrid assembly of a human genome starting with ~60x Illumina paired end data and ~30x Oxford Nanopore data in less than 6 days on a small computing cluster with ~200 CPU-cores. Bigger clusters will allow for assembly run times of as little as 2-3 days. MaSuRCA hybrid technique outputs high quality consensus that does not require any polishing.  Thus MaSuRCA assemblies can be used without any additional post-processing for downstream steps, such as gene annotation, in any genome project.

This release also improves compatibility with SLURM scheduler, by eliminating the second pass of mega-reads that was unstable on some systems. There are many stability and efficiency improvements. Here are some highlights:

  1. worked around the "consensus sequence mismatch" error in POLCA that occurred rarely in complex sequence regions
  2. chromosome scaffolder now picks low and high coverage thresholds automatically based on mapped read coverage
  3. improved chromosome scaffolder speed and accuracy using more efficient algorithms
  4. updated the version of MUMmer to 4.0.0rc1
  5. fixed a bug that sometimes caused minor under-reporting of actual errors in POLCA report file
  6. updated code for reference-assisted assembly for better performance when multiple references are used

The release is available here:

https://github.com/alekseyzimin/masurca/releases/download/v4.0.1/MaSuRCA-4.0.1.tar.gz