Tuesday, March 23, 2021

Making MaSuRCA easier to use for small projects

 MaSuRCA is a very flexible and versatile assembler.  It is capable of producing high quality assemblies from Illumina and long high error reads.  For big projects that utilize data from many Illumina/Nanopore/PacBio instrument runs and may need adjustments to assembly parameters, one can use the configuration file to conveniently input all data and set all parameters.  The configuration file is well commented and a sample is created during installation (sr_config_example.txt). However, many small projects, such as assemblies of small bacteria and small eukaryotes use data from a single Illumina run and sometimes a single Nanopore or PacBio flowcell.  It is bothersome to create or edit a configuration file for such small projects.  This is why I created a new simpler way to run MaSuRCA for such projects.  This is now available in the latest MaSuRCA version 4.0.3. This runs the full version of the MaSuRCA assembly pipeline with default settings.

Releases · alekseyzimin/masurca (github.com)

If your project uses data from a single Illumina run that produced either a file of single-end reads or two files for paired end reads, and optionally a single file containing long Nanopore or PacBio reads, you can skip creating a configuration file and use a simple command-line interface to run MaSuRCA. The options are described in the usage message that is displayed by using -h or --help switch.  There are three command line switches, -i, -t and -r.  -t specifies the number of threads to use, -i specifies the names and paths to Illumina paired end reads files and -r specifies the name and the path to the long reads file.  For example:

/path_to_MaSuRCA/bin/masurca -t 32 -i /path_to/pe_R1.fa,/path_to/pe_R2.fa

will run assembly with only Illumina paired end reads from files /path_to/pe_R1.fa (forward) and /path_to/pe_R2.fa (reverse). An example of the hybrid assembly:

/path_to_MaSuRCA/bin/masurca -t 32 -i /path_to/pe_R1.fa,/path_to/pe_R2.fa -r /path_to/nanopore.fastq.gz

This command will run a hybrid assembly, correcting Nanopore reads with Illumina data first.  Ilumina paired end reads files must be fastq, can be gzipped, and Nanopore/PacBio data files for the -r option can be fasta or fastq and can be gzipped. 

No comments:

Post a Comment