MaSuRCA genome assembly package

Friday, April 22, 2022

SAMBA scaffolder included in MaSuRCA makes assemblies more contiguous (and correct)

Imagine a situation where you have a genome of an organism that you have been studying, assembled from the previous generation sequencing data. You have spent time to produce this assembly, curated it, and it looks good, but contiguity is not up to today's standards. A way to improve the existing assembly, without starting anew, is to upgrade it using additional long-read data. If you still have DNA from the same individual that was used to produce data for the original assembly, it is relatively straightforward (and cheap) now to produce additional long read data with Oxford Nanopore MinIion or PromethIon instrument. You can then use SAMBA to quickly scaffold and gap-fill your existing draft assembly with the additional long-read data. This will result in substantial improvements in contiguity, and likely correctness. SAMBA can use the long reads to check existing contigs for misassemblies using long read alignments, break at suspected misassembly locations, and then scaffold the contigs and fill in the sequence for all spanned gaps in the scaffolds. This yields both much bigger and more structurally correct contigs. SAMBA is free open-source software included with MaSuRCA version 4.0.9 and up: Releases · alekseyzimin/masurca · GitHub

SAMBA is published in PLoS Computational biology: Zimin AV, Salzberg SL. The SAMBA tool uses long reads to improve the contiguity of genome assemblies. PLoS computational biology. 2022 Feb 4;18(2):e1009860.

The invocation of SAMBA is as follows:

samba.sh [options]
-r <contigs or scaffolds in fasta format> 
-q <long reads or another assembly used to scaffold in fasta or fastq format, can be gzipped> 
-t <number of threads> 
-d <scaffolding data type: ont, pbclr or asm, default:ont> 
-m <minimum matching length, default:5000> 
-o <maximum overhang, default:1000> 
-a <optional: allowed merges file in the format per line: contig1 contig2, only pairs of contigs listed will be considered for merging, useful for intrascaffold gap filling>
-v verbose flag
-h|--help|-u|--usage this message

SAMBA installs with MaSuRCA and requires no external dependencies. The only parameter that is worth modifying is -m or the minimum matching length. 2000-2500 is a good value for small eukaryotic genomes 100-400Mb in size, 5000 is the default best value for large eukaryotic genomes (2-3Gbp), and 9000-10000 is the best value for large highly repetitive plant genomes (5Gbp+).

MaSuRCA also provides a wrapper script for SAMBA that allows to use SAMBA to close intra-scaffold gaps in an assembly. The usage is as follows:

close_scaffold_gaps.sh [options]
-r <scaffolds to gapclose> MANDATORY
-q <sequences used for closing gaps,  can be long reads or another assembly, in fasta or fastq format, can ge gzipped> MANDATORY
-t <number of threads, default:1>
-i <identity% default:98>
-m <minimum match length on the two sides of the gap, default:2500>
-o <max overhang, default:1000>
-v verbose flag
-h|--help|-u|--usage this message

The above script will split the scaffolds into contigs and then run SAMBA only allowing intrascaffold gaps to be filled. Contig "flips" inside the scaffold are allowed, making this a great tool to gapfill and fix assemblies scaffolded with HiC data, because HiC scaffolding sometimes incorrectly flips contigs in scaffolds.

Tuesday, March 23, 2021

Making MaSuRCA easier to use for small projects

MaSuRCA is a very flexible and versatile assembler. It is capable of producing high quality assemblies from Illumina and long high error reads. For big projects that utilize data from many Illumina/Nanopore/PacBio instrument runs and may need adjustments to assembly parameters, one can use the configuration file to conveniently input all data and set all parameters. The configuration file is well commented and a sample is created during installation (sr_config_example.txt). However, many small projects, such as assemblies of small bacteria and small eukaryotes use data from a single Illumina run and sometimes a single Nanopore or PacBio flowcell. It is bothersome to create or edit a configuration file for such small projects. This is why I created a new simpler way to run MaSuRCA for such projects. This is now available in the latest MaSuRCA version 4.0.3. This runs the full version of the MaSuRCA assembly pipeline with default settings.

Releases · alekseyzimin/masurca (github.com)

If your project uses data from a single Illumina run that produced either a file of single-end reads or two files for paired end reads, and optionally a single file containing long Nanopore or PacBio reads, you can skip creating a configuration file and use a simple command-line interface to run MaSuRCA. The options are described in the usage message that is displayed by using -h or --help switch. There are three command line switches, -i, -t and -r. -t specifies the number of threads to use, -i specifies the names and paths to Illumina paired end reads files and -r specifies the name and the path to the long reads file. For example:

/path_to_MaSuRCA/bin/masurca -t 32 -i /path_to/pe_R1.fa,/path_to/pe_R2.fa

will run assembly with only Illumina paired end reads from files /path_to/pe_R1.fa (forward) and /path_to/pe_R2.fa (reverse). An example of the hybrid assembly:

/path_to_MaSuRCA/bin/masurca -t 32 -i /path_to/pe_R1.fa,/path_to/pe_R2.fa -r /path_to/nanopore.fastq.gz

This command will run a hybrid assembly, correcting Nanopore reads with Illumina data first. Ilumina paired end reads files must be fastq, can be gzipped, and Nanopore/PacBio data files for the -r option can be fasta or fastq and can be gzipped.

Tuesday, January 26, 2021

MaSuRCA 4.0.1 release

This release contains major improvements in speed of hybrid assemblies. Thanks to the new k-unitig pre-correction algorithm, the speed of mega-reads algorithm, which corrects the long high error reads from Oxford Nanopore or Pacific Biosciences platforms, increased by about a factor of 6 for large genomes. The new algorithm eliminated the need to run second pass of the mega-reads, resulting in lower memory requirements. This resulted in major improvements of run times, especially for big genomes. It is now possible to run an hybrid assembly of a human genome starting with ~60x Illumina paired end data and ~30x Oxford Nanopore data in less than 6 days on a small computing cluster with ~200 CPU-cores. Bigger clusters will allow for assembly run times of as little as 2-3 days. MaSuRCA hybrid technique outputs high quality consensus that does not require any polishing. Thus MaSuRCA assemblies can be used without any additional post-processing for downstream steps, such as gene annotation, in any genome project.

This release also improves compatibility with SLURM scheduler, by eliminating the second pass of mega-reads that was unstable on some systems. There are many stability and efficiency improvements. Here are some highlights:

worked around the "consensus sequence mismatch" error in POLCA that occurred rarely in complex sequence regions
chromosome scaffolder now picks low and high coverage thresholds automatically based on mapped read coverage
improved chromosome scaffolder speed and accuracy using more efficient algorithms
updated the version of MUMmer to 4.0.0rc1
fixed a bug that sometimes caused minor under-reporting of actual errors in POLCA report file
updated code for reference-assisted assembly for better performance when multiple references are used

The release is available here:

https://github.com/alekseyzimin/masurca/releases/download/v4.0.1/MaSuRCA-4.0.1.tar.gz

Thursday, December 17, 2020

MaSuRCA 4.0.0 incorporates new techniques that speed up assemblies of mammalian genomes by a factor of six

MaSuRCA has been serving the research community well for the past few years, with many low-cost sequencing and assembly projects benefiting from its reduced requirements on coverage depth for long read sequencing data and high quality output assemblies. One of the complaints about MaSuRCA has been speed -- and indeed a mammalian sized ~3Gbp genome usually took about a month to assemble even on research computing clusters. For the past few months I have been concentrating on developing new efficient algorithms for correction of long high error reads such as the ones produced by Oxford Nanopore MinION/PromethION and PacBio SMRT CLR sequencing.

The new MaSuRCA 4.0.0 improves the run time for the mammalian genome assemblies by at least a factor of six by eliminating the second pass of correction and replacing it with the new ultrafast k-unitig pre-correction algorithm. The k-unitig pre-correction reduces the error rate of the input reads by up to 50% with very small time investment, less than 1 day on a single 32-64 core server for a mammalian genome data set, which in turn allows for use of longer k-mers in (17 vs 15) in the main mega-read building phase. Use of longer mers increases specificity of the super-reads alignments to the long reads and thus reduces the complexity of the super-read graph, which has major impact on the run time.

In direct run time comparisons of the assembly times for human HG01243 Illumina/Nanopore data set with about 62x genome coverage by Illumina reads and about 47 x coverage by ultralong Nanopore reads (read N50 >80kb) new MaSuRCA completed the assembly in less than 5 days, where the previous MaSuRCA (version 3.4.2) required 32 days on the same 256-core computer cluster. The assembly results were comparable between the two runs, with the new version producing slightly more contiguous assembly. On experiments with A.thaliana (genome size ~120Mbp) data set of 40x coverage by PacBio RSII reads combined with 100x coverage the new MaSuRCA produces better assembly with fewer misassemblies and N50 contig size of 6.2Mbp vs N50 contig of 5.45Mbp for the previous version. Overall I expect the new MaSuRCA to deliver very similar or slightly better assembly results in significantly less time.

The pre-release of the new MaSuRCA is available on github at

https://github.com/alekseyzimin/masurca/raw/master/MaSuRCA-4.0.0.tar.gz

This version will be published as the official release once I complete the testing on multiple large genome data sets. In meanwhile if you would like to install and try the new release and find any problems with installation or usage, please report them on the github issues board

Issues · alekseyzimin/masurca (github.com)

Monday, September 16, 2019

MaSuRCA-polish tool in MaSuRCA 3.3.4

There is a new tool available in MaSuRCA, MaSuRCA-polish tool assembly consensus quality evaluation/polishing tool. The tool is designed to detect and correct single-base and short insertion/deletion errors in assembled genomes using Illumina data. The tool is is partially based on the error evaluation method described in (Jain et al, 2018). It works best for assemblies with at least 99% consensus quality, as it is based on mapping Illumina reads to the assembly, and mapping accuracy decreases as assembly error rate increases.

The tool first uses bwa mem (Li et al, 2009) aligner to map the Illumina reads to the assembly. It then uses the alignments to find short sequence variants using freebayes tool (Garrison et al, 2012). Any variant that has no allele that agrees with the consensus and where there is at least one alternative allele with frequency 3 or more is counted as an error in the consensus. Then, for every location where we detect an error, we replace the consensus allele with the highest count alternative. The code is parallel and it runs in less than 24 hours on a human genome with 30x Illumina data on 48 core AMD Opteron server with 128Gb of RAM. The tool is distributed with MaSuRCA assembler (https://github.com/alekseyzimin/masurca) (Zimin at al, 2013).

We applied the MaSuRCA-polish tool to the assembly of human HG002 genome using a 30x coverage subset of the 65x total coverage. The tool found 138,386 single base substitutions and 410,116 insertion/deletion errors in the assembly, and performed correction. To evaluate the error rate after correction we used a different 30x coverage subset of the Illumina data. The tool was able to correct about 95% of the substitution errors and about 96% of insertion/deletion errors, reducing the overall consensus error rate by approximately a factor of 20.

The usage is:

masurca-polish.sh -a -r <'Illumina_reads_fastq1 Illumina_reads_fastq'> -t [-n] <optional:do not fix errors that are found>

Note that a beta version of the tool was called evaluate_consensus_error_rate.sh in earlier MaSuRCA releases.

References
1. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, Tyson JR, Beggs AD, Dilthey AT, Fiddes IT, Malla S. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature biotechnology. 2018 Apr;36(4):338.

2. Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25:1754-60. [PMID: 19451168]

3. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907. 2012 Jul 17.

4. Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. The MaSuRCA genome assembler. Bioinformatics. 2013 Aug 29;29(21):2669-77.

Monday, July 29, 2019

Test Illumina+Pacbio data set for MaSuRCA

There is new ftp site for a relatively small data set for 12Mbp genome of S.cerevisiae W303 to test MaSuRCA installation.

ftp://ftp.ccb.jhu.edu/pub/alekseyz/test_data/

This data set should assemble in under 30 scaffolds with N50 of over 650Kbp. The data and config files are provided along with masurca.sh script that shows the command lines. The data set is a subset of data available at

http://schatzlab.cshl.edu/data/ectools/

Thursday, March 7, 2019

MaSuRCA version 3.3.1

Today I am releasing a new version of MaSuRCA assembler, 3.3.1. This version has no new features, only performance improvements and bugfixes. The release is available from the usual github download page:
https://github.com/alekseyzimin/masurca/releases/tag/v3.3.1

I am currently working on the MaSuRCA 4 version. This version will replace CABOG assembler with Flye (https://github.com/fenderglass/Flye) for hybrid assembly of Illumina paired end + Oxford Nanopore/PacBio long reads. This will result in significant performance improvements, as Flye takes about 1 day on a 64-core server to assemble error-corrected human 20x data set, and CABOG takes about a week on 300-core cluster to do the same task.

You can use Fly now to assemble error corrected reads output by MaSuRCA. To do that you can stop the assembly after the following file has been generated:
mr.41.15.15.0.02.1.fa -- for nanopore assemblies
mr.41.15.17.0.029.1.fa -- for pacbio assemblies

and then run the Flye assembler as follows:

GS=`cat ESTIMATED_GENOME_SIZE.txt` && <flye_path>/bin/flye --nano-corr <mr.41.15.15.0.02.1.fa or mr.41.15.17.0.029.1.fa> -t <number_of-threads> -g $GS -m 2000 -o flye_assembly -i 0

Friday, November 30, 2018

Beta SLURM support in MaSuRCA 3.3.0b

I have been working to implement grid execution in MaSuRCA for a while, and MaSuRCA supported SGE for nearly a year now. The implementation of support for SLURM was behind due to an important difference between job submission in SGE and SLURM: absence of "sync" option, where the submit command in a script would not return until all jobs have exited. There are ways around it and we are still testing what would work best in the framework of MaSuRCA. For now one can run the four most computation intensive parts of MaSuRCA on the grid:

1. mega-reads correction 1st pass
2. mega-reads correction second pass (sometimes grid is not necessary for this one)
3. overlap-based trimming in CABOG
4. overlapping in CABOG

The way this works, one runs assemble.sh and then the run prepares batch jobs and exits, instructing user to run a given sbatch command with given parameters. Once all sbatch jobs are done, one should re-run assemble.sh script (would not hurt to re-generate it just in case). The scrip checks if all jobs completed successfully and then proceeds. If some jobs have failed, the script exits, instructing the user to run sbatch command again.

To configure the grid run one must add the following options to the config file. Hint: these options are already present in the example config file created upon installation of MaSuRCA

USE_GRID=1
GRID_QUEUE=your_slurm_partition
GRID_ENGINE=SLURM
GRID_BATCH_SIZE=<amount of Nanopore/Pacbio sequence to use per batch>, 300000000 default, will create 10 batches for every 1x of mammalian genome coverage
NUM_THREADS=32

The sbatch command will run one job per 32-core node in this case. If your nodes have fewer or more cores, adjust accordingly.

If grid jobs fail in mega-reads pass2 (jf_aligner jobs) it is likely due to not enough memory on the nodes. To solve this, reduce GRID_BATCH_SIZE, re-generate assemble.sh , remove mr_pass2 folder and re-run.

To get the 3.3.0b version, please go to github
https://github.com/alekseyzimin/masurca/blob/master/MaSuRCA-3.3.0b.tar.gz

Thanks to Daniela Puiu, the software engineer at the Center for Computational Biology for implementing SLURM support in CABOG and helping me test MaSuRCA!

Tuesday, November 20, 2018

MaSuRCA 3.2.9 hotfix

I decided to release a minor hotfix to MaSuRCA 3.2.9 on 11/20/2019 keeping version number the same, but with a small change: now the final output file final.genome.scf.fasta contains scaffolds. Previous version split scaffolds and output contigs in that file. To recover the scaffolds from 3.2.8 assemblies you can run:

<MaSuRCA path>/bin/recover_scaffolds.pl < final.genome.scf.fasta >final.genome.scf.scaffolds.fasta

The NA12878 assembly of newly basecalled release3+4 Nanopore+Illumina data (see previous posts) now has the following statistics (N50 computed with 3.1Gbp human genome size):

Total sequence: 2,886,141,146
Number of scaffolds: 3,240
Longest scaffold: 56,745,809
Scaffold N50: 14,764,183
Scaffold L50: 63
Number of contigs: 3,448
Longest contig: 51,762,539
Contig N50: 9,580,052
Contig L50: 88

The assembly is posted here:

ftp://ftp.ccb.jhu.edu/pub/alekseyz/na12878/na12878_MaSuRCA_3.2.8_nanopore_wgs_consortium_37x_GIAB_100x.scaffolds.fa

The consensus quality stats:
Substitution Errors: 49,520
Insertion/Deletion Errors: 115,752
Consensus Quality: 99.9943 (or 0.4 errors per 10000 bases)

Thursday, November 15, 2018

MaSuRCA 3.2.9 release: >99.99% consensus quality of MaSuRCA assemblies from Nanopore+Illumina data

This is a maintenance release with small bugfixes and speed/memory usage improvements. Some scripts have been converted from Perl to C++ to improve performance and reduce memory usage on very large data sets (10Gbp+ genomes).

There is a new script:

evaluate_consensus_error_rate.sh (run with -h to get usage)

This script follows guidelines and procedure of consensus quality evaluation described in https://www.nature.com/articles/nbt.4060 . It uses bwa to map Illumina data to the assembly, and then freebayes to get the variants. Bwa, samtools and freebayes must be installed and available on the PATH. Any assembly consensus variants (e.g. SNPs, indels) that are not supported by any Illumina reads, but where there is one or more alternatives that are supported by at least 3 Illumina reads are called errors in the consensus. The script thus estimates the total number of errors in the genome assembly consensus and computes sequence quality. The output is <>.report file, where <> is the name of the input assembly fasta file.

According to this evaluation, MaSuRCA assemblies have very high consensus quality; in my experiments 30x Pacbio+ 100x Illumina assembly of A.thaliana had 99.9972% quality (3 errors per 100,000 bases). Human NA12878 assembly from 37x Nanopore+100x Illumina data described in the previous post and available here ftp://ftp.ccb.jhu.edu/pub/alekseyz/na12878/na12878_MaSuRCA_3.2.8_nanopore_wgs_consortium_37x_GIAB_100x.fa (nanopore data from https://www.nature.com/articles/nbt.4060 and Illumina data from GIAB project), had 99.9913% quality (less than 1 error per 10,000 bases).

Note that there is no "polishing" required for MaSuRCA assemblies. The sequence can be used as output with no additional processing. Polishing with Pilon should be used with caution and can lead to adverse effects on assemblies. Here is a quote from Adam Phillippy blog:

"...Finally, a note of caution on Illumina polishing with Pilon. While it can improve consensus statistics overall, it can worsen the assembly in some regions, especially complex repetitive sequence like the MHC. If using Pilon, we recommend limiting the allowable edits and focusing on the primary nanopore error mode (indels)...."

The entire blog post can be found here: https://genomeinformatics.github.io/na12878update/