Imagine a situation where you have a genome of an organism that you have been studying, assembled from the previous generation sequencing data. You have spent time to produce this assembly, curated it, and it looks good, but contiguity is not up to today's standards. A way to improve the existing assembly, without starting anew, is to upgrade it using additional long-read data. If you still have DNA from the same individual that was used to produce data for the original assembly, it is relatively straightforward (and cheap) now to produce additional long read data with Oxford Nanopore MinIion or PromethIon instrument. You can then use SAMBA to quickly scaffold and gap-fill your existing draft assembly with the additional long-read data. This will result in substantial improvements in contiguity, and likely correctness. SAMBA can use the long reads to check existing contigs for misassemblies using long read alignments, break at suspected misassembly locations, and then scaffold the contigs and fill in the sequence for all spanned gaps in the scaffolds. This yields both much bigger and more structurally correct contigs. SAMBA is free open-source software included with MaSuRCA version 4.0.9 and up: Releases · alekseyzimin/masurca · GitHub
SAMBA is published in PLoS Computational biology: Zimin AV, Salzberg SL. The SAMBA tool uses long reads to improve the contiguity of genome assemblies. PLoS computational biology. 2022 Feb 4;18(2):e1009860.
The invocation of SAMBA is as follows:
samba.sh [options]
-r <contigs or scaffolds in fasta format>
-q <long reads or another assembly used to scaffold in fasta or fastq format, can be gzipped>
-t <number of threads>
-d <scaffolding data type: ont, pbclr or asm, default:ont>
-m <minimum matching length, default:5000>
-o <maximum overhang, default:1000>
-a <optional: allowed merges file in the format per line: contig1 contig2, only pairs of contigs listed will be considered for merging, useful for intrascaffold gap filling>
-v verbose flag
-h|--help|-u|--usage this message
SAMBA installs with MaSuRCA and requires no external dependencies. The only parameter that is worth modifying is -m or the minimum matching length. 2000-2500 is a good value for small eukaryotic genomes 100-400Mb in size, 5000 is the default best value for large eukaryotic genomes (2-3Gbp), and 9000-10000 is the best value for large highly repetitive plant genomes (5Gbp+).
MaSuRCA also provides a wrapper script for SAMBA that allows to use SAMBA to close intra-scaffold gaps in an assembly. The usage is as follows:
close_scaffold_gaps.sh [options]
-r <scaffolds to gapclose> MANDATORY
-q <sequences used for closing gaps, can be long reads or another assembly, in fasta or fastq format, can ge gzipped> MANDATORY
-t <number of threads, default:1>
-i <identity% default:98>
-m <minimum match length on the two sides of the gap, default:2500>
-o <max overhang, default:1000>
-v verbose flag
-h|--help|-u|--usage this message
The above script will split the scaffolds into contigs and then run SAMBA only allowing intrascaffold gaps to be filled. Contig "flips" inside the scaffold are allowed, making this a great tool to gapfill and fix assemblies scaffolded with HiC data, because HiC scaffolding sometimes incorrectly flips contigs in scaffolds.