Thursday, January 19, 2017

New version 3.2.1_01032017

I just posted a new version of the assembler.  The new version includes the following changes:

1. improved ploidy detection for heterozygous genome assembly; now the ploidy is auto-detected and difference settings are used for heterozygous genomes to filter the alternative haplotype
2. improvement in speed and correctness of mega-reads correction of PacBio/nanopore reads; now we require all gaps in mega-reads that are filled with raw PacBio/nanopore reads to have >=3 coverage even for high coverage long read data sets
3. changed local alignment parameters in CA8 overlap/scaffold/consensus to speed up the alignments without apparent loss of sensitivity

Overall the new version should be 10-20% faster and more correct.  The heterozygous genomes will output one haplotype as final assembly in dedup.genome.scf.fasta file.  Both haplotypes are still available in 9-terminator/genome.scf.fasta.

The new version is posted in
ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/

Tuesday, December 13, 2016

ovlMemory error in hybrid Pacbio/MinION assembly

Please note that if you have the option ovlMemory set in the CA_PARAMETERS section of the config file, the Illumina+Pacbio/MinION assembly will give an error in CABOG.  This is normal behavior, this option is not compatible with CABOG version 8 used for the long-read assembly.  I suggest to remove any additional options that you may have in CA_PARAMETERS section of the configuration file, then re-generate assemble.sh and run the assembly and the assembly will proceed without error.

Comments are welcome

I welcome any comments on the posts.  You can use comments to report bugs on the new versions or ask questions about the new features.  For now all comments are moderated, that is I have to look at every comment and approve it before it appears on the public blog.  This way I can make sure I have looked at all comments.

New version 3.2.1_12132016

I just posted a new MaSuRCA version on the alternative download site: ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/.  This version improves speed by running delta-filter in parallel after Nucmer in the deduplication of contigs and unitigs.  The new script is called parallel_delta-filter.sh and its usage is:


parallel_delta-filter delta-prefix "switches" num_processes


the output is the filtered delta-prefix.fdelta file.  delta-prefix is the delta file name without .delta. "switches" are command line parameters for delta-filter, e.g. "-q -I 95 -l 100".  num_processes is a number of processes that you wish to run.

Friday, December 9, 2016

Alternative download site

MaSuRCA source code is always available from http://www.genome.umd.edu.  If the site is not available for some reason, the latest version is available at ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/

Documentation update

MaSuRCA documentation has been updated.  The  link on the website points to the correct version of the documentation.  The Quick Start Guide is available at ftp://ftp.genome.umd.edu/pub/MaSuRCA/MaSuRCA_QuickStartGuide.pdf



Thursday, December 8, 2016

New version 3.2.1_12082016

An updated version has been posted: 3.2.1_12082016.  This version adds de-duplication of scaffolds (removal of redundant scaffolds from the assembly) as final step.  If you need to run the de-duplication manually you can do so by running "bin/deduplicate_contigs.sh <CA assembly folder name> genome <NUM_THREADS>" .

Wednesday, December 7, 2016

New version 3.2.1_12062016

New version of MaSuRCA (3.2.1_12062016) is available.  This version has an extensive list of new features and updates from the previous 3.2.1_10202016 release.  The improvements include:

1. faster unitig consensus (use pbutgcns along with utgcns)
2. faster contig consensus (use pbutgcns along with ctgcns)
3. new feature: deduplication of heterozygous genomes at the unitig stage; one haplotype is removed, and each contig in the assembly is output as single haplotype; however not all contigs represent the same haplotype -- this feature is under development
4. overlap filtering to detect and prevent repeat-induced misassemblies in contigs
5. bugfixes and small improvements in speed and accuracy in Celera Assembler -- the main assembly engine in MaSuRCA

Welcome

Welcome to the official MaSuRCA development blog maintained by Aleksey Zimin.  This blog is dedicated to announcements and support of MaSuRCA whole genome de novo assembly package.

At this time MaSuRCA supports Illumina,Sanger, 454, Oxford Nanopore and PacBio data.  Please refer to MaSuRCA quick start manual http://www.genome.umd.edu/docs/MaSuRCA_QuickStartGuide.pdf before using MaSuRCA.

Please supply either Pacbio or Nanopore reads (not both) as PACBIO=file.fa or NANOPORE=file.fa in the configuration file. All Pacbio or Nanopore reads must be in a single fasta file!

To obtain the latest SOURCE CODE of MaSuRCA and compilation instructions, see this page. BINARIES for various Linux distributions are NOT YET AVAILABLE and they are coming soon.

The package is freely available under GPL open source license at http://www.genome.umd.edu.  If you use MaSuRCA, please cite the following publications:

http://bioinformatics.oxfordjournals.org/content/early/2013/08/29/bioinformatics.btt476.abstract

http://biorxiv.org/content/early/2016/07/26/066100