MaSuRCA genome assembly package: New version 3.2.1

Thursday, January 19, 2017

New version 3.2.1_01032017

I just posted a new version of the assembler. The new version includes the following changes:

1. improved ploidy detection for heterozygous genome assembly; now the ploidy is auto-detected and difference settings are used for heterozygous genomes to filter the alternative haplotype
2. improvement in speed and correctness of mega-reads correction of PacBio/nanopore reads; now we require all gaps in mega-reads that are filled with raw PacBio/nanopore reads to have >=3 coverage even for high coverage long read data sets
3. changed local alignment parameters in CA8 overlap/scaffold/consensus to speed up the alignments without apparent loss of sensitivity

Overall the new version should be 10-20% faster and more correct. The heterozygous genomes will output one haplotype as final assembly in dedup.genome.scf.fasta file. Both haplotypes are still available in 9-terminator/genome.scf.fasta.

The new version is posted in
ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/

8 comments:

UnknownJanuary 24, 2017 at 12:53 PM
How would you recommend utilizing 2X250 overlapping Illumina reads in MaSuRCA? Would it be best to overlap/merge them and provide the resulting fasta file as a SE data source? Or provide them as PE reads with a negative insert size?
ReplyDelete
Replies
AlêJanuary 26, 2017 at 8:15 AM
Dear Aleksey,
Is there any way to speed-up the cgw step ? I'm assembling a small (320mb), polyploid and heterozygous plant genome, and the cgw step normally take 40-50 days running.
ReplyDelete
Replies
lolsealFebruary 2, 2017 at 2:54 PM
Hi Aleksey,

I'm working on including PacBio data in an assembly. When I run masurca with the pacbio data line in my configuration file, the assemble.sh script doesn't include any runCA lines - it stops at the 'createSuperReadsForDirectory.perl' line. When I comment out the PACBIO line and run masurca I get a much longer assemble.sh script including the usual runCA lines near the end.

Just wanted to check before I run this for a few weeks - is this expected behavior?
ReplyDelete
Replies
lolsealFebruary 2, 2017 at 2:55 PM
Another question - what is the best way to distribute Masurca processing across multiple compute nodes in a cluster? Are there any steps that are particularly amenable to parallelization?

Thanks!
David
ReplyDelete
Replies
Aleksey ZiminFebruary 6, 2017 at 4:22 PM
MaSuRCA supports processing over a cluster in runCA part. You can communicate all relevant SGE parameters under CA_parameters. With PacBio data there will be no runCA commands in assemble.sh -- there is a call to mega_reads_assemble.... instead.
ReplyDelete
Replies

Add comment