Thursday, January 19, 2017

New version 3.2.1_01032017

I just posted a new version of the assembler.  The new version includes the following changes:

1. improved ploidy detection for heterozygous genome assembly; now the ploidy is auto-detected and difference settings are used for heterozygous genomes to filter the alternative haplotype
2. improvement in speed and correctness of mega-reads correction of PacBio/nanopore reads; now we require all gaps in mega-reads that are filled with raw PacBio/nanopore reads to have >=3 coverage even for high coverage long read data sets
3. changed local alignment parameters in CA8 overlap/scaffold/consensus to speed up the alignments without apparent loss of sensitivity

Overall the new version should be 10-20% faster and more correct.  The heterozygous genomes will output one haplotype as final assembly in dedup.genome.scf.fasta file.  Both haplotypes are still available in 9-terminator/genome.scf.fasta.

The new version is posted in
ftp://ftp.genome.umd.edu/pub/MaSuRCA/latest/

8 comments:

  1. How would you recommend utilizing 2X250 overlapping Illumina reads in MaSuRCA? Would it be best to overlap/merge them and provide the resulting fasta file as a SE data source? Or provide them as PE reads with a negative insert size?

    ReplyDelete
    Replies
    1. Are these reads really from 250bp fragments? The reads can be 250bp overlapping by 50bp for example, which translates into 450 bp fragments. Then you can supply these as
      PE=pe 450 50 file file
      Almost all of these will be joined automatically.

      Delete
  2. Dear Aleksey,
    Is there any way to speed-up the cgw step ? I'm assembling a small (320mb), polyploid and heterozygous plant genome, and the cgw step normally take 40-50 days running.

    ReplyDelete
    Replies
    1. Yes, please watch of a new version next week (February 10+). I made some speed improvements to cgw.

      Delete
    2. Looking forward, any particular date ?
      Can I use the last version (01202017) in the ftp site for production ?

      Delete
  3. Hi Aleksey,

    I'm working on including PacBio data in an assembly. When I run masurca with the pacbio data line in my configuration file, the assemble.sh script doesn't include any runCA lines - it stops at the 'createSuperReadsForDirectory.perl' line. When I comment out the PACBIO line and run masurca I get a much longer assemble.sh script including the usual runCA lines near the end.

    Just wanted to check before I run this for a few weeks - is this expected behavior?

    ReplyDelete
  4. Another question - what is the best way to distribute Masurca processing across multiple compute nodes in a cluster? Are there any steps that are particularly amenable to parallelization?

    Thanks!
    David

    ReplyDelete
  5. MaSuRCA supports processing over a cluster in runCA part. You can communicate all relevant SGE parameters under CA_parameters. With PacBio data there will be no runCA commands in assemble.sh -- there is a call to mega_reads_assemble.... instead.

    ReplyDelete