Wednesday, October 25, 2017

Note on ploidy estimation in MaSuRCA

Since the 3.2.2 version, MaSuRCA uses modified algorithms and settings for assembly of heterozygous diploid/polyploid genomes. Therefore there is a ploidy setting that is auto-computed and saved in PLOIDY.txt.  Valid values for PLOIDY are 1 and 2. Editing this file will result in forcing the assembler to use ploidy as indicated.

Ploidy 1 means haploid and ploidy 2 means diploid. This is a gross over-simplification that is used I the assembler for the time being.

Ploidy for non-clonal genomes is always 2,  but for the internal algorithms ploidy 1 means that the genome is relatively inbred and ploidy 2 means that it is relatively outbred. The reasoning is that in most genomes there is a proportion of the sequence that is conserved between the two haplotypes, and then there is proportion of sequence that is divergent.  I treat ploidy as measurement of ratio of the total amount of unique sequence in the genome / haploid genome size. This is a number between 1 and 2.  1 means no divergence ( the homologous chromosomes are identical) and 2 means two haplotypes are 100% different.  At this time I do not treat this as a floating parameter between 1 and 2,  but instead I set a threshold in the middle based on heuristical computation.  This is an over-simplification and I will introduce a refinement of this parameter in later versions.

Triticum Aestivum (Bread Wheat) assembly paper is out.

Triticum Aestivum (Bread Wheat) assembly paper has just appeared in GigaScience.  100 CPU-years to assemble Bread Wheat 16Gb genome with MaSuRCA and Falcon!!!
https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/gix097/4561661/The-first-near-complete-assembly-of-the-hexaploid