Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2013 Jan 29;14(1):R8.
doi: 10.1186/gb-2013-14-1-r8.

CGAL: computing genome assembly likelihoods

Comparative Study

CGAL: computing genome assembly likelihoods

Atif Rahman et al. Genome Biol. .

Abstract

Assembly algorithms have been extensively benchmarked using simulated data so that results can be compared to ground truth. However, in de novo assembly, only crude metrics such as contig number and size are typically used to evaluate assembly quality. We present CGAL, a novel likelihood-based approach to assembly assessment in the absence of a ground truth. We show that likelihood is more accurate than other metrics currently used for evaluating assemblies, and describe its application to the optimization and comparison of assembly algorithms. Our methods are implemented in software that is freely available at http://bio.math.berkeley.edu/cgal/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
A generative graphical model for sequencing. N paired-end reads are generated independently from a genome. Here, F denotes the distribution of fragment lengths, S is the distribution of start sites of reads and E stands for error parameters.
Figure 2
Figure 2
Hash length vs log likelihood for E. coli. Log likelihoods of assemblies of E. coli reads are shown on the y-axis. Assemblies are generated using different assemblers for varying k-mer length, which is shown on the x-axis. The dotted line corresponds to the log likelihood of the reference.
Figure 3
Figure 3
Hash length vs difference from reference for E. coli. The differences between assemblies and the reference are shown on the y-axis where the difference refers to the numbers of bases in the reference not covered by the assembly or differ between the reference and the assembly.
Figure 4
Figure 4
Log likelihood vs N50 scaffold length for E. coli. Log likelihoods are shown on the x-axis and N50 scaffold lengths are shown on the y-axis. Each circle corresponds to an assembly generated using an assembler for some hash length and the sizes of the circles correspond to similarity with reference. The R2 values are: (i) log likelihood vs similarity: 0.9372048, (ii) log likelihood vs N50 scaffold length: 0.44011, (iii) N50 scaffold length vs similarity: 0.3216882.
Figure 5
Figure 5
Log likelihood vs numbers of mis-assembly features and suspicious regions for E. coli. Log likelihoods are shown on the x-axis and numbers of mis-assembly features and suspicious regions reported by amosvalidate are shown on the y-axis. Each symbol corresponds to an assembly generated using an assembler for some hash length and the sizes of the symbols correspond to similarity with reference. The R2 values are: (i) log likelihood vs number of mis-assembly features: 0.8922, (ii) log likelihood vs number of suspicious regions: 0.9039, (iii) similarity vs number of mis-assembly features: 0.8211, (iv) similarity vs number of suspicious regions: 0.7723.
Figure 6
Figure 6
Hash length vs log likelihood for E. coli data from CLC bio. Log likelihoods of assemblies of E. coli reads from CLC bio are shown on the y-axis. Assemblies are generated using different assemblers for varying k-mer length, which is shown on the x-axis. The yellow dotted line corresponds to the log likelihood of the reference provided and the gray dotted line corresponds to the log likelihood of the strain we believe the reads were generated from.
Figure 7
Figure 7
Hash length vs log likelihood for G. clavigera. Log likelihoods of assemblies of G. clavigera reads are shown on the y-axis. Assemblies are generated using different assemblers for varying k-mer length, which is shown on the x-axis. The dotted lines correspond to the log likelihoods of the assemblies generated using Sanger, 454 and Illumina data.
Figure 8
Figure 8
Hash length vs difference from reference for G. clavigera. Differences between assemblies and the reference are shown on the y-axis where difference refers to the numbers of bases in the reference not covered by the assembly or differ between the reference and the assembly.
Figure 9
Figure 9
Log likelihood vs N50 scaffold length for G. clavigera. Log likelihoods are shown on the x-axis and N50 scaffold lengths are shown on the y-axis. Each circle corresponds to an assembly generated using an assembler for some hash length and the sizes of the circles correspond to similarity with reference. The R2 values are: (i) log likelihood vs similarity: 0.4545793, (ii) log likelihood vs N50 scaffold length: 0.002397233, (iii) N50 scaffold length vs similarity: 0.006084032.
Figure 10
Figure 10
Coverage vs log likelihood for Assemblathon 1 entries. Coverage is shown on the x-axis and log likelihood is shown on the y-axis. Each circle corresponds to an assembly. The R2 value is 0.989972.

Similar articles

Cited by

References

    1. Medvedev P, Brudno M. Maximum likelihood genome assembly. Journal of Computational Biology. 2009;16:1101–1116. doi: 10.1089/cmb.2009.0047. - DOI - PMC - PubMed
    1. Medvedev P, Georgiou K, Myers G, Brudno M. In: Algorithms in Bioinformatics, Volume 4645 of Lecture Notes in Computer Science. Giancarlo R, Hannenhalli S, editor. Berlin/Heidelberg: Springer; 2007. Computability of models for sequence assembly. pp. 289–301.http://dx.doi.org/10.1007/978-3-540-74126-8_27 - DOI
    1. Nagarajan N, Pop M. Parametric complexity of sequence assembly: theory and applications to next generation sequencing. Journal of computational biology. 2009;16:897–908. doi: 10.1089/cmb.2009.0005. - DOI - PubMed
    1. Earl DA, Bradnam K, St John J, Darling A, Lin D, Faas J, Yu HOK, Vince B, Zerbino DR, Diekhans M, Nguyen N, Nuwantha P, Sung AWK, Ning Z, Haimel M, Simpson JT, Fronseca NA, Birol N, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelly DR, Phillippy AM, Koren S. et al.Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Research. 2011;12:2224–2241. - PMC - PubMed
    1. Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marcais G, Pop M, Yorke JA. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2012;3:557–567. - PMC - PubMed

Publication types