Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Feb 15:7:41-64.
doi: 10.1146/annurev-animal-020518-115005. Epub 2018 Oct 31.

Whole-Genome Alignment and Comparative Annotation

Affiliations
Review

Whole-Genome Alignment and Comparative Annotation

Joel Armstrong et al. Annu Rev Anim Biosci. .

Abstract

Rapidly improving sequencing technology coupled with computational developments in sequence assembly are making reference-quality genome assembly economical. Hundreds of vertebrate genome assemblies are now publicly available, and projects are being proposed to sequence thousands of additional species in the next few years. Such dense sampling of the tree of life should give an unprecedented new understanding of evolution and allow a detailed determination of the events that led to the wealth of biodiversity around us. To gain this knowledge, these new genomes must be compared through genome alignment (at the sequence level) and comparative annotation (at the gene level). However, different alignment and annotation methods have different characteristics; before starting a comparative genomics analysis, it is important to understand the nature of, and biases and limitations inherent in, the chosen methods. This review is intended to act as a technical but high-level overview of the field that should provide this understanding. We briefly survey the state of the genome alignment and comparative annotation fields and potential future directions for these fields in a new, large-scale era of comparative genomics.

Keywords: comparative genomics; genome alignment; genome annotation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An example of how different heuristics affect a genome alignment. All panels are dotplots: A line with positive slope indicates an alignment from the positive strand of sequence 1 to the positive strand of sequence 2, and a negative slope indicates an alignment from the positive strand of sequence 1 to the negative strand of sequence 2. Solid blue lines represent alignments, and red dashed lines represent where alignments have been missed. (a) The true alignment between the two sequences. (b) The same alignment if a single-copy aligner perfectly recovered the true alignment, except for the ignored duplication. (c) The same alignment according to a global or approximately global aligner: No edit operations except insertions, deletions, and substitutions are allowed, so substantial alignment is missing.
Figure 2
Figure 2
A diagram showing the difference between a reference-biased and a reference-free multiple alignment. In a human-biased multiple alignment, any large regions that are deleted in human, or inserted somewhere else in the tree, cannot be aligned.
Figure 3
Figure 3
An example of how progressive genome alignment works, focused on aligners like VISTA-LAGAN (SuperMap) (36) and progressiveCactus (40), which reconstruct ancestral genomes as input for further alignment steps. (a) A large guide tree (usually the species tree), which may include many species, is divided up into smaller local alignment problems of a few genomes each. (b) A diagram of what occurs within each subproblem. Each subproblem is focused on reconstructing a single ancestral genome, which is then used as input for subproblems further up the tree. Ingroup genomes (children of the ancestor in question) and, optionally, outgroup genomes (nondescendants of the ancestor) are aligned together. A plausible ancestral reconstruction is generated for use in later subproblems.
Figure 4
Figure 4
Comparing RNA sequencing (RNA-seq) expression quantification across different species with Comparative Annotation Toolkit (CAT). Kallisto (109) protein-coding gene-level expression for chimpanzee induced pluripotent stem cell (iPSC) RNA-seq is compared with human across all of the chimpanzee annotation and assembly combinations as well as when mapped directly to human. In all cases, the x-axis is the transcripts per million of human iPSC data mapped to GRCh38 annotated with GENCODE V27. The highest correlation (Pearson r = 0.96) is seen when comparing Clint (panTro6) annotated with CAT to GRCh38. The value p is the p-value of observing the Pearson correlation.

References

    1. Needleman SB, Wunsch CD. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443–53 - PubMed
    1. Smith T, Waterman M. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147:195–97 - PubMed
    1. Bray N, Dubchak I, Pachter L. 2003. AVID: a global alignment program. Genome Res. 13:97–102 - PMC - PubMed
    1. Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. 2018. MUMmer4: a fast and versatile genome alignment system. PLOS Comput. Biol. 14:e1005944. - PMC - PubMed
    1. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, et al. 2003. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13:721–31 - PMC - PubMed

Publication types