. 2011 Mar 1;27(5):595-603.

doi: 10.1093/bioinformatics/btq713. Epub 2011 Jan 13.

AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision

Alexej Abyzov¹, Mark Gerstein

Affiliations

PMID: 21233167
PMCID: PMC3042181
DOI: 10.1093/bioinformatics/btq713

AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision

Alexej Abyzov et al. Bioinformatics. 2011.

. 2011 Mar 1;27(5):595-603.

doi: 10.1093/bioinformatics/btq713. Epub 2011 Jan 13.

Authors

Alexej Abyzov¹, Mark Gerstein

Affiliation

¹ Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06520, USA.

PMID: 21233167
PMCID: PMC3042181
DOI: 10.1093/bioinformatics/btq713

Abstract

Motivation: Defining the precise location of structural variations (SVs) at single-nucleotide breakpoint resolution is an important problem, as it is a prerequisite for classifying SVs, evaluating their functional impact and reconstructing personal genome sequences. Given approximate breakpoint locations and a bridging assembly or split read, the problem essentially reduces to finding a correct sequence alignment. Classical algorithms for alignment and their generalizations guarantee finding the optimal (in terms of scoring) global or local alignment of two sequences. However, they cannot generally be applied to finding the biologically correct alignment of genomic sequences containing SVs because of the need to simultaneously span the SV (e.g. make a large gap) and perform precise local alignments at the flanking ends.

Results: Here, we formulate the computations involved in this problem and describe a dynamic-programming algorithm for its solution. Specifically, our algorithm, called AGE for Alignment with Gap Excision, finds the optimal solution by simultaneously aligning the 5' and 3' ends of two given sequences and introducing a 'large-gap jump' between the local end alignments to maximize the total alignment score. We also describe extensions allowing the application of AGE to tandem duplications, inversions and complex events involving two large gaps. We develop a memory-efficient implementation of AGE (allowing application to long contigs) and make it available as a downloadable software package. Finally, we applied AGE for breakpoint determination and standardization in the 1000 Genomes Project by aligning locally assembled contigs to the human genome.

Availability and implementation: AGE is freely available at http://sv.gersteinlab.org/age.

PubMed Disclaimer

Figures

**Fig. 1.**
Schematics of the expected optimal alignment around a structural variation (left) and alignments produced by global Needleman–Wunsch (NW) and local Smith–Waterman (SW) algorithms (right). The structural variation, i.e. deletion, is in red. In (B), the deletion is accompanied by a small insertion (blue). Throughout the figure, alignable flanking regions are shown in green and orange. Both SW and NW algorithms generally cannot arrive at a biologically correct alignment.

**Fig. 2.**
Sequence similarity (see A) around SV breakpoints (shades of green) can mislead local and/or global alignment(s) to produce incorrect alignment (see B). Conceptually, to produce correct alignment one has to find an optimal jump between overlapping local alignments. However, local alignment calculation and jump finding have to be done simultaneously rather than successively to guarantee finding the optimal alignment (Supplementary Fig. S4).

**Fig. 3.**
Schematics of the algorithm. (A) Two alignment score matrices for the alignment of the 5′ ends (S^L matrix) and the 3′ ends (S^R matrix) are constructed *a la* the Smith–Waterman local alignment. In this example, scoring is as follows: match = 1, mismatch = −1, gap open penalty = 4 and gap extend penalty = 2. Orange arrows represent trace-back information. The best alignment maximizes the sum of the maximum in the leading submatrix (highlighted buff) in S^L and the maximum in the paired trailing submatrix (highlighted cyan) in S^R. (B) Matrices are converted so that each cell contains the maximum score of the leading submatrix in S^L and the trailing submatrix in S^R. The location of score maxima is traced (blue arrows), just like the score is traced in scoring matrices. Maximum score sum (red) can now be found in one pass through the matrices. (C) Alignment is constructed by, first, tracing back the maximum location (red arrows) in each matrix and then, tracing back alignments for the 5′ and 3′ ends (green arrow). The resulting alignment is the sum of the alignments at the 5′ and 3′ ends, with the unaligned region in-between. The maximum score can be redundant (bold rectangles). However, the resulting alignment will be the same.

**Fig. 4.**
Schematics for aligning sequences with tandem duplication and inversion. Color gradient reflects the directionality of sequences from the 5′-end to the 3′-end. (A) Aligning a contig/read spanning a duplication breakpoint can be no different than aligning sequences with a deletion/insertion. (B) For contigs spanning other duplication breakpoints, the order of the aligned fragments is different in the two se-quences. (C) Optimal split-alignment for sequences with inversions is calculated by aligning the 5′-end of the first sequence to the 5′-end of the second one and aligning the 3′-end of the first sequence to the 3′-end of the reverse complement of the second one.

**Fig. 5.**
Comparison of assembled contig alignments in the region of predicted deletions. The first line in each alignment is the sequence for the genomic region, while the second is for the contig sequence. Nucleotide numbering is sequential, starting from one in both compared sequences. Each alignment is accompanied by a schematic representation underneath. (A) The predicted deletion is chr20:2,969,769-2,970,056. The contig that is 614 bp in length has been aligned by the AGE, GAP3, CrossMatch and Blat programs to the predicted region of deletion, which is extended by 1 kb in each direction, i.e. from 2,968,769-2,971,056. The first sequence (genomic region) has two pairs of homologous sequences: orange to yellow and dark green to light green. AGE alignment clearly identifies a large unaligned region, confirms a predicted deletion, and derives deletion breakpoints as chr20:2,969,756-2,970,052 (coordinates are for the first and the last deleted bases). Note that the resulting breakpoints are in excellent agreement (within 13 bp) with the prediction. No other program was able to produce the correct alignment. (B) Predicted deletion is chr8:118,292,728-118,292,987. The contig of 530 bp in length has been aligned by the AGE and GAP3 programs to the predicted region of deletion, which is extended by 1 kb in each direction, i.e. from 118 291 728 to 118 293 987. AGE alignment clearly identifies a large unaligned region, confirms a predicted deletion, and derives deletion breakpoints as chr8:118,292,711-118,292,990 (coordinates are for first and last deleted bases). GAP3 is not able to align the left flanking sequence, as the penalty for a long gap outweighs the matches at the left flanking sequence. All coordinates are for human hg18 reference.

See this image and copyright information in PMC

References

1. Abyzov A, et al. CNVnator: an approach to discover, genotype and characterize typical and atypical cnvs from family and population genome sequencing. Genome Res. 2011 (submitted) - PMC - PubMed
1. Chao KM, et al. Recent developments in linear-space alignment methods: a survey. J. Comput. Biol. 1994;1:271–291. - PubMed
1. Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
1. Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982;162:705–708. - PubMed
1. Hirschberg DS. A linear space algorithm for computing maximal common subsequences. Commun. ACM. 1975;18:341–343.

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision

Affiliation

AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous