Design optimization methods for genomic DNA tiling arrays

Paul Bertone¹, Valery Trifonov, Joel S Rozowsky, Falk Schubert, Olof Emanuelsson, John Karro, Ming-Yang Kao, Michael Snyder, Mark Gerstein

Affiliations

PMID: 16365382
PMCID: PMC1361723
DOI: 10.1101/gr.4452906

Design optimization methods for genomic DNA tiling arrays

Paul Bertone et al. Genome Res. 2006 Feb.

. 2006 Feb;16(2):271-81.

doi: 10.1101/gr.4452906. Epub 2005 Dec 19.

Authors

Paul Bertone¹, Valery Trifonov, Joel S Rozowsky, Falk Schubert, Olof Emanuelsson, John Karro, Ming-Yang Kao, Michael Snyder, Mark Gerstein

Affiliation

¹ Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CN 06520, USA. P50 HG02357

PMID: 16365382
PMCID: PMC1361723
DOI: 10.1101/gr.4452906

Abstract

A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.

PubMed Disclaimer

Figures

**Figure 1.**
(*Left*) Evolution of genomic tiling arrays. Representing large spans of genomic DNA with bacterial artificial chromosome (BAC) clones facilitates global experimentation using relatively few array features, at the expense of low-tiling resolution. Higher-resolution designs using PCR products or oligonucleotides allow precise mapping of transcripts and regulatory elements, but require labor-intensive or technologically sophisticated approaches to implement. (*Upper right*) Linear feature tiling with gapped and end-to-end oligonucleotide placement. (*Lower right*) Overlapping tiles using fractional offset (e.g., one 25-mer probe placed every 5 nt) and single-base offset placement. The latter strategy provides a finer-resolution tiling of the genomic sequence, and can give a more precise indication of where hybridizing sequences are located on the chromosome.

**Figure 2.**
The problem of sequence similarity in tiling genomic DNA. (A) The level of similarity of oligonucleotide sequences to the remainder of the genome is represented by descending bars, where longer bars indicate more redundant sequences. If the redundancy exceeds a given threshold, indicated by the dashed line, the sequence is omitted from the tile path (B). Avoiding redundant or repetitive sequences inhibits adequate tiling of the sequence (C). Here, the level of non-repetitive sequence coverage decreases as the minimum tile size increases. At this point it also becomes necessary to use approximations that identify instances of known DNA transposons, retroelements, satellites, and other repetitive sequences, rather than calculating an explicit measure of sequence similarity. (D) In order to recover a higher percentage of non-repetitive DNA, tiling algorithms can be devised that incorporate some redundant sequences (gray) in an optimal fashion, which balances the cost of inclusion against the gain in sequence coverage.

**Figure 3.**
Representation of repetitive and non-repetitive sequences in genomic DNA. (A) Repetitive (gray) and non-repetitive (blue) segments are oriented vertically; the length of each subsequence is reflected by the height of the corresponding bar. (B) Repeat-masked region of human chromosome 10 showing alternating segments of repetitive and non-repetitive DNA. The high level of sequence fragmentation is clear, as is the wide range of sizes in both repetitive and non-repetitive segments. The horizontal line (red) indicates a segment length of 300 bp; a large number of non-repetitive sequences below this threshold are omitted when using naive tiling methods that simply avoid repeats. (C) Finer-resolution window of chromosome 10 before and after optimal sequence tiling. (Blue bars) Non-repetitive sequence that is covered in each case, (red bars) non-repetitive sequence that is lost. Non-repetitive sequences below the minimum size threshold are omitted when the sequence is tiled in a straightforward manner (*left* panel). Many of these are recovered after optimal tiling methods are applied (*right* panel). Note that a small number of repetitive nucleotides (short blue bars extending below the horizontal line) are included in the tile path to increase the overall sequence coverage.

**Figure 4.**
(A) Graphical representation of repetitive and non-repetitive segments in repeat-masked DNA. (B) In the naive tiling case, many small non-repetitive regions might be lost (yellow). (C) These can be recovered by using partitioning methods that generate an optimal tile path over the sequence.

**Figure 5.**
Many different partitionings share common subparts. To compute any partitioning with a split at k, the best partitioning for (*i, k*) and for (*k, j*) must be known. Since there are many ways to partition the sequence with a split at k, we only need to recursively evaluate a subpartitioning for subword (*i, j*) and (*k, j*) once. In all cases where we need the optimal solution for these subwords again, we refer to the pre-computed result instead of considering all further possible partitionings of that subword.

See this image and copyright information in PMC

References

1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215 403-410. - PubMed
1. Bao, Z. and Eddy, S.R. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12 1269-1276. - PMC - PubMed
1. Benson, G. 1999. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27 573-580. - PMC - PubMed
1. Berman, P., Bertone, P., Dasgupta, B., Gerstein, M., Kao, M.Y., and Snyder, M. 2004. Fast optimal genome tiling with applications to microarray design and homology search. J. Comput. Biol. 11 766-785. - PubMed
1. Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M., Weissman, S., et al. 2004. Global identification of human transcribed sequences with genome tiling arrays. Science 306 2242-2246. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

P50 HG002357/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Design optimization methods for genomic DNA tiling arrays

Affiliation

Design optimization methods for genomic DNA tiling arrays

Authors

Affiliation

Abstract

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources