Repetitive elements may comprise over two-thirds of the human genome

A P Jason de Koning¹, Wanjun Gu, Todd A Castoe, Mark A Batzer, David D Pollock

Affiliations

PMID: 22144907
PMCID: PMC3228813
DOI: 10.1371/journal.pgen.1002384

Repetitive elements may comprise over two-thirds of the human genome

A P Jason de Koning et al. PLoS Genet. 2011 Dec.

. 2011 Dec;7(12):e1002384.

doi: 10.1371/journal.pgen.1002384. Epub 2011 Dec 1.

Authors

A P Jason de Koning¹, Wanjun Gu, Todd A Castoe, Mark A Batzer, David D Pollock

Affiliation

¹ Department of Biochemistry and Molecular Genetics, School of Medicine, University of Colorado, Aurora, Colorado, USA.

PMID: 22144907
PMCID: PMC3228813
DOI: 10.1371/journal.pgen.1002384

Abstract

Transposable elements (TEs) are conventionally identified in eukaryotic genomes by alignment to consensus element sequences. Using this approach, about half of the human genome has been previously identified as TEs and low-complexity repeats. We recently developed a highly sensitive alternative de novo strategy, P-clouds, that instead searches for clusters of high-abundance oligonucleotides that are related in sequence space (oligo "clouds"). We show here that P-clouds predicts >840 Mbp of additional repetitive sequences in the human genome, thus suggesting that 66%-69% of the human genome is repetitive or repeat-derived. To investigate this remarkable difference, we conducted detailed analyses of the ability of both P-clouds and a commonly used conventional approach, RepeatMasker (RM), to detect different sized fragments of the highly abundant human Alu and MIR SINEs. RM can have surprisingly low sensitivity for even moderately long fragments, in contrast to P-clouds, which has good sensitivity down to small fragment sizes (∼25 bp). Although short fragments have a high intrinsic probability of being false positives, we performed a probabilistic annotation that reflects this fact. We further developed "element-specific" P-clouds (ESPs) to identify novel Alu and MIR SINE elements, and using it we identified ∼100 Mb of previously unannotated human elements. ESP estimates of new MIR sequences are in good agreement with RM-based predictions of the amount that RM missed. These results highlight the need for combined, probabilistic genome annotation approaches and suggest that the human genome consists of substantially more repetitive sequence than previously believed.

PubMed Disclaimer

Figures

**Figure 1. Principles of repeat identification using P-clouds.**
A) True data distribution representing divergence within a TE family from a master element sequence (center). B) Consensus sequence based search throws away information by collapsing observed data to a single sequence. C) *P-clouds* clusters related high-abundance oligos, thus providing better coverage of sequence space.

**Figure 2. P-clouds and RepeatMasker annotation of the repeat structure of the human genome.**
Results are displayed as a percentage of the ungapped genome assembly length. A) Consensus results prior to this study indicate that <50% of the genome is repetitive (*RepeatMasker*). B) Analysis using *P-clouds* suggests more than two-thirds of the genome is repetitive or repeat-derived.

**Figure 3. Percentage of previously-identified transposable elements annotated by *P-clouds*.**
A) The percentage of nucleotides and repeats for each family or repeat classification group. B) The number of nucleotides annotated or missed.

**Figure 4. Percentage of Alu elements in different Alu subfamilies not annotated by *P-clouds* analysis.**
Displayed are elements for which no portion was annotated. The relative age of *Alu* subfamilies increases from left to right.

**Figure 5. Percent detection success for fragments of known full-length *SINE* elements.**
A) *Alu* regions. B) *MIR* regions. Identification success is displayed as a running average of 10 bp starting positions.

**Figure 6. MIR element-specific *P-clouds* detect the short fragments that *RepeatMasker* cannot.**
A) Predicted true distribution of *MIR* fragments in the human genome, using observed *RepeatMasker* results and *RepeatMasker's* sensitivity estimates from Figure 5B. B) Novel P-clouds annotations on the RepeatMasked portion of the human genome, minus predicted false positives from dinucleotide simulations (see text).

See this image and copyright information in PMC

References

1. Frith MC, Pheasant M, Mattick JS. Genomics: The amazing complexity of the human transcriptome. Eur J Hum Genet. 2005;13:894–897. - PubMed
1. Mattick JS, Makunin IV. Non-coding RNA. Hum Mol Genet. 2006;15:R17–29. - PubMed
1. Pheasant M, Mattick JS. Raising the estimate of functional human sequences. Genome Res. 2007;17:1245–1253. - PubMed
1. Batzer MA, Deininger PL. Alu repeats and human genomic diversity. Nat Rev Genet. 2002;3:370–379. - PubMed
1. Eichler EE. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 2001;17:661–669. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Repetitive elements may comprise over two-thirds of the human genome

Affiliation

Repetitive elements may comprise over two-thirds of the human genome

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous