RepARK--de novo creation of repeat libraries from whole-genome NGS reads

Philipp Koch¹, Matthias Platzer², Bryan R Downie²

Affiliations

¹ Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany philippk@fli-leibniz.de.
² Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany.

PMID: 24634442
PMCID: PMC4027187
DOI: 10.1093/nar/gku210

RepARK--de novo creation of repeat libraries from whole-genome NGS reads

Philipp Koch et al. Nucleic Acids Res. 2014 May.

. 2014 May;42(9):e80.

doi: 10.1093/nar/gku210. Epub 2014 Mar 14.

Authors

Philipp Koch¹, Matthias Platzer², Bryan R Downie²

Affiliations

¹ Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany philippk@fli-leibniz.de.
² Genome Analysis, Leibniz Institute for Age Research - Fritz Lipmann Institute, Beutenbergstr. 11, 07745 Jena, Germany.

PMID: 24634442
PMCID: PMC4027187
DOI: 10.1093/nar/gku210

Abstract

Generation of repeat libraries is a critical step for analysis of complex genomes. In the era of next-generation sequencing (NGS), such libraries are usually produced using a whole-genome shotgun (WGS) derived reference sequence whose completeness greatly influences the quality of derived repeat libraries. We describe here a de novo repeat assembly method--RepARK (Repetitive motif detection by Assembly of Repetitive K-mers)--which avoids potential biases by using abundant k-mers of NGS WGS reads without requiring a reference genome. For validation, repeat consensuses derived from simulated and real Drosophila melanogaster NGS WGS reads were compared to repeat libraries generated by four established methods. RepARK is orders of magnitude faster than the other methods and generates libraries that are: (i) composed almost entirely of repetitive motifs, (ii) more comprehensive and (iii) almost completely annotated by TEclass. Additionally, we show that the RepARK method is applicable to complex genomes like human and can even serve as a diagnostic tool to identify repetitive sequences contaminating NGS datasets.

PubMed Disclaimer

Figures

**Figure 1.**
Workflow of the repeat library creation pipeline RepARK. WGS sequencing reads (a) contain unique (black) and repetitive (red) fractions of the genome. K-mers of all reads (b) were counted and the threshold of frequent k-mers is determined. These abundant k-mers are isolated (c) and assembled by a *de novo* genome assembly program (such as Velvet) into repeat consensus sequences (d).

**Figure 2.**
Cumulative length of repetitive and non-repetitive consensuses within each library. Black: repetitive consensuses (i.e. align more than once to the reference); gray: non-repetitive consensuses (i.e. singly mapping or not at all); Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.

**Figure 3.**
Repeat fractions identified in the *D. melanogaster* reference sequence. Black: fraction of the reference masked by RepeatMasker using the respective repeat library; gray: fraction of the reference that was subsequently masked by RepeatMasker using RepBase; Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.

**Figure 4.**
Boxplot of DmRepBase repeat class completeness in the *de novo* repeat libraries. DNA: 33 DNA transposons; LTR: 138 LTR retrotransposons; non-LTR: 41 non-LTR retrotransposons; Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads; box: first and third quartiles; horizontal line: median; whiskers: most extreme value within 1.5× of inter-quartile range; dots: outliers. A full table of repeat family representation in the RepARK libraries can be found in Supplementary Table S3.

**Figure 5.**
Fractions of known *D. melanogaster* segmental duplications identified by the *de novo* repeat libraries. Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.

**Figure 6.**
Fractions of the *D. melanogaster* genome reference classified according to annotated repeat libraries. Black: DNA transposon sequence; dark gray: retrotransposon sequence; light gray: unclear; Sanger: libraries based on Sanger sequencing data; simulated: libraries derived from simulated NGS reads; real: libraries derived from Illumina reads.

**Figure 7.**
High confidence alignments of human RepARK consensuses (right half) to the Epstein-Barr virus genome (left half, HHV-4). Each ribbon represents a consensus alignment with >90% mapping and p < 10⁻⁶⁰, encompassing 90.5% of the Epstein-Barr virus genome. Lower confidence consensuses align to the remaining 9.5% with more relaxed criteria. Three consensuses map multiple times to the virus genome sequence (NODE_48265, NODE_888, NODE_5085; dark red). Created with Circoletto (http://bat.ina.certh.gr/tools/circoletto/).

See this image and copyright information in PMC

References

1. Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. - PubMed
1. Mayer K.F., Waugh R., Brown J.W., Schulman A., Langridge P., Platzer M., Fincher G.B., Muehlbauer G.J., Sato K., Close T.J., et al. A physical, genetic and functional sequence assembly of the barley genome. Nature. 2012;491:711–716. - PubMed
1. Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2012;13:36–46. - PMC - PubMed
1. Yandell M., Ence D. A beginner's guide to eukaryotic genome annotation. Nat. Rev. Genet. 2012;13:329–342. - PubMed
1. Feschotte C., Pritham E.J. DNA transposons and the evolution of eukaryotic genomes. Annu. Rev. Genet. 2007;41:331–368. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RepARK--de novo creation of repeat libraries from whole-genome NGS reads

Affiliations

RepARK--de novo creation of repeat libraries from whole-genome NGS reads

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases