. 2020 Oct 19;21(1):463.

doi: 10.1186/s12859-020-03779-w.

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Xingyu Liao¹, Xin Gao², Xiankai Zhang³, Fang-Xiang Wu⁴, Jianxin Wang³

Affiliations

¹ School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China. liaoxingyu@csu.edu.cn.
² Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.
³ School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China.
⁴ Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SKS7N5A9, Canada.

PMID: 33076827
PMCID: PMC7574428
DOI: 10.1186/s12859-020-03779-w

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Xingyu Liao et al. BMC Bioinformatics. 2020.

. 2020 Oct 19;21(1):463.

doi: 10.1186/s12859-020-03779-w.

Authors

Xingyu Liao¹, Xin Gao², Xiankai Zhang³, Fang-Xiang Wu⁴, Jianxin Wang³

Affiliations

¹ School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China. liaoxingyu@csu.edu.cn.
² Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955, Saudi Arabia.
³ School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha, 410083, China.
⁴ Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SKS7N5A9, Canada.

PMID: 33076827
PMCID: PMC7574428
DOI: 10.1186/s12859-020-03779-w

Abstract

Background: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools.

Results: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences.

Conlusions: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

Keywords: Assembly; De novo repeat identification; NGS reads; The high-frequency k-mers; The high-frequency reads.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
A visual example of the alignment of the high frequency *k-mers* and the high frequency reads with the segments in Repbase library

**Fig. 2**
A special repetitive region on the human-r14 genome is covered by the detection results of each tools.

**Fig. 3**
An practical example of the alignment of fragments obtained by different tools and the human-r14 reference sequence

**Fig. 4**
Alignment ratios and multiple alignment ratios of repeats which generated from four tools on five different datasets. ‘Alignment ratios(%)’ is the proportion of fragments in the detected results that can be aligned to the reference genome, and ‘Multiple alignment ratios(%)’ is the proportion of fragments in the detected results that can be aligned to multiple locations on the reference genome

**Fig. 5**
The frequency distribution of segments in detected results generated from RepAHR on dataset of Drosophila melanogaster

**Fig. 6**
Masked ratios on reference genome. 'Masked ratios on reference genome(%)' is the proportion of bases on the reference genome marked as the repeats generated from the four tools, which is measured by RepeatMasker

**Fig. 7**
Masked ratios on Repbase sequences. 'Masked ratios on Repbase sequences(%)' is the proportion of bases on the fragments in the repbase library that are covered by the detection results of the four tools

**Fig. 8**
Distribution of BLAST coverage ratios on Repbase sequences. Box plots of BLAST alignment ratios of the repeats identified by the four tools to the repeat segments in the repbase library. Sub-graph a shows the case of the distribution of single alignment, and sub-graph b shows the case of the distribution of the maximum alignment

**Fig. 9**
The illustration of the pipeline of RepAHR

**Fig. 10**
The *k-mer* frequency distribution histogram. In this figure, the blue line is the number of *k-mer* with a specific frequency, the orange dotted line is a Gaussian fit to the trend near the main peak of the blue line, and the green dotted line is the vertical line from the position at the main peak to the x-axis, and p is the position where the green dotted line intersects the x-axis

**Fig. 11**
Schematic diagram of generating the high-frequency reads. In this figure, a green line on the left denotes a high-frequency *k-mer*, all these *k-mers* constitute a high-frequency *k-mer* set. The blue line denotes the NGS reads, and the green and red line segments under the blue line represent all the *k-mers* generated from an NGS read. A green line denotes a *k-mer* which appears in the high-frequency *k-mer* set, and a red line denotes a *k-mer* which does not appear in the high-frequency *k-mer* set. The diagram contains a matched case and an unmatched case on the right

See this image and copyright information in PMC

Cited by

msRepDB: a comprehensive repetitive sequence database of over 80 000 species.
Liao X, Hu K, Salhi A, Zou Y, Wang J, Gao X. Liao X, et al. Nucleic Acids Res. 2022 Jan 7;50(D1):D236-D245. doi: 10.1093/nar/gkab1089. Nucleic Acids Res. 2022. PMID: 34850956 Free PMC article.
Repetitive DNA sequence detection and its role in the human genome.
Liao X, Zhu W, Zhou J, Li H, Xu X, Zhang B, Gao X. Liao X, et al. Commun Biol. 2023 Sep 19;6(1):954. doi: 10.1038/s42003-023-05322-y. Commun Biol. 2023. PMID: 37726397 Free PMC article. Review.
Methodologies for the De novo Discovery of Transposable Element Families.
Storer JM, Hubley R, Rosen J, Smit AFA. Storer JM, et al. Genes (Basel). 2022 Apr 17;13(4):709. doi: 10.3390/genes13040709. Genes (Basel). 2022. PMID: 35456515 Free PMC article. Review.
Study of Dispersed Repeats in the Cyanidioschyzon merolae Genome.
Rudenko V, Korotkov E. Rudenko V, et al. Int J Mol Sci. 2024 Apr 18;25(8):4441. doi: 10.3390/ijms25084441. Int J Mol Sci. 2024. PMID: 38674025 Free PMC article.
Genome-Wide Tool for Sensitive de novo Identification and Visualisation of Interspersed and Tandem Repeats.
Kalendar R, Kairov U. Kalendar R, et al. Bioinform Biol Insights. 2024 Dec 18;18:11779322241306391. doi: 10.1177/11779322241306391. eCollection 2024. Bioinform Biol Insights. 2024. PMID: 39703748 Free PMC article.

References

1. Janicki M, Rooke R, Yang G. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Res. 2011;19(6):787. doi: 10.1007/s10577-011-9230-7. - DOI - PubMed
1. de Koning AJ, Gu W, Castoe TA, Batzer MA, Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7(12):1002384. doi: 10.1371/journal.pgen.1002384. - DOI - PMC - PubMed
1. Ouyang S, Buell CR. The TIGR plant repeat databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 2004;32(suppl 1):360–363. doi: 10.1093/nar/gkh099. - DOI - PMC - PubMed
1. Castro JP, Carareto CM. Drosophila melanogaster P transposable elements: mechanisms of transposition and regulation. Genetica. 2004;121(2):107–118. doi: 10.1023/B:GENE.0000040382.48039.a. - DOI - PubMed
1. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2012;13(1):36. doi: 10.1038/nrg3117. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Affiliations

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Miscellaneous