RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
- PMID: 33076827
- PMCID: PMC7574428
- DOI: 10.1186/s12859-020-03779-w
RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
Abstract
Background: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools.
Results: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences.
Conlusions: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.
Keywords: Assembly; De novo repeat identification; NGS reads; The high-frequency k-mers; The high-frequency reads.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures











Similar articles
-
An improved approach for reconstructing consensus repeats from short sequence reads.BMC Genomics. 2018 Aug 13;19(Suppl 6):566. doi: 10.1186/s12864-018-4920-6. BMC Genomics. 2018. PMID: 30367582 Free PMC article.
-
A sensitive repeat identification framework based on short and long reads.Nucleic Acids Res. 2021 Sep 27;49(17):e100. doi: 10.1093/nar/gkab563. Nucleic Acids Res. 2021. PMID: 34214175 Free PMC article.
-
RepARK--de novo creation of repeat libraries from whole-genome NGS reads.Nucleic Acids Res. 2014 May;42(9):e80. doi: 10.1093/nar/gku210. Epub 2014 Mar 14. Nucleic Acids Res. 2014. PMID: 24634442 Free PMC article.
-
The present and future of de novo whole-genome assembly.Brief Bioinform. 2018 Jan 1;19(1):23-40. doi: 10.1093/bib/bbw096. Brief Bioinform. 2018. PMID: 27742661 Review.
-
Next-generation sequencing and large genome assemblies.Pharmacogenomics. 2012 Jun;13(8):901-15. doi: 10.2217/pgs.12.72. Pharmacogenomics. 2012. PMID: 22676195 Free PMC article. Review.
Cited by
-
msRepDB: a comprehensive repetitive sequence database of over 80 000 species.Nucleic Acids Res. 2022 Jan 7;50(D1):D236-D245. doi: 10.1093/nar/gkab1089. Nucleic Acids Res. 2022. PMID: 34850956 Free PMC article.
-
Repetitive DNA sequence detection and its role in the human genome.Commun Biol. 2023 Sep 19;6(1):954. doi: 10.1038/s42003-023-05322-y. Commun Biol. 2023. PMID: 37726397 Free PMC article. Review.
-
Methodologies for the De novo Discovery of Transposable Element Families.Genes (Basel). 2022 Apr 17;13(4):709. doi: 10.3390/genes13040709. Genes (Basel). 2022. PMID: 35456515 Free PMC article. Review.
-
Study of Dispersed Repeats in the Cyanidioschyzon merolae Genome.Int J Mol Sci. 2024 Apr 18;25(8):4441. doi: 10.3390/ijms25084441. Int J Mol Sci. 2024. PMID: 38674025 Free PMC article.
-
Genome-Wide Tool for Sensitive de novo Identification and Visualisation of Interspersed and Tandem Repeats.Bioinform Biol Insights. 2024 Dec 18;18:11779322241306391. doi: 10.1177/11779322241306391. eCollection 2024. Bioinform Biol Insights. 2024. PMID: 39703748 Free PMC article.
References
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Molecular Biology Databases
Miscellaneous