Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Oct 19;21(1):463.
doi: 10.1186/s12859-020-03779-w.

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Affiliations

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Xingyu Liao et al. BMC Bioinformatics. .

Abstract

Background: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools.

Results: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences.

Conlusions: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.

Keywords: Assembly; De novo repeat identification; NGS reads; The high-frequency k-mers; The high-frequency reads.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
A visual example of the alignment of the high frequency k-mers and the high frequency reads with the segments in Repbase library
Fig. 2
Fig. 2
A special repetitive region on the human-r14 genome is covered by the detection results of each tools.
Fig. 3
Fig. 3
An practical example of the alignment of fragments obtained by different tools and the human-r14 reference sequence
Fig. 4
Fig. 4
Alignment ratios and multiple alignment ratios of repeats which generated from four tools on five different datasets. ‘Alignment ratios(%)’ is the proportion of fragments in the detected results that can be aligned to the reference genome, and ‘Multiple alignment ratios(%)’ is the proportion of fragments in the detected results that can be aligned to multiple locations on the reference genome
Fig. 5
Fig. 5
The frequency distribution of segments in detected results generated from RepAHR on dataset of Drosophila melanogaster
Fig. 6
Fig. 6
Masked ratios on reference genome. 'Masked ratios on reference genome(%)' is the proportion of bases on the reference genome marked as the repeats generated from the four tools, which is measured by RepeatMasker
Fig. 7
Fig. 7
Masked ratios on Repbase sequences. 'Masked ratios on Repbase sequences(%)' is the proportion of bases on the fragments in the repbase library that are covered by the detection results of the four tools
Fig. 8
Fig. 8
Distribution of BLAST coverage ratios on Repbase sequences. Box plots of BLAST alignment ratios of the repeats identified by the four tools to the repeat segments in the repbase library. Sub-graph a shows the case of the distribution of single alignment, and sub-graph b shows the case of the distribution of the maximum alignment
Fig. 9
Fig. 9
The illustration of the pipeline of RepAHR
Fig. 10
Fig. 10
The k-mer frequency distribution histogram. In this figure, the blue line is the number of k-mer with a specific frequency, the orange dotted line is a Gaussian fit to the trend near the main peak of the blue line, and the green dotted line is the vertical line from the position at the main peak to the x-axis, and p is the position where the green dotted line intersects the x-axis
Fig. 11
Fig. 11
Schematic diagram of generating the high-frequency reads. In this figure, a green line on the left denotes a high-frequency k-mer, all these k-mers constitute a high-frequency k-mer set. The blue line denotes the NGS reads, and the green and red line segments under the blue line represent all the k-mers generated from an NGS read. A green line denotes a k-mer which appears in the high-frequency k-mer set, and a red line denotes a k-mer which does not appear in the high-frequency k-mer set. The diagram contains a matched case and an unmatched case on the right

Similar articles

Cited by

References

    1. Janicki M, Rooke R, Yang G. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes. Chromosome Res. 2011;19(6):787. doi: 10.1007/s10577-011-9230-7. - DOI - PubMed
    1. de Koning AJ, Gu W, Castoe TA, Batzer MA, Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7(12):1002384. doi: 10.1371/journal.pgen.1002384. - DOI - PMC - PubMed
    1. Ouyang S, Buell CR. The TIGR plant repeat databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 2004;32(suppl 1):360–363. doi: 10.1093/nar/gkh099. - DOI - PMC - PubMed
    1. Castro JP, Carareto CM. Drosophila melanogaster P transposable elements: mechanisms of transposition and regulation. Genetica. 2004;121(2):107–118. doi: 10.1023/B:GENE.0000040382.48039.a. - DOI - PubMed
    1. Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2012;13(1):36. doi: 10.1038/nrg3117. - DOI - PMC - PubMed

LinkOut - more resources