Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Feb 20;47(3):e18.
doi: 10.1093/nar/gky1194.

Deep repeat resolution-the assembly of the Drosophila Histone Complex

Affiliations

Deep repeat resolution-the assembly of the Drosophila Histone Complex

Philipp Bongartz et al. Nucleic Acids Res. .

Abstract

Though the advent of long-read sequencing technologies has led to a leap in contiguity of de novo genome assemblies, current reference genomes of higher organisms still do not provide unbroken sequences of complete chromosomes. Despite reads in excess of 30 000 base pairs, there are still repetitive structures that cannot be resolved by current state-of-the-art assemblers. The most challenging of these structures are tandemly arrayed repeats, which occur in the genomes of all eukaryotes. Untangling tandem repeat clusters is exceptionally difficult, since the rare differences between repeat copies are obscured by the high error rate of long reads. Solving this problem would constitute a major step towards computing fully assembled genomes. Here, we demonstrate by example of the Drosophila Histone Complex that via machine learning algorithms, it is possible to exploit the underlying distinguishing patterns of single nucleotide variants of repeats from very noisy data to resolve a large and highly conserved repeat cluster. The ideas explored in this paper are a first step towards the automated assembly of complex repeat structures and promise to be applicable to a wide range of eukaryotic genomes.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
In (A), we illustrate the fundamental problem: a hypothetical master consensus of all copy versions is more similar to the signatures than signatures that belong to the same copy are to each other. With that property, it acts like a vanishing point, signatures with low error rate all seem to be quite similar. The neural network depicted in (B) solves this problem because it is able to pick up on the sub-signatures shared by signatures from the same copy. In (C), we show the unrolled converging corrector: the signatures are repeatedly corrected by the concatenated neural networks using them and both neighbouring signatures until the bases stop changing. Panel (D) shows one assembly graph greedily traversed starting from one end of the complex. Node size and number give the size of each assembly group after mapping all clusters, even those that do not fit anywhere well. This over mapping allows us to double check on over represented groups and to catch the two collapsed parts of the complex marked in red. The other coloured nodes stand for large scale variations.
Figure 2.
Figure 2.
In (A), we examine full signatures with ground truth information. For each signature, we calculate the likelihood that the n-th best overlapping signature belongs to the same ground truth copy group. This likelihood starts low and drops fast, whereas the corrected signatures have a significantly higher likelihood of correct overlaps, which stays stable for the 25 best overlaps. Panel (B) shows the error reduction achieved by first pass correction and the converging corrector. In (C) and (D), we use a t-distributed stochastic neighbor embedding visualization (t-SNE) to show how the correction facilitates the separation of neighbouring groups of signatures.
Figure 3.
Figure 3.
Panel (A) shows how raw sequencing data is categorized as repeat or unique sequence using the mapping information of subsequences of the repeat template. In (B), the reads are cut and the repeat sequences are arranged into a MSA. Panel (C) shows the refinement of the MSA. In (D), corrections between rows are detected and statistically significant bases are collected into signatures. Panel (E) illustrates how the signatures are corrected via neural networks. In (F) finally, the signatures are clustered and the resulting assembly graph is traversed.

References

    1. Morgan T.H. An attempt to analyze the constitution of the chromosomes on the basis of sex-limited inheritance in drosophila. J. Exp. Zool. Part A. 1911; 11:365–413.
    1. Myers E.W., Sutton G., Delcher A., Dew I., Fasulo D., Flanigan M., Kravitz S., Mobarry C., Reinert K., Remington K. et al. .. A whole-genome assembly of drosophila. Science. 2000; 287:2196–2204. - PubMed
    1. International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome. Nature. 2001; 409:860–921. - PubMed
    1. Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A. et al. .. The sequence of the human genome. Science. 2001; 291:1304–1351. - PubMed
    1. Hoskins R.A., Carlson J.W., Wan K.H., Park S., Mendez I., Galle S.E., Booth B.W., Pfeiffer B.D., George R.A., Svirskas R. et al. .. The release 6 reference sequence of the drosophila melanogaster genome. Genome Res. 2015; 25:445–458. - PMC - PubMed

Publication types

LinkOut - more resources