Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015:2015:563482.
doi: 10.1155/2015/563482. Epub 2015 Oct 19.

RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes

Affiliations

RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes

Krisztian Buza et al. Int J Genomics. 2015.

Abstract

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used. Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge. Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

PubMed Disclaimer

Figures

Figure 1
Figure 1
RECORD: Reference-Assisted Genome Assembly for Closely Related Genomes. The inputs of the pipeline, that is, the experimental reads and the reference genome, are illustrated in the top left and top right of the figure, respectively. Intermediate results produced in various steps of the analysis process are depicted. The dependency between these intermediate results is shown by arrows. In the illustration of the 3rd step, we underlined those segments of the edited reference which were replaced by one of the assembly contigs.
Figure 2
Figure 2
Generation of pseudoreads from the reference genome.
Figure 3
Figure 3
Resolution of ambiguity. First, for each contig, its best mapping is determined, and then the remaining ambiguity is resolved in greedy fashion by giving priority to the beginning of the contigs as shown in the figure.
Figure 4
Figure 4
Comparison of the proposed approach (RECORD) with two state-of-the-art genome assemblers on data simulated with wgsim. In this experiment, we consider the evolved genome produced by Evolver as the target genome; the reference genome is the ancestral genome. The diagrams show the performance of the examined approaches according to various criteria as the function of the number of simulated reads that were used for the assembly. The diagram (a) shows the number of covered bases of the target genome; the diagram (b) shows the accuracy, that is, overall percentile identity between the assembly contigs and the corresponding segments of the target genome, while the diagram (c) shows the number of those largest contigs that together cover at least 50% of the target genome.
Figure 5
Figure 5
Proportion of ambiguously mapped contigs (before the selection of the best mapping for each contig) in case of various numbers of simulated reads.
Figure 6
Figure 6
Comparison of the proposed approach (RECORD) with two state-of-the-art genome assemblers on real data. In this experiment, we compared assemblies resulting from various number of experimental reads to the assembly which is produced by Amos using all the experimental reads; that is, the target genome is the assembly produced by Amos using all the reads. In this case, the reference genome exhibits 99.7 percent identity with the result of Amos which is used as the gold standard. The diagrams follow the same structure as the one in Figure 4.
Figure 7
Figure 7
Proportion of ambiguously mapped contigs (before the selection of the best mapping for each contig) in case of experiments on publicly available data sets.

References

    1. Metzker M. L. Sequencing technologies the next generation. Nature Reviews Genetics. 2010;11(1):31–46. doi: 10.1038/nrg2626. - DOI - PubMed
    1. Morozova O., Marra M. A. Applications of next-generation sequencing technologies in functional genomics. Genomics. 2008;92(5):255–264. doi: 10.1016/j.ygeno.2008.07.001. - DOI - PubMed
    1. Cui K., Zhao K. Chromatin Remodeling. Vol. 833. Springer; 2012. Genome-wide approaches to determining nucleosome occupancy in metazoans using MNase-Seq; pp. 413–419. (Methods in Molecular Biology). - DOI - PMC - PubMed
    1. Song L., Crawford G. E. Dnase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protocols. 2010;2010(2) doi: 10.1101/pdb.prot5384. - DOI - PMC - PubMed
    1. Fullwood M. J., Liu M. H., Pan Y. F., et al. An oestrogen-receptor-α-bound human chromatin interactome. Nature. 2009;462(7269):58–64. doi: 10.1038/nature08497. - DOI - PMC - PubMed