Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Jan 29;110(5):1785-90.
doi: 10.1073/pnas.1220349110. Epub 2013 Jan 10.

Reference-assisted chromosome assembly

Affiliations

Reference-assisted chromosome assembly

Jaebum Kim et al. Proc Natl Acad Sci U S A. .

Abstract

One of the most difficult problems in modern genomics is the assembly of full-length chromosomes using next generation sequencing (NGS) data. To address this problem, we developed "reference-assisted chromosome assembly" (RACA), an algorithm to reliably order and orient sequence scaffolds generated by NGS and assemblers into longer chromosomal fragments using comparative genome information and paired-end reads. Evaluation of results using simulated and real genome assemblies indicates that our approach can substantially improve genomes generated by a wide variety of de novo assemblers if a good reference assembly of a closely related species and outgroup genomes are available. We used RACA to reconstruct 60 Tibetan antelope (Pantholops hodgsonii) chromosome fragments from 1,434 SOAPdenovo sequence scaffolds, of which 16 chromosome fragments were homologous to complete cattle chromosomes. Experimental validation by PCR showed that predictions made by RACA are highly accurate. Our results indicate that RACA will significantly facilitate the study of chromosome evolution and genome rearrangements for the large number of genomes being sequenced by NGS that do not have a genetic or physical map.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Overview of the RACA algorithm. (A) RACA takes a reference, a de novo sequenced target (as scaffolds), and one or more outgroup genomes as input data. (B) Syntenic fragments (SFs) delimited by vertical dashed lines are constructed by first aligning reference and target genome sequences and next merging colinear alignments. The outgroup is not always aligned to SFs (e.g., sf2) and may contain rearrangements compared with one SF (e.g., sf10). Pluses and minuses represent the orientations of the target and outgroup on the reference, and three groups of SFs represent three reference chromosomes. (C) For each pair of SFs, the adjacency scores (edge weights) that combine (i) the posterior probability [PostProb(i,j)] of the adjacency and (ii) the coverage of paired-end reads [Link(i,j)] are calculated. Only a portion of the edge weight matrix is shown on the Left, and this matrix can represent all four adjacency cases: (i, j), (−i, j), (i, −j), and (−i, −j), where i and j are the indexes of two SFs sfi and sfj, respectively. (D) The SF graph is built by connecting SFs whose edge weight in C is higher than a certain threshold (0.1 was used in the case of Tibetan antelope). Head (closed circle) and tail (open circle) vertices from the same SF are always connected with a maximum weight (dashed edge). (E) Constructed chains of SFs that are extracted by the RACA algorithm.
Fig. 2.
Fig. 2.
Estimation of the accuracy of RACA using simulated genome assemblies. (A) Phylogenetic tree used to generate simulated datasets. Average substitution rates of the generated datasets in comparison with R are D0: 0.059, D1: 0.064, D2: 0.0724, D3: 0.081, D4: 0.090, D5: 0.098, D6: 0.106, D7: 0.114, D8: 0.121, D9: 0.129, D10: 0.136. (B) For all sequence fragments that were generated from the datasets D0D9, RACA predicted the order and orientation of the sequence fragments by using the dataset R as a reference and more divergent datasets as an outgroup, which were then compared with the true order and orientation. Two evaluation measures were used: (i) recall (x axis), which is the fraction of the true order and orientation of sequence fragments that was found in the predicted sequence fragments, and (ii) precision (y axis), which is the fraction of the predicted order and orientation of sequence fragments that agree with the true order and orientation. For each dataset, the average across five different fragments was displayed, and error bars (horizontal for recall and vertical for precision) represent ±1 SD from the average.
Fig. 3.
Fig. 3.
Evaluating RACA improvement of the GAGE assemblies. RACA improved the original assemblies created by seven genome assemblers in the GAGE datasets. The final RACA assemblies were compared with the original assemblies in terms of N50 and the number of adjacency errors. Heat maps show the log ratio of RACA N50 to the N50 of the original assembly (Upper horizontal block), and the log ratio of RACA adjacency errors to the errors of the original assembly (Lower horizontal block), with orangutan genome as a reference (vertical block on the Left) as well as mouse genome as a reference (vertical block on the Right). Four different resolutions of SF size were used: 100, 50, 10, and 1 kbp; gray blocks represent the results where there were no N50 data due to low coverage at certain resolutions. For the complete dataset, see SI Appendix, Tables S3 and S4.

Similar articles

Cited by

References

    1. Hardison RC. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 2000;16(9):369–372. - PubMed
    1. Bejerano G, et al. Ultraconserved elements in the human genome. Science. 2004;304(5675):1321–1325. - PubMed
    1. Pollard KS, et al. An RNA gene expressed during cortical development evolved rapidly in humans. Nature. 2006;443(7108):167–172. - PubMed
    1. Bourque G, Zdobnov EM, Bork P, Pevzner PA, Tesler G. Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages. Genome Res. 2005;15(1):98–110. - PMC - PubMed
    1. Murphy WJ, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005;309(5734):613–617. - PubMed

Publication types