Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 30;19(1):26.
doi: 10.1186/s12859-018-2026-4.

Inferring synteny between genome assemblies: a systematic evaluation

Affiliations

Inferring synteny between genome assemblies: a systematic evaluation

Dang Liu et al. BMC Bioinformatics. .

Abstract

Background: Genome assemblies across all domains of life are being produced routinely. Initial analysis of a new genome usually includes annotation and comparative genomics. Synteny provides a framework in which conservation of homologous genes and gene order is identified between genomes of different species. The availability of human and mouse genomes paved the way for algorithm development in large-scale synteny mapping, which eventually became an integral part of comparative genomics. Synteny analysis is regularly performed on assembled sequences that are fragmented, neglecting the fact that most methods were developed using complete genomes. It is unknown to what extent draft assemblies lead to errors in such analysis.

Results: We fragmented genome assemblies of model nematodes to various extents and conducted synteny identification and downstream analysis. We first show that synteny between species can be underestimated up to 40% and find disagreements between popular tools that infer synteny blocks. This inconsistency and further demonstration of erroneous gene ontology enrichment tests raise questions about the robustness of previous synteny analysis when gold standard genome sequences remain limited. In addition, assembly scaffolding using a reference guided approach with a closely related species may result in chimeric scaffolds with inflated assembly metrics if a true evolutionary relationship was overlooked. Annotation quality, however, has minimal effect on synteny if the assembled genome is highly contiguous.

Conclusions: Our results show that a minimum N50 of 1 Mb is required for robust downstream synteny analysis, which emphasizes the importance of gold standard genomes to the science community, and should be achieved given the current progress in sequencing technology.

Keywords: Assembly quality; Comparative genomics; Genome synteny; Nematode genomes.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Definition of synteny block and break. Genes located on chromosomes of two species are denoted in letters. Each gene is associated with a number representing the species they belong to (species 1 or 2). Orthologous genes are connected by dashed lines and genes without an orthologous relationship are treated as gaps in synteny programs. Under the criterion of at least three orthologous genes (anchors): a synteny block can be orthologs with the same order (block a), with reversed order (block b), or allowing some gaps (block c). In contrast, cases of causing a synteny break can be lack of orthologs (break a), co-arranged gene order (break b) or gaps (break c)
Fig. 2
Fig. 2
Synteny blocks identified between un-fragmented and fragmented C. elegans chromosome IV. The original sequence is used as the reference and coloured in black. Established synteny regions (outer number stands for synteny coverage) of the 5 different program packages: DAGchainer (red), i-ADHoRe (yellow), MCScanX (green), SynChro (light blue), and Satsuma (blue) are joined to query sequences with different levels of fragmentation (un-fragmented, 1 Mb and 100 kb fragmented). Chromosome positions are labeled in megabases (Mb). For plots of other chromosomes see Additional file 3: Figure S2 and Additional file 4: Figure S3
Fig. 3
Fig. 3
A zoomed-in 600 kb region of synteny identified between the reference C. elegans genome and a 100 kb fragmented assembly. Synteny blocks in fragmented assembly defined by the five detection programs DAGchainer (red), i-ADHoRe (yellow), MCScanX (green), SynChro (light blue), and Satsuma (blue) are drawn as rectangles. Fragmented sites are labeled by vertical red dashed lines. Genes are shown as burgundy rectangles, with gene starts marked using dark blue lines. Two scenarios are marked: a) a synteny block was not identified by MCScanX, and b) several synteny blocks only detected by SynChro
Fig. 4
Fig. 4
Error rate (%) of synteny identification in fragmented assemblies. The error rate is defined as the difference between the synteny coverage calculated with the original genome (almost 100%) and that in fragmented assemblies, where the original assembly was used as the reference in both cases. 5% and 2% error rate positions are marked by grey solid and dashed lines, respectively. Different pairs in synteny identification are separated in different panels. The upper panels are self-comparisons, while the bottom are comparisons between closely related species. Note that for a clear visualization of pattern changes, the scales of error rate are different between upper and bottom panels. Colors represent different types of synteny detection programs. The letters a, b, c and d denote the comparisons of C. elegans vs. C. elegans, S. ratti vs. S. ratti, C. elegans vs. C. briggsae, and S. ratti vs. S. stercoralis respectively
Fig. 5
Fig. 5
Relationship between error rate (%) in synteny identification and distribution of sequence length in assemblies. Different colors denote multiple sources of assembly. Panel a shows error rates (%) in synteny identification when assemblies compared against the C. elegans reference genome. Panel b demonstrates distributions of sequence length of assemblies with an N50 of around 1 Mb. Dashed and dotted lines specify the N50 and N97.5 respectively
Fig. 6
Fig. 6
Comparison of gene ontology (GO) enriched terms in C. briggsae synteny breaks between C. elegans vs. C. briggsae and 100 replicates of C. elegans vs. 100 kb fragmented C. briggsae. Top ranks of GO terms in the original comparison are shown in the Y axis. For original top ranking GO terms, only those that appeared more than 10 times in top 10 ranks of after-fragmentation comparison replicates were displayed (see Additional file 7: Table S2 for more details). The X axis shows top 10 ranks and rank “out of top 10” in the comparison when assemblies were fragmented. The darkness of color is proportional to the occurrence of the GO term in that rank within 100 replicates. Regions in red are indications of occurred ranking errors. All GO categories have adjusted p-value < 0.01
Fig. 7
Fig. 7
Synteny coverage (%) between C. elegans and S. ratti assemblies against original or ALLMAPS scaffolded assemblies from 100 kb fragmented assemblies of C. briggsae and S. stercoralis
Fig. 8
Fig. 8
Synteny linkage of C. elegans vs. original C. briggsae assembly and C. elegans vs. ALLMAPS C. briggsae assembly. ALLMAPS assembly with L90 = 1063 from 100 kb fragmented C. briggsae assembly with L90 = 6 (top), original C. elegans assembly with L90 = 6 (middle) and original C. briggsae assembly with L90 = 6 (bottom) are shown in different horizontal lines. Vertical lines on chromosome lines show the start/end positions of the first/last gene in a synteny block. Each panel shows a separate chromosome. Block linkages in the same orientation are labeled in red, while those in inverted orientation are labeled in blue
Fig. 9
Fig. 9
Pseudocode of genome assembly fragmentation

Similar articles

Cited by

References

    1. Gordon D, Huddleston J, Chaisson MJ, Hill CM, Kronenberg ZN, Munson KM, Malig M, Raja A, Fiddes I, Hillier LW, et al. Long-read sequence assembly of the gorilla genome. Science (New York, NY) 2016;352:aae0344. doi: 10.1126/science.aae0344. - DOI - PMC - PubMed
    1. Lien S, Koop BF, Sandve SR, Miller JR, Matthew P, Leong JS, Minkley DR, Zimin A, Grammes F, Grove H, et al. The Atlantic salmon genome provides insights into rediploidization. Nature. 2016;533:200–205. doi: 10.1038/nature17164. - DOI - PMC - PubMed
    1. Iorizzo M, Ellison S, Senalik D, Zeng P, Satapoomin P, Huang J, Bowman M, Iovene M, Sanseverino W, Cavagnaro P, et al. A high-quality carrot genome assembly provides new insights into carotenoid accumulation and asterid genome evolution . Nat Genet. 2016;48:657–66. - PubMed
    1. Jarvis DE, Ho YS, Lightfoot DJ, Schmöckel SM, Li B, Borm TJA, Ohyanagi H, Mineta K, Michell CT, Saber N, et al. The genome of Chenopodium Quinoa. Nature. 2017;542:1-6. - PubMed
    1. Ma L, Chen Z, Huang DW, Kutty G, Ishihara M, Wang H, Abouelleil A, Bishop L, Davey E, Deng R, et al. Genome analysis of three Pneumocystis species reveals adaptation mechanisms to life exclusively in mammalian hosts. Nat Commun. 2016;7:10740. doi: 10.1038/ncomms10740. - DOI - PMC - PubMed

Publication types

LinkOut - more resources