Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 16;19(1):54.
doi: 10.1186/s12864-017-4429-4.

Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains

Affiliations

Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains

Luis Acuña-Amador et al. BMC Genomics. .

Abstract

Background: Without knowledge of their genomic sequences, it is impossible to make functional models of the bacteria that make up human and animal microbiota. Unfortunately, the vast majority of publicly available genomes are only working drafts, an incompleteness that causes numerous problems and constitutes a major obstacle to genotypic and phenotypic interpretation. In this work, we began with an example from the class Bacteroidia in the phylum Bacteroidetes, which is preponderant among human orodigestive microbiota. We successfully identify the genetic loci responsible for assembly breaks and misassemblies and demonstrate the importance and usefulness of long-read sequencing and curated reannotation.

Results: We showed that the fragmentation in Bacteroidia draft genomes assembled from massively parallel sequencing linearly correlates with genomic repeats of the same or greater size than the reads. We also demonstrated that some of these repeats, especially the long ones, correspond to misassembled loci in three reference Porphyromonas gingivalis genomes marked as circularized (thus complete or finished). We prove that even at modest coverage (30X), long-read resequencing together with PCR contiguity verification (rrn operons and an integrative and conjugative element or ICE) can be used to identify and correct the wrongly combined or assembled regions. Finally, although time-consuming and labor-intensive, consistent manual biocuration of three P. gingivalis strains allowed us to compare and correct the existing genomic annotations, resulting in a more accurate interpretation of the genomic differences among these strains.

Conclusions: In this study, we demonstrate the usefulness and importance of long-read sequencing in verifying published genomes (even when complete) and generating assemblies for new bacterial strains/species with high genomic plasticity. We also show that when combined with biological validation processes and diligent biocurated annotation, this strategy helps reduce the propagation of errors in shared databases, thus limiting false conclusions based on incomplete or misleading information.

Keywords: Bacteroidetes; Porphyromonas gingivalis; annotation; biocuration; comparative genomics; genomic repeats; long-read sequencing; misassembly.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
Relatedness of complete Bacteroidia genomes for species having at least two different strains. a. Dendrogram of the inter-species relatedness calculated with the OrthoANI algorithm, clustered using UPGMA, and shown with the corresponding pairwise identity heatmap. b. Dendrogram of the intra-species relatedness, shown with the corresponding pairwise identity heatmap
Fig. 2
Fig. 2
Genomic distribution of repeats (at least 3 copies) in each genome studied. Circos representations of each strain’s chromosome, with oriC positioned at the first nucleotide of the dnaA gene. For each repeat, its first occurrence in the genome is the starting point of each line that links it to all of the other positions. As the copy number increases, the line colours range from light blue to red. The total number of repeats can be visualized as the number of intersections of the circular chromosome. Strains of the same species are grouped together and arranged in ascending order of repeat counts
Fig. 3
Fig. 3
Genomic repeats in Porphyromonas gingivalis (P. g.) strains. From left to right, strain relatedness, genomic repeat distribution, and number of copies. The dendogram shows intra-species relatedness calculated with OrthoANI and clustered with UPGMA. The circular chromosome of each strain is presented using Circos, with oriC positioned at the top. For each repeat (at least 3 copies), its first occurrence in the genome is the starting point of the lines that link it to all other positions. As the copy number increases, the lines go from light blue to red. On the right, the number of repeats by copy number. Since all repeats have at least 2 copies, the total number of repeats corresponds to the light blue bar
Fig. 4
Fig. 4
A de novo genome assembly of Porphyromonas gingivalis artificial reads. a. Eleven programs were used for de novo assembly of the seven strains in study. The main cumulative lengths were calculated, and plotted here against the contig index. b. The three assemblers that produced the highest N50 were plotted in the same manner as in a. (upper panel), then the assembly was mapped to the reference and only the mapped contigs were plotted (lower panel). c. The number of contigs (A5-miseq and SPAdes) was plotted against the amount of repeats (with at least 3 copies). d. Identification of gaps: after assembly with A5-miseq or SPAdes, genomic regions not covered by contigs were extracted. The gaps were classified into five categories: genomic islands, ribosomal RNA (rrn) operons, coding sequences (CDS) with repeated domains, intergenic sequences, and insertion sequences or miniature inverted-repeat transposable element (IS/MITEs)
Fig. 5
Fig. 5
Functional comparison of the common coding sequences in two Porphyromonas gingivalis annotations. Comparison of a. an annotation available at the NCBI, and b. this study’s manually biocurated annotation. For both, the common CDS were classified into five categories. Both pie charts reflect mean values
Fig. 6
Fig. 6
Pangenome overview of ATCC 33277, TDC60, and W83 strains, focusing on accessory and unique genomes. The central triangle represents the core genome, which has at least 1522 genes (see text for details). Each corner is a Porphyromonas gingivalis (P. g.) strain, with a pie chart showing the unique genome’s distribution of functions, with total and absolute counts shown. On each triangle side, stacked histograms show the accessory genome of the strains in the adjacent vertices. Total and absolute counts are shown, and the differences between strain numbers are due to paralogy

References

    1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):59–65. doi: 10.1038/nature08821. - DOI - PMC - PubMed
    1. Human Microbiome Project Consortium Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207–214. doi: 10.1038/nature11234. - DOI - PMC - PubMed
    1. Hugon P, Dufour JC, Colson P, Fournier PE, Sallah K, Raoult D. A comprehensive repertoire of prokaryotic species identified in human beings. The Lancet Infectious diseases. 2015;15(10):1211–1219. doi: 10.1016/S1473-3099(15)00293-5. - DOI - PubMed
    1. Lloyd-Price J, Abu-Ali G, Huttenhower C. The healthy human microbiome. Genome medicine. 2016;8(1):51. doi: 10.1186/s13073-016-0307-y. - DOI - PMC - PubMed
    1. Lozupone CA, Stombaugh JI, Gordon JI, Jansson JK, Knight R. Diversity, stability and resilience of the human gut microbiota. Nature. 2012;489(7415):220–230. doi: 10.1038/nature11550. - DOI - PMC - PubMed

Publication types

MeSH terms