GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms

Patrick Denis Browne^{1

2}, Tue Kjærgaard Nielsen^{1

2}, Witold Kot^{1

2}, Anni Aggerholm³, M Thomas P Gilbert⁴, Lara Puetz⁴, Morten Rasmussen⁵, Athanasios Zervas², Lars Hestbjerg Hansen^{1

2}

Affiliations

¹ Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, 1871, Denmark.
² Department of Environmental Science, Aarhus University, Frederiksborgvej 399, Roskilde, 4000, Denmark.
³ Department of Hematology, Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, Aarhus N, 8200, Denmark.
⁴ The GLOBE Institute, Faculty of Health and Biomedical Sciences, University of Copenhagen, Blegdamsvej 3B, Copenhagen N, 2200, Denmark.
⁵ Department of Genetics, School of Medicine, Stanford University, 291 Campus Drive, Stanford, CA 94305-5051, USA.

PMID: 32052832
PMCID: PMC7016772
DOI: 10.1093/gigascience/giaa008

GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms

Patrick Denis Browne et al. Gigascience. 2020.

. 2020 Feb 1;9(2):giaa008.

doi: 10.1093/gigascience/giaa008.

Authors

Affiliations

¹ Department of Plant and Environmental Sciences, University of Copenhagen, Thorvaldsensvej 40, Frederiksberg C, 1871, Denmark.
² Department of Environmental Science, Aarhus University, Frederiksborgvej 399, Roskilde, 4000, Denmark.
³ Department of Hematology, Aarhus University Hospital, Palle Juul-Jensens Boulevard 99, Aarhus N, 8200, Denmark.
⁴ The GLOBE Institute, Faculty of Health and Biomedical Sciences, University of Copenhagen, Blegdamsvej 3B, Copenhagen N, 2200, Denmark.
⁵ Department of Genetics, School of Medicine, Stanford University, 291 Campus Drive, Stanford, CA 94305-5051, USA.

PMID: 32052832
PMCID: PMC7016772
DOI: 10.1093/gigascience/giaa008

Abstract

Background: Metagenomic sequencing is a well-established tool in the modern biosciences. While it promises unparalleled insights into the genetic content of the biological samples studied, conclusions drawn are at risk from biases inherent to the DNA sequencing methods, including inaccurate abundance estimates as a function of genomic guanine-cytosine (GC) contents.

Results: We explored such GC biases across many commonly used platforms in experiments sequencing multiple genomes (with mean GC contents ranging from 28.9% to 62.4%) and metagenomes. GC bias profiles varied among different library preparation protocols and sequencing platforms. We found that our workflows using MiSeq and NextSeq were hindered by major GC biases, with problems becoming increasingly severe outside the 45-65% GC range, leading to a falsely low coverage in GC-rich and especially GC-poor sequences, where genomic windows with 30% GC content had >10-fold less coverage than windows close to 50% GC content. We also showed that GC content correlates tightly with coverage biases. The PacBio and HiSeq platforms also evidenced similar profiles of GC biases to each other, which were distinct from those seen in the MiSeq and NextSeq workflows. The Oxford Nanopore workflow was not afflicted by GC bias.

Conclusions: These findings indicate potential sources of difficulty, arising from GC biases, in genome sequencing that could be pre-emptively addressed with methodological optimizations provided that the GC biases inherent to the relevant workflow are understood. Furthermore, it is recommended that a more critical approach be taken in quantitative abundance estimates in metagenomic studies. In the future, metagenomic studies should take steps to account for the effects of GC bias before drawing conclusions, or they should use a demonstrably unbiased workflow.

Keywords: GC bias; Illumina; Oxford Nanopore; PacBio; high-throughput sequencing; metagenomics.

PubMed Disclaimer

Figures

**Figure 1:**
Coverage biases in the sequencing of *Fusobacterium sp*. C1. The circle plot shows from the inside: GC content (Ring 1); positions of CDSs, rRNAs, and tRNAs (Ring 2); positions of the PCR targets for ddPCR and the 5.3-kb PCR products (Ring 3); and coverages of Nanopore, MiSeq, NextSeq, HiSeq, and PacBio reads (Rings 4–8, respectively). The circles are numbered from the inside. The GC content plot is centred on the median GC content, with GC contents greater than the median extending outwards. The coverage data are plotted in 50 nt windows, with separate linear scales for each dataset.

**Figure 2:**
Coverage biases in MiSeq datasets from many bacteria with different GC contents. Dot plots show local GC content and normalized relative coverages in 500-nt windows (see Methods for explanation) of MiSeq data from a variety of bacteria with different average GC contents. Error bars indicate ±1 standard deviation of normalized coverage. The intensity of the blue in the dots is a log-transformed heat map of the number of 500-nt windows averaged into that datapoint. The datapoint with the most windows in each plot has maximum blue. The vertical green line marks the average GC content of each assembly. The average normalized coverage value is indicated with a horizontal dashed red line.

**Figure 3:**
GC biases in NextSeq, PacBio, Nanopore, and HiSeq data. The dot plots are as described in Fig. 2.

See this image and copyright information in PMC

References

1. Reuter Jason A, Spacek DV, Snyder Michael P. High-throughput sequencing technologies. Mol Cell. 2015;58(4):586–97. - PMC - PubMed
1. Schirmer M, Ijaz UZ, D'Amore R, et al.. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Res. 2015;43(6):e37. - PMC - PubMed
1. Brooks JP, Edwards DJ, Harwich MD, et al.. The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol. 2015;15(1):66. - PMC - PubMed
1. Jakobsen TH, Hansen MA, Jensen PØ, et al.. Complete genome sequence of the cystic fibrosis pathogen Achromobacter xylosoxidans NH44784-1996 complies with important pathogenic phenotypes. PLoS One. 2013;8(7):e68484. - PMC - PubMed
1. Quail MA, Smith M, Coupland P, et al.. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13(1):341. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms

Affiliations

GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Miscellaneous