Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2014 Jan;15(1):56-62.
doi: 10.1038/nrg3655. Epub 2013 Dec 10.

The role of replicates for error mitigation in next-generation sequencing

Affiliations
Review

The role of replicates for error mitigation in next-generation sequencing

Kimberly Robasky et al. Nat Rev Genet. 2014 Jan.

Abstract

Advances in next-generation sequencing (NGS) technologies have rapidly improved sequencing fidelity and substantially decreased sequencing error rates. However, given that there are billions of nucleotides in a human genome, even low experimental error rates yield many errors in variant calls. Erroneous variants can mimic true somatic and rare variants, thus requiring costly confirmatory experiments to minimize the number of false positives. Here, we discuss sources of experimental errors in NGS and how replicates can be used to abate such errors.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Sources of unexpected and erroneous variation and established post-processing tools used to cope with unexpected variants
Sequencing experiments involve many steps from sample acquisition to final data analysis, and a major challenge in the process stems from the emergence of unexpected variants a. These can include legitimate somatic mosaicism and rare oncogenic variants. Additionally, many erroneous sequence variants arise during experimental steps (e.g., via sample degradation, PCR amplification, base-calling error). b. Several analytical tools and post-processing mechanisms are often employed for separating true variation from false sequence variants. These include indicators of data quality (e.g., base call and mapping quality scores) and filters that are informed by those indicators. Additional tertiary analyses can also highlight systematic biases through clustering methods and possible false positive variants by accounting for Mendelian inheritance patterns. Throughout the sequencing and post-processing pipeline, the use of replicated sequencing experiments can help mitigate the impact erroneous variants from the experimental steps and inform post-processing filters. Thus, greater accuracy of germline variant detection can be attained and improved sensitivity can be achieved for true somatic variation.
Figure 2
Figure 2. Platform-independent method for choosing quality score thresholds using replicate sequencing data
Variants are called for all replicates and then classified as concordant if the variant calls agree among the replicates or discordant if they differ. Variants are then rank-ordered by the desired metric (e.g., quality scores), and plotted similar to receiver-operator characteristic (ROC) curves. That is, the cumulative distributions of concordant and discordant variants are plotted left to right as the stringency of the confidence score of interest decreases.
Figure 3
Figure 3. Plotting replicate scores to assess filter efficiency
The efficiency of different variant call filter metrics can be evaluated by plotting replicate-based SNV concordance and discordance in a manner similar to a ROC curve. As one travels from left to right on the plot, the rank-ordered quality score is reduced in stringency and the fractions of retained concordant and discordant variants increase. Thus, this curve quantifies the proportion of good data (concordant SNVs) retained and bad data (discordant SNVs) discarded as a consequence of variable quality score cut-offs. For the genomes used in our analysis, this graph indicates that filtering variants solely based on locus read depth is inferior to filtering by genomic and expression, quality scores. Furthermore, filtering by expression data quality scores is also inferior to filtering by genomic quality scores (genomic quality scores from Complete Genomics Inc.), but nevertheless both are better than filtering loci by read depth. The read depth curve that excludes outliers (read depth higher than the 99.5th-percentile) outperforms the all-inclusive read depth curve. As an example of how to understand the value of a threshold, note that choosing a threshold score of 120 as a measure for highest quality for the genomic data will include the same fraction of total predicted errors as choosing a threshold quality score of 23800 for the expression data. Meanwhile, when a similar threshold is chosen for read depth, the efficiency at retaining true variants is worse than random.

References

    1. O'Rawe J, et al. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med. 2013;5:28. - PMC - PubMed
    1. Kircher M, Heyn P, Kelso J. Addressing challenges in the production and analysis of illumina sequencing data. BMC Genomics. 2011;12:382. - PMC - PubMed
    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11:31–46. This review details many current sequencing technologies, including their strengths and limitations.

    1. Sboner A, Mu XJ, Greenbaum D, Auerbach RK, Gerstein MB. The real cost of sequencing: higher than you think! Genome Biol. 2011;12:125. - PMC - PubMed
    1. Ratan A, et al. Comparison of sequencing platforms for single nucleotide variant calls in a human sample. PLoS One. 2013;8:e55089. A thorough study of current error modes, coverage profiles and GC-biases of Next-generation technologies.

Publication types

MeSH terms