Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep;39(9):1129-1140.
doi: 10.1038/s41587-021-01049-5. Epub 2021 Sep 9.

Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study

Affiliations

Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study

Jonathan Foox et al. Nat Biotechnol. 2021 Sep.

Erratum in

Abstract

Assessing the reproducibility, accuracy and utility of massively parallel DNA sequencing platforms remains an ongoing challenge. Here the Association of Biomolecular Resource Facilities (ABRF) Next-Generation Sequencing Study benchmarks the performance of a set of sequencing instruments (HiSeq/NovaSeq/paired-end 2 × 250-bp chemistry, Ion S5/Proton, PacBio circular consensus sequencing (CCS), Oxford Nanopore Technologies PromethION/MinION, BGISEQ-500/MGISEQ-2000 and GS111) on human and bacterial reference DNA samples. Among short-read instruments, HiSeq 4000 and X10 provided the most consistent, highest genome coverage, while BGI/MGISEQ provided the lowest sequencing error rates. The long-read instrument PacBio CCS had the highest reference-based mapping rate and lowest non-mapping rate. The two long-read platforms PacBio CCS and PromethION/MinION showed the best sequence mapping in repeat-rich areas and across homopolymers. NovaSeq 6000 using 2 × 250-bp read chemistry was the most robust instrument for capturing known insertion/deletion events. This study serves as a benchmark for current genomics technologies, as well as a resource to inform experimental design and next-generation sequencing variant calling.

PubMed Disclaimer

Conflict of interest statement

Competing interests

G.P.S. is employed by Illumina Inc. X.Z., W.Z., F.T., Y.Z. and H.L are employees of MGI Inc. All other authors declare no competing interests.

Figures

Extended Data Fig. 1 |
Extended Data Fig. 1 |. Quality Control and Decoy Capture.
(a) The insert Size distribution of every replicate, stratified by sequencing instrument. (b) The percentage of total reads that were mapped to decoy contigs within the GRCh38 reference genome.
Extended Data Fig. 2 |
Extended Data Fig. 2 |. Normalized Genomic Coverage.
Heatmap showing the distribution of read counts per library (rows) by GC content (columns) across human whole genome and exome samples. Read count values are normalized by total reads per replicate, such that a value of 1 matches maximum value for a given replicate. Annotation tracks on the right indicate the sequencing platform and cell line genome for that replicate.
Extended Data Fig. 3 |
Extended Data Fig. 3 |. All-versus-all Genomic Coverage Comparison.
Comparisons for every platform within each UCSC RepeatMasker region. Blue bars indicate >50% of shared sites are better represented in the given platform (column) versus all other platforms (rows). Red bars indicate that the other platform out-covered the given platform.
Extended Data Fig. 4 |
Extended Data Fig. 4 |. Variant Detection by Context.
Precision and sensitivity scores as derived from rtg vcfeval analysis, stratified by regions in (a) the CLINVAR database and (b) the OMIM database. For each of the cell lines, genes from each database were overlapped with high confidence regions for variant calling. (c) Scores stratified by regions in the exome, as defined by the AmpliSeq target capture regions file. For each of the cell lines, exomic regions were overlapped with high confidence regions for variant calling.
Extended Data Fig. 5 |
Extended Data Fig. 5 |. Genomic Variant Heatmap.
Heatmap of genotype (GT) of variant alleles on chromosome 1, across all human replicates across within sequencing platforms, as measured against the Genome in a Bottle high confidence variant call sets for each genome. Heterozygous variant alleles are shaded in orange (0.5), homozygous variants in red (1), missing data in blue (0), and inapplicable sites (sites outside of the GIAB high confidence region in one cell line but present in another) in gray. Hierarchical clustering reveals strong grouping by cell line, followed by less clear grouping within platforms and inter- and intra-lab replicates.
Extended Data Fig. 6 |
Extended Data Fig. 6 |. Mendelian Violation Detection Per Context.
UpSet intersections of Mendelian violations. Each plot is stratified by variant type (SNPs on top, followed by INDELs; INS_5 = insertions 0–5 bp in size, INS_6to15 = insertions 6 to 15 bp in size, INS_15 = insertions >15 bp in size; same for deletions, ‘DEL’). Events were recorded within high confidence regions for the Ashkenazi Son (HG002).
Extended Data Fig. 7 |
Extended Data Fig. 7 |. Structural Variants per Instrument.
Comparison between the identified SVs in the six replicates from long-read sequencing instruments, showing agreement of 6,980 SVs between samples (green column).
Extended Data Fig. 8 |
Extended Data Fig. 8 |. Structural Variant Metrics.
Coverage, insert size, and read length mean and standard deviation across total SVs in sequencing runs.
Extended Data Fig. 9 |
Extended Data Fig. 9 |. SV Agreement between Callers and Instruments.
(a) Insights into SV variability by caller. First the strategy used to examine SV caller variability after stratifying for platforms, replicates and centers variability; next the SV call set sizes and overlap with the GIAB SV call set for the SV caller variability set of HG002; finally the types and sizes of SVs in the SV caller variability set of HG002 (translocations are set to size 50 by default in the SURVIVOR parameters for visualization purposes). (b) Insights into SV variability by platform. Diagrams utilize sequencing runs from HiSeqX10, HiSeq2000 and HiSeq4000 while the final two characterize all platforms available. First the strategy used to examine platform variability after stratifying for SV callers, centers and replicates variability; next, SV call set sizes and overlaps with the GIAB SV call set for the platform variability SV call set of HG002; next, types and sizes of SVs in the platform variability SV call set of HG002. Final two panels include HiSeqX10, HiSeq2000, HiSeq4000, NovaSeq, BGI and MGI for visualization purposes. The NovaSeq, BGI and MGI SV call sets were not integrated into the analyses strategy because sequencing runs with replicates for each sample at different centers on different platforms were not available. On top, SV call set sizes and overlap with the GIAB SV call set for the platform variability SV call set of HG002. Below, types and sizes of SVs in the platform variability SV call set of HG002. (Translocations are set to size 50 by default in the SURVIVOR parameters for visualization purposes).
Extended Data Fig. 10 |
Extended Data Fig. 10 |. Metagenomic Bacterial Sequencing Distribution.
(a) Heatmap showing the distribution of read counts per library (rows) by GC content (columns) across bacterial genomes and the metagenomic mixtrue. Read count values are normalized by total reads per replicate, such that a value of 1 matches maximum value for a given replicate. Annotation tracks on the right indicate the sequencing platform and cell line genome for that replicate. (b) Calculations of entropy per genome/metagenomic mixture. Entropy was measured across all GC windows for all replicates for a given sample, rowSums(-(p * log(p)).
Fig. 1 |
Fig. 1 |. Experimental design and mapping results.
a, Three standard human genomic DNA samples from NIST RM 8392 were used to prepare libraries, including TruSeq PCR-Free whole-genome libraries and AmpliSeq exome libraries, for sequencing on an array of platforms. Three bacterial species (E. coli, S. epidermidis and P. fluorescens) and one metagenomic mixture of ten bacterial species (metagenomic pool) were also sequenced. b, Mean depth of coverage of replicate, colored by platform, and stratified by sample type. Depth is calculated by dividing total bases sequenced by size of respective genome. c, Mapping rate for every replicate for each instrument, including uniquely mapped reads, reads that mapped to multiple places in the genome, reads marked as duplicates and reads that did not map. Squares indicate father replicates, circles indicate mother replicates, and triangles indicate son replicates. Vertical dotted lines separate instrument groups. d, The same as c, but for bacterial species sequenced, colored by sequencing platform. For clarity, horizontal lines are provided at 0 and 100% where appropriate.
Fig. 2 |
Fig. 2 |. Distribution of genomic coverage across sequencing technologies for all replicates.
a, Aligned BAMs were downsampled to 25× mean read depth, and the distribution of coverage of each locus in the UCSC RepeatMask regions was plotted using an inverse hyperbolic sine (IHS) transformation. Asterisks indicate significantly higher coverage for a given platform compared to the global mean, as measured by a one-tailed Wilcoxon test. *P < 0.01; **P < 0.001; ***P < 1 × 10−5. b, Comparison of each platform against all other platforms in each UCSC RepeatMasker context. Blue dots indicate >50% of shared sites are better represented in a given platform versus some other platform. Red dots indicate that the other platform out-covered the given platform. c, Coefficient of variation (CV) of coverage per platform per UCSC RepeatMasker type, examining a total of 10,000 sites per repeat type (with the exception of satellites, which had only n = 4,579 sites). Coverage was calculated for all bases within a region and variation was calculated among all replicates per platform, including replicates from Illumina HiSeq 2500 (n = 14), HiSeq 4000 (n = 15), HiSeq X10 (n = 10), BGISEQ-500 (n = 3), MGISEQ-2000 (n = 4), NovaSeq with 2 × 150-bp read chemistry (n = 6), NovaSeq with 2 × 250-bp read chemistry (n = 3), PacBio CCS (n = 3) and ONT PromethlON (n = 3).
Fig. 3 |
Fig. 3 |. Estimating rates of sequencing error per platform.
a, Bar plot showing total average error rate within each UCSC RepeatMasker context. Individual replicates per platform are shown as separate bars. Values are averaged across all bases covering a given context. The y axis is plotted as square root transformed. b,c, Proportional mismatch rates across GC windows (b) and base number (c). Values at each window are averaged across all reads from all replicates. For long-read platforms, read length is capped at 6 kbp. The y axis is plotted as square root transformed. d,e, Error rate in homopolymer (n = 72,687; d) and STR (n = 928,143; e) regions. In d, true homopolymers are shown at increasing copy number. In e, STRs are plotted by entropy, a measure of complexity of the motif. The y axis is plotted as square root transformed.
Fig. 4 |
Fig. 4 |. Validating SNPs and INDEL events from short-read datasets against the GIAB high-confidence truth set as determined by RTG vcfeval.
a, Common germline haplotype variant callers were compared for each sequencing platform across the entire genome, showing the sensitivity and specificity achieved by each, for every replicate. b, Overall sensitivity and specificity plotted for variants in each UCSC RepeatMasker region, overlapped with high-confidence regions for each cell line respectively. c, Presence matrix of true-positive SNP variants within each UCSC RepeatMasker region. Each column is one variant. A yellow value indicates that the majority of replicates for that platform captured that variant, whereas blue indicates that variant was missed. d, Same as for c, but for INDELs. e, Distribution of sizes of INDELs captured per sequencing platform. Values below zero on the x axis indicate deletions; values to the right indicate insertions. Number of true-positive INDELs is plotted per mutation size and colored by platform.
Fig. 5 |
Fig. 5 |. Assessing variability for the son (HG002) across HiSeq X10, 2000 and 4000, platforms that had more than one replicate per cell line to enable this analysis.
a, Number of SVs across sequencing reactions for HG002 replicates including deletions, duplications, inversions, insertions, translocations, total, SVs overlapping with the HG002 reference set, and SVs overlapping with GIAB high-confidence regions. B–d, Variability is shown that can be attributed to callers (b), platforms (c) and replicates (d). e, The distribution of single support (unique) SVs in 100-kb windows across the different stratification strategies.
Fig. 6 |
Fig. 6 |. Reproducibility of sequencing of bacterial genomes in a complex metagenomic mixture.
a, Distribution of taxonomic assignment of strains present in the metagenomic mixture (Bacillus subitilis, Chromobacter violaceum, Entercoccus faecalis, Escherichia coli, Halobacillus halophilus, Haloferax volcanii, Micrococcus luteus, Pseudoalteromonas haloplanktis, Pseudomonas fluorescens and Staphylococcus epidermidis, sorted by order of GC content), stratified per replicate per sequencing platform. b, Heatmap showing the Spearman correlation of the average coverage within all instruments of each strain in the mixture. c, Distribution of presence of each taxon across all replicates from each sequencing instrument, with the expected 10% representation indicated by a horizontal dotted line. The taxa are ordered by GC content and have their Gram stain status indicated.

References

    1. Schuster SC Next-generation sequencing transforms today’s biology. Nat. Methods 5, 16–18 (2008). - PubMed
    1. Shendure J & Ji H Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008). - PubMed
    1. DePristo MA et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011). - PMC - PubMed
    1. Mardis ER The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008). - PubMed
    1. MacLean D, Jones JD & Studholme DJ Application of ‘next-generation’ sequencing technologies to microbial genetics. Nature Rev. Microbiol. 7, 96–97 (2009). - PubMed

Publication types