Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jul 13;4(3):lqac051.
doi: 10.1093/nargab/lqac051. eCollection 2022 Sep.

High-throughput method for the hybridisation-based targeted enrichment of long genomic fragments for PacBio third-generation sequencing

Affiliations

High-throughput method for the hybridisation-based targeted enrichment of long genomic fragments for PacBio third-generation sequencing

Tim Alexander Steiert et al. NAR Genom Bioinform. .

Abstract

Hybridisation-based targeted enrichment is a widely used and well-established technique in high-throughput second-generation short-read sequencing. Despite the high potential to genetically resolve highly repetitive and variable genomic sequences by, for example PacBio third-generation sequencing, targeted enrichment for long fragments has not yet established the same high-throughput due to currently existing complex workflows and technological dependencies. We here describe a scalable targeted enrichment protocol for fragment sizes of >7 kb. For demonstration purposes we developed a custom blood group panel of challenging loci. Test results achieved > 65% on-target rate, good coverage (142.7×) and sufficient coverage evenness for both non-paralogous and paralogous targets, and sufficient non-duplicate read counts (83.5%) per sample for a highly multiplexed enrichment pool of 16 samples. We genotyped the blood groups of nine patients employing highly accurate phased assemblies at an allelic resolution that match reference blood group allele calls determined by SNP array and NGS genotyping. Seven Genome-in-a-Bottle reference samples achieved high recall (96%) and precision (99%) rates. Mendelian error rates were 0.04% and 0.13% for the included Ashkenazim and Han Chinese trios, respectively. In summary, we provide a protocol and first example for accurate targeted long-read sequencing that can be used in a high-throughput fashion.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
We developed a high-throughput hybridisation-based targeted enrichment protocol for third-generation sequencing application. Long fragments of >7 kb can be efficiently enriched in 16× sample multiplexing resulting in >6 kb HiFi reads.
Figure 1.
Figure 1.
Schematic workflow of a high-throughput compatible protocol for long fragment targeted enrichment. (A) HMW genomic DNA is physically fragmented to a desired fragment length of 10 kb by repeatedly passing through an orifice of a g-tube (B). Here the centrifugation speed is the decisive parameter for the product fragment sizes and can be adjusted accordingly. (C) Unwanted smaller fragments are removed in the following step by clean-up with self-prepared size selection beads. Size selection is performed analogously to standard SPRI bead clean-up steps and can be done in 96-well plates on a commercial lab robot. (D) The size-selected fragments are end-repaired (ER) A-tailed (AT), and adapter ligated (AL) to an adapter sequence, which can be identified by its molecular barcode in subsequent sample pools. (E) After a conventional clean-up barcoded fragments can directly be pooled in equimolar ratios and are dried down for hybridisation, resuspended in hybridisation buffers, and mixed with unspecific blocking sequences (see Materials and Methods). After denaturation of the sample DNA, the targeted enrichment bait panel is added to the reaction and the mix is incubated at 65°C overnight for hybridisation. In the hybridisation step, the baits bind their respective target sequences in the samples. The capturing of the desired target fragments that become immobilised by the binding of DNA-linked biotin to the streptavidin surface of magnetic beads, is schematically depicted. Non-specifically bound fragments are washed away in a series of washing steps at different temperatures and with a variety of different wash buffers. (F) The captured and enriched target fragments get amplified in a PCR by the addition of ligated adapter-specific primers. Here, as smaller fragments get amplified preferentially in PCR-based amplification and sequencing steps, fragment sizes decrease from initially 10 to >7 kb (Supplementary Figure S1), and ultimately to >6 kb HiFi reads. (G) To avoid a drop in insert size, a second size selection step analogous to the first one is required to increase the average fragment size. This replaces the PCR clean-up step. PacBio libraries can subsequently be prepared from the size-selected, enriched, and barcoded fragments.
Figure 2.
Figure 2.
General QC metrics. (A) Number of non-duplicate HiFi reads per sample for the 16-sample multiplexed capture pool. The green line indicates the mean of all samples. (B) Selected run and QC metrics: run sequence output in Gb, duplicate ratio for HiFi reads, mean and maximum read length, mean target coverage of the HiFi read alignments, on-target-rate, fold enrichment, fold 80 base penalty (F80BP), and mean HiFi read accuracy based on an exemplary 10 kb fragment. Except for the read output and accuracy all numbers reflect the mean over all 16 samples and the respective standard deviation (SD) is provided. (C) Longest subread length versus polymerase read length. Count for respective subread/polymerase read length is indicated by colour, referring to scale on the right. (D) Coverage uniformity plot, showing the percentage of target region covered with x reads. Representative set of loci of equivalent size with paralogous (‘par.’: GYPA/GYPB/GYPE, 81 kb, left panel) and non-paralogous, normal (‘nor.’: ACKR1/ABCB6/KEL/AQP1/SEMA7A/SLC4A1, 79 kb right panel) sequences are depicted. Almost all target bases in both loci sets, paralogous (97%, 88.14–99.97%, SD = 3.47%) and normal (99.14%, 98.84–100%, SD = 0.25%) are covered with HiFi reads.
Figure 3.
Figure 3.
PrecisionFDA benchmarking results. Venn diagrams depict the number of true positive (TP, dark grey) SNP variants that were depicted in both the query and the reference data as well as the number of false negative (FN, light blue) and false positives (FP, green) in the query data. This is provided for the seven GIAB reference samples analysed in the current study. Only high-confidence variants of the GIAB samples were considered in the analysis. This was done by intersecting all files with the respective high-confidence region bed-files available for each GIAB sample. The resulting variant files were processed and benchmarked according to the PrecisionFDA truth challenge.
Figure 4.
Figure 4.
Comparison between protocols for targeted enrichment of long genomic fragments. Workflow for preparation of 96 samples and major equipment required is illustrated for each protocol. Best-case workflow is illustrated from sample fragmentation until enriched fragments that can be subject to PacBio library preparation. Individual process steps for 96 samples are depicted with respect to an eight-hour workday, including hands-on and incubation times. Overnight incubation was considered when possible for the respective step. Enrichment process steps encompass fragmentation (fragm.), bead-based clean-up (c.u.), size selection (si.sel.), end repair (end rep.), A-tailing (a-tail.), combined step of the last two (e.r.a.t.), adapter ligation (ada.lig.), pre-capture PCR (pre-capt. PCR), probe hybridisation (hyb.), streptavidin capture (capt.), post-capture PCR (post-capt. PCR), sample concentration (concentr.) and pre-repair (pre-rep.). Throughput for the introduced protocol is significantly higher compared to all other current comparable approaches reported elsewhere.

References

    1. Shendure J., Ji H. Next-generation DNA sequencing. Nat. Biotechnol. 2008; 26:1135–1145. - PubMed
    1. International Human Genome Sequencing Consortium International human genome sequencing consortium. Nature. 2001; 409:860–921. - PubMed
    1. Waterson R., Lindblad-Toh K., Birney E., Rogers J., Abril J. Mouse genome sequencing consortium. Nat. Methods. 2002; 420:61–65.
    1. Alkan C., Sajjadian S., Eichler E.E. Limitations of next-generation genome sequence assembly. Nature Methods. 2011; 8:61. - PMC - PubMed
    1. Salzberg S.L., Yorke J.A. Beware of mis-assembled genomes. Bioinformatics. 2005; 21:4320–4321. - PubMed