Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2021 Sep;22(9):572-587.
doi: 10.1038/s41576-021-00367-3. Epub 2021 May 28.

Towards population-scale long-read sequencing

Affiliations
Review

Towards population-scale long-read sequencing

Wouter De Coster et al. Nat Rev Genet. 2021 Sep.

Abstract

Long-read sequencing technologies have now reached a level of accuracy and yield that allows their application to variant detection at a scale of tens to thousands of samples. Concomitant with the development of new computational tools, the first population-scale studies involving long-read sequencing have emerged over the past 2 years and, given the continuous advancement of the field, many more are likely to follow. In this Review, we survey recent developments in population-scale long-read sequencing, highlight potential challenges of a scaled-up approach and provide guidance regarding experimental design. We provide an overview of current long-read sequencing platforms, variant calling methodologies and approaches for de novo assemblies and reference-based mapping approaches. Furthermore, we summarize strategies for variant validation, genotyping and predicting functional impact and emphasize challenges remaining in achieving long-read sequencing at a population scale.

PubMed Disclaimer

Conflict of interest statement

W.D.C. and F.J.S. have received sponsored travel from PacBio and/or Oxford Nanopore. M.H.W. declares no competing interests.

Figures

Fig. 1
Fig. 1. Overview of population-scale studies using long-read sequencing.
Studies published in 2019–2021 in which five or more samples were sequenced are included. Genome size of study organisms is viewed in three different categories (<500 Mbp, 500–2,000 Mbp and >2,000 Mbp), and the methodological approach taken to investigate genetic variation (comparison of assemblies, read mapping against a reference or both) is illustrated by the different colours. For further details, see Table 1.
Fig. 2
Fig. 2. Overview of long-read population study design.
a | The experimental design of three different approaches is outlined. In the first strategy (left), all samples are sequenced at medium to high coverage by long-read sequencing. In the second approach (middle), a proportion of the samples are sequenced with medium to high coverage and the remainder using low coverage by long-read sequencing (similar to the initial 1000 Genomes project). In the third approach (right), a proportion of the samples are sequenced at medium to high coverage by long-read sequencing and the remainder by short-read sequencing. The decision of which approach to take will affect the ability to detect common (red symbols) or rare (grey symbols) events in the population. The decision also depends on the available budget, existing data and the sample DNA availability. b | Overview of current established sequencing technologies based on CHM13 sequencing data: Illumina, Pacific Biosciences (PacBio) High Fidelity (HiFi) reads or ultra-long reads from Oxford Nanopore Technologies (ONT). The N50 read length and average read accuracy are highlighted in orange. Although each technology has advantages and disadvantages, HiFi and ONT are the most promising for future applications. c | Overview of analysis strategies. Although multiple approaches are available, the main decision is whether to use an alignment-based approach or a de novo assembly-based approach, which has implications for sequencing requirements and the approaches, resolution and comprehensiveness of downstream computational analysis.
Fig. 3
Fig. 3. Potential problems for different genome comparison approaches.
a | Schematic depiction of a potential problem in a de novo assembly-based approach. The presence of a novel segment N1 in two de novo assemblies, at different locations and, even more so, a sequence variant (red x), poses a challenge to correct reporting by current state-of-the-art methods and variation formats. b | Similar representation of the N1 problem in an alignment-based approach, where the coordinates of N1 are shared, but remain challenging for the identification of the single-nucleotide variant (SNV) or the entire N1 sequence. c | A graph-based representation of N1, which enables a clearer comparison of the variant across the samples, illustrating the potential benefits of graph genomes. R1–R3 represents the backbone of the graph genome and N1, and its SNV represents novel sequencing for a given sample set.
Fig. 4
Fig. 4. Genotyping of SVs and SNVs across a population set.
a | Graph genome-based genotyping of a region with multiple alleles between two genome segments (green and pink). Insertions of different sizes (yellow) can be genotyped at the same locus using spanning reads (blue and purple) to identify the presence of two different alleles. b | An example of structural variants (SVs) and single-nucleotide variants (SNVs) across different unique and repeat regions being correctly or incorrectly genotyped based on read length. c | A phylogenetically informed filtering approach for SVs. Assuming that after a sufficiently long time (4Ne generations, where e = effective population size) most or all genetic variation should be fully sorted between two clades; variants that do not adhere to this assumption and are polymorphic across clades (for example, variant 3) can be removed. Although this approach is certainly very conservative and ignores the fact that some types of variation exhibit repeated mutations on the same locus, it can be considered a first step towards more reliable genotyping of SVs.

References

    1. Patron J, Serra-Cayuela A, Han B, Li C, Wishart DS. Assessing the performance of genome-wide association studies for predicting disease risk. PLoS ONE. 2019;14:e0220215. doi: 10.1371/journal.pone.0220215. - DOI - PMC - PubMed
    1. Hartman KA, Rashkin SR, Witte JS, Hernandez RD. Imputed genomic data reveals a moderate effect of low frequency variants to the heritability of complex human traits. bioRxiv. 2019 doi: 10.1101/2019.12.18.879916. - DOI
    1. Halvorsen M, et al. Increased burden of ultra-rare structural variants localizing to boundaries of topologically associated domains in schizophrenia. Nat. Commun. 2020;11:1842. doi: 10.1038/s41467-020-15707-w. - DOI - PMC - PubMed
    1. Huddleston J, et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 2017;27:677–685. doi: 10.1101/gr.214007.116. - DOI - PMC - PubMed
    1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016;17:333–351. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources