Review

. 2021 Sep;22(9):572-587.

doi: 10.1038/s41576-021-00367-3. Epub 2021 May 28.

Towards population-scale long-read sequencing

Wouter De Coster^#^{1

2}, Matthias H Weissensteiner^#³, Fritz J Sedlazeck⁴

Affiliations

¹ Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium.
² Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium.
³ Department of Biology, Penn State University, Pennsylvania, PA, USA.
⁴ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. fritz.sedlazeck@bcm.edu.

^# Contributed equally.

PMID: 34050336
PMCID: PMC8161719
DOI: 10.1038/s41576-021-00367-3

Review

Towards population-scale long-read sequencing

Wouter De Coster et al. Nat Rev Genet. 2021 Sep.

. 2021 Sep;22(9):572-587.

doi: 10.1038/s41576-021-00367-3. Epub 2021 May 28.

Authors

Wouter De Coster^#^{1

2}, Matthias H Weissensteiner^#³, Fritz J Sedlazeck⁴

Affiliations

¹ Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, VIB, Antwerp, Belgium.
² Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium.
³ Department of Biology, Penn State University, Pennsylvania, PA, USA.
⁴ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA. fritz.sedlazeck@bcm.edu.

^# Contributed equally.

PMID: 34050336
PMCID: PMC8161719
DOI: 10.1038/s41576-021-00367-3

Abstract

Long-read sequencing technologies have now reached a level of accuracy and yield that allows their application to variant detection at a scale of tens to thousands of samples. Concomitant with the development of new computational tools, the first population-scale studies involving long-read sequencing have emerged over the past 2 years and, given the continuous advancement of the field, many more are likely to follow. In this Review, we survey recent developments in population-scale long-read sequencing, highlight potential challenges of a scaled-up approach and provide guidance regarding experimental design. We provide an overview of current long-read sequencing platforms, variant calling methodologies and approaches for de novo assemblies and reference-based mapping approaches. Furthermore, we summarize strategies for variant validation, genotyping and predicting functional impact and emphasize challenges remaining in achieving long-read sequencing at a population scale.

PubMed Disclaimer

Conflict of interest statement

W.D.C. and F.J.S. have received sponsored travel from PacBio and/or Oxford Nanopore. M.H.W. declares no competing interests.

Figures

**Fig. 1. Overview of population-scale studies using long-read sequencing.**
Studies published in 2019–2021 in which five or more samples were sequenced are included. Genome size of study organisms is viewed in three different categories (<500 Mbp, 500–2,000 Mbp and >2,000 Mbp), and the methodological approach taken to investigate genetic variation (comparison of assemblies, read mapping against a reference or both) is illustrated by the different colours. For further details, see Table 1.

**Fig. 2. Overview of long-read population study design.**
a | The experimental design of three different approaches is outlined. In the first strategy (left), all samples are sequenced at medium to high coverage by long-read sequencing. In the second approach (middle), a proportion of the samples are sequenced with medium to high coverage and the remainder using low coverage by long-read sequencing (similar to the initial 1000 Genomes project). In the third approach (right), a proportion of the samples are sequenced at medium to high coverage by long-read sequencing and the remainder by short-read sequencing. The decision of which approach to take will affect the ability to detect common (red symbols) or rare (grey symbols) events in the population. The decision also depends on the available budget, existing data and the sample DNA availability. b | Overview of current established sequencing technologies based on CHM13 sequencing data: Illumina, Pacific Biosciences (PacBio) High Fidelity (HiFi) reads or ultra-long reads from Oxford Nanopore Technologies (ONT). The N50 read length and average read accuracy are highlighted in orange. Although each technology has advantages and disadvantages, HiFi and ONT are the most promising for future applications. c | Overview of analysis strategies. Although multiple approaches are available, the main decision is whether to use an alignment-based approach or a de novo assembly-based approach, which has implications for sequencing requirements and the approaches, resolution and comprehensiveness of downstream computational analysis.

**Fig. 3. Potential problems for different genome comparison approaches.**
a | Schematic depiction of a potential problem in a de novo assembly-based approach. The presence of a novel segment N1 in two de novo assemblies, at different locations and, even more so, a sequence variant (red x), poses a challenge to correct reporting by current state-of-the-art methods and variation formats. b | Similar representation of the N1 problem in an alignment-based approach, where the coordinates of N1 are shared, but remain challenging for the identification of the single-nucleotide variant (SNV) or the entire N1 sequence. c | A graph-based representation of N1, which enables a clearer comparison of the variant across the samples, illustrating the potential benefits of graph genomes. R1–R3 represents the backbone of the graph genome and N1, and its SNV represents novel sequencing for a given sample set.

**Fig. 4. Genotyping of SVs and SNVs across a population set.**
a | Graph genome-based genotyping of a region with multiple alleles between two genome segments (green and pink). Insertions of different sizes (yellow) can be genotyped at the same locus using spanning reads (blue and purple) to identify the presence of two different alleles. b | An example of structural variants (SVs) and single-nucleotide variants (SNVs) across different unique and repeat regions being correctly or incorrectly genotyped based on read length. c | A phylogenetically informed filtering approach for SVs. Assuming that after a sufficiently long time (4N_e generations, where e = effective population size) most or all genetic variation should be fully sorted between two clades; variants that do not adhere to this assumption and are polymorphic across clades (for example, variant 3) can be removed. Although this approach is certainly very conservative and ignores the fact that some types of variation exhibit repeated mutations on the same locus, it can be considered a first step towards more reliable genotyping of SVs.

See this image and copyright information in PMC

References

1. Patron J, Serra-Cayuela A, Han B, Li C, Wishart DS. Assessing the performance of genome-wide association studies for predicting disease risk. PLoS ONE. 2019;14:e0220215. doi: 10.1371/journal.pone.0220215. - DOI - PMC - PubMed
1. Hartman KA, Rashkin SR, Witte JS, Hernandez RD. Imputed genomic data reveals a moderate effect of low frequency variants to the heritability of complex human traits. bioRxiv. 2019 doi: 10.1101/2019.12.18.879916. - DOI
1. Halvorsen M, et al. Increased burden of ultra-rare structural variants localizing to boundaries of topologically associated domains in schizophrenia. Nat. Commun. 2020;11:1842. doi: 10.1038/s41467-020-15707-w. - DOI - PMC - PubMed
1. Huddleston J, et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 2017;27:677–685. doi: 10.1101/gr.214007.116. - DOI - PMC - PubMed
1. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016;17:333–351. doi: 10.1038/nrg.2016.49. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

UM1 HG008898/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Towards population-scale long-read sequencing

Affiliations

Towards population-scale long-read sequencing

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Miscellaneous