Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2025 Oct 5:2025.10.02.25336942.
doi: 10.1101/2025.10.02.25336942.

Population-scale Long-read Sequencing in the All of Us Research Program

Affiliations

Population-scale Long-read Sequencing in the All of Us Research Program

Kiran V Garimella et al. medRxiv. .

Abstract

The All of Us Research Program (AoU) is a national biobank seeking to enroll one million individuals in the United States to link genomic and biomedical data, including short- and long-read whole-genome sequencing (srWGS/LRS), with rich electronic health record (EHR) information. Here, we present the first large-scale analyses of long-read sequencing (LRS) in AoU and offer a new framework for deriving genomic insights into complex structural variation (SV) of relevance to human health and disease. We performed joint analyses of 1,027 individuals self-identifying as Black or African American, sequenced to ~8x coverage with Pacific Biosciences HiFi technology and processed using cloud-native pipelines. From these LRS data we constructed a comprehensive variant callset encompassing known (FMR1 and HTT) and novel repeat expansions, clinically relevant haplotypes at loci inaccessible to srWGS, and haplotypes relevant to disease risk (HLA) and pharmacogenomics (CYP2D6), including SNVs, indels, and SVs. We developed methods for cohort-level variant calling and a scalable workflow to impute >750,000 of these SVs into existing srWGS datasets for trait association and human disease studies. Expanding to 10,000 self-identified Black or African American AoU participants with srWGS and matched EHRs, we identified 291 SV-disease associations (p < 1×10-5) spanning 226 conditions with 50.9% of associations involving SVs absent from the matched srWGS callset. Across the 226 traits, after fine-mapping using SVs and SNVs we identified 191 SV-disease pairs spanning 160 traits (70.8%) where the SV had the strongest association within the locus. Associations specific to those with computed ancestry similar to the African reference population exhibited larger effect sizes and lower allele frequencies, consistent with high-risk, ancestry-specific variants. These results demonstrate that the integration of LRS into AoU and future biobank initiatives can provide transformative new insights into genomic variation with potentially profound impact on precision medicine.

PubMed Disclaimer

Conflict of interest statement

Ethics declarations / Competing Interests K.V.G. is a co-inventor on a pending international patent application related to long-read RNA isoform sequencing, licensed to Pacific Biosciences, but not used in this study. E.E.E. is a scientific advisory board (SAB) member of Variant Bio, Inc. F.J.S. receives research support from Illumina and Nanopore. The remaining authors declare no competing interests.

Figures

Fig. 1:
Fig. 1:. LRS of self-identified Black or African American participants from All of Us.
a, 1,027 participants selected for PacBio HiFi sequencing (black vertical lines) from 245,388 AoU participants (top multicolored bar) with existing short-read data in CDRv7 releases and beyond. Computed genetic ancestry from srWGS data shown for: AFR–African, AMR–Admixed American, EAS–East Asian, EUR–European, MID–Middle Eastern, SAS–South Asian, OTH–other (typically admixed). b, PCA of SNV genetic data from participants selected for PacBio LRS (gray diamonds), overlapping participants with Oxford Nanopore Technologies (ONT) data (red, filled diamonds), and participants from the Human Pangenome Reference Consortium (HPRC) for comparison and evaluation (blue triangles), projected onto the AoU srWGS data. c, Geographic sampling locations for 1,027 long-read participants collected across the continental U.S., shown at the US Census Bureau Census Division level (no LRS participants sampled from Alaska, Hawaii, or other territories). d, Coverage distributions of selected LRS participants, stratified by sequencing technology. e, Mean read lengths of selected LRS participants, stratified by sequencing technology. f, Clinical phenotype distribution in AoU-LR participants based on EHR data. Diagnoses were mapped using the Observational Medical Outcomes Partnership (OMOP) concept IDs and interpreted via standard Observational Health Data Sciences and Informatics (OHDSI) concept mappings. For each disease category, all descendant concept IDs were identified using the OHDSI CONCEPT_ANCESTOR table, and counts reflect the number of unique participants with a diagnosis matching any related concept.
Fig. 2:
Fig. 2:. Structural variant calling and quality control.
a, Receiver operating characteristic (ROC) curves for per-participant SV filtering performance, evaluated against 47 high-quality diploid assemblies from the HPRC. True positive rate (TPR) and false positive rate (FPR) are shown under strict and lenient filtering thresholds, stratified by variant type and genomic context (tandem repeat [TR] vs. non-TR regions). One participant is highlighted for clarity. b, De Finetti diagram depicting ratios of homozygous reference, heterozygous, and homozygous alternate genotypes at each SV site in strict (left, n=616,411 SVs) and lenient (right, n=1,157,726 SVs) callsets, as well as the fraction of SV sites in Hardy-Weinberg equilibrium. Every site is assumed to be biallelic. c, Distribution of SV lengths for insertions and deletions, stratified by callset. Peaks corresponding to known mobile element families (Alu, SVA, and L1) are labeled. d, Per-participant SV counts across different groups: 47 HPRC samples used for comparison, 1,027 AoU participants with long reads (subsets with other data types indicated by horizontal bars; PacBio: purple; ONT: blue; Illumina: red). Purple and blue curves represent expected variant counts based on raw Sniffles2 calls on ~30x PacBio data from HPRC samples and ~35x ONT data from AoU, respectively, illustrating expected SV yield at higher coverage. e, Histograms of unfiltered allele frequencies from the AoU+HPRC panel (x-axis) and 2,504 unrelated 1KGP short-read samples imputed against it (y-axis), for SV-length insertion/deletion (left/right, Pearson correlation coefficient 0.84/0.94) bubbles. f, Number of imputed AoU LRS SVs per participant across five continental population groups in 2,504 unrelated 1KGP short-read samples. g, UpSet plots show the number of protein-coding genes (including intronic loci) intersected by an SV across the five continental groups in 1KGP.
Fig. 3:
Fig. 3:. Expanded triplet repeat loci and structure of disease-associated loci.
a, Most common and b, longest FMR1 repeat alleles. Sequences were extracted from 1,753 sex-filtered long-read haplotypes with TRGT and plotted according to repeat and interruption patterns. AGG interruptions are included in repeat counts. Arrows indicate unstable transmission risk alleles with 25 to 33 (light blue) or at least 34 (dark blue) uninterrupted CGG repeats, premutation alleles with at least 55 repeats (red), and an allele with five interruptions (yellow). c, Length distributions indicating the number of repeat units in extracted TRGT sequences for FMR1 and HTT. d, Most common and e, longest HTT repeat alleles. Sequences were extracted from 1,944 long-read haplotypes with TRGT and plotted. Blue arrows indicate intermediate-length unstable transmission risk alleles with 27 to 35 CAG repeats, and red arrows indicate potential low-penetrance Huntington’s disease alleles with at least 36 repeats. f, Boxplot of the distribution of tandem repeat length variation values (measured by median absolute deviation) for the entire catalog (first box), a set of loci specifically expanded in humans relative to nonhuman primates (second box), the CODIS loci (third box), and the known pathogenic loci (fourth box). g, 5' UTR CGG repeats and h, protein-coding CAG repeats with median (50th percentile) repeat length for each locus plotted against the 99th percentile repeat length, both measured in trinucleotide units. At each locus, the longest pure (uninterrupted) repeat was segmented. A y=x trend line is included for reference, indicating where the median and 99th percentile repeat lengths are equal. Out of 4,115 CGG loci and 1,483 CAG loci, 11 and 12 disease-associated loci are highlighted in red/green with annotated genes, while non-disease-associated loci are shown in purple/orange.
Fig. 4:
Fig. 4:. CYP2D6–7 haplotype structure and content.
a, Copy number variation and hybrid structures for 1,266 AoU-LR Phase 1 haplotypes. Hifiasm assemblies with a single contig over the full CYP2D6-7 locus were divided into 100 bp k-mers and colored by the best-mapping reference gene. An example plot for each structure is shown along with the number of haplotypes the structure was seen in. b, All (left) and novel (right) missense variants. Predicted haplotype coding sequences were compared with the GRCh38 coding sequence, which was substituted with changes for the assigned (sub)star allele to find novel variants. Variants are colored by AlphaMissense classification, and asterisks indicate presence in the srWGS and LRS AoU Hail matrix tables.
Fig. 5:
Fig. 5:. Functional impact of AoU SVs.
a, Distribution of CADD-SV score for shared SVs in the AoU strict cohort. Each data point represents an SV. The x-axis indicates the number of participant containing the SV, and the y-axis shows the PHRED-scaled CADD score. “Known” SVs are those identified in at least one of the 1KGP-ONT, HGSVC, or HPRC datasets,,, while “AoU detected” SVs are those absent in those datasets. The dashed gray horizontal line denotes the score threshold above which SVs are considered likely pathogenic. b, Relationship between the best-fit slope (β) derived from OLS regression and gene-level q-values. The eGenes shown are medically significant and listed in OMIM, with high-confidence q-values (<1 × 10−60). c, Distribution of genotypes and gene expression values in 731 participants for the BID-associated deletion, with a q-value of 5.92 × 10−13. d, Manhattan plot of the 322 bp deletion and nearby SNVs, with log10 P values. The deletion is the top variant, and points are colored by their LD (r2) with this SV. The enhancer overlapping the SV is shown in green. e, Histogram showing the number of SVs in LD (r2 ≥ 0.5) with SNVs from the GWAS catalog. SVs located within trait-associated genes are shown in orange, and those outside these genes are shown in blue. Bars with diagonal hatching (///) indicate SVs linked to disease- or disorder-related traits. f, LD heatmap of an insertion (marked in blue) along with nearby GWAS variants and SVs located within ±100 kbp of the insertion. GWAS variants in LD are labeled with their IDs. GWAS-associated genes are shown below the heatmap.
Fig. 6:
Fig. 6:. Genome-wide linkage and trait associations of SVs.
a, Genotyping and imputation with the long-read panel. Small variants and SVs are identified from 1,074 long-read participants (AoU + HPRC) and then genotyped and imputed in 10,000 short-read self-identified Black or African American participants from All of Us biobank, with the total number and allele frequency distributions of SNVs and SVs shown. b, Disease phenotypes are extracted from EHRs of the same participants and grouped into 11 categories. Conditions belonging to multiple categories are assigned to all relevant groups. c, Comparison of association significance between each SV and the strongest nearby SNV (within ±100 kbp) for the same phenotype (central panel). Points are colored by LD (r2) between the SV and SNP. Circles indicate genic SVs (overlapping gene bodies), and triangles indicate intergenic SVs. The top and right panels display stacked histograms of −log10(p-values) for SNP- and SV-disease associations, respectively. Genic SVs are shown in purple; intergenic SVs are shown in light blue. d, Distribution of genome-wide SV-disease associations with large effect sizes (odds ratio [OR] ≥ 2.5 or ≤ 0.4) in which the SV is the lead variant. Orange indicates associations with SVs discovered in 1,027 AoU and 47 HPRC long-read samples and subsequently imputed in 10,000 AoU short-read participants, while blue represents associations with SVs identified in the short-read callset derived from the same 1,027 AoU participants. e, Relationship between allele frequency (log scale) and relative risk (OR > 1) for genome-wide disease-associated SVs with stronger signals than nearby SNPs. Circles indicate SVs within gene bodies, and triangles indicate intergenic SVs. Associations specific to African genetic ancestry are shown in yellow and all others in gray. Point size reflects association significance. f, Manhattan plots showing a 50bp deletion linked to hypertensive heart failure. The insertion is the lead variant, and surrounding points are colored based on their LD (r2) with the SV. g, Relationship between allele frequency (log scale) and relative risk (OR > 1) for coding-region disease-associated SVs with stronger signals than nearby SNPs. Associations specific to African genetic ancestry are shown in yellow and all others in gray. Point size reflects association significance. h, Manhattan plots showing a 200 bp insertion associated with atelectasis. The insertion is the lead variant, and surrounding points are colored based on their LD (r2) with the SV.

References

    1. All of Us Research Program Investigators et al. The ‘All of Us’ Research Program. N. Engl. J. Med. 381, 668–676 (2019). - PMC - PubMed
    1. Bianchi D. W. et al. The All of Us Research Program is an opportunity to enhance the diversity of US biomedical research. Nat. Med. 30, 330–333 (2024). - PMC - PubMed
    1. All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature 627, 340–346 (2024). - PMC - PubMed
    1. Chen S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024). - PMC - PubMed
    1. Collins R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020). - PMC - PubMed

Publication types