Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 5;111(12):2618-2642.
doi: 10.1016/j.ajhg.2024.10.021.

Prequalification of genome-based newborn screening for severe childhood genetic diseases through federated training based on purifying hyperselection

Affiliations

Prequalification of genome-based newborn screening for severe childhood genetic diseases through federated training based on purifying hyperselection

Stephen F Kingsmore et al. Am J Hum Genet. .

Abstract

Genome-sequence-based newborn screening (gNBS) has substantial potential to improve outcomes in hundreds of severe childhood genetic disorders (SCGDs). However, a major impediment to gNBS is imprecision due to variants classified as pathogenic (P) or likely pathogenic (LP) that are not SCGD causal. gNBS with 53,855 P/LP variants, 342 genes, 412 SCGDs, and 1,603 therapies was positive in 74% of UK Biobank (UKB470K) adults, suggesting 97% false positives. We used the phenomenon of purifying hyperselection, which acts to decrease the frequency of SCGD causal diplotypes, to reduce false positives. Training of gene-disease-inheritance mode-diplotype tetrads in 618,290 control and affected subjects identified 293 variants or haplotypes and seven genes with variable inheritance contributing higher positive diplotype counts than consistent with purifying hyperselection and with little or no evidence of SCGD causality. With these changes, 2.0% of UKB470K adults were positive. In contrast, gNBS was positive in 7.2% of 3,118 critically ill children with suspected SCGDs and 7.9% of 705 infant deaths. When compared with rapid diagnostic genome sequencing (RDGS), gNBS had 99.1% recall. In eight true-positive children, gNBS was projected to decrease time to diagnosis by a median of 121 days and avoid life-threatening disease presentations in four children, organ damage in six children, ∼$1.25 million in healthcare cost, and ten (1.4%) infant deaths. Federated training predicated on purifying hyperselection provides a general framework to attain high precision in population screening. Federated training across many biobanks and clinical trials can provide a privacy-preserving mechanism for qualification of gNBS in diverse genetic ancestries.

Keywords: artificial intelligence; diplotype; false positive; genetic architecture; genome sequencing; infant mortality; newborn screening; purifying hyperselection; query federation; severe childhood genetic diseases.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests K.P.H., C.M.K., and S.S.M. are employees and shareholders of Illumina, Inc. W.R.M., Y.L., and T.D. are employees and shareholders of Alexion, AstraZeneca Rare Disease. E.F. is an employee and shareholder of Fabric Genomics, Inc. M.K. and S. Schwartz are employees and shareholders of Genomenon, Inc. J.L., C.K., and S. Shelnutt are employees and shareholders of TileDB, Inc. M.Y. is a co-founder and consultant of Fabric Genomics, Inc. S.F.K. has filed a patent related to this work.

Figures

None
Graphical abstract
Figure 1
Figure 1
Technical approach to structured, adaptive development of the BeginNGS.2 SCGD screening, diagnosis, and treatment platform (A) Development of a structured SCGD molecular and treatment knowledge base and screening algorithm that is trained in multicentric, large diplotype models. Federated training identifies variants in BeginNGS.2 genes contributing to diplotypes with frequencies (fdiplotype 1 … n) inconsistent with purifying hyperselection, such that fdiplotype 1 … n are greater than P, the population prevalence of the corresponding genetic disease(s) after correction for penetrance (p), expressivity (e), diplotype heterogeneity (d), and locus heterogeneity (l)(Figure 2). (B) Highly automated platform for scalable population screening, diagnosis, and treatment that is empowered by the knowledge base and trained algorithm. GS, genome sequence; SME, subject matter expert; Rx, treatment. Automated interpretation includes a diplotype query and use of the Transformer tool. (C) Federated learning by (1) iterative queries of genomic sequences of UKB470K and RCIGM RDGS cohorts, with (2) return of positive diplotypes with zygosity and count of positive subjects and (3) removal of NSDCC variants and disorder MOI contributing excess positive counts. Rx, therapeutic intervention; GS, genome sequence; SME, subject matter expert; MOI, mode of inheritance; GTRx; Genome-to-Treatment; eCDS, electronic clinical decision support; DBS, dried blood spot; Exp., expected; TP, true positive rate; TN, true negative rate; AWS, Amazon web services; aiSNPs, ancestry-informative single-nucleotide polymorphisms; ETL, extract, transform, load; Db, database; VEP, variant effect predictor.
Figure 2
Figure 2
Training of the BeginNGS.2 genetic disease screening algorithm in multicentric, large diplotype models (A) Federated training in large GS cohorts flags P or LP variants for evaluation as non-severe disease causing in childhood (NSDCC) based on absence of purifying hyperselection evidenced by contributing diplotype frequencies (f) that are greater than those expected based on the sum of the corresponding disease prevalences (P) following correction for penetrance (p), expressivity (e), diplotype heterogeneity (d), and locus heterogeneity (L). (B) Manhattan plot of counts of 2,785 diplotypes that were gNBS positive in UKB470K. The x axis shows chromosome number and relative nucleotide position from the lowest value (left) to the highest value (right). The y axis is the diplotype count in UKB470K. 113 diplotypes with counts ≥54 in UKB470K (frequency >1 in 8,703) are indicated in green if disease causal (n = 16), and in red if determined to be NSDCC (n = 97) using the method of (A). The top 109 CFTR diplotypes (with counts >3, 1 in 118,000) are also indicated as green if disease causal (n = 5) and red if not (n = 104). (C) Rank ordering of 2,785 diplotype counts in UKB470K from largest (left) to smallest (right). The x axis shows the diplotype rank from most common (left) to least common (right). The y axis is the diplotype count in UKB470K. The top 10 (darker shaded blue) and 100 (lighter shaded blue) diplotypes accounted for 91% and 97%, respectively, of the total diplotype count. The 113 diplotypes with frequencies >1 in 8,703 (counts ≥54) are indicated in green if disease causal (n = 16), and in red (n = 97) if determined to be NSDCC using the method of (A), indicating the power to reduce false positives.

References

    1. GUTHRIE R., SUSI A. A SIMPLE PHENYLALANINE METHOD FOR DETECTING PHENYLKETONURIA IN LARGE POPULATIONS OF NEWBORN INFANTS. Pediatrics. 1963;32:338–343. - PubMed
    1. IRWIN H.R., NOTRICA S., FLEMING W. Blood phenylalanine levels of newborn infants. A routine screening program for the hospital newborn nursery. Calif. Med. 1964;101:331–333. - PMC - PubMed
    1. Wilson J.M.G., Jungner G., World Health Organization . World Health Organization; 1968. Principles and Practice of Screening for Disease.
    1. Newborn Screening: A blueprint for the future. Pediatrics. 2000;106:S383–S427. - PubMed
    1. Watson M.S., Mann M.Y., Lloyd-Puryear M.A., Rinaldo P., Howell R.R. Newborn Screening: Towards a Uniform Screening Panel and System. Genet. Med. 2006;8:1S–11S. doi: 10.1542/peds.2005-2633J. - DOI - PMC - PubMed