Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Dec 5:9:390.
doi: 10.12688/wellcomeopenres.22697.2. eCollection 2024.

Exome sequencing of UK birth cohorts

Affiliations

Exome sequencing of UK birth cohorts

Mahmoud Koko et al. Wellcome Open Res. .

Abstract

Birth cohort studies involve repeated surveys of large numbers of individuals from birth and throughout their lives. They collect information useful for a wide range of life course research domains, and biological samples which can be used to derive data from an increasing collection of omic technologies. This rich source of longitudinal data, when combined with genomic data, offers the scientific community valuable insights ranging from population genetics to applications across the social sciences. Here we present quality-controlled whole exome sequencing data from three UK birth cohorts: the Avon Longitudinal Study of Parents and Children (8,436 children and 3,215 parents), the Millenium Cohort Study (7,667 children and 6,925 parents) and Born in Bradford (8,784 children and 2,875 parents). The overall objective of this coordinated effort is to make the resulting high-quality data widely accessible to the global research community in a timely manner. We describe how the datasets were generated and subjected to quality control at the sample, variant and genotype level. We then present some preliminary analyses to illustrate the quality of the datasets and probe potential sources of bias. We introduce measures of ultra-rare variant burden to the variables available for researchers working on these cohorts, and show that the exome-wide burden of deleterious protein-truncating variants, S het burden, is associated with educational attainment and cognitive test scores. The whole exome sequence data from these birth cohorts (CRAM & VCF files) are available through the European Genome-Phenome Archive, and here we provide guidance for their use.

Keywords: ALSPAC; BiB; EGA; MCS; WES.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. Overview of quality control steps and the sample size remaining at each step.
The sample sizes shown do not include five Genome In a Bottle (GIAB) samples sequenced along each cohort. Note that the sample QC was performed in several stages. For ALSPAC, the checks for sample swaps against array data were performed on the preliminary dataset after variant calling (Sample QC 2) while the other checks were performed in stage 3. ALSPAC had a disproportionately large number of variants before QC (but not after QC) due to a sequencing artefact that caused excess C>A mutations (see Box 3), which mostly failed random forest filtering (see the weight of the random forest feature ‘Is_CA’ in Table 2) or hard filters (see the cumulative ‘False Positives %’ in ALSPAC SNVs in Figure 3). The raw sequencing data (CRAM files) and the final VCFs were uploaded to the European Genome-Phenome Archive (EGA).
Figure 2.
Figure 2.. PCA plot showing the continental ancestry groups for samples from the 1000 Genomes Project and UK birth cohorts.
The birth cohort samples are coloured by continental population labels (‘super-populations’) inferred from their similarity to the 1000 Genomes samples. Top-left: The 1000 Genomes samples. Top-right: Avon Longitudinal Study of Parents and Children (ALSPAC). Bottom-left: Millennium Cohort Study (MCS). Bottom-right: Born in Bradford study (BiB). Note that the PCA was performed separately for each cohort (merged with 1000 Genomes, as described in Methods), therefore the axes and scale are not identical between cohorts. AFR: African, AMR: Admixed American (sometimes referred to as Hispanic-Latin American), EAS: East Asian, EUR: European, SAS: South Asian, OTH: other (ie. could not be confidently inferred).
Figure 3.
Figure 3.. True and false positive rates plotted against random forest bins for single nucleotide variants (SNVs) and insertions-deletions (indels).
The plots are cumulative, such that the y-axis depicts the overall indicated metric (percent of true or false positives identified) for SNVs or indels in bins lower than and including the bin indicated on the x-axis.
Figure 4.
Figure 4.. Integrated variant and genotype QC for SNVs in MCS.
The plots show the following metrics obtained after applying different combinations of variant- and genotype-level filters: a) true versus false positive rate, b) recall versus precision for sites from the reference GIAB sample, and c) transmitted/untransmitted ratio for synonymous singletons versus precision (left) or versus recall (right). In each plot, the points show the value of the metric for a particular combination of filters. The colour indicates the random forest (RF) bin filter, the opacity of the point indicates the genotype quality (GQ) filter, the size of the point indicates the filter on a combination of genotype depth (DP) and heterozygous allele balance (Het. AB), and the shape of the point indicates the minimum genotyping rate (min. geno. rate) after applying genotype QC (i.e. missingness filter). For plots of true/false positives and precision/recall in all three cohorts, see Extended Data Figure 4 (SNVs) and Extended Data Figure 5 (indels).
Figure 5.
Figure 5.. Rare and ultra-rare variant counts in the birth cohorts.
The boxplots show the median and interquartile range (IQR) after applying the recommended filters shown in Table 3. Two (gnomAD) allele frequency thresholds are shown (rare: 0.1%; ultra-rare 0.003%). The whiskers show the range excluding outliers (solid lines) or including outliers (dashed lines). The largest sample groups in terms of predicted genetic ancestry (European (EUR) in the three cohorts and South Asian (SAS) in BiB) are plotted separately from the remaining samples.
Figure 6.
Figure 6.. Association between S het burden for ultra-rare protein-truncating and synonymous variants and education-related phenotypes.
S het burden is a measure of exome-wide ultra-rare variant burden (see the Methods for more details on S het burden derivation). The distribution of S het burden in individuals of predicted European genetic ancestry is shown in ( a) for protein-truncating variants and ( b) for synonymous variants. The plot also shows the association of these scores with educational/cognitive test scores. For each set of QC filters on the y axis, the x axis shows the log-odds of having more than 13 education-years (ALSPAC) and the change (standard deviations) in reading test scores (MCS and BiB) for an S het burden of 1 (versus 0) calculated using protein-truncating variants ( a) and synonymous variants ( b).

References

    1. Agarwal I, Fuller ZL, Myers SR, et al. : Relating pathogenic loss-of-function mutations in humans to their evolutionary fitness costs. eLife. 2023;12: e83172. 10.7554/eLife.83172 - DOI - PMC - PubMed
    1. Arciero E, Dogra SA, Malawsky DS, et al. : Fine-scale population structure and demographic history of British Pakistanis. Nat Commun. 2021;12(1): 7189. 10.1038/s41467-021-27394-2 - DOI - PMC - PubMed
    1. Asimit J, Zeggini E: Rare variant association analysis methods for complex traits. Annu Rev Genet. 2010;44:293–308. 10.1146/annurev-genet-102209-163421 - DOI - PubMed
    1. Athieniti E, Spyrou GM: A guide to multi-omics data collection and integration for translational medicine. Comput Struct Biotechnol J. 2022;21:134–149. 10.1016/j.csbj.2022.11.050 - DOI - PMC - PubMed
    1. Auer PL, Lettre G: Rare variant association studies: considerations, challenges and opportunities. Genome Med. 2015;7(1): 16. 10.1186/s13073-015-0138-2 - DOI - PMC - PubMed

LinkOut - more resources