The GenomeAsia 100K Project enables genetic discoveries across Asia

GenomeAsia100K Consortium

Collaborators

GenomeAsia100K Consortium:
Jeffrey D Wall, Eric W Stawiski, Aakrosh Ratan, Hie Lim Kim, Changhoon Kim, Ravi Gupta, Kushal Suryamohan, Elena S Gusareva, Rikky Wenang Purbojati, Tushar Bhangale, Vadim Stepanov, Vladimir Kharkov, Markus S Schröder, Vedam Ramprasad, Jennifer Tom, Steffen Durinck, Qixin Bei, Jiani Li, Joseph Guillory, Sameer Phalke, Analabha Basu, Jeremy Stinson, Sandhya Nair, Sivasankar Malaichamy, Nidhan K Biswas, John C Chambers, Keith C Cheng, Joyner T George, Seik Soon Khor, Jong-Il Kim, Belong Cho, Ramesh Menon, Thiramsetti Sattibabu, Akshi Bassi, Manjari Deshmukh, Anjali Verma, Vivek Gopalan, Jong-Yeon Shin, Mahesh Pratapneni, Sam Santhosh, Katsushi Tokunaga, Badrul M Md-Zain, Kok Gan Chan, Madasamy Parani, Purushothaman Natarajan, Michael Hauser, R Rand Allingham, Cecilia Santiago-Turla, Arkasubhra Ghosh, Santosh Gopi Krishna Gadde, Christian Fuchsberger, Lukas Forer, Sebastian Schoenherr, Herawati Sudoyo, J Stephen Lansing, Jonathan Friedlaender, George Koki, Murray P Cox, Michael Hammer, Tatiana Karafet, Khai C Ang, Syed Q Mehdi, Venkatesan Radha, Viswanathan Mohan, Partha P Majumder, Somasekar Seshagiri, Jeong-Sun Seo, Stephan C Schuster, Andrew S Peterson

PMID: 31802016
PMCID: PMC7054211
DOI: 10.1038/s41586-019-1793-z

The GenomeAsia 100K Project enables genetic discoveries across Asia

GenomeAsia100K Consortium. Nature. 2019 Dec.

. 2019 Dec;576(7785):106-111.

doi: 10.1038/s41586-019-1793-z. Epub 2019 Dec 4.

Author

GenomeAsia100K Consortium

Collaborators

GenomeAsia100K Consortium:
Jeffrey D Wall, Eric W Stawiski, Aakrosh Ratan, Hie Lim Kim, Changhoon Kim, Ravi Gupta, Kushal Suryamohan, Elena S Gusareva, Rikky Wenang Purbojati, Tushar Bhangale, Vadim Stepanov, Vladimir Kharkov, Markus S Schröder, Vedam Ramprasad, Jennifer Tom, Steffen Durinck, Qixin Bei, Jiani Li, Joseph Guillory, Sameer Phalke, Analabha Basu, Jeremy Stinson, Sandhya Nair, Sivasankar Malaichamy, Nidhan K Biswas, John C Chambers, Keith C Cheng, Joyner T George, Seik Soon Khor, Jong-Il Kim, Belong Cho, Ramesh Menon, Thiramsetti Sattibabu, Akshi Bassi, Manjari Deshmukh, Anjali Verma, Vivek Gopalan, Jong-Yeon Shin, Mahesh Pratapneni, Sam Santhosh, Katsushi Tokunaga, Badrul M Md-Zain, Kok Gan Chan, Madasamy Parani, Purushothaman Natarajan, Michael Hauser, R Rand Allingham, Cecilia Santiago-Turla, Arkasubhra Ghosh, Santosh Gopi Krishna Gadde, Christian Fuchsberger, Lukas Forer, Sebastian Schoenherr, Herawati Sudoyo, J Stephen Lansing, Jonathan Friedlaender, George Koki, Murray P Cox, Michael Hammer, Tatiana Karafet, Khai C Ang, Syed Q Mehdi, Venkatesan Radha, Viswanathan Mohan, Partha P Majumder, Somasekar Seshagiri, Jeong-Sun Seo, Stephan C Schuster, Andrew S Peterson

PMID: 31802016
PMCID: PMC7054211
DOI: 10.1038/s41586-019-1793-z

Abstract

The underrepresentation of non-Europeans in human genetic studies so far has limited the diversity of individuals in genomic datasets and led to reduced medical relevance for a large proportion of the world's population. Population-specific reference genome datasets as well as genome-wide association studies in diverse populations are needed to address this issue. Here we describe the pilot phase of the GenomeAsia 100K Project. This includes a whole-genome sequencing reference dataset from 1,739 individuals of 219 population groups and 64 countries across Asia. We catalogue genetic variation, population structure, disease associations and founder effects. We also explore the use of this dataset in imputation, to facilitate genetic studies in populations across Asia and worldwide.

PubMed Disclaimer

Conflict of interest statement

A.S.P., E.W.S., S. Seshagiri, T.B., J.T.G., J.T., J. Stinson, Q.B., M.S.S., S.D. and K.S. were employees of Genentech at the time this work was carried out. S. Santhosh, A.V., M. Pratapneni, V. Ramprasad, S.P., R.M., R.G., S.N., S.M., T.S., V.G., J.T.G., M.D. and S.P. are employees of and/or have equity in MedGenome. C.K., J.-S.S. and J.-Y.S. are employees of Macrogen.

Figures

**Fig. 1. Sampling distribution of GAsP.**
a, b, Sample sizes. c, Location, language and social hierarchy associated with samples from south Asia. Groups with fewer than three samples are not plotted. See Supplementary Table 1a for definitions and descriptions of samples and population groups included in each geographically defined set.

**Fig. 2. Population structure and admixture.**
a, ADMIXTURE plots for k = 12 and k = 14 illustrating the identification of 12 reference groups. b, Proposed modern human migration route into southeast Asia during the Last Glacial Maximum with potential locations of Denisovan admixture (yellow asterisks). Green indicates the above water landmass at the glacial maximum and white outlines indicate present-day shorelines. c, Estimates of Denisovan ancestry in south Asians, stratified by social/cultural group and language. IE, Indo-European. Adivasi Indo-European, n = 30; Adivasi non-Indo-European, n = 196; caste Indo-European, n = 68; caste non-Indo-European, n = 155; upper caste Indo-European, n = 49; upper caste non-Indo-European, n = 19; Pakistani Indo-European, n = 79. The centre line indicates the median; box limits show the middle 50%; whiskers extend two standard deviations from the mean; points are outliers.

**Fig. 3. Disease-relevant variant discovery.**
a, Filtering using the GAsP dataset improves candidate variant discovery by removing population specific variants (n = 152). The centre line indicates the median; box limits show the upper and lower quartiles; whiskers extend 1.5× the interquartile range. b, Allele count (AC) and frequency distribution of variants in the GAsP dataset that are designated disease-causing in the Human Gene Mutation Database (HGMD) or pathogenic in ClinVar. Autosomal-dominant (AD) or autosomal-recessive (AR) or other (unknown) classification as per OMIM. A number of variants (n = 37) that had previously been reported to be pathogenic are found in the GAsP study dataset at high frequency and were reclassified (Supplementary Table 4d). c, Frequency of β-thalassaemia variant (chromosome 11:5248155 c.92+5G>C) across Asia shows a geographical enrichment. MAF in South Asia is 1.4%. NA, not available. d, Novel cancer-predisposing variants identified in the GenomeAsia dataset. e, Population-specific probabilities of adverse drug reactions predicted from the aggregate allele frequencies of known variants associated with response to the indicated drugs.

**Fig. 4. Founder effects and homozygous loss of function.**
a, IBD scores across different population groups are shown for 96 ethnicities (1,417 samples) across global regions. The scores given in the figure are relative ratios compared to that of the Finnish group. b, Violin plot showing IBD scores in 29 tribal groups and 25 non-tribal groups consisting of 293 and 336 samples, respectively. The centre line indicates the median; box limits show 1.5× the interquartile range. c, Proportion of genes with at least one high-confidence PTV. d, Proportion of novel, known, heterozygous and homozygous PTVs in the GAsP dataset. e, Pie chart of novel homozygous PTVs plotted by region (inner circle) and population group (outer circle). Groups with less than two PTVs were grouped as other. f, Novel homozygous PTV Q2010* (green) found in ABCA7 localizes to the C-terminal ABC domain. Previously reported PVTs are shown in grey.

**Extended Data Fig. 1. Diversity and divergence times of GAsP samples.**
a, PCA plot of study samples. Africa (AFR), n = 102; West Eurasia (WER), n = 111; South Asia (SAS), n = 642; Southeast Asia (SEA), n = 162; Oceania (OCE), n = 68; Northeast Asia (NEA), n = 346; Americas (AMR), n = 26. The samples included in each of these geographically defined groups are described in Supplementary Table 1a. b, MSMC cross-coalescence rates showing divergence time estimates between different groups. The point estimate of the date was given at which 25%,50% and 75% of lineages in the pair of populations have coalesced into a commonancestral population.

**Extended Data Fig. 2. Characteristics of GAsP SNPs and indels.**
a, b, Comparison of all GAsP variants (a) or coding variants (b) with gnomAD, ExAC, 1000 Genomes, ESP and dbSNP data as a function of the MAF within the GAsP dataset. c, d, The number and lengths of small indels in the genome (c) or coding regions (d). e–h, Proportion of non-coding (e, g) or coding (f, h) indels that were singletons (e, f) or rare (allele frequency of <0.1%; g, h).

See this image and copyright information in PMC

References

1. Popejoy AB, Fullerton SM. Genomics is failing on diversity. Nature. 2016;538:161–164. - PMC - PubMed
1. Martin AR, et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 2019;51:584–591. - PMC - PubMed
1. The Genome of the Netherlands Consortium Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet. 2014;46:818–825. - PubMed
1. The 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. - PMC - PubMed
1. Gurdasani D, et al. The African Genome Variation Project shapes medical genetics in Africa. Nature. 2015;517:327–332. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

The GenomeAsia 100K Project enables genetic discoveries across Asia

Collaborators

The GenomeAsia 100K Project enables genetic discoveries across Asia

Author

Collaborators

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Molecular Biology Databases