Integrating Linguistics, Social Structure, and Geography to Model Genetic Diversity within India

Aritra Bose¹, Daniel E Platt¹, Laxmi Parida¹, Petros Drineas², Peristera Paschou³

Affiliations

¹ Computational Genomics, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA.
² Computer Science Department, Purdue University, West Lafayette, IN, USA.
³ Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.

PMID: 33481022
PMCID: PMC8097304
DOI: 10.1093/molbev/msaa321

Integrating Linguistics, Social Structure, and Geography to Model Genetic Diversity within India

Aritra Bose et al. Mol Biol Evol. 2021.

. 2021 May 4;38(5):1809-1819.

doi: 10.1093/molbev/msaa321.

Authors

Aritra Bose¹, Daniel E Platt¹, Laxmi Parida¹, Petros Drineas², Peristera Paschou³

Affiliations

¹ Computational Genomics, IBM T.J. Watson Research Center, Yorktown Heights, NY, USA.
² Computer Science Department, Purdue University, West Lafayette, IN, USA.
³ Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.

PMID: 33481022
PMCID: PMC8097304
DOI: 10.1093/molbev/msaa321

Abstract

India represents an intricate tapestry of population substructure shaped by geography, language, culture, and social stratification. Although geography closely correlates with genetic structure in other parts of the world, the strict endogamy imposed by the Indian caste system and the large number of spoken languages add further levels of complexity to understand Indian population structure. To date, no study has attempted to model and evaluate how these factors have interacted to shape the patterns of genetic diversity within India. We merged all publicly available data from the Indian subcontinent into a data set of 891 individuals from 90 well-defined groups. Bringing together geography, genetics, and demographic factors, we developed Correlation Optimization of Genetics and Geodemographics to build a model that explains the observed population genetic substructure. We show that shared language along with social structure have been the most powerful forces in creating paths of gene flow in the subcontinent. Furthermore, we discover the ethnic groups that best capture the diverse genetic substructure using a ridge leverage score statistic. Integrating data from India with a data set of additional 1,323 individuals from 50 Eurasian populations, we find that Indo-European and Dravidian speakers of India show shared genetic drift with Europeans, whereas the Tibeto-Burman speaking tribal groups have maximum shared genetic drift with East Asians.

Keywords: India; South Asia; algorithms; data mining; genomics; population structure.

PubMed Disclaimer

Figures

**Fig. 1.**
A map of locations of the 33 populations in the normalized set and the results of principal component analysis. (A) Map of India showing the locations of the 368 individuals in the normalized subset across 33 well-defined populations, 47,283 SNPs (see supplementary fig. S1A, Supplementary Material online, for the pan-Indian data set of 90 ethnic groups and supplementary fig. S2, Supplementary Material online, for the corresponding PCA plot). The populations are colored by their sociolinguistic group. (B) Top two PCs of the normalized data set show clustering by language groups. (C) PCA plot colored and marked by sociolinguistic groups shows the genetic structure stratified by sociolinguistic groups.

**Fig. 2.**
Network of 90 Indian populations (891 individuals) in the pan-Indian data set based on shared ancestry as defined by meta-analysis of ADMIXTURE results. Only the top 40% of edges (most related) populations are shown here (see Materials and Methods for details). The node labels are colored by their corresponding language groups as shown in figure 1.

**Fig. 3.**
Shared genetic drift between 33 Indian populations (denoted by X) and 50 Eurasian/East Asian populations (denoted by Y) as estimated by f₃ statistics with Yoruba as an outgroup f₃ (YRI; X, Y). The darkest colors correspond to greatest portions of shared genetic drift with Indian populations. Full results can be found in supplementary table S4, Supplementary Material online.

See this image and copyright information in PMC

References

1. Abbi A. 2009. Is great Andamanese genealogically and typologically distinct from Onge and Jarawa? Lang Sci. 31(6):791–812.
1. Alaoui AE, Mahoney MW.. 2015. Fast randomized kernel ridge regression with statistical guarantees. Proceedings of the 28th International Conference on Neural Information Processing Systems. Vol. 1. Cambridge (MA: ): MIT Press, NIPS’15. p. 775–783.
1. Alexander DH, Novembre J, Lange K.. 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19(9):1655–1664. - PMC - PubMed
1. ArunKumar G, Soria-Hernanz DF, Kavitha VJ, Arun VS, Syama A, Ashokan KS, Gandhirajan KT, Vijayakumar K, Narayanan M, Jayalakshmi M, et al. 2012. Population differentiation of southern Indian male lineages correlates with agricultural expansions predating the caste system. PLoS One 7(11):e50269. - PMC - PubMed
1. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR.. 2015. A global reference for human genetic variation. Nature 526(7571):68–74. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Integrating Linguistics, Social Structure, and Geography to Model Genetic Diversity within India

Affiliations

Integrating Linguistics, Social Structure, and Geography to Model Genetic Diversity within India

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials