Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Dec;85(6):775-85.
doi: 10.1016/j.ajhg.2009.10.016.

Genetic structure of the Han Chinese population revealed by genome-wide SNP variation

Affiliations

Genetic structure of the Han Chinese population revealed by genome-wide SNP variation

Jieming Chen et al. Am J Hum Genet. 2009 Dec.

Abstract

Population stratification is a potential problem for genome-wide association studies (GWAS), confounding results and causing spurious associations. Hence, understanding how allele frequencies vary across geographic regions or among subpopulations is an important prelude to analyzing GWAS data. Using over 350,000 genome-wide autosomal SNPs in over 6000 Han Chinese samples from ten provinces of China, our study revealed a one-dimensional "north-south" population structure and a close correlation between geography and the genetic structure of the Han Chinese. The north-south population structure is consistent with the historical migration pattern of the Han Chinese population. Metropolitan cities in China were, however, more diffused "outliers," probably because of the impact of modern migration of peoples. At a very local scale within the Guangdong province, we observed evidence of population structure among dialect groups, probably on account of endogamy within these dialects. Via simulation, we show that empirical levels of population structure observed across modern China can cause spurious associations in GWAS if not properly handled. In the Han Chinese, geographic matching is a good proxy for genetic matching, particularly in validation and candidate-gene studies in which population stratification cannot be directly accessed and accounted for because of the lack of genome-wide data, with the exception of the metropolitan cities, where geographical location is no longer a good indicator of ancestral origin. Our findings are important for designing GWAS in the Chinese population, an activity that is expected to intensify greatly in the near future.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Population Stratification of the Han Chinese The PCA plots were oriented with PC1 on the y axis and with PC2 on the x axis. These plots were obtained by using all 107,565 SNPs. (A) The cluster and stratification of the samples from the ten provinces of China, showing an evident north-south genetic differentiation. (B) The samples from Beijing (purple) (CHB samples from the International HapMap project), Shanghai (orange), and Singapore (yellow) were compared against the provincial samples. The majority of the Singapore samples fall in the southern (lower) sector, whereas the Beijing and Shanghai samples were largely located within the northern (upper) sector. (C) The three dialect groups from the Guangdong province were shown against all of the Han Chinese samples from the ten provinces. There was stratification among the three dialect groups, along the same north-south trend observed for the overall Han Chinese.
Figure 2
Figure 2
Comparison between the Geographic Map of China and the Genetic Structure of the Han Chinese The PCA plots of the provincial samples are superimposed on the map of China to show the general north-south trend across China.
Figure 3
Figure 3
Estimated Population Structure by STRUCTURE for K = 2 and K = 3 Each individual is represented by a thin vertical line, and each province is demarcated by a thick vertical black line. The provinces are arranged from north to south, with JPT on the extreme left, representing the northernmost locality, to Liaoning, the northernmost province of China investigated in this study. The Guangdong individuals were grouped into the three dialect groups of Teochew, Hakka and Cantonese. These were then followed by the samples from the two metropolitan cities of Beijing (represented by CHB) and Shanghai, as well as the overseas Chinese community in Singapore. In K = 2, the northern provinces are clearly anchored by the JPT, with a huge membership of northern samples (represented by the yellow segment). The northern membership decreases gradually down to the southern provinces, which show a strong membership of southern samples (represented by the brown segment). At K = 3, JPT is clearly separated from the Han Chinese samples. The analysis revealed a demarcation of north-central-south similar to that shown by Figure 2. The Beijing, Shanghai and Singapore samples showed a clear mixture of southern (long brown lines) and northern (shorter brown lines) individuals, as compared to the provincial samples. The three dialect samples from the Guangdong province were also different from each other, with Teochew being more similar to individuals from the provinces of Hunan and Cantonese being the most southern representative.
Figure 4
Figure 4
Q-Q Plots of the p Values from the Simulated Association Analyses with or without Correction for Population Stratification The columns correspond to the Q-Q plots of the uncorrected, GC-corrected, and PCA-corrected p values. The rows correspond to 20%, 40%, 80%, and 100% stratification of the simulated case and control samples. (A–C) 20% stratification: 500N cases, 400N and 100S controls. (D–F) 40% stratification: 500N cases, 300N and 200S controls. (G–I) 80% stratification: 500N cases, 100N and 400S controls. (J–L) 100% stratification: 500N cases and 500S controls.

References

    1. The Wellcome Trust Case-Control Consortium Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. - PMC - PubMed
    1. Zhang X.J., Huang W., Yang S., Sun L.D., Zhang F.Y., Zhu Q.X., Zhang F.R., Zhang C., Zheng H.F., Liu J.J. Psoriasis genome-wide association study identifies susceptibility variants within LCE gene cluster at 1q21. Nat. Genet. 2009;41:205–210. - PubMed
    1. Liu X.G., Tan L.J., Lei S.F., Liu Y.J., Shen H., Wang L., Yan H., Guo Y.F., Xiong D.H., Deng H.W. Genome-wide association and replication studies identified TRHR as an important gene for lean body mass. Am. J. Hum. Genet. 2009;84:418–423. - PMC - PubMed
    1. Jakkula E., Rehnström K., Varilo T., Pietiläinen O.P., Paunio T., Pedersen N.L., deFaire U., Järvelin M.R., Saharinen, Peltonen L. The genome-wide patterns of variation expose significant substructure in a founder population. Am. J. Hum. Genet. 2008;83:787–794. - PMC - PubMed
    1. Tian C., Plenge R.M., Ransom M., Lee A., Villoslada P., Selmi C., Klareskog L., Pulver A.E., Qi L., Gregersen P.K., Seldin M.F. Analysis and application of European genetic substructure using 300K SNP information. PLoS Genet. 2008;4:e4. - PMC - PubMed

Publication types

LinkOut - more resources