. 2024 Jan 11;20(1):e1011037.

doi: 10.1371/journal.pgen.1011037. eCollection 2024 Jan.

Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression

Qi-Xin Zhang^{1

2}, Tianzi Liu^{3

4}, Xinxin Guo⁵, Jianxin Zhen⁶, Meng-Yuan Yang⁷, Saber Khederzadeh⁷, Fang Zhou⁸, Xiaotong Han⁹, Qiwen Zheng⁴, Peilin Jia⁴, Xiaohu Ding⁹, Mingguang He^{9

10

11}, Xin Zou¹², Jia-Kai Liao^{13

14}, Hongxin Zhang¹², Ji He¹⁵, Xiaofeng Zhu¹⁶, Daru Lu^{17

18}, Hongyan Chen⁸, Changqing Zeng^{4

19}, Fan Liu^{4

20}, Hou-Feng Zheng⁷, Siyang Liu⁵, Hai-Ming Xu¹, Guo-Bo Chen^{2

21}

Affiliations

¹ Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China.
² Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People's Hospital, People's Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China.
³ CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China.
⁴ CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China.
⁵ School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China.
⁶ Central Laboratory, Shenzhen Baoan Women's and Children's Hospital, Shenzhen, Guangdong, China.
⁷ Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China.
⁸ State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China.
⁹ State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China.
¹⁰ Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia.
¹¹ Ophthalmology, Department of Surgery, University of Melbourne, Melbourne, Victoria, Australia.
¹² State Key Laboratory of CAD & GC, Zhejiang University, Hangzhou, Zhejiang, China.
¹³ School of Mathematics and Statistics and Research Institute of Mathematical Sciences (RIMS), Jiangsu Provincial Key Laboratory of Educational Big Data Science and Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China.
¹⁴ Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, Zhejiang, China.
¹⁵ Department of Neurology, Peking University Third Hospital, Beijing, China.
¹⁶ Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, Ohio, United States of America.
¹⁷ State Key Laboratory of Genetic Engineering and MOE Engineering Research Center of Gene Technology, School of Life Sciences and Zhongshan Hospital, Fudan University, Shanghai, China.
¹⁸ NHC Key Laboratory of Birth Defects and Reproductive Health, Chongqing Population and Family Planning Science and Technology Research Institute, Chongqing, China.
¹⁹ Henan Academy of Sciences, Zhengzhou, Henan, China.
²⁰ Department of Forensic Sciences, College of Criminal Justice, Naif Arab University of Security Sciences, Riyadh, Kingdom of Saudi Arabia.
²¹ Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, Hangzhou, Zhejiang, China.

PMID: 38206971
PMCID: PMC10783776
DOI: 10.1371/journal.pgen.1011037

Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression

Qi-Xin Zhang et al. PLoS Genet. 2024.

. 2024 Jan 11;20(1):e1011037.

doi: 10.1371/journal.pgen.1011037. eCollection 2024 Jan.

Authors

Affiliations

¹ Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China.
² Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People's Hospital, People's Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China.
³ CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China.
⁴ CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China.
⁵ School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China.
⁶ Central Laboratory, Shenzhen Baoan Women's and Children's Hospital, Shenzhen, Guangdong, China.
⁷ Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China.
⁸ State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China.
⁹ State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China.
¹⁰ Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia.
¹¹ Ophthalmology, Department of Surgery, University of Melbourne, Melbourne, Victoria, Australia.
¹² State Key Laboratory of CAD & GC, Zhejiang University, Hangzhou, Zhejiang, China.
¹³ School of Mathematics and Statistics and Research Institute of Mathematical Sciences (RIMS), Jiangsu Provincial Key Laboratory of Educational Big Data Science and Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China.
¹⁴ Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, Zhejiang, China.
¹⁵ Department of Neurology, Peking University Third Hospital, Beijing, China.
¹⁶ Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, Ohio, United States of America.
¹⁷ State Key Laboratory of Genetic Engineering and MOE Engineering Research Center of Gene Technology, School of Life Sciences and Zhongshan Hospital, Fudan University, Shanghai, China.
¹⁸ NHC Key Laboratory of Birth Defects and Reproductive Health, Chongqing Population and Family Planning Science and Technology Research Institute, Chongqing, China.
¹⁹ Henan Academy of Sciences, Zhengzhou, Henan, China.
²⁰ Department of Forensic Sciences, College of Criminal Justice, Naif Arab University of Security Sciences, Riyadh, Kingdom of Saudi Arabia.
²¹ Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, Hangzhou, Zhejiang, China.

PMID: 38206971
PMCID: PMC10783776
DOI: 10.1371/journal.pgen.1011037

Erratum in

Correction: Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression.
PLOS Genetics Staff. PLOS Genetics Staff. PLoS Genet. 2024 Feb 21;20(2):e1011149. doi: 10.1371/journal.pgen.1011149. eCollection 2024 Feb. PLoS Genet. 2024. PMID: 38381719 Free PMC article.

Abstract

Explicitly sharing individual level data in genomics studies has many merits comparing to sharing summary statistics, including more strict QCs, common statistical analyses, relative identification and improved statistical power in GWAS, but it is hampered by privacy or ethical constraints. In this study, we developed encG-reg, a regression approach that can detect relatives of various degrees based on encrypted genomic data, which is immune of ethical constraints. The encryption properties of encG-reg are based on the random matrix theory by masking the original genotypic matrix without sacrificing precision of individual-level genotype data. We established a connection between the dimension of a random matrix, which masked genotype matrices, and the required precision of a study for encrypted genotype data. encG-reg has false positive and false negative rates equivalent to sharing original individual level data, and is computationally efficient when searching relatives. We split the UK Biobank into their respective centers, and then encrypted the genotype data. We observed that the relatives estimated using encG-reg was equivalently accurate with the estimation by KING, which is a widely used software but requires original genotype data. In a more complex application, we launched a finely devised multi-center collaboration across 5 research institutes in China, covering 9 cohorts of 54,092 GWAS samples. encG-reg again identified true relatives existing across the cohorts with even different ethnic backgrounds and genotypic qualities. Our study clearly demonstrates that encrypted genomic data can be used for data sharing without loss of information or data sharing barrier.

Copyright: © 2024 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Fig 1. Resolution for varying relatedness using GRM, encGRM and *encG-reg*.**
The figure shows the resolution for detecting relatives or overlapping samples with respect to varying number of markers at every row (for better illustration m_e was twice that of **Eq 3**) and the degree of relatives to be detected (r = 0, 1, and 2). The y axis is the relatedness calculated from GRM and the x axis is the estimated relatedness calculated from ***encG-reg*** (A) and encGRM (B). Each point represents an individual pair between cohort 1 and cohort 2 (there are 200 × 200 = 40,000 pairs in total), given the simulated relatedness. The dotted line indicates the 95% confidence interval of the relatedness directly estimated from the original genotype (blue) and the encrypted genotype (red). The table provides how m and k are estimated. The columns “under minimal m_e” provide benchmark for a parameter, and it is practically to choose 2×m_e and then estimate k as shown under the column “practical m_e”.

**Fig 2. Workflow of *encG-reg* and its practical timeline as exercised in Chinese cohorts.**
The mathematical details of ***encG-reg*** are simply algebraic, but its inter-cohort implementation involves coordination. (A) We illustrate its key steps, the time cost of which was adapted from the present exercise for 9 Chinese datasets (here simplified as three cohorts). **Cohort assembly**: It took us about a week to call and got positive responses from our collaborators (See **Table 3**), who agreed with our research plan. **Inter-cohort QC**: we received allele frequencies reports from each cohort and started to implement inter-cohort QC according to “geo-geno” analysis (see **Fig 6**). This step took about two weeks. **Encrypt genotypes**: upon the choice of the exercise, it could be exhaustive design (see UKB example), which may maximize the statistical power but with increased logistics such as generating pairwise S_ij; in the Chinese cohorts study we used parsimony design, and generated a unique S given 500 SNPs that were chosen from the 7,009 common SNPs. It took about a week to determine the number of SNPs and the dimension of k according to **Eq 3** and 4, and to evaluate the effective number of markers. **Perform *encG-reg* and validation**: we conducted inter-cohort ***encG-reg*** and validated the results (see **Fig 7** and **Table 4**). It took one week. (B) Two interactions between data owners and central analyst, including example data for exchange and possible attacks and corresponding preventative strategies.

**Fig 3. Sampling variance of GRM, encGRM and *encG-reg* in simulations.**
The observed and theoretical sampling variance of GRM (**A1-A4**), encGRM (**B1-B4**) and ***encG-reg*** (**C1-C4**) are given in bar plots. Individual genotypes are simulated with m = 1,000, 1,250, 1,500, 1,750, and 2,000 independent markers. A total number of n₁ = n₂ = 1,000 pairs of relatives are simulated under each different levels of relatedness (r = 0, 1, 2, and 3). As for the encryption, the column number of random matrices are k = 4,000, 5,000, 6,000, 7,000, and 8,000 correspondingly.

**Fig 4. Influence of minor allele frequencies in detecting relatives in Manchester and Oxford cohorts.**
The bar plots provide a comparison of relatedness scores for the known 1st-degree and 2nd-degree relatives estimated by KING, GRM, ***encG-reg***, and ***encG-reg+*** at two representative assessment centers (Manchester and Oxford). For each assessment centers, 566 and 2,209 SNPs were randomly selected within specific MAF ranges: 0.01 to 0.05, 0.05 to 0.15, 0.15 to 0.25, 0.25 to 0.35, 0.35 to 0.5, and 0.05 to 0.5. Here, ***encG-reg***+ denotes the use of 1.2-fold of the minimum number of k and IBD denotes twice the relatedness score estimated by KING. After resampling SNPs 100 times, the average GRM score, standard deviation, and statistical power were calculated for each detected relative-pair. The grey dashed line indicates the expected statistical power of 0.9. The solid colored lines indicate the average relatedness scores for certain degrees as estimated by the four methods. 17 pairs of so-called 1st-degree and 2 pairs of 2nd-degree relatives were approved using overall SNPs by KING.

**Fig 5. Resolution for detecting relatives in UKB cohorts by KING and *encG-reg* under exhaustive design.**
(A) Chord diagrams show the number of inter-cohort identical/twins, 1st-degree, and 2nd-degree relatedness across 19 UKB assessments with over 10,000 samples. Relatedness was detected and compared between KING and ***encG-reg*** under an exhaustive design, encompassing a total of 171 inter-cohort analyses. In each chord plot, the length of the side edge is proportional to the count of detected relatives between the focal cohort and other cohorts. (B) The scatter plot shows the estimated relatedness score by KING and ***encG-reg*** for inter-cohort relative pairs, including identical, 1st-degree, and 2nd-degree pairs. (C) The histogram shows the distribution of relatedness scores estimated by ***encG-reg***. (D) The histogram shows the distribution of relatedness scores estimated by KING.

**Fig 6. Cohort-level genetic background analyses for Chinese cohorts under parsimony encG-reg analysis.**
(A) Overview of the intersected SNPs across cohorts, a black dot indicated its corresponding cohort was included. Each row represented one cohort while each column represented one combination of cohorts. Dots linked by lines suggested cohorts in this combination. The height of bars represented the cohort’s SNP numbers (rows) or SNP intersection numbers (columns). Inset histogram plots show the distribution of the 7,009 intersected SNPs and the 500 SNPs randomly chosen from the 7,009 SNPs for ***encG-reg*** analysis. (B) 7,009 SNPs were used to estimate fPC from the intersection of SNPs for the 9 cohorts. Each triangle represented one Chinese cohort and was placed according to their first two principal component scores (fPC1 and fPC2) derived from the received allele frequencies. (C) Five private datasets have been pinned onto the base map from GADM (https://gadm.org/data.html) using R language. The size of point indicates the sample size of each dataset. (D) Global fStructure plot indicates global-level F_st-derived genetic composite projected onto the three external reference populations: 1KG-CHN (CHB and CHS), 1KG-EUR (CEU and TSI), and 1KG-AFR (YRI), respectively; 4,296 of the 7,009 SNPs intersected with the three reference populations were used. (E) Within Chinese fStructure plot indicates within-China genetic composite. The three external references are 1KG-CHB (North Chinese), 1KG-CHS (South Chinese), and 1KG-CDX (Southwest minority Chinese Dai), respectively; 4,809 of the 7,009 SNPs intersected with these three reference populations were used. Along x axis are 9 Chinese cohorts and the height of each bar represents its proportional genetic composition of the three reference populations. Cohort codes: YRI, Yoruba in Ibadan representing African samples; CHB, Han Chinese in Beijing; CHS, Southern Han Chinese; CHN, CHB and CHS together; CEU, Utah Residents with Northern and Western European Ancestry; TSI, Tuscani in Italy; CDX, Chinese Dai in Xishuangbanna.

**Fig 7. Detected identical pairs and 1st-degree pairs between Chinese cohorts.**
(A) The circle plot illustrates identical pairs and (B) 1st-degree pairs across 9 Chinese cohorts. The solid links indicates anticipated relatedness between the CAS cohorts. The dashed link indicates relatedness identified across cohorts. The length of each cohort bar is proportional to their respective sample sizes. (C) The histogram shows all estimated relatedness using ***encG-reg*** between CAS1 and CAS2, most of which are unrelated pairs and the theoretical probability density function is given as the normal distribution $N (0, \frac{1}{m_{e}} + \frac{1}{k_{1}})$ (grey solid curve). The inset histogram on the left shows the estimated relatedness around 0.5 and the theoretical probability density function is given as the normal distribution $N (θ_{r}, \frac{1 - {θ_{r}}^{2}}{m_{e}} + \frac{1 - {θ_{r}}^{2}}{k_{1}})$ (blue solid curve). The threshold (grey dot line) for rejecting H₀ was calculated by $z_{1 - α / N} \sqrt{\frac{1}{m_{e}} + \frac{1}{k_{1}}}$ . The inset histogram on the right shows the estimated relatedness around 1. The threshold (grey dot line) for rejecting H₀ was calculated by $z_{1 - α / N} \sqrt{\frac{1}{m_{e}} + \frac{1}{k_{0}}}$ . Here we included 208 controls merged from 1KG-CHN. $m_{e} = 477, k_{0} = 70, k_{1} = 710, N = 930,140,004$ . (D) Relationship verification for 19 Guangdong twins split in CAS cohorts. Dashed lines indicate inference criteria for detecting relatedness of different degrees. Solid line of y = x indicates the agreement between ***encG-reg*** and IBD. Points are colored with IBD-inferred relatedness in KING (identical in green, 1st-degree in blue, and unrelated in red) and are shaped according to ***encG-reg***-inferred relatedness (identical in square, 1st-degree in diamond, and unrelated in circle). (E) and (F) Illustration for ***encG-reg*** estimation for sporadic related inter-cohort samples. The grey line is the criterion for identical pairs (slope of 1) or 1st-degree pairs (slope of 0.5). The solid lines colored in red are without adjustment for missing values (***encG-reg*** score), and in the bottom (colored in purple) are adjusted for missing values (***encG-reg*** score*).

See this image and copyright information in PMC

References

1. Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM. Robust relationship inference in genome-wide association studies. Bioinformatics. 2010; 26: 2867–2873. doi: 10.1093/bioinformatics/btq559 - DOI - PMC - PubMed
1. Thomson R, McWhirter R. Adjusting for familial relatedness in the analysis of GWAS data. Methods in Molecular Biology. Humana Press, New York, NY; 2017. pp. 175–190. doi: 10.1007/978-1-4939-6613-4_10 - DOI - PubMed
1. Choi SW, Mak TSH, O’Reilly PF. Tutorial: a guide to performing polygenic risk score analyses. Nature Protocols. Europe PMC Funders; 2020. pp. 2759–2772. doi: 10.1038/s41596-020-0353-1 - DOI - PMC - PubMed
1. Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. Pitfalls of predicting complex traits from SNPs. Nature Reviews Genetics. NIH Public Access; 2013. pp. 507–515. doi: 10.1038/nrg3457 - DOI - PMC - PubMed
1. Guerrini CJ, Robinson JO, Bloss CC, Bash Brooks W, Fullerton SM, Kirkpatrick B, et al. Family secrets: Experiences and outcomes of participating in direct-to-consumer genetic relative-finder services. Am J Hum Genet. 2022; 109: 486–497. doi: 10.1016/j.ajhg.2022.01.013 - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression

Affiliations

Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources