Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 24;3(10):100390.
doi: 10.1016/j.xgen.2023.100390. eCollection 2023 Oct 11.

Learning functional conservation between human and pig to decipher evolutionary mechanisms underlying gene expression and complex traits

Affiliations

Learning functional conservation between human and pig to decipher evolutionary mechanisms underlying gene expression and complex traits

Jinghui Li et al. Cell Genom. .

Abstract

Assessment of genomic conservation between humans and pigs at the functional level can improve the potential of pigs as a human biomedical model. To address this, we developed a deep learning-based approach to learn the genomic conservation at the functional level (DeepGCF) between species by integrating 386 and 374 functional profiles from humans and pigs, respectively. DeepGCF demonstrated better prediction performance compared with the previous method. In addition, the resulting DeepGCF score captures the functional conservation between humans and pigs by examining chromatin states, sequence ontologies, and regulatory variants. We identified a core set of genomic regions as functionally conserved that plays key roles in gene regulation and is enriched for the heritability of complex traits and diseases in humans. Our results highlight the importance of cross-species functional comparison in illustrating the genetic and evolutionary basis of complex phenotypes.

Keywords: complex trait; deep learning; functional conservation; gene expression; human; pig.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
Overview of the DeepGCF model (A) The learning procedure of the DeepGCF model consists of two steps. The first step is to train DeepSEA models in humans and pigs separately to transform the binary functional features (e.g., peaks called from ATAC-seq and ChIP-seq and chromatin states predicted from a multivariate Hidden Markov Model (ChromHMM)) to continuous values by predicting the functional effects of single nucleotides through centering the target nucleotide at a genomic region of 1,000 bp. The second step is to train a pseudo-Siamese network to predict whether the paired human-pig regions are orthologous using two corresponding vectors of functional effects predicted from DeepSEA and normalized gene expression as input. The output, DeepGCF score, is a value between 0 and 1 quantifying the functional conservation of the paired human-pig region. (B) The DeepGCF model can be applied to predict the effect of genome variants on functional conservation, quantified by changes in DeepGCF scores.
Figure 2
Figure 2
The performance of DeepGCF under different scenarios (A) Receiver operating characteristic (ROC) curves comparing the performance of DeepGCF (this study) and LECIF methods. The ROC curve of each method is generated by predicting whether 200,000 pairs randomly selected from the testing set, which included equal numbers of orthologous and non-orthologous pairs, were orthologous. (B) Precision-recall (PR) curves generated by similar procedures as the ROC curves. (C) The distribution of DeepGCF scores across all 38,961,848 human-pig ortholog pairs. (D) The areas under the ROC curve (AUROC) and PR curve (AUPRC) of DeepGCF using all (human, 386; pig, 374), ∼50% (human, 192; pig, 187), ∼10% (human, 52; pig: 47), and ∼1% (human, 4; pig: 4) of human and pig functional features. The subsets of the human and pig features were randomly and proportionally selected from each of the ChIP-seq/ATAC-seq, ChromHMM, and RNA-seq profiles. (E) The AUROC and AUPRC of DeepGCF using all functional features (human, 386; pig, 374), features without ChIP-seq/ATAC-seq (human, 129; pig, 84), features without ChromHMM (human, 180; pig, 210), and features without RNA-seq (human, 77; pig, 80).
Figure 3
Figure 3
Comparison of functional and sequence conservations (A) Relationship between DeepGCF scores and PhyloP scores of 20,000 randomly selected human regions. The PhyloP score is based on multiple alignments of 99 vertebrate genomes to the human genome. The blue line is the fitted loess regression. The red crosses represent 50 equally divided percentiles of the PhyloP score and corresponding mean DeepGCF score. (B) Enrichment fold of 8 sequence class categories for regions with high DeepGCF (>95th percentile) and high PhyloP (>95th percentile, high D & high P, n = 260,281) and regions with high DeepGCF (<5th percentile) and medium PhyloP (between 47.5th and 52.5th percentile, low D & med P, n = 77,848). Enrichment is equal to the proportion of a sequence class category for a type of orthologous region divided by that for the whole genome. The dashed line (= 1) represents no enrichment. (C) Distribution of DeepGCF score for different sequence ontologies. The red and green dashed lines represent the mean and median DeepGCF score of the whole genome, respectively. The dots in each box represent the mean DeepGCF score. In each box, the center line represents the median, the dot represents the mean, box limits represent the upper and lower quartiles, whiskers represent 1.5 × interquartile range, and individual points are outliers. (D) ΔDeepGCF (DeepGCF after mutation – original DeepGCF) caused by 1,000,000 randomly selected orthologous variants, which are classified into 8 sequence class categories annotated by Sei.. (E) The effect of orthologous variants (n = 35,575,835) on the DeepGCF score of regions in 40 sequence classes annotated by Sei, which are classified into 8 categories. The effect was measured by ΔDeepGCF for variants in each sequence class. The SD of ΔDeepGCF for each sequence class quantifies the sensitivity of the sequence class to variants. The dashed line is the fitted regression line.
Figure 4
Figure 4
DeepGCF scores of genomic regions overlapping with regulatory elements (A) Distribution of average DeepGCF scores across human tissues (n = 12) and pig tissues (n = 14) for each chromatin state. The red and green dashed lines represent the mean and median DeepGCF score of the whole genome. In each box, the center line represents the median, box limits represent the upper and lower quartiles, whiskers represent 1.5 × interquartile range, and individual points are outliers. (B) DeepGCF scores of genomic regions overlapping with tissue-specific strongly active promoters and enhancers for human and pig. “All common” represents promoters/enhancers shared across all tissues. Asterisks denote two-sided Mann-Whitney U test: ∗∗∗∗p < 2.2 × 10−16. (C) Number of significantly enriched GO terms for human of genes related to promoters annotated by Sei. Significance was calculated using FDR < 0.05 for the binomial and hypergeometric tests. The genes were binned by DeepGCF into 10 equal-width bins, and a functional enrichment analysis was conducted on each bin. (D) Similar to (C) but showing the results of enhancers annotated by Sei..
Figure 5
Figure 5
Relationship of DeepGCF scores to genetic variants (A) The distribution of DeepGCF scores for eQTLs and sQTLs. The red and green dashed lines represent the mean and median DeepGCF score of the whole genome, respectively. Asterisks denote two-sided Mann-Whitney U test: ∗∗∗∗p < 10−8. In each box, the center line represents the median, the dot represents the mean, box limits represent the upper and lower quartiles, whiskers represent 1.5 × interquartile range, and individual points are outliers. (B) Relationship between the absolute value of eQTL effect size measured by log allelic fold change (|log2(aFC)|) and DeepGCF score for eGenes. The genes were binned by DeepGCF into 10 equal-width bins for human and pig, respectively. Asterisks denote that the group is different from all other groups: ∗∗∗∗p < 10−8 based on Tukey’s multiple comparisons. (C) DeepGCF scores of tissue-sharing e/sGenes from human at local false sign rate (LFSR) < 5% obtained by MashR. Each solid line represents ± standard deviation. (D) Similar to (C) but showing the results for pigs.
Figure 6
Figure 6
Relationship of conservation score to pathogenic variants (A) The distribution of DeepGCF scores for pathogenic and likely pathogenic SNPs (n = 104,033) obtained from ClinVar, compared with the distribution of DeepGCF scores across the whole genome. Asterisks denote two-sided Mann-Whitney U test: ∗∗∗∗p < 5 × 10−8. In each box, the center line represents the median, box limits represent the upper and lower quartiles, whiskers represent 1.5 × interquartile range, and individual points are outliers. (B) SD of ΔDeepGCF (DeepGCF after mutation – original DeepGCF) caused by ClinVar SNPs. The SNPs were binned by their original DeepGCF into 10 equal-width bins. (C) ClinVar SNPs classified by Sei. A polar coordinate system was used, where the radial coordinate indicates the SNP effect on DeepGCF score. The red solid circle represents zero DeepGCF change, and two dashed circles represent ±0.03 of DeepGCF encompassing 95% of SNPs. Each dot represents a SNP, and SNPs in the red circle were predicted to have positive effects (increased DeepGCF), while SNPs outside of the red circle were predicted to have negative effects (decreased DeepGCF). Dot size indicates the original DeepGCF. Within each sequence class, SNPs were ordered by chromosomal coordinates. Diseases and gene names associated with the top 10 SNPs with the largest impact on DeepGCF were annotated.
Figure 7
Figure 7
Application of DeepGCF on complex traits/diseases in human (A) Heritability enrichment calculated by LDSC for 80 human traits using functionally conserved regions (top 5% DeepGCF). The regions were divided into 5 equal equal-width bins, and the heritability enrichment of all traits was calculated for each bin. The red dashed line is the fitted regression line between heritability enrichment and DeepGCF percentile, and the gray area is the 95% confidence interval. In each box, the center line represents the median, box limits represent the upper and lower quartiles, whiskers represent 1.5 × interquartile range, and individual points are outliers. (B) Significant heritability enrichment (FDR < 0.05) explained by functionally conserved regions for 8 human traits. The error bar is the estimated standard error of heritability enrichment. (C) The number of putative causal SNPs (PIP > 0.95 and GWAS p < 5 × 10−8) identified by PolyFun + SuSiE with functionally conserved regions as a prior and SuSiE without priors for 7 human traits (the results for coxarthrosis are not shown because no causal SNPs were found using either method). (D) The relative prediction accuracy of polygenic scores for 20 human complex traits using functionally conserved regions as a prior in SBayesRC. Relative prediction accuracy is equal to (prediction accuracy using the prior – prediction accuracy without priors) / prediction accuracy without priors. Relative prediction accuracy > 0 (dashed line) indicates a higher accuracy than without priors.

References

    1. Alföldi J., Lindblad-Toh K. Comparative genomics as a tool to understand evolution and disease. Genome Res. 2013;23:1063–1068. doi: 10.1101/gr.157503.113. - DOI - PMC - PubMed
    1. Lunney J.K., Van Goor A., Walker K.E., Hailstock T., Franklin J., Dai C. Importance of the pig as a human biomedical model. Sci. Transl. Med. 2021;13 doi: 10.1126/scitranslmed.abd5758. - DOI - PubMed
    1. Schelstraete W., Devreese M., Croubels S. Comparative toxicokinetics of Fusarium mycotoxins in pigs and humans. Food Chem. Toxicol. 2020;137 doi: 10.1016/j.fct.2020.111140. - DOI - PubMed
    1. Montgomery R.A., Stern J.M., Lonze B.E., Tatapudi V.S., Mangiola M., Wu M., Weldon E., Lawson N., Deterville C., Dieter R.A., et al. Results of Two Cases of Pig-to-Human Kidney Xenotransplantation. N. Engl. J. Med. 2022;386:1889–1898. doi: 10.1056/NEJMoa2120238. - DOI - PubMed
    1. Kragh P.M., Nielsen A.L., Li J., Du Y., Lin L., Schmidt M., Bøgh I.B., Holm I.E., Jakobsen J.E., Johansen M.G., et al. Hemizygous minipigs produced by random gene insertion and handmade cloning express the Alzheimer’s disease-causing dominant mutation APPsw. Transgenic Res. 2009;18:545–558. doi: 10.1007/s11248-009-9245-4. - DOI - PubMed