Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 16;20(1):254.
doi: 10.1186/s12859-019-2877-3.

GenePy - a score for estimating gene pathogenicity in individuals using next-generation sequencing data

Affiliations

GenePy - a score for estimating gene pathogenicity in individuals using next-generation sequencing data

E Mossotto et al. BMC Bioinformatics. .

Abstract

Background: Next-generation sequencing is revolutionising diagnosis and treatment of rare diseases, however its application to understanding common disease aetiology is limited. Rare disease applications binarily attribute genetic change(s) at a single locus to a specific phenotype. In common diseases, where multiple genetic variants within and across genes contribute to disease, binary modelling cannot capture the burden of pathogenicity harboured by an individual across a given gene/pathway. We present GenePy, a novel gene-level scoring system for integration and analysis of next-generation sequencing data on a per-individual basis that transforms NGS data interpretation from variant-level to gene-level. This simple and flexible scoring system is intuitive and amenable to integration for machine learning, network and topological approaches, facilitating the investigation of complex phenotypes.

Results: Whole-exome sequencing data from 508 individuals were used to generate GenePy scores. For each variant a score is calculated incorporating: i) population allele frequency estimates; ii) individual zygosity, determined through standard variant calling pipelines and; iii) any user defined deleteriousness metric to inform on functional impact. GenePy then combines scores generated for all variants observed into a single gene score for each individual. We generated a matrix of ~ 14,000 GenePy scores for all individuals for each of sixteen popular deleteriousness metrics. All per-gene scores are corrected for gene length. The majority of genes generate GenePy scores < 0.01 although individuals harbouring multiple rare highly deleterious mutations can accumulate extremely high GenePy scores. In the absence of a comparator metric, we examine GenePy performance in discriminating genes known to be associated with three common, complex diseases. A Mann-Whitney U test conducted on GenePy scores for this positive control gene in cases versus controls demonstrates markedly more significant results (p = 1.37 × 10- 4) compared to the most commonly applied association tool that combines common and rare variation (p = 0.003).

Conclusions: Per-gene per-individual GenePy scores are intuitive when assessing genetic variation in individual patients or comparing scores between groups. GenePy outperforms the currently accepted best practice tools for combining common and rare variation. GenePy scores are suitable for downstream data integration with transcriptomic and proteomic data that also report at the gene level.

Keywords: Gene score; Genome analysis; Mathematical modelling; Next-generation sequencing; Pathogenicity score.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Single variant GenePy score distribution under fixed deleteriousness values. Impact of varying zygosity and minor allele frequency (MAF)
Fig. 2
Fig. 2
GenePy profiles observed for all genes across the whole cohort for all sixteen deleteriousness metrics. Uncorrected GenePy scores (upper panel) exhibit characteristic spikes reflecting gene scores strongly influenced by the effect of: single highly deleterious (D = 1) common homozygous variants (red) or; single highly deleterious very rare/novel variants (MAF = 0.00001) (blue). GenePycgl score profiles (lower panel) do not display these spikes. Invariant genes conferring a GenePy score < 0.01 are overrepresented and not shown here by commencing the x-axis with the 0.01–0.02 bin. All sixteen versions of the GenePy score exhibit long tails in the GenePy score distribution truncated here at a score of six
Fig. 3
Fig. 3
GenePy score profiles for seven independent patients diagnosed with IBD across selected genes from the NOD2 and TLR pathways. GenePy scores shown were implemented using the M-CAP deleteriousness (D) metric. To facilitate plotting, raw GenePy scores were transformed to Z-scores for each gene. Different colours depict individual patient profiles. Despite being diagnosed with the same disease, all individuals exhibit distinctive profiles across key genes implicated in key immune pathways. Some individuals have evidence of gene pathogenicity within the same pathway (e.g. IBD5 and IBD6) this is conferred through accumulated mutation in different genes – IBD6 has elevated gene-level scores for TAB1, CARD6 and MAPK3 while IBD5 may have impaired function in this pathway due to combined mutation in MAPK13, BP1 and NFKB1. Similarly, IBD1, IBD3 and IBD4 exhibit pathogenic profiles in TLR pathway genes only. These individual level data can be combined with disease phenotype, severity and treatment outcome data in machine learning models to better stratify patient cohorts and realise the promise of personalised medicine

References

    1. Trujillano D, Bertoli-Avella AM, Kumar Kandaswamy K, Weiss ME, Köster J, Marais A, et al. Clinical exome sequencing: results from 2819 samples reflecting 1000 families. Eur J Hum Genet. 2017;25:176–182. doi: 10.1038/ejhg.2016.146. - DOI - PMC - PubMed
    1. Shen T, Lee A, Shen C, Lin CJ. The long tail and rare disease research: the impact of next-generation sequencing for rare Mendelian disorders. Genet Res (Camb) 2015;97:e15. doi: 10.1017/S0016672315000166. - DOI - PMC - PubMed
    1. Jamuar SS, Tan E-C. Clinical application of next-generation sequencing for Mendelian diseases. Hum Genomics. 2015;9:10. doi: 10.1186/s40246-015-0031-5. - DOI - PMC - PubMed
    1. Gilissen C, Hoischen A, Brunner HG, Veltman JA. Disease gene identification strategies for exome sequencing. Eur J Hum Genet. 2012;20:490–497. doi: 10.1038/ejhg.2011.258. - DOI - PMC - PubMed
    1. Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing Program ED. Green ED, Batzoglou S, et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913. doi: 10.1101/gr.3577405.. - DOI - PMC - PubMed