Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Feb;22(2):362-370.
doi: 10.1038/s41436-019-0643-6. Epub 2019 Aug 30.

AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature

Affiliations

AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature

Johannes Birgmeier et al. Genet Med. 2020 Feb.

Abstract

Purpose: Both monogenic pathogenic variant cataloging and clinical patient diagnosis start with variant-level evidence retrieval followed by expert evidence integration in search of diagnostic variants and genes. Here, we try to accelerate pathogenic variant evidence retrieval by an automatic approach.

Methods: Automatic VAriant evidence DAtabase (AVADA) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic genetic variant evidence in full-text primary literature about monogenic disease and convert it to genomic coordinates.

Results: AVADA automatically retrieved almost 60% of likely disease-causing variants deposited in the Human Gene Mutation Database (HGMD), a 4.4-fold improvement over the current best open source automated variant extractor. AVADA contains over 60,000 likely disease-causing variants that are in HGMD but not in ClinVar. AVADA also highlights the challenges of automated variant mapping and pathogenicity curation. However, when combined with manual validation, on 245 diagnosed patients, AVADA provides valuable evidence for an additional 18 diagnostic variants, on top of ClinVar's 21, versus only 2 using the best current automated approach.

Conclusion: AVADA advances automated retrieval of pathogenic monogenic variant evidence from full-text literature. Far from perfect, but much faster than PubMed/Google Scholar search, careful curation of AVADA-retrieved evidence can aid both database curation and patient diagnosis.

Keywords: automatic variant retrieval; full-text extraction; machine learning; natural language processing; variants database.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest

P.D.S. and D.N.C are the creators of HGMD. They receive financial support for it from Qiagen LTD through a License Agreement with Cardiff University. The other authors declare no conflict of interest.

Figures

Figure 1.
Figure 1.. Construction of the automated variant evidence database AVADA. Identification of relevant literature:
AVADA discovers potentially relevant articles (about the genetic causes of Mendelian diseases) from PubMed, downloads their full text, and again filters potentially relevant articles based on the articles’ full text. Variant mapping: Variant descriptions are detected in articles using 47 manually built regular expressions. Variant descriptions are then linked to mentioned genes to form gene-variant candidate mappings. Gene-variant candidate mappings are filtered using a gene-variant candidate classifier and converted to genomic coordinates. AVADA ultimately retrieves (unvalidated) evidence about 203,536 distinct genetic variants in 5,827 genes from 61,116 articles.
Figure 2.
Figure 2.. Automatic conversion of variant mentions to genomic coordinates from full-text literature.
(A) AVADA uses a regular expression to detect a variant mention (e.g., p.M34T) in the full text of an article. The position of the variant in the transcript (34), reference (M) and alternative alleles (T) are parsed using the regular expression. (B) AVADA detects mentioned genes in the article using a list of gene names and synonyms, and the help of a classifier that decides if recognized words are indeed a gene mention. The variant description detected in step A forms gene-variant candidate mappings with those genes that have the reference “M” at amino acid number 34. (C) Gene-variant candidate mappings (variant=p.M34T and gene=GJB2 in this example, highlighted in green) are associated with 125 numerical features based on the relative positions of the closest mention of the candidate gene to the variant mention, information about the candidate gene’s importance in the article, and words and characters surrounding the gene and variant mentions and nearby gene mentions (the latter highlighted in red; see Supplementary Methods). (D) A machine learning classifier (implemented as a Gradient Boosting classifier) takes these 125 features as input and returns a score between 0 and 1 indicating the classifier’s assessment of whether the variant actually refers to the given candidate gene. If the classifier returns a score greater than 0.9, the gene-variant candidate mapping is transformed to Variant Call Format (chromosome, position, reference and alternative allele) and entered into the AVADA database. In the present example, AVADA correctly decides that p.M34T only maps to GJB2 and not connexin 30 (encoded by the gene GJB6). Example taken from PubMed ID 23808595.
Figure 3.
Figure 3.. Automatic variant retrieval results.
(A) Top ten journals in AVADA. AVADA retrieved variants from 3,159 articles in “Human Mutation”, 2,330 articles in “American Journal of Human Genetics”, 2,042 articles in “Human Molecular Genetics” etc. (B) Top ten journals in all of HGMD. Similar to AVADA, the top three journals are “Human Mutation”, the “American Journal of Human Genetics”, and “Human Molecular Genetics”. Reassuringly, the two lists share 9 of the top 10 journals even though HGMD is manually curated whereas AVADA automatically retrieves variant evidence, but does not validate it. (C) (Unvalidated) AVADA variants intersected with all curated disease-causing variants in HGMD (“DM” variants only) and ClinVar (“likely/pathogenic” variants only). AVADA retrieves 85,888 variants also in the HGMD set (subset to disease-causing variants) and 26,033 variants also in the ClinVar set (subset to pathogenic and likely pathogenic variants). (D) AVADA’s potential value in patient diagnosis. We enumerate the number of patient diagnostic variants found in each of four databases, for 245 Deciphering Developmental Disorders (DDD) diagnosed patients. Curated HGMD and ClinVar (predating the DDD publication) are subset to disease-causing (“DM”), and “likely/pathogenic”, respectively. For tmVar and AVADA, we manually validated all diagnostic evidence shown. AVADA completely subsumes and almost triples abstract-based tmVar. And while ClinVar alone implicates 21 diagnostic variants, AVADA offers unvalidated evidence for an additional 27 variants, of which 18 are valid, virtually doubling ClinVar’s reach.

References

    1. Taylor JC, Martin HC, Lise S, et al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet. 2015;47(7):717–726. doi:10.1038/ng.3304 - DOI - PMC - PubMed
    1. Lek M, Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536(7616):285–291. - PMC - PubMed
    1. Dewey FE, Grove ME, Pan C, et al. Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014;311(10):1035. doi:10.1001/jama.2014.1717 - DOI - PMC - PubMed
    1. Smedley D, Jacobsen JOB, Jäger M, et al. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat Protoc. 2015;10(12):2004–2015. doi:10.1038/nprot.2015.124 - DOI - PMC - PubMed
    1. Jagadeesh KA, Birgmeier J, Guturu H, et al. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet Med Off J Am Coll Med Genet. July 2018. doi:10.1038/s41436-018-0072-y - DOI - PubMed

Publication types