Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2004 Jan 2;32(1):135-42.
doi: 10.1093/nar/gkh162. Print 2004.

Automatic extraction of mutations from Medline and cross-validation with OMIM

Affiliations
Comparative Study

Automatic extraction of mutations from Medline and cross-validation with OMIM

Dietrich Rebholz-Schuhmann et al. Nucleic Acids Res. .

Abstract

Mutations help us to understand the molecular origins of diseases. Researchers, therefore, both publish and seek disease-relevant mutations in public databases and in scientific literature, e.g. Medline. The retrieval tends to be time-consuming and incomplete. Automated screening of the literature is more efficient. We developed extraction methods (called MEMA) that scan Medline abstracts for mutations. MEMA identified 24,351 singleton mutations in conjunction with a HUGO gene name out of 16,728 abstracts. From a sample of 100 abstracts we estimated the recall for the identification of mutation-gene pairs to 35% at a precision of 93%. Recall for the mutation detection alone was >67% with a precision rate of >96%. This shows that our system produces reliable data. The subset consisting of protein sequence mutations (PSMs) from MEMA was compared to the entries in OMIM (20,503 entries versus 6699, respectively). We found 1826 PSM-gene pairs to be in common to both datasets (cross-validated). This is 27% of all PSM-gene pairs in OMIM and 91% of those pairs from OMIM which co-occur in at least one Medline abstract. We conclude that Medline covers a large portion of the mutations known to OMIM. Another large portion could be artificially produced mutations from mutagenesis experiments. Access to the database of extracted mutation-gene pairs is available through the web pages of the EBI (refer to http://www.ebi. ac.uk/rebholz/index.html).

PubMed Disclaimer

Figures

Figure 1
Figure 1
Workflow overview. 16 142 HUGO gene names were integrated as patterns into a finite state automaton. This is also true for mutation patterns, which encoded a mutation as regular expression. All Medline abstracts were scanned and the different FSAs extracted the phrases and tagged the result.
Figure 2
Figure 2
Number of PSM–gene pairs per gene sorted according to OMIM. From top to the bottom different genes are listed, and the blocks to the right represent the number of PSM–gene pairs: first the number of PSM–gene pairs unique to OMIM, then those contained in both databases and finally those listed in MEMA only (Medline). The genes have been sorted according to the number of pairs in OMIM, which is the sum of the first two sections in the block. refers to OMIM-owned genes, and and to MEMA-owned or BOTH-owned genes, respectively (see text). OMIM provides more PSM–gene pairs than MEMA for only 12 out of 30 genes, although these are the top 30 of OMIM. The correlation coefficient for the distribution of mutations per gene found in OMIM and in Medline is 0.53.
Figure 3
Figure 3
Number of PSMs from 1971 to 2001. The diagram shows the number of PSMs found for genes where at least one PSM–gene pair has been cross-validated. The blocks from bottom to top represent the number of PSM–gene pairs in total for OMIM-owned, for BOTH-owned and for MEMA-owned genes. The inset displays the number of PSMs of the years 1971–1985 at a larger scale (0–16). During the years 1990–2000 a steady increase in PSM–gene pairs takes place. Only a small portion is integrated into OMIM, mainly represented by the PSM–gene pairs of BOTH-owned genes. This is explained by the fact that Medline reports on experimentally induced mutations. Such mutation–gene pairs are not relevant to OMIM, since the evidence for impact to a human disease might not be known or might be unclear (see Discussion).

References

    1. Perutz M.F. and Lehmann,H. (1968) Molecular pathology of human haemoglobin. Nature, 219, 902–909. - PubMed
    1. Tolle R. (2001) Information Technology Tools for Efficient SNP Studies. Am. J. Pharmacogenomics, 1, 1–12. - PubMed
    1. Medline database (December 2001) Access through US National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894, USA.
    1. Online Mendelian Inheritance in Man, OMIM™ (December 2001) McKusick–Nathans Institute for Genetic Medicine, Johns Hopkins University (Baltimore, MD) and National Center for Biotechnology Information, National Library of Medicine (Bethesda, MD).
    1. Hamosh A., Scott,A.F., Amberger,J., Valle,D. and McKusick,V.A. (2000) Online Mendelian Inheritance in Man (OMIM). Hum. Mutat., 15, 57–61. - PubMed

Publication types