Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Aug 27;10 Suppl 8(Suppl 8):S1.
doi: 10.1186/1471-2105-10-S8-S1.

Extraction of human kinase mutations from literature, databases and genotyping studies

Affiliations

Extraction of human kinase mutations from literature, databases and genotyping studies

Martin Krallinger et al. BMC Bioinformatics. .

Abstract

Background: There is a considerable interest in characterizing the biological role of specific protein residue substitutions through mutagenesis experiments. Additionally, recent efforts related to the detection of disease-associated SNPs motivated both the manual annotation, as well as the automatic extraction, of naturally occurring sequence variations from the literature, especially for protein families that play a significant role in signaling processes such as kinases. Systematic integration and comparison of kinase mutation information from multiple sources, covering literature, manual annotation databases and large-scale experiments can result in a more comprehensive view of functional, structural and disease associated aspects of protein sequence variants. Previously published mutation extraction approaches did not sufficiently distinguish between two fundamentally different variation origin categories, namely natural occurring and induced mutations generated through in vitro experiments.

Results: We present a literature mining pipeline for the automatic extraction and disambiguation of single-point mutation mentions from both abstracts as well as full text articles, followed by a sequence validation check to link mutations to their corresponding kinase protein sequences. Each mutation is scored according to whether it corresponds to an induced mutation or a natural sequence variant. We were able to provide direct literature links for a considerable fraction of previously annotated kinase mutations, enabling thus more efficient interpretation of their biological characterization and experimental context. In order to test the capabilities of the presented pipeline, the mutations in the protein kinase domain of the kinase family were analyzed. Using our literature extraction system, we were able to recover a total of 643 mutations-protein associations from PubMed abstracts and 6,970 from a large collection of full text articles. When compared to state-of-the-art annotation databases and high throughput genotyping studies, the mutation mentions extracted from the literature overlap to a good extent with the existing knowledgebases, whereas the remaining mentions suggest new mutation records that were not previously annotated in the databases.

Conclusion: Using the proposed residue disambiguation and classification approach, we were able to differentiate between natural variant and mutagenesis types of mutations with an accuracy of 93.88. The resulting system is useful for constructing a Gold Standard set of mutations extracted from the literature by human experts with minimal manual curation effort, providing direct pointers to relevant evidence sentences. Our system is able to recover mutations from the literature that are not present in state-of-the-art databases. Human expert manual validation of a subset of the literature extracted mutations conducted on 100 mutations from PubMed abstracts highlights that almost three quarters (72%) of the extracted mutations turned out to be correct, and more than half of these had not been previously annotated in databases.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flow chart of the presented literature mining approach for mutation extraction. This flow chart provides an overview of the different processing steps to extract mutations relevant for human kinases. The main steps include the construction of a kinase relevant article collection, the detection of mutation mentions, the scoring of the type of mutation (induced/natural variant), the linking of mutations to the corresponding protein sequence and the comparison to existing databases.
Figure 2
Figure 2
Mutation type frequencies from PubMed and SwissProt. A. Relative frequency of each mutation type derived from PubMed abstracts and from the SwissProt database. B. Most frequent mutation types from PubMed abstracts, and from SwissProt (SP), annotated as natural variant or induced (mutagenesis) substitutions. C. Values of the Spearman rank correlation between the text mining derived mutation types and the database derived mutation types. All p-values are below 10e-6, therefore statistically significant.
Figure 3
Figure 3
Comparative analysis of wild type residues and mutations extracted from SwissProt and using text mining. This chart illustrates differences in terms of wild type and mutant residue frequencies derived from the SwissProt database and obtained through automatic literature processing. A. Relative frequency of each wild type residue derived from mutations extracted from PubMed abstracts and from the SwissProt database. B. Relative frequency of each mutant residue derived from mutations extracted from PubMed abstracts and from the SwissProt database.
Figure 4
Figure 4
Evaluation of classifying induced mutation mentions and natural variants. A. Box plot of the sentence classifier scores for Natural Variant (NV) annotated mutations in SwissProt and a random subset of sentence scores from mutation mentioning sentences. B. Example cases of mutation mentions corresponding to natural variant and induced mutations. C. Example features used by the sentence classifier for the positive class (Natural Variant) and the Negative class (Figure D, induced mutation). E and F Manual classification result for 50 randomly selected mutation mentioning sentences for classifier score intervals. (1) Score above 4, (2) score range of 4-3, (3) score range 3-2, (4) score range 2-1, (5) score range 1-0, (6) score range 0 to minus 1, (7) score range from minus 1 to minus 2, (8) score range from minus 2 to minus 3, (9) score range below minus 3. Positive scores correspond to mutations classified as natural variant, negative scores correspond to mutations classified as induced/mutagenesis.
Figure 5
Figure 5
Distribution of literature extracted mutations in the groups defined by Kinbase. A. Number of mutations from the literature lodging in the different protein kinase domain groups in which Kinbase classifies the human kinome when the abstracts and the full text articles are taken into account respectively. B. Normalized distribution of mutations in the different protein kinase domain groups in which Kinbase classifies the human kinome when the abstracts and the full text articles are taken into account respectively
Figure 6
Figure 6
Localization of the mutations extracted from the Pubmed abstracts within the structure of the Protein Kinase domain. The ATP binding pocket is represented with sticks. The DFG motif (activation segment, essential for kinase function) allocates a big number of mutations. The light brown Asparagine (central part of the figure) in the inter-lobe region, more than 10 mutations. The highest density residue is Lysine 64 (red), allocating 65 mutations. This residue has been reported as essential for protein function and ATP binding. We observe that most of the mutations allocate in or near the ATP binding pocket or the activation segment and that mutations outside the binding pocket correspond generally to low mutation density residues (colored in grey and green in the kinase domain model).
Figure 7
Figure 7
Success estimate of the extraction pipeline by human expert manual validation. These percentages were calculated upon a manual sampling and validation protocol conducted on 100 abstracts. Correct – Database confirmed: These are the mutations that have been found already in at least one of the analyzed databases (Uniprot, SAAPdb, COSMIC, KinMutBase or Greenman). Correct-Manual validation: This subset corresponds to the mutation-protein pairs that have been found correct after manual validation on 100 abstracts. Correct – Orthologue: This subset corresponds to the cases where mapping is confirmed by manual validation and the mutation is mapped to a non-human orthologue. Incorrect Mutation to Protein Assignment: Corresponds to the cases where both proteins share the same amino acid at the mutated position and the algorithm choses the incorrect pair. Incorrect Mutation assignment: Cases where the mutation is not properly identified. An interesting particular case are the confusion with cell lines (accounting 66% of this category) Too ambiguous even for human experts: Odd little informative cases where even human experts reading the abstracts are not able to identify to which protein the mutation corresponds to.

References

    1. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S. The Protein Kinase Complement of the Human Genome. Science. 2002;298:1912–1934. - PubMed
    1. Ubersax JA, Woodbury EL, Quang PN, Paraz M, Blethrow JD, Shah K, Shokat KM, Morgan DO. Targets of the Cyclin-dependent Kinase Cdk1. Nature. 2003;425:859–864. - PubMed
    1. Ptacek J, Devgan G, Michaud G, Zhu H, Zhu X, Fasolo J, Guo H, Jona G, Breitkreutz A, Sopko R, McCartney RR, Schmidt MC, Rachidi N, Lee SJ, Mah AS, Meng L, Stark MJR, Stern DF, De Virgilio C, Tyers M, Andrews B, Gerstein M, Schweitzer B, Predki PF, Snyder M. Global Analysis of Protein Phosphorylation in Yeast. Nature. 2005;438:679–684. - PubMed
    1. Huse M, Kuriyan J. The conformational plasticity of protein kinases. Cell. 2002;109:275–82. - PubMed
    1. Burgess AW. EGFR family: structure physiology signalling and therapeutic targets. Growth Factors. 2008;26:263–74. - PubMed

Publication types

LinkOut - more resources