Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jul 18;16 Suppl 1(Suppl 1):68.
doi: 10.1186/s12911-016-0294-3.

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

Affiliations

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

Karin M Verspoor et al. BMC Med Inform Decis Mak. .

Abstract

Background: The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems.

Methods: In this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus.

Results: For the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78-0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations.

Conclusions: This work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

Keywords: Biocuration; Genetic variant information; Information extraction; Text mining.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Overall architecture of the proposed approach
Fig. 2
Fig. 2
Architecture of PKDE4J. The structure of the overall PKDE4J system
Fig. 3
Fig. 3
Stages in iterative optimization algorithm
Fig. 4
Fig. 4
Example Variome annotation. An example sentence annotation from the Variome corpus [6], including annotations of a number of relations such as cohort-has-mutation (“individuals with germline mutations”) and cohort-has-size (“5 % of all colorectal cancers”, where ‘cancers’ is treated as a metonym for a patient cohort)

References

    1. Stenson PD, Ball EV, Howells K, Phillips AD, Mort M, Cooper DN. The Human Gene Mutation Database: Providing a comprehensive central mutation database for molecular diagnostics and personalised genomics. Hum Genomics. 2009;4(2):69–72. doi: 10.1186/1479-7364-4-2-69. - DOI - PMC - PubMed
    1. Stenson P, Mort M, Ball E, Shaw K, Phillips A, Cooper D. The Human Gene Mutation Database: Building a comprehensive mutation repository for clinical and molecular genetics, diagnostic testing and personalized genomic medicine. Hum Genetics. 2014;133(1):1–9. doi: 10.1007/s00439-013-1358-4. - DOI - PMC - PubMed
    1. Bamford S, Dawson E, Forbes S, Clements J, Pettett R, Dogan A, Flanagan A, Teague J, Futreal P, Stratton M, et al. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br J Cancer. 2004;91(2):355–8. - PMC - PubMed
    1. Claustres M, Horaitis O, Vanevski M, Cotton RGH. Time for a unified system of mutation description and reporting: A review of locus-specific mutation databases. Genome Res. 2002;12(5):680–8. doi: 10.1101/gr.217702. - DOI - PubMed
    1. Baumgartner W, Cohen K, Fox L, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23(13):41–8. doi: 10.1093/bioinformatics/btm229. - DOI - PMC - PubMed

Publication types