Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Apr 7;8(1):14.
doi: 10.1186/s13326-017-0116-2.

SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature

Affiliations

SNPPhenA: a corpus for extracting ranked associations of single-nucleotide polymorphisms and phenotypes from literature

Behrouz Bokharaeian et al. J Biomed Semantics. .

Abstract

Background: Single Nucleotide Polymorphisms (SNPs) are among the most important types of genetic variations influencing common diseases and phenotypes. Recently, some corpora and methods have been developed with the purpose of extracting mutations and diseases from texts. However, there is no available corpus, for extracting associations from texts, that is annotated with linguistic-based negation, modality markers, neutral candidates, and confidence level of associations.

Method: In this research, different steps were presented so as to produce the SNPPhenA corpus. They include automatic Named Entity Recognition (NER) followed by the manual annotation of SNP and phenotype names, annotation of the SNP-phenotype associations and their level of confidence, as well as modality markers. Moreover, the produced corpus was annotated with negation scopes and cues as well as neutral candidates that play crucial role as far as negation and the modality phenomenon in relation to extraction tasks.

Result: The agreement between annotators was measured by Cohen's Kappa coefficient where the resulting scores indicated the reliability of the corpus. The Kappa score was 0.79 for annotating the associations and 0.80 for the confidence degree of associations. Further presented were the basic statistics of the annotated features of the corpus in addition to the results of our first experiments related to the extraction of ranked SNP-Phenotype associations. The prepared guideline documents render the corpus more convenient and facile to use. The corpus, guidelines and inter-annotator agreement analysis are available on the website of the corpus: http://nil.fdi.ucm.es/?q=node/639 .

Conclusion: Specifying the confidence degree of SNP-phenotype associations from articles helps identify the strength of associations that could in turn assist genomics scientists in determining phenotypic plasticity and the importance of environmental factors. What is more, our first experiments with the corpus show that linguistic-based confidence alongside other non-linguistic features can be utilized in order to estimate the strength of the observed SNP-phenotype associations.

Trial registration: Not Applicable.

Keywords: Degree of confidence; Modality; Negation; Phenotype; Relation extraction; SNP.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
A sample sentence in the corpus within a negation cue and scope
Fig. 2
Fig. 2
A sample of a sentence with three modality markers
Fig. 3
Fig. 3
Different steps for producing the SNPPhenA corpus
Fig. 4
Fig. 4
A sample of SNP and phenotype named entity recognition in the corpus
Fig. 5
Fig. 5
A sample of two annotated associations between two SNPs and a phenotype in the SNPPhenA corpus
Fig. 6
Fig. 6
Samples of positive association candidate between highlighted two SNPs and a phenotype
Fig. 7
Fig. 7
Samples of negative association candidate between highlighted six SNPs and a phenotype
Fig. 8
Fig. 8
A sample of neutral association candidate with used highlighted entities
Fig. 9
Fig. 9
A sample of neutral association candidate with a negation cue
Fig. 10
Fig. 10
A sample of a strong association that has been mentioned to have a strong degree of confidence
Fig. 11
Fig. 11
A sample of a weak association that has been mentioned to have a weak degree of confidence
Fig. 12
Fig. 12
A sample of moderate association that has been mentioned to have a moderate degree of confidence
Fig. 13
Fig. 13
A sample of a negated sentence with negation cue and scope

References

    1. Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, et al. A general approach to single-nucleotide polymorphism discovery. Nat Genet. 1999;23(4):452–456. doi: 10.1038/70570. - DOI - PubMed
    1. others, I. H Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. doi: 10.1038/nature09298. - DOI - PMC - PubMed
    1. Martin E, and Hine R. A Dictionary of Biology, 6 ed. Oxford University Press; 2014.
    1. Leslie R, O. C. Retrieved May 2016, from GRASP: 2016. http://grasp.nhlbi.nih.gov/Updates.aspx. Accessed May 2016.
    1. Verspoor KM, Heo GE, Kang KY, Song M. Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts. BMC Med Inform Decis Mak. 2016;16(1):37. doi: 10.1186/s12911-016-0276-5. - DOI - PMC - PubMed

LinkOut - more resources