. 2011;12 Suppl 4(Suppl 4):S4.

doi: 10.1186/1471-2105-12-S4-S4. Epub 2011 Jul 5.

Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

Philippe E Thomas¹, Roman Klinger, Laura I Furlong, Martin Hofmann-Apitius, Christoph M Friedrich

Affiliations

Affiliation

¹ Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, Schloss Birlinghoven, 53754 Sankt Augustin, Germany. thomas@informatik.hu-berlin.de

PMID: 21992066
PMCID: PMC3194196
DOI: 10.1186/1471-2105-12-S4-S4

Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

Philippe E Thomas et al. BMC Bioinformatics. 2011.

. 2011;12 Suppl 4(Suppl 4):S4.

doi: 10.1186/1471-2105-12-S4-S4. Epub 2011 Jul 5.

Authors

Philippe E Thomas¹, Roman Klinger, Laura I Furlong, Martin Hofmann-Apitius, Christoph M Friedrich

Affiliation

¹ Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Department of Bioinformatics, Schloss Birlinghoven, 53754 Sankt Augustin, Germany. thomas@informatik.hu-berlin.de

PMID: 21992066
PMCID: PMC3194196
DOI: 10.1186/1471-2105-12-S4-S4

Abstract

Background: Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed.

Results: This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html.

Conclusions: Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus.

PubMed Disclaimer

Figures

**Figure 1**
Number of articles annotated with MeSH Term: “polymorphism, single nucleotide”.

**Figure 2**
Representative workflow for extracting SNP information from unstructured text.

**Figure 3**
**Illustration of the first recommendation for a common mutation nomenclature.** Example annotation for parts of gene MECP2 (GenBank entry: NG_007107.1) using the first suggestions for a mutation nomenclature. Exonic sequence is labeled green, intronic regions are labeled brown and the surrounding untranslated regions are labeled in blue. In the first suggestions for a common nomenclature the most 5’ sequence of the first exon is the start position. Adjacent bases are subsequently numbered. Variations occurring in intronic regions obtain two numbers. The first describes the location of the closest exon and the second is the distance to this exons. As shown in this picture, intronic positions are usually described in relation to the closer exon. The underlined ATG marks the start codon, where the leading adenine has been later proposed as common start position. Using these recommendations the two SNPs are described as 2C→A and 252+2T→C

**Figure 4**
**Illustration of the recommendations using the genomic sequence.** The example annotations for parts of the gene MECP2 (NG_007107.1) are following the genomic DNA numbering concept. Numbering starts at the beginning of the used reference sequence. Following bases are consecutively numbered. Using these recommendations the two SNPs are described as 4943C→A and 10490T→C

**Figure 5**
**Illustration of the “intervening sequence” concept in human mutation nomenclature.** The example annotations for parts of the gene MECP2 (NG_007107.1) are following the IVS concept. In this nomenclature variant the adenine of the start codon is used as start position. Variations located in intronic regions start with the abbreviation “IVS” followed by the number of the intron where the variation is located. The consecutive number determines the distance to the next intron/exon boundary. Using these recommendations the two SNPs are described as –225C→A and IVS2+2T→C

**Figure 6**
**Illustration of the latest recommendations for human mutation nomenclature.** This most recent nomenclature discards the IVS concept for intronic variations. Instead, the concept introduced earlier using two numbers is again recommended. Variations occurring in the 3’UTR are labeled with a preceding asterisk and numbering starts at the beginning of the UTR. Using these recommendations the two SNPs are described as c.–225C>A and c.26+2T>C

**Figure 7**
Proposed spellings for mutations over the last years.

**Figure 8**
**Exemplified depiction of a paragraph annotated by a machine learning tool.** Prior to normalization all sub-entities (alleles and location) have to be combined into tuples of entities. In this example the location 261 can be wrongly associated with the two closest states G and C. This can be circumvented by punishment of punctuation marks between two entities, like the comma in this case.

**Figure 9**
Flow chart of the normalization procedure.

See this image and copyright information in PMC

References

1. Collins FS, Brooks LD, Chakravarti A. A DNA polymorphism discovery resource for research on human genetic variation. Genome Research. 1998;8(12):1229–1231. - PubMed
1. Rösler A, Bailey L, Jones S, Briggs J, Cuss S, Horsey I, Kenrick M, Kingsmore S, Kent L, Pickering J, Knott T, Shipstone E, Scozzafava G. Rolling circle amplification for scoring single nucleotide polymorphisms. Nucleosides Nucleotides Nucleic Acids. 2001;20(4-7):893–894. - PubMed
1. Ke X, Taylor MS, Cardon LR. Singleton SNPs in the human genome and implications for genome-wide association studies. European Journal of Human Genetics. 2008;16(4):506–515. doi: 10.1038/sj.ejhg.5201987. - DOI - PubMed
1. Ingram VM. A specific chemical difference between the globins of normal human and sickle-cell anaemia haemoglobin. Nature. 1956;178(4537):792–794. doi: 10.1038/178792a0. - DOI - PubMed
1. Chang JC, Kan YW. beta 0 thalassemia, a nonsense mutation in man. Proceedings of the National Academy of Sciences of the United States of America. 1979;76(6):2886–2889. doi: 10.1073/pnas.76.6.2886. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

Affiliation

Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources