Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011;12 Suppl 4(Suppl 4):S4.
doi: 10.1186/1471-2105-12-S4-S4. Epub 2011 Jul 5.

Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

Affiliations

Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers

Philippe E Thomas et al. BMC Bioinformatics. 2011.

Abstract

Background: Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed.

Results: This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html.

Conclusions: Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Number of articles annotated with MeSH Term: “polymorphism, single nucleotide”.
Figure 2
Figure 2
Representative workflow for extracting SNP information from unstructured text.
Figure 3
Figure 3
Illustration of the first recommendation for a common mutation nomenclature. Example annotation for parts of gene MECP2 (GenBank entry: NG_007107.1) using the first suggestions for a mutation nomenclature. Exonic sequence is labeled green, intronic regions are labeled brown and the surrounding untranslated regions are labeled in blue. In the first suggestions for a common nomenclature the most 5’ sequence of the first exon is the start position. Adjacent bases are subsequently numbered. Variations occurring in intronic regions obtain two numbers. The first describes the location of the closest exon and the second is the distance to this exons. As shown in this picture, intronic positions are usually described in relation to the closer exon. The underlined ATG marks the start codon, where the leading adenine has been later proposed as common start position. Using these recommendations the two SNPs are described as 2C→A and 252+2T→C
Figure 4
Figure 4
Illustration of the recommendations using the genomic sequence. The example annotations for parts of the gene MECP2 (NG_007107.1) are following the genomic DNA numbering concept. Numbering starts at the beginning of the used reference sequence. Following bases are consecutively numbered. Using these recommendations the two SNPs are described as 4943C→A and 10490T→C
Figure 5
Figure 5
Illustration of the “intervening sequence” concept in human mutation nomenclature. The example annotations for parts of the gene MECP2 (NG_007107.1) are following the IVS concept. In this nomenclature variant the adenine of the start codon is used as start position. Variations located in intronic regions start with the abbreviation “IVS” followed by the number of the intron where the variation is located. The consecutive number determines the distance to the next intron/exon boundary. Using these recommendations the two SNPs are described as –225C→A and IVS2+2T→C
Figure 6
Figure 6
Illustration of the latest recommendations for human mutation nomenclature. This most recent nomenclature discards the IVS concept for intronic variations. Instead, the concept introduced earlier using two numbers is again recommended. Variations occurring in the 3’UTR are labeled with a preceding asterisk and numbering starts at the beginning of the UTR. Using these recommendations the two SNPs are described as c.–225C>A and c.26+2T>C
Figure 7
Figure 7
Proposed spellings for mutations over the last years.
Figure 8
Figure 8
Exemplified depiction of a paragraph annotated by a machine learning tool. Prior to normalization all sub-entities (alleles and location) have to be combined into tuples of entities. In this example the location 261 can be wrongly associated with the two closest states G and C. This can be circumvented by punishment of punctuation marks between two entities, like the comma in this case.
Figure 9
Figure 9
Flow chart of the normalization procedure.

References

    1. Collins FS, Brooks LD, Chakravarti A. A DNA polymorphism discovery resource for research on human genetic variation. Genome Research. 1998;8(12):1229–1231. - PubMed
    1. Rösler A, Bailey L, Jones S, Briggs J, Cuss S, Horsey I, Kenrick M, Kingsmore S, Kent L, Pickering J, Knott T, Shipstone E, Scozzafava G. Rolling circle amplification for scoring single nucleotide polymorphisms. Nucleosides Nucleotides Nucleic Acids. 2001;20(4-7):893–894. - PubMed
    1. Ke X, Taylor MS, Cardon LR. Singleton SNPs in the human genome and implications for genome-wide association studies. European Journal of Human Genetics. 2008;16(4):506–515. doi: 10.1038/sj.ejhg.5201987. - DOI - PubMed
    1. Ingram VM. A specific chemical difference between the globins of normal human and sickle-cell anaemia haemoglobin. Nature. 1956;178(4537):792–794. doi: 10.1038/178792a0. - DOI - PubMed
    1. Chang JC, Kan YW. beta 0 thalassemia, a nonsense mutation in man. Proceedings of the National Academy of Sciences of the United States of America. 1979;76(6):2886–2889. doi: 10.1073/pnas.76.6.2886. - DOI - PMC - PubMed

Publication types

LinkOut - more resources