. 2016 Jun 13:7:39.

doi: 10.1186/s13326-016-0067-z.

FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

Affiliations

¹ Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel, Servet, Geneva 4, 1211, Switzerland. jerven.bolleman@sib.swiss.
² Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, 94720, CA, US.
³ CeRSA, Parco Tecnologico Padano, Lodi, 26900, Italy.
⁴ CODAMONO, 5-121 Marion Street, Toronto, M6R 1E6, Ontario, Canada.
⁵ Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Room X223, Stanford, 94305-5479, CA, US.
⁶ Integrative Biology Program, Istituto Nazionale Genetica Molecolare, Milan, Italy.
⁷ University of California, Berkeley, Berkeley, CA, USA.
⁸ Department of Computer Science, Aberystwyth, SY23 3DB, UK.
⁹ Center for Information Biology, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka, 411-08540, Japan.
¹⁰ Database Center for Life Science, Research Organization of Information and Systems, 2-11-16, Yayoi, Bunkyo-ku, Tokyo, 113-0032, Japan.
¹¹ The James Hutton Institute, Dundee, DD2 5DA, UK.

PMID: 27296299
PMCID: PMC4907002
DOI: 10.1186/s13326-016-0067-z

FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

Jerven T Bolleman et al. J Biomed Semantics. 2016.

. 2016 Jun 13:7:39.

doi: 10.1186/s13326-016-0067-z.

Authors

Affiliations

¹ Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel, Servet, Geneva 4, 1211, Switzerland. jerven.bolleman@sib.swiss.
² Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, 94720, CA, US.
³ CeRSA, Parco Tecnologico Padano, Lodi, 26900, Italy.
⁴ CODAMONO, 5-121 Marion Street, Toronto, M6R 1E6, Ontario, Canada.
⁵ Stanford Center for Biomedical Informatics Research, 1265 Welch Road, Room X223, Stanford, 94305-5479, CA, US.
⁶ Integrative Biology Program, Istituto Nazionale Genetica Molecolare, Milan, Italy.
⁷ University of California, Berkeley, Berkeley, CA, USA.
⁸ Department of Computer Science, Aberystwyth, SY23 3DB, UK.
⁹ Center for Information Biology, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka, 411-08540, Japan.
¹⁰ Database Center for Life Science, Research Organization of Information and Systems, 2-11-16, Yayoi, Bunkyo-ku, Tokyo, 113-0032, Japan.
¹¹ The James Hutton Institute, Dundee, DD2 5DA, UK.

PMID: 27296299
PMCID: PMC4907002
DOI: 10.1186/s13326-016-0067-z

Abstract

Background: Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples.

Description: We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned "omics" areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations.

Conclusions: Our ontology allows users to uniformly describe - and potentially merge - sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.

Keywords: Annotation; Data integration; RDF; SPARQL; Semantic Web; Sequence feature; Sequence ontology; Standardisation.

PubMed Disclaimer

Figures

**Fig. 1**
The classes and object properties used in FALDO

**Fig. 2**
Assorted conventions for regions, start, end, and strands. This figure shows two hypothetical features on a DNA sequence (labeled chr1), on either the forward strand (*orange*) or reverse strand (*blue*). Using the INSDC location string notation, these regions are “1050..2080” and “complement(1050..2080)” respectively if implicitly given in terms of the reference chr1. Using the GTF/GFF3 family of formats, regardless of the strand these two locations are described with s t a r t=1050 and e n d=2080, and in general, s t a r t≤e n d. Biologically speaking, in terms of transcription, the start of a genomic feature is strand dependent. For the forward strand feature (*orange*), the start is 1050 while the reverse strand feature (*blue*) starts from 2080

**Fig. 3**
OWL2 property chain axiom to infer that all positions described in an INSDC record are relative to the main sequence of the record (in RDF turtle syntax, prefixes omitted)

**Fig. 4**
A SPARQL query to add all faldo:reference properties to faldo:positions described from a insdc:record

**Fig. 5**
JBrowse showing features, whose location is encoded using FALDO, selected via SPARQL (at e.g. http://togogenome.org/gene/1016998:SPAB_00296)

**Fig. 6**
Excerpt from UniProt entry Q6Q250 showing the position of an active site and a signal peptide in both the UniProt flat-file format and FALDO

**Fig. 7**
DDBJ record associated with UniProt Q6Q250 showing the related CDS sequence, with coding region outside of the known deposited mRNA sequence

**Fig. 8**
Using FALDO in Turtle [29] syntax to describe the location of a gene feature *cheY* at complement(NC_000913.2:1965072.. 1965461) in the INSDC record U00096.3

**Fig. 9**
Partial example of using FALDO in JSON-LD [30] syntax to describe the CDS “Protein II” at join(6006..6407,1..831) on J02448. Notice that this is given as a single location rather than being artificially split in two as in the INSDC join(...) notation

**Fig. 10**
FALDO representation of the HindIII restriction enzyme cleavage site with sticky ends

See this image and copyright information in PMC

References

1. Sanger F. The terminal peptides of insulin. Biochem J. 1949;45(5):563–74. doi: 10.1042/bj0450563. - DOI - PMC - PubMed
1. Dayhoff MO, Eck RV, Foundation NBR. Atlas of Protein Sequence and Structure. Silver Spring (Maryland): National Biomedical Research Foundation; 1965.
1. Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth GT, Stein L, Flicek P, Yandell M, Eilbeck K. A standard variation file format for human genome sequences. Genome Biol. 2010; 11(R88). doi:10.1186/gb-2010-11-8-r88. - DOI - PMC - PubMed
1. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12(10):1611–8. doi: 10.1101/gr.361602. - DOI - PMC - PubMed
1. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. doi: 10.1093/bioinformatics/btp163. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

R24 OD011883/OD/NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

Affiliations

FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources