Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 13:7:39.
doi: 10.1186/s13326-016-0067-z.

FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

Affiliations

FALDO: a semantic standard for describing the location of nucleotide and protein feature annotation

Jerven T Bolleman et al. J Biomed Semantics. .

Abstract

Background: Nucleotide and protein sequence feature annotations are essential to understand biology on the genomic, transcriptomic, and proteomic level. Using Semantic Web technologies to query biological annotations, there was no standard that described this potentially complex location information as subject-predicate-object triples.

Description: We have developed an ontology, the Feature Annotation Location Description Ontology (FALDO), to describe the positions of annotated features on linear and circular sequences. FALDO can be used to describe nucleotide features in sequence records, protein annotations, and glycan binding sites, among other features in coordinate systems of the aforementioned "omics" areas. Using the same data format to represent sequence positions that are independent of file formats allows us to integrate sequence data from multiple sources and data types. The genome browser JBrowse is used to demonstrate accessing multiple SPARQL endpoints to display genomic feature annotations, as well as protein annotations from UniProt mapped to genomic locations.

Conclusions: Our ontology allows users to uniformly describe - and potentially merge - sequence annotations from multiple sources. Data sources using FALDO can prospectively be retrieved using federalised SPARQL queries against public SPARQL endpoints and/or local private triple stores.

Keywords: Annotation; Data integration; RDF; SPARQL; Semantic Web; Sequence feature; Sequence ontology; Standardisation.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
The classes and object properties used in FALDO
Fig. 2
Fig. 2
Assorted conventions for regions, start, end, and strands. This figure shows two hypothetical features on a DNA sequence (labeled chr1), on either the forward strand (orange) or reverse strand (blue). Using the INSDC location string notation, these regions are “1050..2080” and “complement(1050..2080)” respectively if implicitly given in terms of the reference chr1. Using the GTF/GFF3 family of formats, regardless of the strand these two locations are described with s t a r t=1050 and e n d=2080, and in general, s t a r te n d. Biologically speaking, in terms of transcription, the start of a genomic feature is strand dependent. For the forward strand feature (orange), the start is 1050 while the reverse strand feature (blue) starts from 2080
Fig. 3
Fig. 3
OWL2 property chain axiom to infer that all positions described in an INSDC record are relative to the main sequence of the record (in RDF turtle syntax, prefixes omitted)
Fig. 4
Fig. 4
A SPARQL query to add all faldo:reference properties to faldo:positions described from a insdc:record
Fig. 5
Fig. 5
JBrowse showing features, whose location is encoded using FALDO, selected via SPARQL (at e.g. http://togogenome.org/gene/1016998:SPAB_00296)
Fig. 6
Fig. 6
Excerpt from UniProt entry Q6Q250 showing the position of an active site and a signal peptide in both the UniProt flat-file format and FALDO
Fig. 7
Fig. 7
DDBJ record associated with UniProt Q6Q250 showing the related CDS sequence, with coding region outside of the known deposited mRNA sequence
Fig. 8
Fig. 8
Using FALDO in Turtle [29] syntax to describe the location of a gene feature cheY at complement(NC_000913.2:1965072.. 1965461) in the INSDC record U00096.3
Fig. 9
Fig. 9
Partial example of using FALDO in JSON-LD [30] syntax to describe the CDS “Protein II” at join(6006..6407,1..831) on J02448. Notice that this is given as a single location rather than being artificially split in two as in the INSDC join(...) notation
Fig. 10
Fig. 10
FALDO representation of the HindIII restriction enzyme cleavage site with sticky ends

References

    1. Sanger F. The terminal peptides of insulin. Biochem J. 1949;45(5):563–74. doi: 10.1042/bj0450563. - DOI - PMC - PubMed
    1. Dayhoff MO, Eck RV, Foundation NBR. Atlas of Protein Sequence and Structure. Silver Spring (Maryland): National Biomedical Research Foundation; 1965.
    1. Reese MG, Moore B, Batchelor C, Salas F, Cunningham F, Marth GT, Stein L, Flicek P, Yandell M, Eilbeck K. A standard variation file format for human genome sequences. Genome Biol. 2010; 11(R88). doi:10.1186/gb-2010-11-8-r88. - DOI - PMC - PubMed
    1. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, Lehväslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12(10):1611–8. doi: 10.1101/gr.361602. - DOI - PMC - PubMed
    1. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. doi: 10.1093/bioinformatics/btp163. - DOI - PMC - PubMed

LinkOut - more resources