Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Dec 30:12:494.
doi: 10.1186/1471-2105-12-494.

Identifying elemental genomic track types and representing them uniformly

Affiliations

Identifying elemental genomic track types and representing them uniformly

Sveinung Gundersen et al. BMC Bioinformatics. .

Abstract

Background: With the recent advances and availability of various high-throughput sequencing technologies, data on many molecular aspects, such as gene regulation, chromatin dynamics, and the three-dimensional organization of DNA, are rapidly being generated in an increasing number of laboratories. The variation in biological context, and the increasingly dispersed mode of data generation, imply a need for precise, interoperable and flexible representations of genomic features through formats that are easy to parse. A host of alternative formats are currently available and in use, complicating analysis and tool development. The issue of whether and how the multitude of formats reflects varying underlying characteristics of data has to our knowledge not previously been systematically treated.

Results: We here identify intrinsic distinctions between genomic features, and argue that the distinctions imply that a certain variation in the representation of features as genomic tracks is warranted. Four core informational properties of tracks are discussed: gaps, lengths, values and interconnections. From this we delineate fifteen generic track types. Based on the track type distinctions, we characterize major existing representational formats and find that the track types are not adequately supported by any single format. We also find, in contrast to the XML formats, that none of the existing tabular formats are conveniently extendable to support all track types. We thus propose two unified formats for track data, an improved XML format, BioXSD 1.1, and a new tabular format, GTrack 1.0.

Conclusions: The defined track types are shown to capture relevant distinctions between genomic annotation tracks, resulting in varying representational needs and analysis possibilities. The proposed formats, GTrack 1.0 and BioXSD 1.1, cater to the identified track distinctions and emphasize preciseness, flexibility and parsing convenience.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Illustration of the geometric properties of the fifteen track types. The base line is a genome, or a sequence, on which the tracks are defined. Vertical lines represents positions, while horizontal lines represent the lengths of the track elements. Gaps are thus illustrated by any empty areas between the elements. Values are represented by the height of the vertical lines. Interconnections are represented by arrows, the thickness of which correspond to the weight of the edge.
Figure 2
Figure 2
Four-dimensional matrix mapping the relations of the fifteen track types. Each dimension represents the exclusion (0) or inclusion (1) of one of the four core informational properties: gaps, lengths, values and interconnections. The track type abbreviations in the top-left box are: Genome Partition (GP), Points (P) and Segments (S); in the bottom-left box: Function (F), Step Function (SF), Valued Points (VP) and Valued Segments (VS); in the top-right box: Linked Base Pairs (LBP), Linked Genome Partition (LGP), Linked Points (LP) and Linked Segments (LS); and in the bottom-right box: Linked Function (LF), Linked Step Function (LSF), Linked Valued Points (LVP) and Linked Valued Segments (LVS). The track types with white background (with gaps) are the sparse track types, while the ones with grey background (without gaps) are the dense track types. See Figure 1 for a geometric illustration of the track types.
Figure 3
Figure 3
Overview of three common tabular formats. A) Generic Feature Format (GFF). The example file is a reduced version of the main example of the GFF version 3 specification [2]. B) Browser Extensible Data format (BED). The example file is fetched from the specification of the format at UCSC [4]. C) Wiggle Track Format (WIG) [8]. The example files show the two subformats variableStep and fixedStep. The track elements in the variableStep file covers single base pairs (span = 1, as default) and contains sparse data. For the fixedStep file, the step attribute is equal to the span attribute. The fixedStep file thus contains dense data. Figure 4 shows GTrack conversions of these example files.
Figure 4
Figure 4
GTrack example files. A) GTrack version of the GFF file in Figure 3A. GTrack conversions of GFF vary according to the set of attributes present in the GFF file. The column selected as the main value may also be changed. B1 and B2) Two possible GTrack conversions of the BED file in Figure 3B. In the direct variant (B1) only a "track type" header line and a column specification line are added. The exon positioning will in this case not be understood by a general GTrack parser. The linked variant (B2) expands the exons into subsegments that links to their parent gene segment. C1 and C2) GTrack conversions of the WIG files in Figure 3C. The variableStep file has sparse track elements covering single base pairs, with associated values. The track is thus of type valued points. The fixedStep file contains dense data, with the same values for a series of consecutive base pairs. The track type is thus of type step function. Note that in the last example, the end values are used for positioning. D) Example GTrack file of type linked genome partition. Here two graphs are defined, one directed and one undirected. To change the active graph, the edges column in the column specification line needs to be changed, in addition to the "undirected edges" header line. The example GTrack files are available at [20]. BioXSD 1.1 versions of the examples are available as follows: A [21], B1 & B2 [22], C1 [23], C2 [24], and D [25].
Figure 5
Figure 5
GTrack subtype example. A) An ad hoc GTrack suptype specification based on the example GTrack file in Figure 4A, which is a conversion from the GFF file in Figure 3A. This and other GTrack subtypes are available from the GTrack website [20]. B) A minimal GTrack header, parsable by fully compliant GTrack parsers. Note that the "Expand GTrack headers" tool, available from the GTrack website [20], can be used to expand headers of GTrack files using subtypes, in order for such files to be used in simpler parsers that do not support the subtype functionality.

References

    1. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–293. doi: 10.1126/science.1181369. - DOI - PMC - PubMed
    1. Generic Feature Format version 3. http://www.sequenceontology.org/gff3.shtml
    1. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Res. 2002;12(6):996–1006. - PMC - PubMed
    1. UCSC genome browser data formats. http://genome.ucsc.edu/FAQ/FAQformat.html
    1. Definition of Gene Transfer Format. http://mblab.wustl.edu/GTF22.html

Publication types