Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May 26;15(5):e0232942.
doi: 10.1371/journal.pone.0232942. eCollection 2020.

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review

Affiliations

Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review

Kelvin V Kredens et al. PLoS One. .

Abstract

The recent decrease in cost and time to sequence and assemble of complete genomes created an increased demand for data storage. As a consequence, several strategies for assembled biological data compression were created. Vertical compression tools implement strategies that take advantage of the high level of similarity between multiple assembled genomic sequences for better compression results. However, current reviews on vertical compression do not compare the execution flow of each tool, which is constituted by phases of preprocessing, transformation, and data encoding. We performed a systematic literature review to identify and compare existing tools for vertical compression of assembled genomic sequences. The review was centered on PubMed and Scopus, in which 45726 distinct papers were considered. Next, 32 papers were selected according to the following criteria: to present a lossless vertical compression tool; to use the information contained in other sequences for the compression; to be able to manipulate genomic sequences in FASTA format; and no need prior knowledge. Although we extracted performance compression results, they were not compared as the tools did not use a standardized evaluation protocol. Thus, we conclude that there's a lack of definition of an evaluation protocol that must be applied by each tool.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Flowchart of study identification and selection process.
Fig 2
Fig 2. Tools for VGDC.
Considering (A) Amount of Publications per Year and (B) Amount of Tools per Programming Language. The total number of tools is greater than 32 because the HiRGC tool has versions in C ++ and Java.
Fig 3
Fig 3. Compression schemes.
(A) Referential pair-based. The target genome sequence will be compressed using information extracted from the reference genome sequence. The goal of this process is to find shared segments. The goal of this process is to find shared segments. The final result is a map (Mapped Target Sequence) in which matches replace such shared segments, and the not shared are written in raw, as a mismatch. In this example, the match is composed of two integers, the first being the position where the shared segment begins in reference and the second representing the size of the match. (B) Dictionary collection-based. In this example, three genomes are compressed at the same time. In the first moment, the process searches for shared segments and stores them in a structure called codebook. Then, to obtain the mapped genomes, it rewrites sequences, replacing segments by indexes of tokens previously created. Token indexes and mismatches symbols will compose the final mapping. In this example, we consider that to be converted as a token, a shared segment must be larger than three bases. (C) Referential collection-based single/multi-reference. In this example, sequences g1, g2, and g3 are compressed. On the first step (C), g2: Target Sequence 1 is compressed concerning g1: Reference Sequence and generates a mapping between them. The second step represents behavior or single-reference (c1) or multi-reference (c2). In the case of single-reference (c1), the process is similar to the previous step (C), in which sequence g3: Target Sequence 2 is compressed based on the same g1: Reference Sequence used in (C). In multi-reference (c2), however, the compression process will try to find shared segments using both g1: Reference Sequence and the ones already compressed, in this case, g2: Target Sequence 1. Thus, matches can point to any other sequence of the collection.
Fig 4
Fig 4. Conceptual compression flow.
The conceptual vision of the activities that can be executed by a VCGD tool to compress DNA sequences. 1) Preprocessing Phase, initial phase to select the sequence that will be used as reference and index it. Some tools do not have an automatic process to select references. In these cases, tools expect the reference to be manually informed. 2) Transformation Phase, phase in which tools explore characteristics inherent to DNA sequences that, due to being all from the same or correlated species, tend to share a significant part of the information. The first activity is to create the mapping, which will contain matches/mismatches or edit operations between a determined target sequence and the reference. After that, tools can either execute the process of Post-processing or Second Order Mapping. 3) Encoding Phase, in this phase the input is the Mapped Sequences that will be encoded, i.e., written in a binary way. The goal is to rewrite instructions aiming to reach the lowest possible entropy. In this phase, some tools opt for grouping data according to the distribution of symbols that will be encoded. This is done because some entropy encoding methods have their efficiency directly related to the number of distinct symbols as well as the probability distribution of each symbol. For example, mismatches can be encoded separately from matches, which can have the element referent to the position, coded separately from elements referring to the size. The final result is a binary file containing one or more compressed sequences.
Fig 5
Fig 5. Reference sequence (artificial vs. biological).
In this figure we illustrate the differences of using a reference data based on a real genomic sequence (B) compared to an artificially generated sequence (A). The great advantage of using an artificial sequence is its high level of similarity to the to-be-compressed sequences since its creation is based on the most frequent segments found within the collection of to-be-compressed sequences. However, the high computational cost for creation and storing of artificial sequences must be considered before applying this approach.
Fig 6
Fig 6. Local search.
Illustrative examples of how the local search can be implemented. (A) After identifying a match or a mismatch the pointer advances, at the same time, both in the reference sequence and in the target sequence and then the search is initiated. (B) The search is limited to a search window that advances every match or mismatch. (C) The reference sequence and the target sequence are divided into blocks, and the search for matches is restricted, in a first moment, to such blocks. (D) The search can be made in the whole reference sequence, but penalties are given to possible matches found in distant positions from the last match identified. With that, the tool prioritizes matches closer to the recently identified. (E) The search is always initiated right after the ending of the last match identified.
Fig 7
Fig 7. Second-order mapping.
In this example, in the first moment (A), two target sequences, Target Sequence 1 and Target Sequence 2, are parsed using a same reference sequence, Reference Sequence. As a result, two mappings are obtained, one for each sequence. In a second moment (B) a Second Order Mapping is executed. The input are the maps previously generated and the tool, parsing Mapped Target Sequence 2 in relation to Mapped Target Sequence 1 identifies a succession of three elements, two matches and one literal that are shared and generate a Second Order Match, SOM(g1, 1, 3). This SOM indicates that, in Target Sequence 2, initiating from position 1 of sequence g1, the first three elements must be considered, in this case M(1, 8), M(10, 6) and L(T).
Fig 8
Fig 8. Post-processing.
In this example, we explain the post-processing phase. After the first order mapping (A), the set of matches M is processed, two by two, trying to merge them. In the first iteration (B), matches m1 and m2 are evaluated. It is verified that (m1.position + m1.length + m2.length + 1) is equal to m2.position, therefore, there is a substitution between matches. Matches are combined, and m2 is removed from set M, arranging the remaining to such removal. At the same time, instruction s1 is created and inserted in set S that will contain editing instructions of the substitution type. In the second iteration (C), matches m1 and m2 (which was m3) are evaluated. It is verified that (m1.position + l1.position) is equal to m2.position, therefore, there is an insertion between matches. Matches are combined; match m2 is removed from set M, arranging the remaining to such removal. At the same time, instruction i1 is created and inserted in the set I that will contain editing instruction of the insertion type. On the third iteration (D) matches m1 and m2 (which was m4 in the first iteration) are evaluated. It is verified that (2 <= m2.position—(m1.position + m1.length + 1) <= Lmax), where Lmax is a predefined value that determines the maximum size that a deletion can have to be treated by post-processing process. Therefore, there is a 2 bases deletion between the evaluated matches. Once again, both matches are combined, match m2 is removed from set M, which will contain only match m1. At the same time, instruction d1 is created and inserted in set D that will contain editing instructions of deletion type. The goal is to reduce the number of integers. In this case, we initially had 8 distinct integers numbers and at the end, only 6 remained.
Fig 9
Fig 9. Encoding process.
(A) is an example of how a tool performs encoding when all elements that are part of the map are sent to be encoded together. In this case, all elements will be processed in a single data stream. In (B), we illustrate the situation in which a tool splits the elements of the map in three different data streams. The first one containing elements referent to the position of matches; the second one referring to sizes; and, at last, bases that do not compose any match, the mismatches. In this case, each data stream will be encoded separately and, by the ending, the results will be combined into a single file. The advantage of this second approach is that the probability distribution of the elements is more concentrated, which allows for a higher compression rate since, according to Information Theory, smaller binary codes will be conceded to symbols that appear more.

References

    1. Schuster SC. Next-generation sequencing transforms today’s biology. Nature Methods. 2007;5(1):16–18. 10.1038/nmeth1156 - DOI - PubMed
    1. Reuter JA, Spacek DV, Snyder MP. High-throughput sequencing technologies. Mol Cell. 2015;58(4):586–597. 10.1016/j.molcel.2015.05.004 - DOI - PMC - PubMed
    1. Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, et al. Big Data: Astronomical or Genomical? PLOS Biology. 2015;13(7):1–11. 10.1371/journal.pbio.1002195 - DOI - PMC - PubMed
    1. Deorowicz S, Grabowski S. Data compression for sequencing data. Algorithms for Molecular Biology. 2013;8(1):25 10.1186/1748-7188-8-25 - DOI - PMC - PubMed
    1. Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research. 2011;21(5):734–740. 10.1101/gr.114819.110 - DOI - PMC - PubMed

Publication types