Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015:2015:905261.
doi: 10.1155/2015/905261. Epub 2015 Apr 19.

A Practical and Scalable Tool to Find Overlaps between Sequences

Affiliations

A Practical and Scalable Tool to Find Overlaps between Sequences

Maan Haj Rachid et al. Biomed Res Int. 2015.

Abstract

The evolution of the next generation sequencing technology increases the demand for efficient solutions, in terms of space and time, for several bioinformatics problems. This paper presents a practical and easy-to-implement solution for one of these problems, namely, the all-pairs suffix-prefix problem, using a compact prefix tree. The paper demonstrates an efficient construction of this time-efficient and space-economical tree data structure. The paper presents techniques for parallel implementations of the proposed solution. Experimental evaluation indicates superior results in terms of space and time over existing solutions. Results also show that the proposed technique is highly scalable in a parallel execution environment.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The prefix tree for strings S 1 = AAGGG, S 2 = ACTTT, S 3 = AGGCT, S 4 = GCCAC, and S 5 = TCCGC.
Figure 2
Figure 2
Construction of prefix compact tree after sorting the strings. Each stage represents the current tree; the strings are S 1 = AAGGG, S 2 = ACTTT, S 3 = AGGCT, S 4 = GCCAC, and S 5 = TCCGC.
Figure 3
Figure 3
Time comparison between six different solutions for APSP, running on machine A, with random data. Logarithmic scale is used. SGA, Readjoiner, Sadakane suffix tree, and SOF are tested with a minimal match length = 15, while other solutions are tested with minimal match length = 1 (which is the only option).
Figure 4
Figure 4
Space comparison between six different solutions, running on machine A, with random data.
Figure 5
Figure 5
Time comparison between 3 different solutions, running on machine A, with real data using 4 threads.
Figure 6
Figure 6
Space comparison between 3 different solutions, running on machine A, with real data.
Figure 7
Figure 7
Time comparison between SOF and Readjoiner, using 16-core AWS machine with real data. We could not run Readjoiner with 3 out of 5 data sets using all 16 threads, so we demonstrate the results using the maximum number of threads which Readjoiner can use.
Algorithm 1
Algorithm 1
Constructing the tree after sorting the sequences.
Algorithm 2
Algorithm 2
Constructing the tree without sorting the sequences.
Algorithm 3
Algorithm 3
Finding all-pairs suffix-prefix.
Algorithm 4
Algorithm 4
Pseudocode for the parallel algorithm.

References

    1. Gusfield D., Landau G. M., Schieber B. An efficient algorithm for the all pairs suffix-prefix problem. Information Processing Letters. 1992;41(4):181–185. doi: 10.1016/0020-0190(92)90176-V. - DOI
    1. Gusfield D. Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press; 1997.
    1. Weiner P. Linear pattern matching algorithms. Proceedings of the IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory (SWAT ’08); October 1973; pp. 1–11. - DOI
    1. Kurtz S. Reducing the space requirement of suffix trees. Software—Practice and Experience. 1999;29(13):1149–1171. doi: 10.1002/(sici)1097-024x(199911)29:13x0003C;1149::aid-spe274x003E;3.0.co;2-o. - DOI
    1. Manber U., Myers G. Suffix arrays: a new method for on-line string searches. Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '90); 1990; Philadelphia, Pa, USA. Society for Industrial and Applied Mathematics; pp. 319–327.

Publication types

LinkOut - more resources