Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Apr 15:9:197.
doi: 10.1186/1471-2105-9-197.

Accelerating string set matching in FPGA hardware for bioinformatics research

Affiliations

Accelerating string set matching in FPGA hardware for bioinformatics research

Yoginder S Dandass et al. BMC Bioinformatics. .

Abstract

Background: This paper describes techniques for accelerating the performance of the string set matching problem with particular emphasis on applications in computational proteomics. The process of matching peptide sequences against a genome translated in six reading frames is part of a proteogenomic mapping pipeline that is used as a case-study. The Aho-Corasick algorithm is adapted for execution in field programmable gate array (FPGA) devices in a manner that optimizes space and performance. In this approach, the traditional Aho-Corasick finite state machine (FSM) is split into smaller FSMs, operating in parallel, each of which matches up to 20 peptides in the input translated genome. Each of the smaller FSMs is further divided into five simpler FSMs such that each simple FSM operates on a single bit position in the input (five bits are sufficient for representing all amino acids and special symbols in protein sequences).

Results: This bit-split organization of the Aho-Corasick implementation enables efficient utilization of the limited random access memory (RAM) resources available in typical FPGAs. The use of on-chip RAM as opposed to FPGA logic resources for FSM implementation also enables rapid reconfiguration of the FPGA without the place and routing delays associated with complex digital designs.

Conclusion: Experimental results show storage efficiencies of over 80% for several data sets. Furthermore, the FPGA implementation executing at 100 MHz is nearly 20 times faster than an implementation of the traditional Aho-Corasick algorithm executing on a 2.67 GHz workstation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An FSM for matching peptide set {ACACD, ACE, CAC}.
Figure 2
Figure 2
Bit-Split Aho-Corasick FSM Architecture.
Figure 3
Figure 3
Aho-Corasick Tile Architecture.
Figure 4
Figure 4
Aho-Corasick Implementation Architecture using 140 tiles (clock and reset signals are not shown for clarity).

Similar articles

Cited by

References

    1. Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004;4:59–77. - PubMed
    1. Jaffe JD, Stange-Thomann N, Smith C, DeCaprio D, Fisher S, Butler J, Calvo S, Elkins T, FitzGerald MG, Hafez N, Kodira CD, Major J, Wang S, Wilkinson J, Nicol R, Nusbaum C, Birren B, Berg HC, Church GM. The complete genome and proteome of Mycoplasma mobile. Genome Res. 2004;14:1447–1461. - PMC - PubMed
    1. Kalume DE, Peri S, Reddy R, Zhong J, Okulate M, Kumar N, Pandey A. Genome annotation of Anopheles gambiae using mass spectrometry-derived data. BMC Genomics. 2005;6:128. - PMC - PubMed
    1. Kuster B, Mortensen P, Andersen JS, Mann M. Mass spectrometry allows direct identification of proteins in large genomes. Proteomics. 2001;1:641–650. - PubMed
    1. McCarthy FM, Cooksey AM, Wang N, Bridges SM, Pharr GT, Burgess SC. Modeling a whole organ using proteomics: the avian bursa of Fabricius. Proteomics. 2006;6:2759–2771. - PubMed

Publication types

LinkOut - more resources