Querying the public databases for sequences using complex keywords contained in the feature lines

Olivier Croce¹, Michaël Lamarre, Richard Christen

Affiliations

Affiliation

¹ Laboratoire de Biologie Virtuelle, UMR 6543, CNRS & University of Nice Sophia-Antipolis, Centre de Biochimie, Parc Valrose, Nice, F06108, France. croce@unice.fr

PMID: 16441875
PMCID: PMC1403806
DOI: 10.1186/1471-2105-7-45

Comparative Study

Querying the public databases for sequences using complex keywords contained in the feature lines

Olivier Croce et al. BMC Bioinformatics. 2006.

. 2006 Jan 27:7:45.

doi: 10.1186/1471-2105-7-45.

Authors

Olivier Croce¹, Michaël Lamarre, Richard Christen

Affiliation

¹ Laboratoire de Biologie Virtuelle, UMR 6543, CNRS & University of Nice Sophia-Antipolis, Centre de Biochimie, Parc Valrose, Nice, F06108, France. croce@unice.fr

PMID: 16441875
PMCID: PMC1403806
DOI: 10.1186/1471-2105-7-45

Abstract

Background: High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords.

Results: We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use.

Conclusion: Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.

PubMed Disclaimer

References

1. EBI statistics http://www3.ebi.ac.uk/Services/DBStats/
1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. doi: 10.1006/jmbi.1990.9999. - DOI - PubMed
1. Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31. - DOI - PMC - PubMed
1. Gouy M, Gautier C, Attimonelli M, Lanave C, di Paola G. ACNUC – a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput Appl Biosci. 1985;1:167–172. - PubMed
1. Schuler GD, Epstein JA, Ohkawa H, Kans JA. Entrez: molecular biology database and retrieval system. Methods Enzymol. 1996;266:141–162. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Querying the public databases for sequences using complex keywords contained in the feature lines

Affiliation

Querying the public databases for sequences using complex keywords contained in the feature lines

Authors

Affiliation

Abstract

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources