Querying the public databases for sequences using complex keywords contained in the feature lines
- PMID: 16441875
- PMCID: PMC1403806
- DOI: 10.1186/1471-2105-7-45
Querying the public databases for sequences using complex keywords contained in the feature lines
Abstract
Background: High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords.
Results: We show that Entrez has severe limitations with respect to retrieving subsequences. SRS works well with simple keywords but not with keywords composed of several terms, and has problems with complex queries. ACNUC works well, but does not allow precise queries in the Feature qualifiers. We developed specific Perl scripts to precisely retrieve subsequences as defined by complex descriptors in the Features qualifiers of the EMBL entries. We improved parts of the bioPerl library to allow parsing of large data files, and we embedded these scripts in a user friendly interface (OS independent) for easy use.
Conclusion: Although not as fast as the public tools that use prebuilt indexes, parsing the complete entries using a script is often necessary in order to retrieve the exact data searched for. Embedding in a user friendly interface allows biologists to use the scripts, which can easily be modified, if necessary, by bioinformaticians for unforeseen needs.
Similar articles
-
MineBlast: a literature presentation service supporting protein annotation by data mining of BLAST results.Bioinformatics. 2005 Aug 15;21(16):3450-1. doi: 10.1093/bioinformatics/bti528. Epub 2005 Jun 7. Bioinformatics. 2005. PMID: 15941742
-
PTGL--a web-based database application for protein topologies.Bioinformatics. 2004 Nov 22;20(17):3277-9. doi: 10.1093/bioinformatics/bth367. Epub 2004 Jun 24. Bioinformatics. 2004. PMID: 15217820
-
Fast parsers for Entrez Gene.Bioinformatics. 2005 Jul 15;21(14):3189-90. doi: 10.1093/bioinformatics/bti488. Epub 2005 May 6. Bioinformatics. 2005. PMID: 15879451
-
Pfam 10 years on: 10,000 families and still growing.Brief Bioinform. 2008 May;9(3):210-9. doi: 10.1093/bib/bbn010. Epub 2008 Mar 15. Brief Bioinform. 2008. PMID: 18344544 Review.
-
Cataloging the relationships between proteins: a review of interaction databases.Mol Biotechnol. 2006 Sep;34(1):69-93. doi: 10.1385/MB:34:1:69. Mol Biotechnol. 2006. PMID: 16943573 Review.
Cited by
-
UbiProt: a database of ubiquitylated proteins.BMC Bioinformatics. 2007 Apr 18;8:126. doi: 10.1186/1471-2105-8-126. BMC Bioinformatics. 2007. PMID: 17442109 Free PMC article.
-
PseudoMLSA: a database for multigenic sequence analysis of Pseudomonas species.BMC Microbiol. 2010 Apr 21;10:118. doi: 10.1186/1471-2180-10-118. BMC Microbiol. 2010. PMID: 20409328 Free PMC article. Review.
-
Preliminary analysis of length and GC content variation in the ribosomal first internal transcribed spacer (ITS1) of marine animals.Mar Biotechnol (NY). 2009 May-Jun;11(3):301-6. doi: 10.1007/s10126-008-9153-2. Epub 2008 Oct 21. Mar Biotechnol (NY). 2009. PMID: 18937008
References
-
- EBI statistics http://www3.ebi.ac.uk/Services/DBStats/
-
- Gouy M, Gautier C, Attimonelli M, Lanave C, di Paola G. ACNUC – a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput Appl Biosci. 1985;1:167–172. - PubMed
-
- Schuler GD, Epstein JA, Ohkawa H, Kans JA. Entrez: molecular biology database and retrieval system. Methods Enzymol. 1996;266:141–162. - PubMed
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources