Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 28:12:giad067.
doi: 10.1093/gigascience/giad067. Epub 2023 Aug 17.

ARA: a flexible pipeline for automated exploration of NCBI SRA datasets

Affiliations

ARA: a flexible pipeline for automated exploration of NCBI SRA datasets

Anand Maurya et al. Gigascience. .

Abstract

Background: One of the most effective and useful methods to explore the content of biological databases is searching with nucleotide or protein sequences as a query. However, especially in the case of nucleic acids, due to the large volume of data generated by the next-generation sequencing (NGS) technologies, this approach is often not available. The hierarchical organization of the NGS records is primarily designed for browsing or text-based searches of the information provided in metadata-related keywords, limiting the efficiency of database exploration.

Findings: We developed an automated pipeline that incorporates the well-established NGS data-processing tools and procedures to allow easy and effective sampling of the NCBI SRA database records. Given a file with query nucleotide sequences, our tool estimates the matching content of SRA accessions by probing only a user-defined fraction of a record's sequences. Based on the selected parameters, it allows performing a full mapping experiment with records that meet the required criteria. The pipeline is designed to be easy to operate-it offers a fully automatic setup procedure and is fixed on tested supporting tools. The modular design and implemented usage modes allow a user to scale up the analyses into complex computational infrastructure.

Conclusions: We present an easy-to-operate and automated tool that expands the way a user can access and explore the information contained within the records deposited in the NCBI SRA database.

Keywords: NGS data; SRA database; database searching; sequence analysis.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Figure 1:
Figure 1:
Graphic presentation of the ARA pipeline workflow: green arrows indicate steps run in the “screen” mode. Blue arrows indicate steps executed in “full” mode. Red arrows show the analysis path specific for the combined “both” mode (including automated generation of filtered SRA records list—depicted by a red rectangle). Yellow rectangles indicate user input data (query sequences and list of SRA records’ accessions). White rectangles represent distinct steps in the analysis process (with tools indicated in brackets) and output reports generated by the pipeline. “Sequence analyses” currently allow mapping and taxonomic classification but can be easily expanded to include more tools (depicted by ellipsis in square brackets).

References

    1. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98.. 10.1016/0076-6879(90)83007-v. - DOI - PubMed
    1. Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.. 10.1016/s0022-2836(05)80360-2. - DOI - PubMed
    1. Langmead B, Trapnell C, Pop M et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. 10.1186/gb-2009-10-3-r25. - DOI - PMC - PubMed
    1. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.. 10.1093/bioinformatics/btp324. - DOI - PMC - PubMed
    1. Katz K, Shutov O, Lapoint R et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50:D387–90.. 10.1093/nar/gkab1053. - DOI - PMC - PubMed

Publication types