Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 5;11(1):732.
doi: 10.1038/s41597-024-03571-y.

Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Affiliations

Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Nuala A O'Leary et al. Sci Data. .

Abstract

To explore complex biological questions, it is often necessary to access various data types from public data repositories. As the volume and complexity of biological sequence data grow, public repositories face significant challenges in ensuring that the data is easily discoverable and usable by the biological research community. To address these challenges, the National Center for Biotechnology Information (NCBI) has created NCBI Datasets. This resource provides straightforward, comprehensive, and scalable access to biological sequences, annotations, and metadata for a wide range of taxa. Following the FAIR (Findable, Accessible, Interoperable, and Reusable) data management principles, NCBI Datasets offers user-friendly web interfaces, command-line tools, and documented APIs, empowering researchers to access NCBI data seamlessly. The data is delivered as packages of sequences and metadata, thus facilitating improved data retrieval, sharing, and usability in research. Moreover, this data delivery method fosters effective data attribution and promotes its further reuse. This paper outlines the current scope of data accessible through NCBI Datasets and explains various options for exploring and downloading the data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
NCBI Datasets interfaces for data access: NCBI Datasets offers web, command-line, and API access points that facilitate the search for and download of genomic sequences, annotations, and comprehensive metadata. These tools, including web-based and programmatic interfaces, guarantee consistent data retrieval from NCBI Datasets.
Fig. 2
Fig. 2
Types of data packages: Three main categories of data packages are currently available: genome, gene, and virus. Users can customize the contents of any data package. Detailed information about each data package, including a list of available files and descriptions of each file, is available in NCBI Datasets documentation under Data packages (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/data-packages/).
Fig. 3
Fig. 3
Organism-focused data access: The NCBI Datasets taxonomy web page provides access to NCBI sequence and metadata data for that organism, including tabular views or assembled genome and annotated genes, download options and links to data in other NCBI databases.
Fig. 4
Fig. 4
Organization of the datasets command-line tool: The datasets command-line tool can be used to browse and download NCBI Datasets data packages. There are two main subcommands: “download” for retrieving data packages and “summary” for displaying metadata. Each subcommand has multiple flags to help narrow data packages to the desired genomes or genes of interest. For an overview of datasets, dataformat, and installation instructions, see our Command-line tools documentation (https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/). (*) virus protein restricted to download of SARS-CoV-2 proteins.
Fig. 5
Fig. 5
The datasets summary command: Diagram of the datasets command-line tool syntax for getting metadata for one or more genomes by taxon name.
Fig. 6
Fig. 6
The datasets download command: Diagram of the datasets command-line tool syntax for downloading a genome data package for one or more genomes by taxon name.
Fig. 7
Fig. 7
Organization of the assembly_data_report.jsonl data report. Select metadata can be retrieved from the data report using the dataformat command line tool. In this example, the organism name (red) and assembly accession (blue) can be extracted from the data report to generate a two-column tabular file.

References

    1. Bornstein K, Gryan G, Chang ES, Marchler-Bauer A, Schneider VA. The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics. 2023;24:575. doi: 10.1186/s12864-023-09643-4. - DOI - PMC - PubMed
    1. Lathe W, Williams J, Mangan M, Karolchik D. Genomic Data Resources: Challenges and Promises. Nature Education. 2008;1(3):2.
    1. Fan J. Why it’s worth making computational methods easy to use. Nature. 2023 doi: 10.1038/d41586-023-01440-z. - DOI - PubMed
    1. Wilkinson MD, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016 31. 2016;3:1–9. - PMC - PubMed
    1. Schuler GD, Epstein JA, Ohkawa H, Kans JA. Entrez: molecular biology database and retrieval system. Methods Enzymol. 1996;266:141–161. doi: 10.1016/S0076-6879(96)66012-1. - DOI - PubMed

LinkOut - more resources