MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
- PMID: 28535296
- PMCID: PMC5870770
- DOI: 10.1093/bioinformatics/btx334
MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive
Abstract
Motivation: The NCBI's Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized set of terms that should be used to describe the biological samples from which the sequencing data are derived. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA.
Results: We present MetaSRA, a database of normalized SRA human sample-specific metadata following a schema inspired by the metadata organization of the ENCODE project. This schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline.
Availability and implementation: The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads. Software implementing our computational pipeline is available at http://github.com/deweylab/metasra-pipeline.
Contact: cdewey@biostat.wisc.edu.
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author(s) 2017. Published by Oxford University Press.
Figures






Similar articles
-
Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive.F1000Res. 2020 May 19;9:376. doi: 10.12688/f1000research.23180.2. eCollection 2020. F1000Res. 2020. PMID: 32864105 Free PMC article.
-
"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0. BMC Bioinformatics. 2020. PMID: 32883210 Free PMC article.
-
pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive.F1000Res. 2019 Apr 23;8:532. doi: 10.12688/f1000research.18676.1. eCollection 2019. F1000Res. 2019. PMID: 31114675 Free PMC article.
-
grabseqs: simple downloading of reads and metadata from multiple next-generation sequencing data repositories.Bioinformatics. 2020 Jun 1;36(11):3607-3609. doi: 10.1093/bioinformatics/btaa167. Bioinformatics. 2020. PMID: 32154830 Free PMC article.
-
MetaRNA-Seq: An Interactive Tool to Browse and Annotate Metadata from RNA-Seq Studies.Biomed Res Int. 2015;2015:318064. doi: 10.1155/2015/318064. Epub 2015 Aug 25. Biomed Res Int. 2015. PMID: 26380270 Free PMC article.
Cited by
-
The role of metadata in reproducible computational research.Patterns (N Y). 2021 Sep 10;2(9):100322. doi: 10.1016/j.patter.2021.100322. eCollection 2021 Sep 10. Patterns (N Y). 2021. PMID: 34553169 Free PMC article. Review.
-
HumanMetagenomeDB: a public repository of curated and standardized metadata for human metagenomes.Nucleic Acids Res. 2021 Jan 8;49(D1):D743-D750. doi: 10.1093/nar/gkaa1031. Nucleic Acids Res. 2021. PMID: 33221926 Free PMC article.
-
Cistrome Data Browser: integrated search, analysis and visualization of chromatin data.Nucleic Acids Res. 2024 Jan 5;52(D1):D61-D66. doi: 10.1093/nar/gkad1069. Nucleic Acids Res. 2024. PMID: 37971305 Free PMC article.
-
STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.Genome Biol. 2021 Sep 20;22(1):270. doi: 10.1186/s13059-021-02490-0. Genome Biol. 2021. PMID: 34544477 Free PMC article.
-
ALE: automated label extraction from GEO metadata.BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509. doi: 10.1186/s12859-017-1888-1. BMC Bioinformatics. 2017. PMID: 29297276 Free PMC article.
References
-
- Bartolini I. et al. (2002). String matching with metric trees using an approximate distance In: Proceedings of the 9th International Symposium on String Processing and Information Retrieval, SPIRE 2002, Springer-Verlag, London, UK, pp. 271–283.
-
- Bröcker J., Smith L. (2007) Increasing the reliability of reliability diagrams. Weather Forecasting, 22, 651–661.
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials