. 2020 May 19:9:376.

doi: 10.12688/f1000research.23180.2. eCollection 2020.

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

Matthew N Bernstein¹, Ariella Gladstein², Khun Zaw Latt³, Emily Clough⁴, Ben Busby⁴, Allissa Dillman⁴

Affiliations

¹ Morgridge Institute for Research, Madison, Wisconsin, 53715, USA.
² Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, 27599, USA.
³ Kidney Disease Branch, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, Maryland, 20892, USA.
⁴ National Center for Biotechnology Information NLM, Bethesda, Maryland, 20894, USA.

PMID: 32864105
PMCID: PMC7445559
DOI: 10.12688/f1000research.23180.2

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

Matthew N Bernstein et al. F1000Res. 2020.

. 2020 May 19:9:376.

doi: 10.12688/f1000research.23180.2. eCollection 2020.

Authors

Matthew N Bernstein¹, Ariella Gladstein², Khun Zaw Latt³, Emily Clough⁴, Ben Busby⁴, Allissa Dillman⁴

Affiliations

¹ Morgridge Institute for Research, Madison, Wisconsin, 53715, USA.
² Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, 27599, USA.
³ Kidney Disease Branch, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, Maryland, 20892, USA.
⁴ National Center for Biotechnology Information NLM, Bethesda, Maryland, 20894, USA.

PMID: 32864105
PMCID: PMC7445559
DOI: 10.12688/f1000research.23180.2

Abstract

The Sequence Read Archive (SRA) is a large public repository that stores raw next-generation sequencing data from thousands of diverse scientific investigations. Despite its promise, reuse and re-analysis of SRA data has been challenged by the heterogeneity and poor quality of the metadata that describe its biological samples. Recently, the MetaSRA project standardized these metadata by annotating each sample with terms from biomedical ontologies. In this work, we present a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA's human RNA-seq data. The first tool, called the Case-Control Finder, finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type. The second tool, called the Series Finder, finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time. These tools were the result of a three-day-long NCBI Codeathon in March 2019 held at the University of North Carolina at Chapel Hill.

Keywords: Hackathon; Jupyter; MetaSRA; Metadata; Ontology; RNA-seq; Sequence Read Archive.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

**Figure 1.. Data flows for hypothesis-driven query tools.**
An overview of the backend processing functions called from the Jupyter notebooks.

**Figure 2.. Example results from the Case-Control Finder.**
Results from running the Case-Control Finder for the query “liver cancer.” ( A) The Case-Control Finder displays the number of case/control samples matched by each tissue and cell type. ( B) The user can select either the case samples or control samples for a given tissue or cell type and display the most common ontology terms associated with those selected samples. Displayed here are the most common terms associated with the case samples labeled as “liver.” ( C) The notebook also displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right).

**Figure 3.. Example results from the Series Finder.**
Results from running the Series Finder for the query “brain” sorted by “age,” where unit is specified as “year.” ( A) The Series Finder displays the number of samples sorted by age. ( B) The user can select samples associated with a given time point for further exploration. Here the samples annotated as “year = 63” are selected. The notebook then displays four pie charts for viewing the fraction of samples belonging to a cell line (top left), each sex (top right), each developmental stage (bottom left), and whether they were given an experimental treatment (bottom right). ( C) Given the selected samples from ( B), the notebook displays the most frequent terms associated with those selected samples.

See this image and copyright information in PMC

Cited by

Microbial Dark Matter: from Discovery to Applications.
Zha Y, Chong H, Yang P, Ning K. Zha Y, et al. Genomics Proteomics Bioinformatics. 2022 Oct;20(5):867-881. doi: 10.1016/j.gpb.2022.02.007. Epub 2022 Apr 26. Genomics Proteomics Bioinformatics. 2022. PMID: 35477055 Free PMC article. Review.
Enhancing Learning About Epidemiological Data Analysis Using R for Graduate Students in Medical Fields With Jupyter Notebook: Classroom Action Research.
Kumwichar P. Kumwichar P. JMIR Med Educ. 2023 May 29;9:e47394. doi: 10.2196/47394. JMIR Med Educ. 2023. PMID: 37247206 Free PMC article.
Detection and Classification of Melanoma Skin Cancer Using Image Processing Technique.
Viknesh CK, Kumar PN, Seetharaman R, Anitha D. Viknesh CK, et al. Diagnostics (Basel). 2023 Oct 26;13(21):3313. doi: 10.3390/diagnostics13213313. Diagnostics (Basel). 2023. PMID: 37958209 Free PMC article.
Metadata retrieval from sequence databases with ffq.
Gálvez-Merchán Á, Min KHJ, Pachter L, Booeshaghi AS. Gálvez-Merchán Á, et al. Bioinformatics. 2023 Jan 1;39(1):btac667. doi: 10.1093/bioinformatics/btac667. Bioinformatics. 2023. PMID: 36610997 Free PMC article.
STAT: a fast, scalable, MinHash-based k-mer tool to assess Sequence Read Archive next-generation sequence submissions.
Katz KS, Shutov O, Lapoint R, Kimelman M, Brister JR, O'Sullivan C. Katz KS, et al. Genome Biol. 2021 Sep 20;22(1):270. doi: 10.1186/s13059-021-02490-0. Genome Biol. 2021. PMID: 34544477 Free PMC article.

References

1. Bairoch A: The Cellosaurus, a Cell-Line Knowledge Resource. J Biomol Tech. 2018;29(2):25–38. 10.7171/jbt.18-2902-002 - DOI - PMC - PubMed
1. Bard J, Rhee SY, Ashburner M: An ontology for cell types. Genome Biol. 2005;6(2):R21. 10.1186/gb-2005-6-2-r21 - DOI - PMC - PubMed
1. Bernstein M: mbernste/hypothesis-driven-SRA-queries: First release (Version v1.0.0). Zenodo. 2020. 10.5281/zenodo.3807512 - DOI
1. Bernstein MN, Doan A, Dewey CN: MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive. Bioinformatics. 2017;33(18):2914–2923. 10.1093/bioinformatics/btx334 - DOI - PMC - PubMed
1. Collado-Torres L, Nellore A, Kammers K, et al. : Reproducible RNA-seq analysis using recount2. Nat Biotechnol. 2017;35(4):319–321. 10.1038/nbt.3838 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

Affiliations

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

LinkOut - more resources

Full Text Sources