Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
- PMID: 28449062
- PMCID: PMC5459929
- DOI: 10.1093/gigascience/gix029
Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive
Abstract
It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party.
Keywords: high-throughput sequencing; sequencing quality; public data; database.
© The Authors 2017. Published by Oxford University Press.
Figures




Similar articles
-
The Sequence Read Archive: explosive growth of sequencing data.Nucleic Acids Res. 2012 Jan;40(Database issue):D54-6. doi: 10.1093/nar/gkr854. Epub 2011 Oct 18. Nucleic Acids Res. 2012. PMID: 22009675 Free PMC article.
-
SRAdb: query and use public next-generation sequencing data from within R.BMC Bioinformatics. 2013 Jan 17;14:19. doi: 10.1186/1471-2105-14-19. BMC Bioinformatics. 2013. PMID: 23323543 Free PMC article.
-
"METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI's sequence read archive".BMC Bioinformatics. 2020 Sep 3;21(1):378. doi: 10.1186/s12859-020-03694-0. BMC Bioinformatics. 2020. PMID: 32883210 Free PMC article.
-
An introduction to high-throughput sequencing experiments: design and bioinformatics analysis.Methods Mol Biol. 2013;1038:1-26. doi: 10.1007/978-1-62703-514-9_1. Methods Mol Biol. 2013. PMID: 23872966 Review.
-
Data structures based on k-mers for querying large collections of sequencing data sets.Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16. Genome Res. 2021. PMID: 33328168 Free PMC article. Review.
Cited by
-
VARUS: sampling complementary RNA reads from the sequence read archive.BMC Bioinformatics. 2019 Nov 8;20(1):558. doi: 10.1186/s12859-019-3182-x. BMC Bioinformatics. 2019. PMID: 31703556 Free PMC article.
-
NeuroLINCS Proteomics: Defining human-derived iPSC proteomes and protein signatures of pluripotency.Sci Data. 2023 Jan 11;10(1):24. doi: 10.1038/s41597-022-01687-7. Sci Data. 2023. PMID: 36631473 Free PMC article.
-
Importance of experimental information (metadata) for archived sequence data: case of specific gene bias due to lag time between sample harvest and RNA protection in RNA sequencing.PeerJ. 2021 Aug 25;9:e11875. doi: 10.7717/peerj.11875. eCollection 2021. PeerJ. 2021. PMID: 34527435 Free PMC article.
-
Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection.Gigascience. 2019 Apr 1;8(4):giz052. doi: 10.1093/gigascience/giz052. Gigascience. 2019. PMID: 31222199 Free PMC article.
-
All of gene expression (AOE): An integrated index for public gene expression databases.PLoS One. 2020 Jan 24;15(1):e0227076. doi: 10.1371/journal.pone.0227076. eCollection 2020. PLoS One. 2020. PMID: 31978081 Free PMC article.
References
-
- Organisation for Economic Co-operation and Development OECD Principles and Guidelines for Access to Research Data from Public Funding. Paris: OECD; 2007. http://www.oecd.org/science/sci-tech/38500813.pdf. (11 November 2016, date last accessed).
-
- Ball CA, Sherlock G, Brazma A. Funding high-throughput data sharing. Nat Biotechnol 2004;22(9):1179–83. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources