Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive

Tazro Ohta, Takeru Nakazato, Hidemasa Bono

PMID: 28449062
PMCID: PMC5459929
DOI: 10.1093/gigascience/gix029

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive

Tazro Ohta et al. Gigascience. 2017.

. 2017 Jun 1;6(6):1-8.

doi: 10.1093/gigascience/gix029.

Authors

Tazro Ohta, Takeru Nakazato, Hidemasa Bono

PMID: 28449062
PMCID: PMC5459929
DOI: 10.1093/gigascience/gix029

Abstract

It is important for public data repositories to promote the reuse of archived data. In the growing field of omics science, however, the increasing number of submissions of high-throughput sequencing (HTSeq) data to public repositories prevents users from choosing a suitable data set from among the large number of search results. Repository users need to be able to set a threshold to reduce the number of results to obtain a suitable subset of high-quality data for reanalysis. We calculated the quality of sequencing data archived in a public data repository, the Sequence Read Archive (SRA), by using the quality control software FastQC. We obtained quality values for 1 171 313 experiments, which can be used to evaluate the suitability of data for reuse. We also visualized the data distribution in SRA by integrating the quality information and metadata of experiments and samples. We provide quality information of all of the archived sequencing data, which enable users to obtain sufficient quality sequencing data for reanalyses. The calculated quality data are available to the public in various formats. Our data also provide an example of enhancing the reuse of public data by adding metadata to published research data by a third party.

Keywords: high-throughput sequencing; sequencing quality; public data; database.

PubMed Disclaimer

Figures

**Figure 1:**
Performed sequencing experiments and sequenced samples of public data for quality calculation. **(a)** Bar plot of the top 20 library strategies. Values are categorical, retrieved from metadata described by the data submitter. **(b)** Bar plot of the top 20 sequenced sample organisms. Taxonomy information is retrieved from the NCBI taxonomy database and declared by the data submitter. **(c)** Bar plot of sequencing instrument models.

**Figure 2:**
Data distribution in a public data repository by sequencing quality. **(a, b)** Histogram of sequencing throughput (a) and one color-coded by library source (b). **(c, d)** Histogram of base call accuracy (c) and one color-coded by instrument manufacturer (d).

**Figure 3:**
Human data distribution for each library strategy. **(a–d)** Histograms separated by the top six library strategies. Data distribution is by the total number of sequences (a), median read length (b), sequencing throughput (c), and median base call accuracy (d) per experiment.

**Figure 4:**
Change of data distribution by sequencing quality over time. **(a, b)** Box plots separated by the top six library strategies, showing quarterly change. Data distribution is by the sequencing throughput (a) and median base call accuracy (b) per experiment. The numbers in the plots indicate the numbers of samples in a row. The lines connecting the boxes indicate changes of mean value.

See this image and copyright information in PMC

References

1. Organisation for Economic Co-operation and Development OECD Principles and Guidelines for Access to Research Data from Public Funding. Paris: OECD; 2007. http://www.oecd.org/science/sci-tech/38500813.pdf. (11 November 2016, date last accessed).
1. Sansone SA, Rocca-Serra P, Field D et al. . Toward interoperable bioscience data. NatGenet 2012;44(2):121–6. - PMC - PubMed
1. Ball CA, Sherlock G, Brazma A. Funding high-throughput data sharing. Nat Biotechnol 2004;22(9):1179–83. - PubMed
1. Nakazato T, Ohta T, Bono H. Experimental design-based functional mining and characterization of high-throughput sequencing data in the sequence read archive. PLoS One. 2013;8(10):e77910. - PMC - PubMed
1. Kodama Y, Shumway M, Leinonen R. The Sequence Read Archive: explosive growth of sequencing data. Nucleic Acids Res 2012;40(D1):D54–6. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive

Calculating the quality of public high-throughput sequencing data to obtain a suitable subset for reanalysis from the Sequence Read Archive

Authors

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources