. 2019 Apr 18;19(Suppl 9):238.

doi: 10.1186/s12864-019-5445-3.

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies

Li-An Yang¹, Yu-Jung Chang², Shu-Hwa Chen¹, Chung-Yen Lin¹, Jan-Ming Ho^{1

3}

Affiliations

¹ Institute of Information Science, Academia Sinica, Taipei, Taiwan.
² Institute of Information Science, Academia Sinica, Taipei, Taiwan. yjchang@iis.sinica.edu.tw.
³ Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan.

PMID: 30999844
PMCID: PMC7402383
DOI: 10.1186/s12864-019-5445-3

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies

Li-An Yang et al. BMC Genomics. 2019.

. 2019 Apr 18;19(Suppl 9):238.

doi: 10.1186/s12864-019-5445-3.

Authors

Li-An Yang¹, Yu-Jung Chang², Shu-Hwa Chen¹, Chung-Yen Lin¹, Jan-Ming Ho^{1

3}

Affiliations

¹ Institute of Information Science, Academia Sinica, Taipei, Taiwan.
² Institute of Information Science, Academia Sinica, Taipei, Taiwan. yjchang@iis.sinica.edu.tw.
³ Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan.

PMID: 30999844
PMCID: PMC7402383
DOI: 10.1186/s12864-019-5445-3

Abstract

Background: With the rapid increase in genome sequencing projects for non-model organisms, numerous genome assemblies are currently in progress or available as drafts, but not made available as satisfactory, usable genomes. Data quality assessment of genome assemblies is gaining importance not only for people who perform the assembly/re-assembly processes, but also for those who attempt to use assemblies as maps in downstream analyses. Recent studies of the quality control, quality evaluation/ assessment of genome assemblies have focused on either quality control of reads before assemblies or evaluation of the assemblies with respect to their contiguity and correctness. However, correctness assessment depends on a reference and is not applicable for de novo assembly projects. Hence, development of methods providing both post-assembly and pre-assembly quality assessment reports for examining the quality/correctness of de novo assemblies and the input reads is worth studying.

Results: We present SQUAT, an efficient tool for both pre-assembly and post-assembly quality assessment of de novo genome assemblies. The pre-assembly module of SQUAT computes quality statistics of reads and presents the analysis in a well-designed interface to visualize the distribution of high- and poor-quality reads in a portable HTML report. The post-assembly module of SQUAT provides read mapping analytics in an HTML format. We categorized reads into several groups including uniquely mapped reads, multiply mapped, unmapped reads; for uniquely mapped reads, we further categorized them into perfectly matched, with substitutions, containing clips, and the others. We carefully defined the poorly mapped (PM) reads into several groups to prevent the underestimation of unmapped reads; indeed, a high PM% would be a sign of a poor assembly that requires researchers' attention for further examination or improvements before using the assembly. Finally, we evaluate SQUAT with six datasets, including the genome assemblies for eel, worm, mushroom, and three bacteria. The results show that SQUAT reports provide useful information with details for assessing the quality of assemblies and reads.

Availability: The SQUAT software with links to both its docker image and the on-line manual is freely available at https://github.com/luke831215/SQUAT .

Keywords: Data quality assessment; Genome assembly; Genome sequencing; Non-model organisms.

PubMed Disclaimer

Conflict of interest statement

None of the authors have any competing interests.

Figures

**Fig. 2**
Classification of reads by read mapping analysis. The descriptions and icons of these read labels are shown in Table 1

**Fig. 3**
Label distribution barchart of the mushroom dataset. The poorly-mapped ratio (PM%) of Mushroom dataset is 8.8% on the left with BWA-MEM and 16.3% on the right with BWA-backtrack

**Fig. 4**
Clip ratio distribution of the mushroom dataset. The threshold value is set at 0.3 by default

**Fig. 5**
Alignment score distribution for the mushroom dataset. a Distribution of reads with no errors (type P). b Distribution of reads with substitution errors (type S). c Distribution of reads containing clips (type C). The majority of the alignment scores of reads goes from P, S, to C in decreasing order

**Fig. 6**
Post-assembly report interface

See this image and copyright information in PMC

References

1. Tagu D, Colbourne JK, Nègre N. Genomic data integration for ecological and evolutionary traits in non-model organisms. BMC Genomics. 2014;15:490. doi: 10.1186/1471-2164-15-490. - DOI - PMC - PubMed
1. da Fonseca RR, Albrechtsen A, Themudo GE, Ramos-Madrigal J, Sibbesen JA, Maretty L, et al. Next-generation biology: sequencing and data analysis approaches for non-model organisms. Mar Genomics. 2016;30:3–13. doi: 10.1016/j.margen.2016.04.012. - DOI - PubMed
1. Genome 10K Community of Scientists Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2009;100:659–674. doi: 10.1093/jhered/esp086. - DOI - PMC - PubMed
1. The Global Invertebrate Genomics Alliance (GIGA): developing community resources to study diverse invertebrate genomes J Hered. 2014;105:1–18. doi: 10.1093/jhered/est084. - DOI - PMC - PubMed
1. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, et al. Earth BioGenome project: sequencing life for the future of life. PNAS. 2018;115:4325–4333. doi: 10.1073/pnas.1720115115. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

105-2221-E-001-031-MY3/Ministry of Science and Technology, Taiwan

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies

Affiliations

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources