Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 18;19(Suppl 9):238.
doi: 10.1186/s12864-019-5445-3.

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies

Affiliations

SQUAT: a Sequencing Quality Assessment Tool for data quality assessments of genome assemblies

Li-An Yang et al. BMC Genomics. .

Abstract

Background: With the rapid increase in genome sequencing projects for non-model organisms, numerous genome assemblies are currently in progress or available as drafts, but not made available as satisfactory, usable genomes. Data quality assessment of genome assemblies is gaining importance not only for people who perform the assembly/re-assembly processes, but also for those who attempt to use assemblies as maps in downstream analyses. Recent studies of the quality control, quality evaluation/ assessment of genome assemblies have focused on either quality control of reads before assemblies or evaluation of the assemblies with respect to their contiguity and correctness. However, correctness assessment depends on a reference and is not applicable for de novo assembly projects. Hence, development of methods providing both post-assembly and pre-assembly quality assessment reports for examining the quality/correctness of de novo assemblies and the input reads is worth studying.

Results: We present SQUAT, an efficient tool for both pre-assembly and post-assembly quality assessment of de novo genome assemblies. The pre-assembly module of SQUAT computes quality statistics of reads and presents the analysis in a well-designed interface to visualize the distribution of high- and poor-quality reads in a portable HTML report. The post-assembly module of SQUAT provides read mapping analytics in an HTML format. We categorized reads into several groups including uniquely mapped reads, multiply mapped, unmapped reads; for uniquely mapped reads, we further categorized them into perfectly matched, with substitutions, containing clips, and the others. We carefully defined the poorly mapped (PM) reads into several groups to prevent the underestimation of unmapped reads; indeed, a high PM% would be a sign of a poor assembly that requires researchers' attention for further examination or improvements before using the assembly. Finally, we evaluate SQUAT with six datasets, including the genome assemblies for eel, worm, mushroom, and three bacteria. The results show that SQUAT reports provide useful information with details for assessing the quality of assemblies and reads.

Availability: The SQUAT software with links to both its docker image and the on-line manual is freely available at https://github.com/luke831215/SQUAT .

Keywords: Data quality assessment; Genome assembly; Genome sequencing; Non-model organisms.

PubMed Disclaimer

Conflict of interest statement

None of the authors have any competing interests.

Figures

Fig. 1
Fig. 1
SQUAT assessment workflow
Fig. 2
Fig. 2
Classification of reads by read mapping analysis. The descriptions and icons of these read labels are shown in Table 1
Fig. 3
Fig. 3
Label distribution barchart of the mushroom dataset. The poorly-mapped ratio (PM%) of Mushroom dataset is 8.8% on the left with BWA-MEM and 16.3% on the right with BWA-backtrack
Fig. 4
Fig. 4
Clip ratio distribution of the mushroom dataset. The threshold value is set at 0.3 by default
Fig. 5
Fig. 5
Alignment score distribution for the mushroom dataset. a Distribution of reads with no errors (type P). b Distribution of reads with substitution errors (type S). c Distribution of reads containing clips (type C). The majority of the alignment scores of reads goes from P, S, to C in decreasing order
Fig. 6
Fig. 6
Post-assembly report interface

References

    1. Tagu D, Colbourne JK, Nègre N. Genomic data integration for ecological and evolutionary traits in non-model organisms. BMC Genomics. 2014;15:490. doi: 10.1186/1471-2164-15-490. - DOI - PMC - PubMed
    1. da Fonseca RR, Albrechtsen A, Themudo GE, Ramos-Madrigal J, Sibbesen JA, Maretty L, et al. Next-generation biology: sequencing and data analysis approaches for non-model organisms. Mar Genomics. 2016;30:3–13. doi: 10.1016/j.margen.2016.04.012. - DOI - PubMed
    1. Genome 10K Community of Scientists Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2009;100:659–674. doi: 10.1093/jhered/esp086. - DOI - PMC - PubMed
    1. The Global Invertebrate Genomics Alliance (GIGA): developing community resources to study diverse invertebrate genomes J Hered. 2014;105:1–18. doi: 10.1093/jhered/est084. - DOI - PMC - PubMed
    1. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, et al. Earth BioGenome project: sequencing life for the future of life. PNAS. 2018;115:4325–4333. doi: 10.1073/pnas.1720115115. - DOI - PMC - PubMed

LinkOut - more resources