Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Feb 15;33(4):574-576.
doi: 10.1093/bioinformatics/btw663.

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Affiliations

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Daniel Mapleson et al. Bioinformatics. .

Abstract

Motivation: De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.

Results: We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT's ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies.

Availability and implementation: KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT .

Contact: bernardo.clavijo@earlham.ac.uk.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) and (b), generated using KAT comp, show read k-mer frequency versus assembly copy number stacked histograms for two different assemblies of a heterozygous Fraxinus excelsior genome http://ftp-oadb.tsl.ac.uk/fraxinus_excelsior. Read content in black is absent from the assembly, red occurs once, purple twice, etc. Both k-mer spectra show an error distribution under 25×, heterozygous content around 50× and homozygous content around 100×. (a) contains most (but not all) the heterozygous content, and introduces more duplications on homozygous content. (b) is more collapsed, including mostly a single copy of the homozygous content and less of the heterozygous content. (c) and (d), generated using KAT sect, show kmer coverage across example assembled loci. The assembly k-mer coverage (black line) of assembly (a) in plot (c) shows that the assembly has two copies of this locus, whereas the read k-mer coverage (red line) implies there should be only a single copy. This incorrect duplication has been corrected in assembly (b) with the read and assembly k-mer coverage agreeing in plot (d). The increased read and assembly k-mer coverage at positions 100 and 400 indicates small regions of repetitive sequence in the genome. The halved read k-mer coverage after position 400 indicates a heterozygous locus, which likely caused the duplication of this locus in the assembly (a). See Supplementary Section 5 for a more extensive analysis of all sequences from this loci and their impact on (a) and (b)

References

    1. Anvar S.Y. et al. (2014) Determining the quality and complexity of next-generation sequencing data without a reference genome. Genome Biol., 15, 555.. - PMC - PubMed
    1. Chor B. et al. (2009) Genomic DNA k-mer spectra: models and modalities. Genome Biol., 10, R108.. - PMC - PubMed
    1. Lo C.C., Chain P.S.G. (2014) Rapid evaluation and quality control of next generation sequencing data with faqcs. BMC Bioinformatics, 15, 366.. - PMC - PubMed
    1. Marçais G., Kingsford C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770. - PMC - PubMed
    1. Metzker M.L. (2010) Sequencing technologies - the next generation. Nat. Rev. Genet., 11, 31–46. - PubMed

MeSH terms