Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Dec 29;14(1):jkad246.
doi: 10.1093/g3journal/jkad246.

kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

Affiliations

kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

Adnan Kivanc Corut et al. G3 (Bethesda). .

Abstract

Genome-wide association studies (GWAS) have been widely used to identify genetic variation associated with complex traits. Despite its success and popularity, the traditional GWAS approach comes with a variety of limitations. For this reason, newer methods for GWAS have been developed, including the use of pan-genomes instead of a reference genome and the utilization of markers beyond single-nucleotide polymorphisms, such as structural variations and k-mers. The k-mers-based GWAS approach has especially gained attention from researchers in recent years. However, these new methodologies can be complicated and challenging to implement. Here, we present kGWASflow, a modular, user-friendly, and scalable workflow to perform GWAS using k-mers. We adopted an existing kmersGWAS method into an easier and more accessible workflow using management tools like Snakemake and Conda and eliminated the challenges caused by missing dependencies and version conflicts. kGWASflow increases the reproducibility of the kmersGWAS method by automating each step with Snakemake and using containerization tools like Docker. The workflow encompasses supplemental components such as quality control, read-trimming procedures, and generating summary statistics. kGWASflow also offers post-GWAS analysis options to identify the genomic location and context of trait-associated k-mers. kGWASflow can be applied to any organism and requires minimal programming skills. kGWASflow is freely available on GitHub (https://github.com/akcorut/kGWASflow) and Bioconda (https://anaconda.org/bioconda/kgwasflow).

Keywords: GWAS; bioinformatics tool; k-mers; pipeline; snakemake.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest The authors declare no conflicts of interest.

Figures

Fig. 1.
Fig. 1.
Overview of the kGWASflow workflow. The default kGWASflow workflow consists of three main phases: Preprocessing (1st step), k-mers-based GWAS (2nd step), and post-GWAS analyses (3rd step). The configuration and input files required by the workflow are indicated at the top. The final outputs and report of the workflow are outlined in the bottom right corner. Small boxes with solid outlines and shaded fill denote the publicly available tools employed in the workflow. The workflow steps are customizable, with multiple optional steps, such as read trimming and post-GWAS analysis options (options 1, 2, and 3). These optional steps are in dashed boxes.
Fig. 2.
Fig. 2.
Example working directory structure generated by kGWASflow initialization command: kgwasflow init. This working directory contains the default kGWASflow configuration files and the test/ directory with all the essential files required for a test workflow run.
Fig. 3.
Fig. 3.
Example outputs obtained from kGWASflow by processing the E. coli ampicillin resistance dataset (Rahman et al. 2018). a) Bar plot showing the number of k-mers that appeared in exactly “N” number of samples (“N” goes between 1 to the total number of samples). Only the k-mers that passed the initial filtering step were used. b) Histogram plot showing the distribution of noncanonical k-mer counts. The x-axis shows the unique k-mer counts and the y-axis shows the number of samples. The legend at the top right shows the total number of unique k-mers (noncanonical). The histogram plot for canonical counts can be found in the Supplementary Data (Supplementary Figure S1a). c) Joint plot showing the relationship between the noncanonical unique k-mer counts and the number of reads. The x-axis represents the number of unique k-mers (noncanonical), and the y-axis represents the number of total reads. The red line represents the linear regression line. The r-value is the Pearson correlation coefficient, and the P-value is the two-tailed P-value. The marginal distributions of the x and y axis are also shown on the top and right sides of the plot, respectively. The joint plot for canonical counts can be found in the Supplementary Data (Supplementary Figure S1b). d) Histogram of the -log10 P-values of each k-mer that passed the first kmersGWAS step. The red dashed line indicates the 5% family-wise error-rate threshold, while the blue dashed line indicates the 10% family-wise error-rate threshold. Only the P-values of the best k-mers from the first kmersGWAS step are used. P-values are obtained from GEMMA during the second step of kmersGWAS (a detailed explanation can be found in the k-mers-based GWAS section). e) Manhattan plot showing log10 P-values of k-mers that are significantly associated with ampicillin resistance, mapped to their genomic locations. k-mers were mapped to E. coli plasmid pKBN10P04869A reference genome (PRJNA430286) using bowtie2.

Similar articles

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. doi:10.1016/S0022-2836(05)80360-2 - DOI - PubMed
    1. Andrews S. 2010. FastQC: A Quality Control Tool for High Throughput Sequence Data. Cambridge (UK): Babraham Bioinformatics, Babraham Institute.
    1. Boyle EA, Li YI, Pritchard JK. 2017. An expanded view of complex traits: from polygenic to omnigenic. Cell. 169:1177–1186. doi:10.1016/j.cell.2017.05.038 - DOI - PMC - PubMed
    1. Cano-Gamez E, Trynka G. 2020. From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases. Front Genet. 11:424. doi:10.3389/fgene.2020.00424 - DOI - PMC - PubMed
    1. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 4:7. doi:10.1186/s13742-015-0047-8 - DOI - PMC - PubMed

Publication types

LinkOut - more resources