. 2023 Dec 29;14(1):jkad246.

doi: 10.1093/g3journal/jkad246.

kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

Adnan Kivanc Corut¹, Jason G Wallace^{1

2

3}

Affiliations

¹ Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.
² Institute of Plant Breeding, Genetics, and Genomics, University of Georgia, Athens, GA 30602, USA.
³ Department of Crop and Soil Sciences, University of Georgia, Athens, GA 30602, USA.

PMID: 37976215
PMCID: PMC10755180
DOI: 10.1093/g3journal/jkad246

kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

Adnan Kivanc Corut et al. G3 (Bethesda). 2023.

. 2023 Dec 29;14(1):jkad246.

doi: 10.1093/g3journal/jkad246.

Authors

Adnan Kivanc Corut¹, Jason G Wallace^{1

2

3}

Affiliations

¹ Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.
² Institute of Plant Breeding, Genetics, and Genomics, University of Georgia, Athens, GA 30602, USA.
³ Department of Crop and Soil Sciences, University of Georgia, Athens, GA 30602, USA.

PMID: 37976215
PMCID: PMC10755180
DOI: 10.1093/g3journal/jkad246

Abstract

Genome-wide association studies (GWAS) have been widely used to identify genetic variation associated with complex traits. Despite its success and popularity, the traditional GWAS approach comes with a variety of limitations. For this reason, newer methods for GWAS have been developed, including the use of pan-genomes instead of a reference genome and the utilization of markers beyond single-nucleotide polymorphisms, such as structural variations and k-mers. The k-mers-based GWAS approach has especially gained attention from researchers in recent years. However, these new methodologies can be complicated and challenging to implement. Here, we present kGWASflow, a modular, user-friendly, and scalable workflow to perform GWAS using k-mers. We adopted an existing kmersGWAS method into an easier and more accessible workflow using management tools like Snakemake and Conda and eliminated the challenges caused by missing dependencies and version conflicts. kGWASflow increases the reproducibility of the kmersGWAS method by automating each step with Snakemake and using containerization tools like Docker. The workflow encompasses supplemental components such as quality control, read-trimming procedures, and generating summary statistics. kGWASflow also offers post-GWAS analysis options to identify the genomic location and context of trait-associated k-mers. kGWASflow can be applied to any organism and requires minimal programming skills. kGWASflow is freely available on GitHub (https://github.com/akcorut/kGWASflow) and Bioconda (https://anaconda.org/bioconda/kgwasflow).

Keywords: GWAS; bioinformatics tool; k-mers; pipeline; snakemake.

PubMed Disclaimer

Conflict of interest statement

Conflicts of interest The authors declare no conflicts of interest.

Figures

**Fig. 1.**
Overview of the kGWASflow workflow. The default kGWASflow workflow consists of three main phases: Preprocessing (1st step), k-mers-based GWAS (2nd step), and post-GWAS analyses (3rd step). The configuration and input files required by the workflow are indicated at the top. The final outputs and report of the workflow are outlined in the bottom right corner. Small boxes with solid outlines and shaded fill denote the publicly available tools employed in the workflow. The workflow steps are customizable, with multiple optional steps, such as read trimming and post-GWAS analysis options (options 1, 2, and 3). These optional steps are in dashed boxes.

**Fig. 2.**
Example working directory structure generated by kGWASflow initialization command: kgwasflow init. This working directory contains the default kGWASflow configuration files and the test/ directory with all the essential files required for a test workflow run.

**Fig. 3.**
Example outputs obtained from kGWASflow by processing the *E. coli* ampicillin resistance dataset (Rahman *et al.* 2018). a) Bar plot showing the number of k-mers that appeared in exactly “N” number of samples (“N” goes between 1 to the total number of samples). Only the k-mers that passed the initial filtering step were used. b) Histogram plot showing the distribution of noncanonical k-mer counts. The x-axis shows the unique k-mer counts and the y-axis shows the number of samples. The legend at the top right shows the total number of unique k-mers (noncanonical). The histogram plot for canonical counts can be found in the Supplementary Data (Supplementary Figure S1a). c) Joint plot showing the relationship between the noncanonical unique k-mer counts and the number of reads. The x-axis represents the number of unique k-mers (noncanonical), and the y-axis represents the number of total reads. The red line represents the linear regression line. The r-value is the Pearson correlation coefficient, and the P-value is the two-tailed P-value. The marginal distributions of the x and y axis are also shown on the top and right sides of the plot, respectively. The joint plot for canonical counts can be found in the Supplementary Data (Supplementary Figure S1b). d) Histogram of the -log10 P-values of each k-mer that passed the first kmersGWAS step. The red dashed line indicates the 5% family-wise error-rate threshold, while the blue dashed line indicates the 10% family-wise error-rate threshold. Only the P-values of the best k-mers from the first kmersGWAS step are used. P-values are obtained from GEMMA during the second step of kmersGWAS (a detailed explanation can be found in the k-mers-based GWAS section). e) Manhattan plot showing $-$ log10 P-values of k-mers that are significantly associated with ampicillin resistance, mapped to their genomic locations. k-mers were mapped to *E. coli* plasmid pKBN10P04869A reference genome (PRJNA430286) using bowtie2.

See this image and copyright information in PMC

References

1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. doi:10.1016/S0022-2836(05)80360-2 - DOI - PubMed
1. Andrews S. 2010. FastQC: A Quality Control Tool for High Throughput Sequence Data. Cambridge (UK): Babraham Bioinformatics, Babraham Institute.
1. Boyle EA, Li YI, Pritchard JK. 2017. An expanded view of complex traits: from polygenic to omnigenic. Cell. 169:1177–1186. doi:10.1016/j.cell.2017.05.038 - DOI - PMC - PubMed
1. Cano-Gamez E, Trynka G. 2020. From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases. Front Genet. 11:424. doi:10.3389/fgene.2020.00424 - DOI - PMC - PubMed
1. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 4:7. doi:10.1186/s13742-015-0047-8 - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

Affiliations

kGWASflow: a modular, flexible, and reproducible Snakemake workflow for k-mers-based GWAS

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources