. 2019 Jun 13;20(1):333.

doi: 10.1186/s12859-019-2869-3.

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

Jiaxing Lin¹, Alexander Sibley², Ivo Shterev³, Andrew Nixon², Federico Innocenti⁴, Cliburn Chan¹, Kouros Owzar^{5

6

7}

Affiliations

¹ Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.
² Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA.
³ Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA.
⁴ Division of Pharmacotherapy and Experimental Therapeutics, Chapel Hill, NC, USA.
⁵ Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA. Kouros.Owzar@duke.edu.
⁶ Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA. Kouros.Owzar@duke.edu.
⁷ Division of Pharmacotherapy and Experimental Therapeutics, Chapel Hill, NC, USA. Kouros.Owzar@duke.edu.

PMID: 31195980
PMCID: PMC6567636
DOI: 10.1186/s12859-019-2869-3

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

Jiaxing Lin et al. BMC Bioinformatics. 2019.

. 2019 Jun 13;20(1):333.

doi: 10.1186/s12859-019-2869-3.

Authors

Jiaxing Lin¹, Alexander Sibley², Ivo Shterev³, Andrew Nixon², Federico Innocenti⁴, Cliburn Chan¹, Kouros Owzar^{5

6

7}

Affiliations

¹ Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA.
² Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA.
³ Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA.
⁴ Division of Pharmacotherapy and Experimental Therapeutics, Chapel Hill, NC, USA.
⁵ Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA. Kouros.Owzar@duke.edu.
⁶ Duke Cancer Institute, Duke University Medical Center, Durham, NC, USA. Kouros.Owzar@duke.edu.
⁷ Division of Pharmacotherapy and Experimental Therapeutics, Chapel Hill, NC, USA. Kouros.Owzar@duke.edu.

PMID: 31195980
PMCID: PMC6567636
DOI: 10.1186/s12859-019-2869-3

Abstract

Background: Parametric feature selection methods for machine learning and association studies based on genetic data are not robust with respect to outliers or influential observations. While rank-based, distribution-free statistics offer a robust alternative to parametric methods, their practical utility can be limited, as they demand significant computational resources when analyzing high-dimensional data. For genetic studies that seek to identify variants, the hypothesis is constrained, since it is typically assumed that the effect of the genotype on the phenotype is monotone (e.g., an additive genetic effect). Similarly, predictors for machine learning applications may have natural ordering constraints. Cross-validation for feature selection in these high-dimensional contexts necessitates highly efficient computational algorithms for the robust evaluation of many features.

Results: We have developed an R extension package, fastJT, for conducting genome-wide association studies and feature selection for machine learning using the Jonckheere-Terpstra statistic for constrained hypotheses. The kernel of the package features an efficient algorithm for calculating the statistics, replacing the pairwise comparison and counting processes with a data sorting and searching procedure, reducing computational complexity from O(n²) to O(n log(n)). The computational efficiency is demonstrated through extensive benchmarking, and example applications to real data are presented.

Conclusions: fastJT is an open-source R extension package, applying the Jonckheere-Terpstra statistic for robust feature selection for machine learning and association studies. The package implements an efficient algorithm which leverages internal information among the samples to avoid unnecessary computations, and incorporates shared-memory parallel programming to further boost performance on multi-core machines.

Keywords: Constrained inference; Feature selection; Genome-wide association studies; Jonckheere-Terpstra; Linear rank statistic; Logarithmic complexity; Machine learning; Parallel processing; Robust statistic.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Boxplot of selected plasma protein levels for CALGB 80303 data.Box plot of VEGF-A, VEGF-C, and MCP1 plasma protein levels in CALGB 80303. The boxes indicate the 25^th (Q1) and 75^th (Q3) percentiles, with the heavy line showing the median value. Whiskers indicate max(min(plasma levels), Q1–1.5 * IQR) and min(max(plasma levels), Q3+1.5 * IQR). The circles represent individual patient plasma levels

**Fig. 2**
Cross-validation CPU times. CPU times for computing standardized JT test statistics using cross-validation with different numbers of folds, k, based on n=1,000 samples with m=1,000,000 features and p=1 quantitative trait. Each reported time is the mean of B=100 simulation replicates

**Fig. 3**
CPU times for varying numbers of SNPs and traits. a: CPU times for computing standardized JT test statistics for different numbers of SNPs, m, with a fixed number of traits (p=50) and samples (n=1,000) using 8 threads. b: CPU times for computing standardized JT test statistics for different numbers of traits, p, with a fixed number of SNPs (m=1,000) and samples (n=1,000). Reported time for panel (a) is the mean of B=10 simulation replicates. Reported time for panel (b) is the mean of B=100 simulation replicates

**Fig. 4**
CPU times for varying numbers of samples. CPU times for fastJT and an implementation of the JT test using (unsorted) pairwise comparisons. Results are shown for different numbers of samples (n), with a fixed number of SNPs (m=1,000) and traits (p=1,000). Each reported time is the mean of B=100 simulation replicates

**Fig. 5**
CPU times for varying numbers of processing cores. CPU times for different numbers of processing cores, with a fixed number of SNPs (m=1,000), traits (p=1,000), and samples (n=1,000). The dashed line gives the single core CPU time divided by the number of cores. Each reported time is the mean of B=100 simulation replicates

**Fig. 6**
Empirical rejection rates in the presence of outliers. Empirical rejection rates for the linear regression model and the JT test, for data simulated with n=500 samples for varying levels of MAFs and varying proportions of outliers, π. The red solid line indicates the nominal rate of 0.05. Each reported rate is based on B=10,000 simulation replicates

**Fig. 7**
Machine learning prediction of plasma levels in CALGB 80303. Comparison of observed and predicted plasma protein levels of VEGF-A, VEGF-C, and MCP1. The machine learning model is built based on the top 100 SNPs selected by fastJT, and trained using the glmnet package

See this image and copyright information in PMC

Cited by

Common variation in a long non-coding RNA gene modulates variation of circulating TGF-β2 levels in metastatic colorectal cancer patients (Alliance).
Quintanilha JCF, Sibley AB, Liu Y, Niedzwiecki D, Halabi S, Rogers L, O'Neil B, Kindler H, Kelly W, Venook A, McLeod HL, Ratain MJ, Nixon AB, Innocenti F, Owzar K. Quintanilha JCF, et al. BMC Genomics. 2024 May 14;25(1):473. doi: 10.1186/s12864-024-10354-7. BMC Genomics. 2024. PMID: 38745123 Free PMC article.

References

1. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507. doi: 10.1093/bioinformatics/btm344. - DOI - PubMed
1. Fan J, Fan Y. High-dimensional classification using features annealed independence rules. Ann Statist. 2008;36(6):2605. doi: 10.1214/07-AOS504. - DOI - PMC - PubMed
1. Innocenti F, Jiang C, Sibley AB, Etheridge AS, Hatch AJ, Denning S, Niedzwiecki D, Shterev ID, Lin J, Furukawa Y, Kubo M, Kindler HL, Auman JT, Venook AP, Hurwitz HI, McLeod HL, Ratain MJ, Gordan R, Nixon AB, Owzar K. Genetic variation determines VEGF-A plasma levels in cancer patients. Sci Rep. 2018;8:16332. doi: 10.1038/s41598-018-34506-4. - DOI - PMC - PubMed
1. Altman DG, Martin BJ. Parametric vs non-parametric methods for data analysis. BMJ. 2009;338:3167. doi: 10.1136/bmj.a3167. - DOI - PubMed
1. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18:50. doi: 10.1214/aoms/1177730491. - DOI

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

Affiliations

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases