Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jun 13;20(1):333.
doi: 10.1186/s12859-019-2869-3.

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

Affiliations

fastJT: An R package for robust and efficient feature selection for machine learning and genome-wide association studies

Jiaxing Lin et al. BMC Bioinformatics. .

Abstract

Background: Parametric feature selection methods for machine learning and association studies based on genetic data are not robust with respect to outliers or influential observations. While rank-based, distribution-free statistics offer a robust alternative to parametric methods, their practical utility can be limited, as they demand significant computational resources when analyzing high-dimensional data. For genetic studies that seek to identify variants, the hypothesis is constrained, since it is typically assumed that the effect of the genotype on the phenotype is monotone (e.g., an additive genetic effect). Similarly, predictors for machine learning applications may have natural ordering constraints. Cross-validation for feature selection in these high-dimensional contexts necessitates highly efficient computational algorithms for the robust evaluation of many features.

Results: We have developed an R extension package, fastJT, for conducting genome-wide association studies and feature selection for machine learning using the Jonckheere-Terpstra statistic for constrained hypotheses. The kernel of the package features an efficient algorithm for calculating the statistics, replacing the pairwise comparison and counting processes with a data sorting and searching procedure, reducing computational complexity from O(n2) to O(n log(n)). The computational efficiency is demonstrated through extensive benchmarking, and example applications to real data are presented.

Conclusions: fastJT is an open-source R extension package, applying the Jonckheere-Terpstra statistic for robust feature selection for machine learning and association studies. The package implements an efficient algorithm which leverages internal information among the samples to avoid unnecessary computations, and incorporates shared-memory parallel programming to further boost performance on multi-core machines.

Keywords: Constrained inference; Feature selection; Genome-wide association studies; Jonckheere-Terpstra; Linear rank statistic; Logarithmic complexity; Machine learning; Parallel processing; Robust statistic.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Boxplot of selected plasma protein levels for CALGB 80303 data.Box plot of VEGF-A, VEGF-C, and MCP1 plasma protein levels in CALGB 80303. The boxes indicate the 25th (Q1) and 75th (Q3) percentiles, with the heavy line showing the median value. Whiskers indicate max(min(plasma levels), Q1–1.5 * IQR) and min(max(plasma levels), Q3+1.5 * IQR). The circles represent individual patient plasma levels
Fig. 2
Fig. 2
Cross-validation CPU times. CPU times for computing standardized JT test statistics using cross-validation with different numbers of folds, k, based on n=1,000 samples with m=1,000,000 features and p=1 quantitative trait. Each reported time is the mean of B=100 simulation replicates
Fig. 3
Fig. 3
CPU times for varying numbers of SNPs and traits. a: CPU times for computing standardized JT test statistics for different numbers of SNPs, m, with a fixed number of traits (p=50) and samples (n=1,000) using 8 threads. b: CPU times for computing standardized JT test statistics for different numbers of traits, p, with a fixed number of SNPs (m=1,000) and samples (n=1,000). Reported time for panel (a) is the mean of B=10 simulation replicates. Reported time for panel (b) is the mean of B=100 simulation replicates
Fig. 4
Fig. 4
CPU times for varying numbers of samples. CPU times for fastJT and an implementation of the JT test using (unsorted) pairwise comparisons. Results are shown for different numbers of samples (n), with a fixed number of SNPs (m=1,000) and traits (p=1,000). Each reported time is the mean of B=100 simulation replicates
Fig. 5
Fig. 5
CPU times for varying numbers of processing cores. CPU times for different numbers of processing cores, with a fixed number of SNPs (m=1,000), traits (p=1,000), and samples (n=1,000). The dashed line gives the single core CPU time divided by the number of cores. Each reported time is the mean of B=100 simulation replicates
Fig. 6
Fig. 6
Empirical rejection rates in the presence of outliers. Empirical rejection rates for the linear regression model and the JT test, for data simulated with n=500 samples for varying levels of MAFs and varying proportions of outliers, π. The red solid line indicates the nominal rate of 0.05. Each reported rate is based on B=10,000 simulation replicates
Fig. 7
Fig. 7
Machine learning prediction of plasma levels in CALGB 80303. Comparison of observed and predicted plasma protein levels of VEGF-A, VEGF-C, and MCP1. The machine learning model is built based on the top 100 SNPs selected by fastJT, and trained using the glmnet package

Similar articles

Cited by

References

    1. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507. doi: 10.1093/bioinformatics/btm344. - DOI - PubMed
    1. Fan J, Fan Y. High-dimensional classification using features annealed independence rules. Ann Statist. 2008;36(6):2605. doi: 10.1214/07-AOS504. - DOI - PMC - PubMed
    1. Innocenti F, Jiang C, Sibley AB, Etheridge AS, Hatch AJ, Denning S, Niedzwiecki D, Shterev ID, Lin J, Furukawa Y, Kubo M, Kindler HL, Auman JT, Venook AP, Hurwitz HI, McLeod HL, Ratain MJ, Gordan R, Nixon AB, Owzar K. Genetic variation determines VEGF-A plasma levels in cancer patients. Sci Rep. 2018;8:16332. doi: 10.1038/s41598-018-34506-4. - DOI - PMC - PubMed
    1. Altman DG, Martin BJ. Parametric vs non-parametric methods for data analysis. BMJ. 2009;338:3167. doi: 10.1136/bmj.a3167. - DOI - PubMed
    1. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18:50. doi: 10.1214/aoms/1177730491. - DOI

MeSH terms