Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Mar 31;3(5):100471.
doi: 10.1016/j.patter.2022.100471. eCollection 2022 May 13.

Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool

Affiliations

Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool

Athanasios Tsanas. Patterns (N Y). .

Abstract

We present a new heuristic feature-selection (FS) algorithm that integrates in a principled algorithmic framework the three key FS components: relevance, redundancy, and complementarity. Thus, we call it relevance, redundancy, and complementarity trade-off (RRCT). The association strength between each feature and the response and between feature pairs is quantified via an information theoretic transformation of rank correlation coefficients, and the feature complementarity is quantified using partial correlation coefficients. We empirically benchmark the performance of RRCT against 19 FS algorithms across four synthetic and eight real-world datasets in indicative challenging settings evaluating the following: (1) matching the true feature set and (2) out-of-sample performance in binary and multi-class classification problems when presenting selected features into a random forest. RRCT is very competitive in both tasks, and we tentatively make suggestions on the generalizability and application of the best-performing FS algorithms across settings where they may operate effectively.

Keywords: curse of dimensionality; dimensionality reduction; feature selection; information theory; principle of parsimony; statistical learning; variable selection.

PubMed Disclaimer

Conflict of interest statement

The author declares no competing interests.

Figures

None
Graphical abstract
Figure 1
Figure 1
Comparison of the feature-selection algorithms in terms of true feature set recovery The lower the false discovery rate (FDR), the better the feature-selection algorithm. The horizontal axis denotes the number of features selected during the incremental process for the number of true features in each dataset (the first synthetic dataset had 3 true features, the second had 8 true features, the third and fourth had 10 true features). Above each plot in parentheses, we present the size of the design matrix (in the form samples × features) followed by the number of classes (e.g., the first dataset contains 60 samples and 30 features in a 2-class classification problem).
Figure 2
Figure 2
Comparison of the feature-selection algorithms based on RF out of sample performance for the binary-classification datasets The horizontal axis denotes the number of features selected in the greedy feature-selection process. When the assessment was done using 10-fold cross validation, the results are presented in the form mean ± SD (where SD is in the form of error bars).
Figure 3
Figure 3
Comparison of the feature-selection algorithms based on RF out of sample performance for the multi-class classification datasets The horizontal axis denotes the number of features selected in the greedy feature-selection process. When the assessment was done using 10-fold cross validation, the results are presented in the form mean ± SD (where SD is in the form of error bars).
Figure 4
Figure 4
Average timings for the feature-selection (FS) algorithms explored in this study across the 8 real-world datasets The y axis is presented in a logarithmic scale for convenience. All experiments were run on a desktop Windows 10 machine with an Intel i9-9900K CPU at 3.6 GHz with 64 GB RAM using MATLAB 2021b.
Figure 5
Figure 5
Information theoretic (IT) quantity (relevance or redundancy) as a function of the rank (Spearman) correlation coefficient ρ, computed as I(ρ)=0.5log(1ρ2) Asymptotically, as the absolute value of the correlation coefficient tends to ±1, the IT quantity becomes infinite (in practice, we set this to a very large value: 1,000).
Figure 6
Figure 6
Graphical representation of the effect of the partial correlation coefficient The lowercase letters represent the shared information between the random variables.

References

    1. Hastie T., Tibshirani R., Friedman J. Second edition. Springer; 2009. Elements of Statistical Learning.
    1. Guyon I., Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003;3:1157–1182.
    1. Guyon I., Gunn S., Nikravesh M., Zadeh L.A., editors. Feature Extraction Foundations and Applications. Springer; 2006.
    1. Yu L., Liu H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 2004;5:1205–1224.
    1. Torkkola K. Feature extraction by non-parametric mutual information maximization. J. Mach. Learn. Res. 2003;3:1415–1438.

LinkOut - more resources