Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool

Athanasios Tsanas^{1

2

3}

Affiliations

¹ Usher Institute, Edinburgh Medical School, University of Edinburgh, NINE Edinburgh BioQuarter, 9 Little France road, Edinburgh, UK.
² School of Mathematics, University of Edinburgh, Edinburgh, UK.
³ Alan Turing Institute, British Library, London, UK.

PMID: 35607618
PMCID: PMC9122960
DOI: 10.1016/j.patter.2022.100471

Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool

Athanasios Tsanas. Patterns (N Y). 2022.

. 2022 Mar 31;3(5):100471.

doi: 10.1016/j.patter.2022.100471. eCollection 2022 May 13.

Author

Athanasios Tsanas^{1

2

3}

Affiliations

¹ Usher Institute, Edinburgh Medical School, University of Edinburgh, NINE Edinburgh BioQuarter, 9 Little France road, Edinburgh, UK.
² School of Mathematics, University of Edinburgh, Edinburgh, UK.
³ Alan Turing Institute, British Library, London, UK.

PMID: 35607618
PMCID: PMC9122960
DOI: 10.1016/j.patter.2022.100471

Abstract

We present a new heuristic feature-selection (FS) algorithm that integrates in a principled algorithmic framework the three key FS components: relevance, redundancy, and complementarity. Thus, we call it relevance, redundancy, and complementarity trade-off (RRCT). The association strength between each feature and the response and between feature pairs is quantified via an information theoretic transformation of rank correlation coefficients, and the feature complementarity is quantified using partial correlation coefficients. We empirically benchmark the performance of RRCT against 19 FS algorithms across four synthetic and eight real-world datasets in indicative challenging settings evaluating the following: (1) matching the true feature set and (2) out-of-sample performance in binary and multi-class classification problems when presenting selected features into a random forest. RRCT is very competitive in both tasks, and we tentatively make suggestions on the generalizability and application of the best-performing FS algorithms across settings where they may operate effectively.

Keywords: curse of dimensionality; dimensionality reduction; feature selection; information theory; principle of parsimony; statistical learning; variable selection.

PubMed Disclaimer

Conflict of interest statement

The author declares no competing interests.

Figures

**Figure 1**
Comparison of the feature-selection algorithms in terms of true feature set recovery The lower the false discovery rate (FDR), the better the feature-selection algorithm. The horizontal axis denotes the number of features selected during the incremental process for the number of true features in each dataset (the first synthetic dataset had 3 true features, the second had 8 true features, the third and fourth had 10 true features). Above each plot in parentheses, we present the size of the design matrix (in the form samples × features) followed by the number of classes (e.g., the first dataset contains 60 samples and 30 features in a 2-class classification problem).

**Figure 2**
Comparison of the feature-selection algorithms based on RF out of sample performance for the binary-classification datasets The horizontal axis denotes the number of features selected in the greedy feature-selection process. When the assessment was done using 10-fold cross validation, the results are presented in the form mean ± SD (where SD is in the form of error bars).

**Figure 3**
Comparison of the feature-selection algorithms based on RF out of sample performance for the multi-class classification datasets The horizontal axis denotes the number of features selected in the greedy feature-selection process. When the assessment was done using 10-fold cross validation, the results are presented in the form mean ± SD (where SD is in the form of error bars).

**Figure 4**
Average timings for the feature-selection (FS) algorithms explored in this study across the 8 real-world datasets The y axis is presented in a logarithmic scale for convenience. All experiments were run on a desktop Windows 10 machine with an Intel i9-9900K CPU at 3.6 GHz with 64 GB RAM using MATLAB 2021b.

**Figure 5**
Information theoretic (IT) quantity (relevance or redundancy) as a function of the rank (Spearman) correlation coefficient $ρ$ , computed as $I (ρ) = - 0.5 \cdot log (1 - ρ^{2})$ Asymptotically, as the absolute value of the correlation coefficient tends to $\pm 1$ , the IT quantity becomes infinite (in practice, we set this to a very large value: 1,000).

**Figure 6**
Graphical representation of the effect of the partial correlation coefficient The lowercase letters represent the shared information between the random variables.

See this image and copyright information in PMC

References

1. Hastie T., Tibshirani R., Friedman J. Second edition. Springer; 2009. Elements of Statistical Learning.
1. Guyon I., Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003;3:1157–1182.
1. Guyon I., Gunn S., Nikravesh M., Zadeh L.A., editors. Feature Extraction Foundations and Applications. Springer; 2006.
1. Yu L., Liu H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 2004;5:1205–1224.
1. Torkkola K. Feature extraction by non-parametric mutual information maximization. J. Mach. Learn. Res. 2003;3:1415–1438.

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool

Affiliations

Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool

Author

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Miscellaneous