Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul;48(4-5):167-186.
doi: 10.1177/01466216241238744. Epub 2024 Mar 11.

Using Interpretable Machine Learning for Differential Item Functioning Detection in Psychometric Tests

Affiliations

Using Interpretable Machine Learning for Differential Item Functioning Detection in Psychometric Tests

Elisabeth Barbara Kraus et al. Appl Psychol Meas. 2024 Jul.

Abstract

This study presents a novel method to investigate test fairness and differential item functioning combining psychometrics and machine learning. Test unfairness manifests itself in systematic and demographically imbalanced influences of confounding constructs on residual variances in psychometric modeling. Our method aims to account for resulting complex relationships between response patterns and demographic attributes. Specifically, it measures the importance of individual test items, and latent ability scores in comparison to a random baseline variable when predicting demographic characteristics. We conducted a simulation study to examine the functionality of our method under various conditions such as linear and complex impact, unfairness and varying number of factors, unfair items, and varying test length. We found that our method detects unfair items as reliably as Mantel-Haenszel statistics or logistic regression analyses but generalizes to multidimensional scales in a straight forward manner. To apply the method, we used random forests to predict migration backgrounds from ability scores and single items of an elementary school reading comprehension test. One item was found to be unfair according to all proposed decision criteria. Further analysis of the item's content provided plausible explanations for this finding. Analysis code is available at: https://osf.io/s57rw/?view_only=47a3564028d64758982730c6d9c6c547.

Keywords: differential item functioning; interpretable machine learning; machine learning; psychometrics; random forest; test fairness.

PubMed Disclaimer

Conflict of interest statement

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Figures

Figure 1.
Figure 1.
Data generation models.
Figure 2.
Figure 2.
Distributions of variable importance scores. Note. Theta = person parameter of intended factors; RF = random forest; boxes show the 25th, 50th, and 75th percentiles.
Figure 3.
Figure 3.
Relationships between item difficulty, accuracy, and false-positive-rates (fpr).
Figure 4.
Figure 4.
Discovery (dr) and false-positive-rates (fpr) for random forest (RF) including and excluding thetas and benchmark methods of Mantel–Haenszel (MH) and logistic regression (log) in unfair conditions.
Figure 5.
Figure 5.
False-positive-rates (fpr) for random forest (RF) including and excluding thetas and benchmark methods of Mantel–Haenszel (MH) and logistic regression (log) in null- and fair conditions.
Figure 6.
Figure 6.
DIME model (direct and inferential mediation model; Cromley & Azevedo, 2007).
Figure 7.
Figure 7.
The psychometric model. Note. Roman numbers indicate levels of reading competence. Level I (decoding = reading single words) was not covered explicitly by the test.
Figure 8.
Figure 8.
Percentages of correct responses conditional on the migration background.
Figure 9.
Figure 9.
Comparison of variable importance. Note. Error bars represent 95% confidence intervals.

References

    1. Attali Y., Bar‐Hillel M. (2003). Guess where: The position of correct answers in multiple‐choice test items as a psychometric variable. Journal of Educational Measurement, 40(2), 109–128. 10.1111/j.1745-3984.2003.tb01099.x - DOI
    1. Bauer D., Belzak W., Cole V. (2020). Simplifying the assessment of measurement invariance over multiple background variables: Using regularized moderated nonlinear factor analysis to detect differential item functioning. Structural Equation Modeling: A Multidisciplinary Journal, 27(1), 43–55. 10.1080/10705511.2019.1642754 - DOI - PMC - PubMed
    1. Becker B., Klein O., Biedinger N. (2013). The development of cognitive, language, and cultural skills from age 3 to 6: A comparison between children of Turkish origin and children of native-born German parents and the role of immigrant parents’ acculturation to the receiving society. American Educational Research Journal, 50(3), 616–649. 10.3102/0002831213480825 - DOI
    1. Belzak W., Bauer D. (2020). Improving the assessment of measurement invariance: Using regularization to select anchor items and identify differential item functioning. Psychological Methods, 25(6), 673–690. 10.1037/met0000253 - DOI - PMC - PubMed
    1. Belzak W. C. (2022). The multidimensionality of measurement bias in high‐stakes testing: Using machine learning to evaluate complex sources of differential item functioning. Educational Measurement: Issues and Practice, 42(1), 24–33. 10.1111/emip.12486 - DOI