Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Mar 3;117(9):4571-4577.
doi: 10.1073/pnas.1906831117. Epub 2020 Feb 18.

Expert-augmented machine learning

Affiliations

Expert-augmented machine learning

Efstathios D Gennatas et al. Proc Natl Acad Sci U S A. .

Abstract

Machine learning is proving invaluable across disciplines. However, its success is often limited by the quality and quantity of available data, while its adoption is limited by the level of trust afforded by given models. Human vs. machine performance is commonly compared empirically to decide whether a certain task should be performed by a computer or an expert. In reality, the optimal learning strategy may involve combining the complementary strengths of humans and machines. Here, we present expert-augmented machine learning (EAML), an automated method that guides the extraction of expert knowledge and its integration into machine-learned models. We used a large dataset of intensive-care patient data to derive 126 decision rules that predict hospital mortality. Using an online platform, we asked 15 clinicians to assess the relative risk of the subpopulation defined by each rule compared to the total sample. We compared the clinician-assessed risk to the empirical risk and found that, while clinicians agreed with the data in most cases, there were notable exceptions where they overestimated or underestimated the true risk. Studying the rules with greatest disagreement, we identified problems with the training data, including one miscoded variable and one hidden confounder. Filtering the rules based on the extent of disagreement between clinician-assessed risk and empirical risk, we improved performance on out-of-sample data and were able to train with less data. EAML provides a platform for automated creation of problem-specific priors, which help build robust and dependable machine-learning models in critical applications.

Keywords: computational medicine; machine learning; medicine.

PubMed Disclaimer

Conflict of interest statement

Competing interest statement: The editor, P.J.B., and one of the authors, M.J.v.d.L., are at the same institution (University of California, Berkeley).

Figures

Fig. 1.
Fig. 1.
Overview of the methods. RuleFit involves 1) training a gradient boosting model on the input data, 2) converting boosted trees to rules by concatenating conditions from the root node to each leaf node, and 3) training an L1-regularized (LASSO) logistic regression model. Each rule defines a subpopulation that satisfies all conditions in the rule. Clinician experts assess the mortality risk of the subpopulation defined by each rule compared to the whole sample on a web application. For each rule, delta ranking is calculated as the difference between the subpopulation’s empirical risk as suggested by the data and the clinicians’ estimate. A final model is trained by reducing the influence of those rules with highest delta ranking. This forms an efficient procedure where experts are asked to assess 126 simple rules of 3 to 5 variables each instead of assessing 24,508 cases with 17 variables each.
Fig. 2.
Fig. 2.
Example of a rule presented to clinicians. Age, GCS (1, <6; 2, 6 to 8; 3, 9 to 10; 4, 11 to 13; 5, 14 to 15), ratio of oxygen blood concentration to fractional inspired oxygen concentration (PaO2/FiO2), and BUN concentration are the variables selected for this rule. The decision tree rules derived from gradient boosting, e.g., age ≤ 73.65 and GCS ≤ 4, were converted to the form “median (range)”, e.g., age, 56.17 (16.01 to 73.65), for continuous variables and to the form “mode (included levels)” for categorical variables. Rules were presented in a randomized order, one at a time. The top line (blue box) displays the values for the subpopulation defined by the given rule. The bottom line (gray box) displays the values of the whole population. Participants were asked to assess the risk of belonging to the defined subpopulation compared to the whole sample using a five-point system: highly decrease, moderately decrease, no effect, moderately increase, and highly increase.
Fig. 3.
Fig. 3.
Mortality ratio by average clinicians’ risk ranking. Rules were binned into quintiles based on average clinicians’ assessment. The mean empirical risk for each quintile was plotted. Error bars indicate 1.96 * SE.
Fig. 4.
Fig. 4.
Variable importance estimated using a Random Forest model predicting mortality (A), and clinicians’ assessments (B). While PaO2/FiO2 is the most important variable in both cases, in the former case it is used to learn intubation status, while in the latter clinicians are responding based on its physiological influence on mortality.
Fig. 5.
Fig. 5.
Example of variable shift: heart rate distribution of the same set of patients from MIMIC-II and MIMIC-III1 (A). Models were trained on MIMIC-II data using different subsets of rules defined by the extent of clinicians’ agreement with the empirical risk (delta ranking cut in five bins, ΔR) (B). Mean AUC of models trained on MIMIC-II and tested on MIMIC-III (C) and models trained and tested on MIMIC-III (D). Subsamples of different sizes were used for each subset of rules defined by ΔR to test the hypothesis that eliminating bad rules helps the algorithm train with less data. Error bars represent 1 SD across 10 stratified subsamples.

References

    1. Lenat D. B., Prakash M., Shepherd M., CYC: Using common sense knowledge to overcome brittleness and knowledge acquisition bottlenecks. AI Magazine 6, 65 (1985).
    1. Steyerberg E. W., et al. ; PROGRESS Group , Prognosis Research Strategy (PROGRESS) 3: Prognostic model research. PLoS Med. 10, e1001381 (2013). - PMC - PubMed
    1. Hingorani A. D., et al. ; PROGRESS Group , Prognosis research strategy (PROGRESS) 4: Stratified medicine research. BMJ 346, e5793 (2013). - PMC - PubMed
    1. Cooper G. F., et al. , Predicting dire outcomes of patients with community acquired pneumonia. J. Biomed. Inform. 38, 347–366 (2005). - PubMed
    1. Mullainathan S., Obermeyer Z., Does machine learning automate moral hazard and error? Am. Econ. Rev. 107, 476–480 (2017). - PMC - PubMed

Publication types

MeSH terms