Assessing calibration and bias of a deployed machine learning malnutrition prediction model within a large healthcare … – Nature.com

The primary training cohort used to recalibrate the model included 49,652 patients (median [IQR] age = 66.0 [26.0]), of which 49.9% self-identified as female, 29.6% self-identified as Black or African American, 54.8% were on Medicare and 27.8% on Medicaid. 11,664 (24%) malnutrition cases were identified. Baseline characteristics are summarized in Table 1 and malnutrition event rates are summarized in Supplementary Table 2. The validation cohort used to test the model included 17,278 patients (median [IQR] age = 66.0 [27.0]), of which 49.8% self-identified as female, 27.1% self-identified as Black or African American, 52.9% were on Medicare, and 28.2% on Medicaid. 4,005 (23%) malnutrition cases were identified.

Although the model overall had a c-index of 0.81 (95% CI: 0.80, 0.81), it was miscalibrated according to both weak and moderate calibration metrics, with a Brier score of 0.26 (95% CI: 0.25, 0.26) (Table 2), indicating that the model is relatively inaccurate17. It also overfitted the risk estimate distribution, as evidenced by the calibration curve (Supplementary Fig. 1). Logistic recalibration of the model successfully improved calibration, bringing the calibration intercept to 0.07 (95% CI: 0.11, 0.03), calibration slope to 0.88 (95% CI: 0.86, 0.91), and significantly decreasing Brier score (0.21, 95% CI: 0.20, 0.22), Emax (0.03, 95% CI: 0.01, 0.05), and Eavg (0.01, 95% CI: 0.01, 0.02). Recalibrating the model improved specificity (0.74 to 0.93), PPV (0.47 to 0.60), and accuracy (0.74 to 0.80) while decreasing sensitivity (0.75 to 0.35) and NPV (0.91 to 0.83) (Supplementary Tables 2 and 3).

Weak and moderate calibration metrics between Black and White patients significantly differed prior to recalibration (Table 3, Supplementary Fig. 2A, B), with the model having a more negative calibration intercept for White patients on average compared to Black patients (1.17 vs. 1.07), and Black patients having a higher calibration slope compared to White patients (1.43 vs. 1.29). Black patients had a higher Brier score of 0.30 (95% CI: 0.29, 0.31) compared to White patients with 0.24 (95% CI: 0.23, 0.24). Logistic recalibration significantly improved calibration for both Black and White patients (Table 4, Fig. 1ac). For Black patients within the hold-out set, the recalibrated calibration intercept was 0 (95% CI: -0.07, 0.05), calibration slope was 0.91 (95% CI: 0.87, 0.95), and Brier score improved from 0.30 to 0.23 (95% CI: 0.21, 0.25). For White patients within the hold-out set, the recalibrated calibration intercept was -0.15 (95% CI: -0.20, -0.10), calibration slope was 0.82 (95% CI: 0.78, 0.85), and Brier score improved from 0.24 to 0.19 (95% CI: 0.18, 0.21). Post-recalibration, calibration for Black and White patients still differed significantly according to weak calibration metrics, but not so according to moderate calibration metrics and the strong calibration curves (Table 4, Fig. 1). Calibration curves of the recalibrated model showed good concordance between actual and predicted event probabilities, although the predicted risks for Black and White patients differed between the 30th and 60th risk percentiles. Logistic recalibration also improved the specificity, PPV, and accuracy, but decreased the sensitivity and NPV of the model across both White and Black patients (Supplementary Tables 2and 3). Discriminative ability was not significantly different for White and Black patients before and after recalibration. We also found calibration statistics to be relatively similar in Asian patients (Supplementary Table 4).

Columns from left to right are curves for a, No Recalibration b, Recalibration-in-the-Large and c, Logistic Recalibration for Black vs. White patients d, No Recalibration e, Recalibration-in-the-Large and f, Logistic Recalibration for male vs. female patients.

Calibration metrics between male and female patients also significantly differed prior to recalibration (Table 3, Supplementary Fig. 2C, D). The model had a more negative calibration intercept for female patients on average compared to male patients (1.49 vs. 0.88). Logistic recalibration significantly improved calibration for both male and female patients (Table 4, Fig. 1df). In male patients within the hold-out set, the recalibrated calibration intercept was 0 (95% CI: 0.05, 0.03), calibration slope was 0.88 (95% CI: 0.85, 0.90), and Brier score improved from 0.29 to 0.23 (95% CI: 0.22, 0.24). In female patients within the hold-out set, the recalibrated calibration intercept was 0.11 (95% CI: 0.16, 0.06), calibration slope was 0.91 (95% CI: 0.87, 0.94), but the Brier score did not significantly improve. After logistic recalibration, only calibration intercepts differed between male and female patients. Calibration curves of the recalibrated model showed good concordance, although the predicted risks for males and females differed between the 10th and 30th risk percentiles. Discrimination metrics for male and female patients were significantly different before recalibration. The model had a higher sensitivity and NPV for females than males, but a lower specificity, PPV, and accuracy (Supplementary Table 2). The recalibrated model had the highest sensitivity (0.95, 95% CI: 0.94, 0.96), NPV (0.84, 95% CI: 0.83, 0.85) and accuracy (0.82, 95% CI: 0.81, 0.83) for female patients, at the cost of substantially decreasing sensitivity (0.27, 95% CI: 0.25, 0.30) (Supplementary Table 3).

We also assessed calibration by payor type and hospital type as sensitivity analyses. In the payor type analysis, we found that malnutrition predicted risk was more miscalibrated in patients with commercial insurance with more extreme calibration intercepts, Emax, and Eavg suggesting overestimation of risk (Supplementary Tables 5 and 6, Supplementary Fig. 3A, B). We did not observe substantial differences in weak or moderate calibration across hospital type (community, tertiary, quaternary) except that tertiary acute care centers had a more extreme calibration intercept, suggesting an overestimation of risk (Supplementary Tables 7 and 8, Supplementary Fig. 3C, D). Across both subgroups, logistic recalibration significantly improved calibration across weak, moderate, and strong hierarchy tiers (Supplementary Table 5, Supplementary Table 7, Supplementary Figs. 4 and 5).

Read this article:
Assessing calibration and bias of a deployed machine learning malnutrition prediction model within a large healthcare ... - Nature.com

Related Posts

Comments are closed.