Case study/2025/solo · machine learning · cdc brfss

Diabetes risk prediction, tuned for the right kind of mistake.

A pre-screening tool for diabetes risk built on CDC BRFSS data. The whole design pivots on one clinical decision: false negatives are far more costly than false positives, so the model is tuned for recall, not accuracy, and the threshold makes that explicit.

Role
Solo, data, modelling, calibration, write-up
Module
WM9QG Fundamentals of AI & Data Mining · Warwick, 2025
Stack
pythonxgboostshapplatt calibrationcrisp-dm
Links
~80% Recall · diabetic class
0.82 ROC-AUC
0.15 Decision threshold
250k CDC BRFSS observations

01The problem

Nearly 45% of adults with diabetes worldwide don't know they have it. "Early detection" is the intervention, but only if the model is tuned for the right kind of mistake.

The global healthcare cost of diabetes has grown 338% over 17 years and is projected to hit $1 trillion by 2050. A pre-screening tool that can flag at-risk people for follow-up testing has real value, but only if it's built around the clinical reality of how it will be used.

That reality is asymmetric. A pre-screening tool has two ways to be wrong, and they aren't equal:

Predicted healthy
Predicted at-risk
Actually healthy
True negative

Correct, no action.

False positive

Annoying. Sends a healthy person for a follow-up test.

Actually diabetic
False negative

Tells a diabetic person they're fine. They delay treatment. Complications develop.

True positive

Correct, referred for testing.

False negatives are far more costly. That realisation determined the entire model design: optimise for recall, not accuracy or F1. Everything downstream, the threshold, the calibration, the loss surface I cared about, followed from that single decision.

02Approach

I worked through the full CRISP-DM pipeline rather than jumping straight to a classifier. The framework's value, on a dataset this messy, is that it forces you to look at the data before you model it.

  • Dataset. CDC BRFSS, 250,000 survey observations across the full set of behavioural and health indicators, with strong class imbalance (most respondents non-diabetic).
  • Association Rule Mining first. Apriori (min 2% support, min 30% confidence, lift > 1.5) to surface co-occurring risk patterns before committing to a model. Top pattern: poor general health combined with other risk factors. This phase was about understanding the structure of the data, not about producing rules to ship.
  • Classification, three-class. No diabetes / Prediabetes / Diabetes. Three-class is more clinically meaningful than binary, prediabetes is exactly the intervention window where lifestyle changes still work.
  • Model. XGBoost with Platt (sigmoid) calibration so probability outputs are interpretable, not just ordinal scores. Threshold set at 0.15 rather than the default 0.5, this is where the recall trade-off is made explicit.
  • Clustering, in parallel. Unsupervised segmentation to group the population into risk profiles for targeted public-health interventions. Ran alongside the classification work, not after it.
CDC BRFSS 250k obs 21 features Apriori rule mining min sup 2% Data prep stratified split + impute XGBoost scale_pos _weight Platt calibration (sigmoid) Threshold 0.15 recall ≥ 80% SHAP per-row explanation ← three highlighted steps are the core design decisions
Fig. 1, CRISP-DM pipeline. Association rules first, classifier second — ARM informed feature intuition rather than competing with the model. The three accent-bordered stages (XGBoost, Platt calibration, threshold 0.15) are where the cost-asymmetric design lives.

03Key decisions

Three choices mattered more than the model architecture itself:

Threshold 0.15
Not 0.5. The default threshold assumes equal cost for both error types. That's wrong in a clinical context. 0.15 pushes the model toward recall at the expense of precision, more referrals, fewer missed cases. This is a domain decision, not a model-tuning decision, and the case study calls it out as such.
Platt calibration
Make the probabilities mean something. Uncalibrated XGBoost scores are not probabilities, they're relative rankings. Calibration is what makes a 0.15 threshold meaningful: "this person has a 15% estimated probability of diabetes" is a clinical statement. "This person has a score of 0.15 on an arbitrary scale" is not.
Stratified split
Fit all preprocessing on training data only, then apply to test. Scaling, resampling, every transform, train-only. Easy to skip, quietly ruins your evaluation. With strong class imbalance, stratification at the split is the only way to keep both classes representative in the test set.
SHAP, not feature importance
Per-prediction explanations, not just global ranks. Tree-based importance scores are global; SHAP values are per-row and signed, so I can show a clinician why a specific patient got flagged. For a screening tool, that matters more than knowing which feature mattered overall.

04Explainability

The top risk factors from SHAP, general health self-rating, age, high blood pressure, BMI, match what a GP already knows. I take that as a sign of validity rather than a boring result. A screening model that surfaces risk factors a clinician disagrees with isn't necessarily wrong, but it owes you a much bigger explanation. One that matches clinical intuition only owes you an evaluation.

SHAP beeswarm summary plot. Top features: GenHlth, Age, HighBP, BMI. High feature values (pink) push toward at-risk prediction; low values (blue) push toward no-diabetes.
Fig. 2, Global SHAP summary (beeswarm). GenHlth, Age, HighBP, and BMI dominate — consistent with established clinical risk factors. The right kind of boring.
Why this matters for deployment

A screening tool needs to do two things at once: surface high-risk patients, and explain why. SHAP makes the second possible without giving up the first, the model stays an XGBoost classifier, and the explanation layer sits on top.

05Results

Headline numbers on the held-out test set:

Metric Value Why this matters
Recall (diabetic class) ~80% The cost-asymmetric target. 4 in 5 diabetic patients flagged.
ROC-AUC 0.82 Threshold-independent discrimination.
Decision threshold 0.15 Where the recall trade-off is made explicit.
Class 3-way No diabetes / Prediabetes / Diabetes.
Two plots: Metrics vs Classification Threshold showing where recall crosses 0.80 (dotted line), and a scatter of balanced accuracy vs recall with valid operating points highlighted in green above the 0.80 recall target.
Fig. 3, Threshold analysis. Left: recall, specificity, precision and balanced accuracy vs threshold — the dotted line marks the 80% recall target that drives the 0.15 decision. Right: all operating points with recall ≥ 80% shown in green; 0.15 sits at the balanced-accuracy peak of the valid set.

06What I'd do differently

Two limitations I want to be explicit about, because a screening tool that ignores them isn't ready for clinical use:

  • Self-reported survey data introduces reporting bias. BRFSS asks people about their own behaviour and health, that's both its scale advantage and its main weakness. People under-report things they're embarrassed about and over-report things they think the interviewer wants to hear. A model trained on self-report will mirror that.
  • Subgroup fairness was not audited. Performance was reported in aggregate, not per age group, sex, or ethnicity. A deployed screening tool needs that breakdown before any clinical use, if recall drops 15pp on one subgroup, the cost-asymmetry argument breaks down for exactly the people who need the model to be cautious.

The other thing I'd pursue: cost-sensitive learning at the loss level, rather than only at the threshold. Pushing the asymmetry into training rather than only into the decision rule would let the model learn a richer representation of "where the false negatives live", and would make the threshold less load-bearing.