← Back to case studies ML in Biostatistics

Case Study 1Module 1 applied workflowLogistic regressionROC · AUC · calibration · thresholds

Diabetes Risk Prediction Workflow

This case study applies the Module 1 foundation workflow to a diabetes prediction problem. The aim is not only to fit a logistic regression model, but to interpret the model like a biostatistician: define the prediction question, check predictor timing, evaluate discrimination and calibration, study threshold behaviour and report limitations honestly.

Outcome

Diabetes status

Model

Logistic regression

Task

Binary prediction

Main result

AUC 0.838

Read summary View results Threshold lab Download R script

Executive summary

What does this case study show?

The model shows useful discrimination for diabetes status, with test-set AUC = 0.838 and Brier score = 0.149. At the default threshold of 0.50, the model achieves accuracy = 0.779, sensitivity = 0.623 and specificity = 0.861.

The model is more specific than sensitive at threshold 0.50. It correctly identifies most diabetes-negative patients, but misses some diabetes-positive patients. This matters because a model intended for screening may require higher sensitivity.

The key lesson is that performance is not one number. AUC, calibration, sensitivity, specificity, predictive values, thresholds, leakage risk and clinical consequences all need to be interpreted together.

Case-study conclusion

This is a useful teaching model, not a deployable clinical tool. It demonstrates a responsible prediction workflow, but external validation, calibration assessment and clinical usefulness analysis would be needed before real-world use.

Results first

Model performance summary

Accuracy

0.779

Overall proportion correctly classified.

Useful as a broad summary, but not enough because the outcome is imbalanced.

Sensitivity

0.623

33 of 53 positives detected.

The model misses 20 diabetes-positive patients at threshold 0.50.

Specificity

0.861

87 of 101 negatives correctly identified.

The model is stronger at identifying diabetes-negative patients.

PPV

0.702

Precision among predicted positives.

When the model predicts positive, about 70% are truly positive in this test set.

NPV

0.813

Reassurance among predicted negatives.

When the model predicts negative, about 81% are truly negative in this test set.

AUC

0.838

Useful discrimination.

The model has useful ranking ability, but AUC alone does not prove clinical usefulness.

Brier score

0.149

Prediction error for probabilities.

Lower is better. It summarises how close predicted probabilities are to observed outcomes.

Threshold

0.50

Default classification cut-off.

The default threshold is not automatically the best clinical threshold.

Main interpretation

The model has useful discrimination, but at threshold 0.50 it is more conservative than sensitive. It produces fewer false positives, but it misses 20 diabetes-positive patients. Whether this is acceptable depends on the clinical purpose of the model.

Clinical question

Can routinely measured clinical variables predict diabetes status?

The outcome is diabetes status, coded as positive or negative. The model uses routinely measured patient characteristics to estimate the probability that a patient is diabetes-positive.

Prediction question: using available clinical variables, can we estimate a patient’s probability of diabetes?

This is a binary prediction problem. However, the goal is not simply to output “positive” or “negative”. In medical machine learning, predicted probabilities, thresholds, false positives, false negatives and clinical actions all matter.

Dataset and predictors

Shared diabetes prediction dataset

This case study uses the same diabetes prediction setting that appears across the early course modules. Keeping the dataset consistent helps learners focus on how the modelling ideas develop: supervised learning, logistic regression, validation, AUC, calibration, thresholds, leakage and reporting.

Variable	Clinical meaning	Timing check	Interpretation caution
pregnant	Number of pregnancies	Should be available before prediction	Context-dependent; interpret carefully across populations.
glucose	Plasma glucose concentration	Should be available before prediction	Strong predictive marker, but not automatically causal evidence.
pressure	Diastolic blood pressure	Should be available before prediction	May contribute weakly depending on the population.
triceps	Triceps skinfold thickness	Should be available before prediction	Measurement quality can vary.
insulin	Serum insulin	Should be available before prediction	May be missing or clinically unavailable in some settings.
mass	Body mass index	Should be available before prediction	Predictive, but interpretation should avoid causal overclaiming.
pedigree	Diabetes pedigree function	Should be available before prediction	Represents family-history-related risk information.
age	Age in years	Available before prediction	Usually safe from leakage, but may interact with other risk factors.

Diabetes outcome distribution — The dataset contains more diabetes-negative than diabetes-positive patients. This class imbalance means accuracy alone should not be used as the only performance measure.

Workflow

From prediction question to clinical interpretation

Define the prediction question

State exactly what the model should predict, for whom, and at what point in the clinical pathway.

Output: Can routinely measured clinical features predict diabetes status?

Check the dataset

Inspect the outcome, predictors, class balance, missingness and whether each variable has a plausible clinical meaning.

Output: Outcome: diabetes status. Predictors include glucose, BMI/mass, age, pressure, insulin and pedigree.

Protect against leakage

Ask whether every predictor would be available at the true prediction time.

Output: No predictor should contain future diagnosis, treatment or follow-up information.

Split data

Fit the model on training data and evaluate it on held-out test data.

Output: Training/test separation gives a more honest estimate of generalisation.

Fit the model

Use logistic regression because the outcome is binary and the aim is risk prediction.

Output: The model estimates predicted probability of diabetes.

Evaluate discrimination

Use ROC and AUC to assess whether predicted risks rank positives above negatives.

Output: AUC = 0.838, suggesting useful discrimination.

Evaluate threshold behaviour

Convert predicted probabilities into classes and examine sensitivity, specificity, PPV and NPV.

Output: At threshold 0.50, sensitivity = 0.623 and specificity = 0.861.

Report limitations

Explain what the model can and cannot support. Avoid claiming clinical readiness from one internal case study.

Output: External validation, calibration assessment and clinical usefulness analysis are still needed.

Confusion matrix

Classification results at threshold 0.50

The model produces predicted probabilities. To create predicted classes, we apply a threshold. At threshold 0.50, patients with predicted risk at or above 0.50 are classified as diabetes-positive.

	Predicted negative	Predicted positive	Clinical meaning
Observed negative	87	14	87 true negatives and 14 false positives.
Observed positive	20	33	33 true positives and 20 false negatives.

Why false negatives matter

False negatives are diabetes-positive patients predicted as negative. In this case study, there are 20 false negatives at threshold 0.50. For screening, this may be concerning because these patients may not receive timely follow-up.

Why false positives matter

False positives are diabetes-negative patients predicted as positive. In this case study, there are 14 false positives at threshold 0.50. These may cause extra testing, cost or anxiety.

Discrimination

ROC curve and AUC

The ROC curve shows the trade-off between sensitivity and specificity across many thresholds. The AUC summarises discrimination: how well the model ranks diabetes-positive patients above diabetes-negative patients.

ROC curve for diabetes prediction model — The test-set AUC is 0.838, suggesting useful discrimination in this educational example. AUC does not assess calibration and does not choose the clinical threshold.

Interpretation of AUC = 0.838

AUC can be interpreted as a ranking measure. If we randomly choose one diabetes-positive patient and one diabetes-negative patient, an AUC of 0.838 means the model often assigns a higher predicted risk to the diabetes-positive patient. However, AUC does not tell us whether predicted probabilities are numerically accurate or whether a chosen threshold is clinically appropriate.

Calibration

Are predicted risks close to observed risks?

Calibration asks whether predicted probabilities agree with observed outcome frequencies. A model can have good AUC but still give poorly calibrated risk estimates. This matters because clinical decisions often use the predicted probability itself, not only the rank order.

Calibration plot for diabetes prediction model — The calibration plot broadly follows the diagonal but is unstable in some risk groups. This is expected with limited test-set size. The Brier score is 0.149.

Interpretation of Brier score = 0.149

The Brier score measures the average squared difference between observed outcomes and predicted probabilities. It rewards predictions that are both confident and correct, and penalises confident wrong predictions. The value is useful for comparing models, but it should be reported with discrimination and threshold-based metrics.

Interactive threshold lab

Changing the threshold changes clinical behaviour

A threshold converts predicted probabilities into predicted classes. The default threshold of 0.50 is not automatically best for medical decision-making. Move the slider to see how the model changes.

Threshold: 0.50

Accuracy

0.779

Sensitivity

0.623

Specificity

0.861

PPV

0.702

NPV

0.813

True positive

False positive

True negative

False negative

Threshold interpretation

Conservative; fewer false positives but misses more diabetes-positive patients.

Clinical behaviour

This threshold is relatively specific. It is better at ruling out diabetes-negative patients than detecting every diabetes-positive patient.

Threshold	Accuracy	Sensitivity	Specificity	PPV	NPV	Interpretation
0.20	0.695	0.925	0.574	0.544	0.936	Very sensitive; useful for screening but creates more false positives.
0.30	0.753	0.774	0.743	0.612	0.862	More balanced; detects more positives than threshold 0.50.
0.40	0.766	0.698	0.802	0.649	0.835	Moderately balanced; still more sensitive than threshold 0.50.
0.50	0.779	0.623	0.861	0.702	0.813	Conservative; fewer false positives but misses more diabetes-positive patients.
0.60	0.766	0.509	0.901	0.730	0.777	Highly specific; may be useful when false positives are costly.
0.70	0.747	0.415	0.921	0.733	0.750	Very conservative; many diabetes-positive patients may be missed.

Interpretation checklist

How should a biostatistician interpret this model?

Prediction question

The model estimates diabetes risk using routinely measured variables. It does not prove why diabetes occurs.

Target population

The model should only be applied to patients similar to the population represented in the dataset.

Prediction time

Every predictor must be available before the model is used. Future information would create leakage.

Discrimination

AUC = 0.838 suggests the model separates positives from negatives reasonably well.

Calibration

The calibration plot should be reviewed because good ranking does not guarantee accurate probabilities.

Threshold

Threshold 0.50 gives higher specificity than sensitivity. It may be too conservative for screening.

Clinical usefulness

A model is useful only if its predictions can support a meaningful action.

Limitations

External validation and fuller clinical evaluation are needed before real-world use.

Limitations

What should we be careful about?

Accuracy is not enough

Accuracy of 0.779 hides the balance between false negatives and false positives.

Sensitivity may be too low for screening

At threshold 0.50, the model misses 20 diabetes-positive patients.

AUC is not calibration

AUC tells us about ranking, not whether probabilities are numerically accurate.

Thresholds require clinical judgement

The best threshold depends on the consequences of false positives and false negatives.

External validation is needed

Performance in one teaching dataset does not guarantee performance in another population.

Prediction is not causation

Model coefficients and predictors should not be interpreted as causal effects.

Data quality matters

Clinical variables can be noisy, missing, differently measured or unavailable in real workflows.

Clinical action must be defined

A risk score is useful only if it supports an action such as follow-up, testing or monitoring.

Report-style conclusion

How to write this case study in a report

A logistic regression model was fitted to predict diabetes status using routinely measured clinical characteristics. The model achieved AUC = 0.838 and Brier score = 0.149 in the test data, suggesting useful discrimination and moderate probability accuracy in this educational example. At the default threshold of 0.50, accuracy was 0.779, sensitivity was 0.623 and specificity was 0.861. The model was therefore more specific than sensitive, correctly identifying most diabetes-negative patients but missing 20 diabetes-positive patients. Threshold analysis showed that lower thresholds increased sensitivity but produced more false positives, while higher thresholds increased specificity but missed more positives. This model should be interpreted as a teaching example of a prediction workflow, not as a clinically deployable tool. External validation, fuller calibration assessment and evaluation of clinical usefulness would be needed before real-world use.

What not to write

Do not write: “The model is good because accuracy is 0.779.” Accuracy alone is incomplete. A strong case-study interpretation must discuss sensitivity, specificity, AUC, calibration, threshold choice, false positives, false negatives, leakage risk and limitations.

Case study conclusion

A useful model still needs careful interpretation.

The diabetes risk model shows useful discrimination with AUC = 0.838 and Brier score = 0.149. At threshold 0.50, it is more specific than sensitive. This makes it relatively conservative. For screening, a lower threshold may be more appropriate, but that decision must be justified by the clinical context.

Download R script Back to Module 1 Back to case studies