BiostatisticsAdvancedResource guide

ROC curves, sensitivity, specificity and AUC

An advanced guide to diagnostic test evaluation and prediction model performance, covering sensitivity, specificity, thresholds, ROC curves, AUC and limitations.

Structure

Problem, intuition, method, working, limitations and discussion.

Best for

Students preparing for coursework, analysis, interpretation or revision.

Use with

Learning Hub lessons, tutoring sessions or dissertation planning.

Resource guide

Problem

Students often evaluate diagnostic tests or prediction models using only accuracy. This can be misleading, especially when the outcome is imbalanced or when false positives and false negatives have different consequences. ROC curves, sensitivity, specificity and AUC help evaluate how well a test or model separates people with and without an outcome across possible thresholds. However, these measures are frequently reported without explaining the clinical or practical meaning.

Accuracy can be misleading when one outcome category is much more common.
Students often ignore the trade-off between sensitivity and specificity.
AUC is reported without interpretation.
A high AUC does not automatically mean the model is clinically useful.
Threshold choice is treated as automatic rather than context-dependent.
False positives and false negatives are not discussed.
Discrimination is confused with calibration.

Resource guide

Intuition

A diagnostic test or prediction model often produces a score, probability or measurement. To classify someone as positive or negative, a threshold is needed. Sensitivity measures how well the test detects true cases. Specificity measures how well the test correctly identifies non-cases. The ROC curve shows how sensitivity and specificity change as the threshold moves. AUC summarises discrimination: how well the model ranks cases above non-cases.

Sensitivity is the proportion of true cases correctly identified.
Specificity is the proportion of true non-cases correctly identified.
Lowering the threshold usually increases sensitivity but reduces specificity.
Raising the threshold usually increases specificity but reduces sensitivity.
The ROC curve plots sensitivity against 1 minus specificity.
AUC measures overall discrimination across thresholds.
AUC does not tell you whether predicted probabilities are well calibrated.

Resource guide

Method

Evaluation should begin by defining the outcome, the prediction score and the decision context. Sensitivity and specificity should be calculated at clinically meaningful thresholds. ROC curves can then show the trade-off across thresholds. AUC can summarise discrimination, but it should not be the only performance measure. For prediction models, calibration and clinical usefulness should also be considered.

Step 1: Define the binary outcome clearly.
Step 2: Define what counts as a positive test or predicted event.
Step 3: Decide whether the test gives a binary result, score or probability.
Step 4: Calculate sensitivity and specificity at relevant thresholds.
Step 5: Plot the ROC curve if thresholds vary.
Step 6: Report AUC with a confidence interval where possible.
Step 7: Consider positive and negative predictive values if prevalence matters.
Step 8: Assess calibration if predicted probabilities are used.
Step 9: Discuss consequences of false positives and false negatives.
Step 10: Choose thresholds based on context, not only statistical optimisation.

Resource guide

Working

Suppose a model predicts whether patients will be readmitted within 30 days. If the threshold is low, more patients will be classified as high risk. This may catch more true readmissions, increasing sensitivity, but it may also label many patients high risk who would not be readmitted, reducing specificity. If the threshold is high, fewer false positives occur, but more true readmissions may be missed.

True positive: patient is readmitted and model classifies high risk.
False positive: patient is not readmitted but model classifies high risk.
True negative: patient is not readmitted and model classifies low risk.
False negative: patient is readmitted but model classifies low risk.
Sensitivity = true positives divided by all actual positives.
Specificity = true negatives divided by all actual negatives.
AUC near 0.5 suggests little discrimination.
AUC near 1 suggests strong discrimination, but clinical usefulness still needs evaluation.

Resource guide

Limitations

ROC curves and AUC are useful but incomplete. AUC can look good even when a model is poorly calibrated. AUC also averages performance across all thresholds, including thresholds that may be clinically irrelevant. In imbalanced datasets, precision, predictive values and calibration may be equally or more important. Performance should ideally be evaluated on validation data, not only the development dataset.

AUC does not assess calibration.
AUC can hide poor performance at clinically relevant thresholds.
AUC may be insensitive to improvements that matter in practice.
Sensitivity and specificity do not depend on prevalence, but predictive values do.
High discrimination does not guarantee clinical usefulness.
Performance can be optimistic if assessed only on training data.
Thresholds should reflect consequences and resources.

Resource guide

Discussion

A strong report should explain what the test or model is for. Screening contexts often prioritise sensitivity because missing cases is costly. Confirmatory testing may prioritise specificity because false positives are costly. Prediction models may require discrimination, calibration and decision-curve thinking. Students should avoid reporting AUC alone as if it proves the model is useful.

Explain the decision context.
Report sensitivity and specificity at meaningful thresholds.
Report AUC as a discrimination measure, not a complete evaluation.
Discuss false positives and false negatives.
Mention prevalence when interpreting predictive values.
Assess calibration if probabilities are used.
Avoid claiming clinical usefulness from AUC alone.

Practical checklist

Before you apply this topic

Have you defined the binary outcome?
Have you defined the test, score or prediction model?
Have you identified the decision threshold?
Have you calculated sensitivity?
Have you calculated specificity?
Have you considered false positives and false negatives?
Have you plotted or interpreted the ROC curve?
Have you reported AUC carefully?
Have you considered predictive values if prevalence matters?
Have you assessed calibration for predicted probabilities?
Have you used validation data where possible?
Have you explained clinical or practical usefulness?

Common mistakes

What to avoid

Using accuracy alone for imbalanced outcomes.
Reporting AUC without explaining what it means.
Saying high AUC proves the model is clinically useful.
Ignoring threshold choice.
Ignoring false positives and false negatives.
Confusing sensitivity with positive predictive value.
Confusing specificity with negative predictive value.
Confusing discrimination with calibration.
Evaluating performance only on training data.
Ignoring prevalence when interpreting predictive values.

How this connects to learning

Use the guide as a bridge between theory and application.

A resource guide should not replace a full course or live teaching session. Instead, it helps you organise your thinking. Use it to identify what you understand, what feels unclear, and what questions you should ask before applying a method to real data.

Before a lesson

Read the intuition and problem sections to prepare.

During analysis

Use the method and checklist to guide decisions.

When writing

Use limitations and discussion to improve interpretation.

Related guides

Continue with related topics.

Logistic regression explained for health and social science students

Sample size, power and precision explained

Understanding p-values, confidence intervals and effect sizes

How to report regression results in a dissertation

Common mistakes in dissertation data analysis

Back to all resources Need help applying this?