RegressionIntermediateResource guide

Linear regression assumptions and diagnostics

A detailed guide to the assumptions behind linear regression, why they matter, how students should think about diagnostics and how to report limitations clearly.

Structure

Problem, intuition, method, working, limitations and discussion.

Best for

Students preparing for coursework, analysis, interpretation or revision.

Use with

Learning Hub lessons, tutoring sessions or dissertation planning.

Resource guide

Problem

Linear regression is one of the most widely used statistical methods, but students often treat it as a simple button-clicking procedure. They fit a model, copy the coefficient table and report p-values without checking whether the model is appropriate. This is risky because regression results depend on assumptions about the relationship, residuals, variation, independence and influential observations. If these assumptions are ignored, the estimates, confidence intervals and p-values may be misleading.

Students often report regression tables without checking assumptions.
Linearity is assumed even when the relationship is curved.
Residuals are rarely inspected.
Unequal variance can make standard errors unreliable.
Outliers and influential observations can dominate the model.
Independence is ignored in clustered or repeated-measures data.
Multicollinearity can make coefficients unstable and difficult to interpret.

Resource guide

Intuition

Linear regression tries to describe the average relationship between an outcome and one or more predictors using a straight-line structure. The model is useful when the average change in the outcome is approximately linear over the range being studied. The assumptions are not just technical rules; they tell us whether the model is a reasonable summary of the data and whether the uncertainty estimates can be trusted.

Linearity asks whether a straight-line pattern is reasonable.
Independence asks whether observations are separate pieces of information.
Constant variance asks whether residual spread is similar across fitted values.
Normal residuals matter mainly for small-sample inference.
Outlier checks ask whether a few observations are driving the results.
Multicollinearity asks whether predictors overlap too strongly.

Resource guide

Method

A good regression workflow separates model fitting from model checking. First, define the outcome and predictors. Then fit a model that matches the research question. After that, inspect diagnostics to decide whether the model is adequate. If assumptions are questionable, the solution is not always to abandon regression; sometimes transformation, robust standard errors, additional terms, sensitivity analysis or a different model may be more appropriate.

Step 1: Define the outcome variable and predictors.
Step 2: Check descriptive summaries before fitting the model.
Step 3: Use scatterplots to assess relationships with numerical predictors.
Step 4: Fit the regression model.
Step 5: Inspect residuals versus fitted values for non-linearity and unequal variance.
Step 6: Inspect residual distribution, especially for small samples.
Step 7: Check influential observations using leverage and influence diagnostics.
Step 8: Consider multicollinearity when predictors are strongly related.
Step 9: Decide whether the model is acceptable, needs modification or requires cautious reporting.

Resource guide

Working

Suppose a student models exam score using study hours, attendance and previous grade. The coefficient for study hours estimates the expected difference in exam score for one additional hour of study, holding attendance and previous grade constant. Before trusting this interpretation, the student should check whether the relationship is approximately linear, whether residuals show patterns, whether any students have extreme influence and whether predictors are too strongly correlated.

A residual plot with a curved pattern suggests non-linearity.
A funnel-shaped residual plot suggests unequal variance.
A few very large residuals may indicate outliers or unusual observations.
A high-leverage observation may strongly affect the fitted line.
Strong correlation between predictors can make individual coefficients unstable.
If repeated measurements are present, ordinary linear regression may not be appropriate.
If the outcome is binary, logistic regression is usually more appropriate than linear regression.

Resource guide

Limitations

Regression diagnostics are judgement tools, not automatic pass-or-fail rules. Real datasets rarely satisfy every assumption perfectly. The aim is to decide whether violations are serious enough to affect interpretation. Students should avoid deleting observations only to improve diagnostics, transforming variables without explanation or hiding assumption problems from the report.

Diagnostic plots require judgement.
Small samples make assumption checking harder.
Large samples can make minor deviations look dramatic.
Deleting outliers without justification can bias the analysis.
Transformations can make interpretation less straightforward.
Multicollinearity does not always damage prediction but can harm interpretation.
A model can satisfy diagnostics and still answer the wrong research question.

Resource guide

Discussion

A strong regression report should not only present coefficients. It should explain the purpose of the model, the variables included, the interpretation of key coefficients, the uncertainty around estimates and whether assumptions were checked. If diagnostics show possible problems, the report should discuss them honestly and, where possible, include sensitivity analysis.

State the outcome and predictors clearly.
Explain the interpretation of the main coefficient in context.
Report confidence intervals, not only p-values.
Mention which assumptions were checked.
Describe any important diagnostic concerns.
Justify any transformations or exclusions.
Avoid overclaiming causality from observational regression.

Practical checklist

Before you apply this topic

Is the outcome numerical and suitable for linear regression?
Have you defined the main predictor clearly?
Have you included covariates for a reason?
Have you checked scatterplots for numerical predictors?
Have you checked residuals versus fitted values?
Have you considered constant variance?
Have you checked whether residuals are extremely non-normal?
Have you checked for influential observations?
Have you considered multicollinearity?
Have you avoided deleting outliers without justification?
Have you reported coefficients with confidence intervals?
Have you discussed assumptions and limitations?

Common mistakes

What to avoid

Reporting regression results without diagnostic checks.
Assuming linearity without plotting the data.
Ignoring heteroscedasticity.
Treating residual normality as the only assumption.
Deleting influential observations only because they are inconvenient.
Including many predictors without a clear reason.
Ignoring multicollinearity.
Using linear regression for a binary outcome.
Interpreting adjusted coefficients as automatically causal.
Reporting p-values without explaining coefficient meaning.

How this connects to learning

Use the guide as a bridge between theory and application.

A resource guide should not replace a full course or live teaching session. Instead, it helps you organise your thinking. Use it to identify what you understand, what feels unclear, and what questions you should ask before applying a method to real data.

Before a lesson

Read the intuition and problem sections to prepare.

During analysis

Use the method and checklist to guide decisions.

When writing

Use limitations and discussion to improve interpretation.

Related guides

Continue with related topics.

Choosing between correlation and regression

Understanding p-values, confidence intervals and effect sizes

How to report regression results in a dissertation

Logistic regression explained for health and social science students

Confounding, mediation and effect modification

Back to all resources Need help applying this?