Data analysisFoundationResource guide

How to prepare your data before analysis

A detailed guide for students learning how to clean, check, structure and document data before running statistical analysis.

Structure

Problem, intuition, method, working, limitations and discussion.

Best for

Students preparing for coursework, analysis, interpretation or revision.

Use with

Learning Hub lessons, tutoring sessions or dissertation planning.

Resource guide

Problem

Many statistical problems begin before any test or model is fitted. Students often import a dataset and immediately start running t-tests, regression models or charts without checking whether the data are actually ready for analysis. This can lead to incorrect results, misleading conclusions and avoidable errors. Data preparation is the stage where raw data becomes an analysis-ready dataset.

Variables may be coded incorrectly.
Missing values may be hidden as 99, 999, -1 or blank cells.
Numerical variables may be stored as text.
Categorical variables may have inconsistent labels.
Outliers may be data-entry errors or genuine extreme observations.
Repeated measurements may be treated as independent rows.
The analysis may use the wrong version of the dataset.

Resource guide

Intuition

Data preparation is like checking ingredients before cooking. If the ingredients are spoiled, mislabelled or measured incorrectly, the final dish will not be reliable. In statistics, even a correct method can give a poor answer if the data are messy. Preparing data means understanding what each row and column represents, checking the quality of values and making careful decisions before analysis begins.

Rows usually represent people, samples, visits, observations or measurements.
Columns usually represent variables such as age, group, outcome or exposure.
Each variable has a type: numerical, binary, categorical, ordinal, date or text.
Missing values need to be identified, counted and handled transparently.
Cleaning decisions should be recorded so the analysis is reproducible.

Resource guide

Method

A good data preparation workflow follows a clear order. First, understand the dataset structure. Second, check variable definitions. Third, inspect missing values, impossible values and unusual observations. Fourth, recode variables carefully. Fifth, create a clean analysis dataset while keeping the raw data unchanged. Finally, document every decision so another person could understand what was done.

Step 1: Identify what one row represents.
Step 2: Identify the outcome, exposure, predictors and covariates.
Step 3: Check variable names, labels and definitions.
Step 4: Check whether each variable has the correct type.
Step 5: Count missing values for each variable.
Step 6: Check minimums, maximums and impossible values.
Step 7: Recode categories consistently.
Step 8: Create derived variables only when justified.
Step 9: Save a clean analysis dataset separately from the raw data.
Step 10: Document all exclusions, recoding and cleaning decisions.

Resource guide

Working

Suppose a dissertation dataset contains student ID, age, gender, treatment group, pre-test score, post-test score and final grade. Before analysis, you should check whether each student appears once or multiple times, whether age is numerical, whether treatment groups are labelled consistently, whether pre-test and post-test scores are within possible ranges and whether missing values are coded properly. Only after this should you compare groups, fit regression models or create summary tables.

Check the unit of analysis: one row per student, patient, sample or visit.
Check ID variables for duplicates.
Check whether numerical variables have realistic ranges.
Check whether categorical labels are consistent, such as Male/M, Female/F.
Check whether missing values are genuine missing values or hidden codes.
Check whether derived variables such as change scores are calculated correctly.
Create summary tables before formal analysis.
Use graphs to inspect distributions and unusual values.

Resource guide

Limitations

Data preparation improves analysis quality, but it does not solve every problem. If the study design is weak, if important variables were not collected, or if there is severe missing data, cleaning alone cannot fix the issue. Data preparation should not be used to manipulate results or remove observations simply because they do not support the expected conclusion.

Cleaning cannot fix a poorly designed study.
Removing outliers without justification can bias results.
Deleting missing data may reduce power and introduce bias.
Recoding categories can change interpretation.
Creating too many derived variables can increase confusion.
Data cleaning decisions should not be based on achieving statistical significance.
Uncertainty and limitations should still be reported.

Resource guide

Discussion

A strong analysis report should briefly explain how the data were prepared. This includes the number of observations, how missing values were handled, whether exclusions were made, how variables were coded and whether unusual values were checked. Good data preparation makes the analysis more transparent and easier to defend. It also helps students understand the data before choosing statistical methods.

Always keep the raw dataset unchanged.
Create a separate cleaned dataset for analysis.
Record every cleaning step.
Use clear and consistent variable names.
Report missing data handling clearly.
Explain exclusions and recoding decisions.
Check data before choosing statistical tests.

Practical checklist

Before you apply this topic

Do you know what each row represents?
Do you know what each column represents?
Have you identified the outcome variable?
Have you identified exposures, predictors or grouping variables?
Have you checked variable types?
Have you checked missing values?
Have you checked impossible values?
Have you checked outliers or unusual observations?
Have you checked duplicate IDs or repeated records?
Have you recoded categorical variables consistently?
Have you kept the raw dataset unchanged?
Have you saved a clean analysis dataset?
Have you documented all cleaning decisions?

Common mistakes

What to avoid

Running analysis before understanding the dataset.
Overwriting the raw data file.
Ignoring hidden missing value codes such as 99 or 999.
Treating categorical codes as continuous numbers.
Deleting outliers without investigation.
Ignoring duplicate records.
Creating derived variables without checking formulas.
Changing group labels inconsistently.
Failing to document cleaning decisions.
Using different cleaned datasets for different parts of the same analysis.

How this connects to learning

Use the guide as a bridge between theory and application.

A resource guide should not replace a full course or live teaching session. Instead, it helps you organise your thinking. Use it to identify what you understand, what feels unclear, and what questions you should ask before applying a method to real data.

Before a lesson

Read the intuition and problem sections to prepare.

During analysis

Use the method and checklist to guide decisions.

When writing

Use limitations and discussion to improve interpretation.

Related guides

Continue with related topics.

How to choose the correct statistical test

Missing data: deletion, imputation and reporting

Common mistakes in dissertation data analysis

How to write a statistical analysis plan

How to report regression results in a dissertation

Back to all resources Need help applying this?