BioinformaticsAdvancedResource guide

RNA-seq and differential expression analysis

An advanced guide introducing RNA-seq differential expression analysis, count data, quality control, normalisation, experimental design, multiple testing and biological interpretation.

Structure

Problem, intuition, method, working, limitations and discussion.

Best for

Students preparing for coursework, analysis, interpretation or revision.

Use with

Learning Hub lessons, tutoring sessions or dissertation planning.

Resource guide

Problem

RNA-seq studies measure gene expression across thousands of genes. The aim is often to identify genes that are differentially expressed between conditions, such as treated versus control samples or disease versus healthy samples. The analysis is statistically challenging because the data are high-dimensional, count-based, noisy and affected by library size, batch effects and biological variability. Students often focus on producing a volcano plot without understanding the design and assumptions behind the analysis.

Thousands of genes are tested at once.
Raw read counts are not directly comparable across samples.
Library size and composition effects need normalisation.
Small sample sizes can make estimates unstable.
Batch effects can be confused with biological effects.
Multiple testing correction is essential.
Significant genes need biological interpretation, not just statistical reporting.

Resource guide

Intuition

RNA-seq differential expression asks whether the expression level of each gene differs systematically between groups after accounting for sequencing depth and biological variability. Each gene is tested, but the analysis must borrow information across genes to estimate variation reliably. Because thousands of tests are performed, false discovery rate control is usually more appropriate than ordinary p-value thresholds.

The basic data are gene-level counts.
Counts depend on both biology and sequencing depth.
Normalisation makes samples more comparable.
Differential expression compares conditions while modelling count variability.
Fold change describes the size and direction of expression difference.
Adjusted p-values control false discoveries across many genes.
Biological interpretation should consider pathways, functions and study design.

Resource guide

Method

A differential expression workflow begins with experimental design and metadata checking. Samples should be labelled correctly, groups should be balanced where possible and batch variables should be recorded. After read processing and quantification, the count matrix is checked, lowly expressed genes may be filtered, normalisation is performed and a statistical model is fitted for each gene. Results are interpreted using log fold changes, adjusted p-values and biological context.

Step 1: Define the biological question and comparison.
Step 2: Check sample metadata and group labels carefully.
Step 3: Inspect raw sequencing quality and mapping or quantification summaries.
Step 4: Build a gene-by-sample count matrix.
Step 5: Filter genes with very low expression where appropriate.
Step 6: Normalise for library size and composition effects.
Step 7: Explore samples using PCA or clustering.
Step 8: Fit a differential expression model with relevant design variables.
Step 9: Correct for multiple testing using false discovery rate.
Step 10: Interpret genes, pathways and limitations biologically.

Resource guide

Working

Suppose an RNA-seq study compares infected and uninfected cell samples. For each gene, the analysis estimates whether expression differs between infection conditions. A positive log2 fold change may indicate higher expression in infected samples, depending on the contrast coding. The adjusted p-value indicates whether the gene remains statistically noteworthy after accounting for thousands of tests.

Rows of the count matrix represent genes.
Columns represent samples.
Metadata describe condition, batch, donor, time point or treatment.
Normalisation adjusts for differences in sequencing depth and composition.
PCA can reveal whether samples cluster by condition or unwanted batch.
The design formula should match the biological question.
Log2 fold change describes magnitude and direction.
FDR-adjusted p-values help control false discoveries.
Volcano plots combine effect size and statistical evidence visually.

Resource guide

Limitations

Differential expression analysis is sensitive to study design. No statistical method can fully rescue confounded batches, very small sample sizes or mislabeled samples. RNA abundance is also not the same as protein activity or biological mechanism. Significant genes should be interpreted as evidence of expression differences under the study conditions, not automatic proof of causal pathways.

Small sample sizes reduce power and stability.
Confounded batch and condition can make interpretation impossible.
Outlier samples can strongly affect results.
Low-count genes can produce noisy estimates.
Multiple testing correction reduces false positives but does not remove bias.
Differential expression does not prove functional importance.
Validation may be needed using independent data or experiments.

Resource guide

Discussion

A strong RNA-seq report should describe the design, samples, quality control, normalisation, modelling approach and multiple-testing correction. It should interpret key genes or pathways in biological context and acknowledge limitations such as sample size, batch effects, tissue specificity and validation. The result is not just a list of genes; it is a statistical summary of expression evidence under a defined design.

State the biological comparison clearly.
Describe the sample groups and important covariates.
Mention quality control and sample exploration.
Report the differential expression method and FDR threshold.
Interpret log fold changes with direction and context.
Use pathway or enrichment analysis carefully.
Discuss batch effects, sample size and validation needs.

Practical checklist

Before you apply this topic

Have you defined the biological comparison?
Have you checked sample metadata?
Have you checked quality-control summaries?
Have you created the correct count matrix?
Have you filtered very low-expression genes where appropriate?
Have you normalised counts correctly?
Have you explored samples with PCA or clustering?
Have you included relevant design variables?
Have you corrected for multiple testing?
Have you interpreted log fold changes correctly?
Have you considered batch effects?
Have you discussed biological validation and limitations?

Common mistakes

What to avoid

Analysing raw counts without normalisation.
Ignoring sample metadata errors.
Confusing batch effects with biological effects.
Using p < 0.05 without FDR correction.
Reporting only a volcano plot.
Ignoring low sample size.
Interpreting fold change without checking direction of contrast.
Treating differential expression as proof of mechanism.
Ignoring outlier samples.
Failing to connect gene lists to biological context.

How this connects to learning

Use the guide as a bridge between theory and application.

A resource guide should not replace a full course or live teaching session. Instead, it helps you organise your thinking. Use it to identify what you understand, what feels unclear, and what questions you should ask before applying a method to real data.

Before a lesson

Read the intuition and problem sections to prepare.

During analysis

Use the method and checklist to guide decisions.

When writing

Use limitations and discussion to improve interpretation.

Related guides

Continue with related topics.

Multiple testing and false discovery rate

How to prepare your data before analysis

Common mistakes in dissertation data analysis

Introduction to causal inference and DAGs

Reproducible analysis with R Markdown or Quarto

Back to all resources Need help applying this?