Principal Component Analysis (PCA) plots are the standard quality control and exploratory visualization for high-dimensional biological data — RNA-seq, proteomics, metabolomics, and single-cell datasets. They reduce thousands of dimensions to two axes that capture the most variance, revealing sample clustering, batch effects, and outliers at a glance.
What a PCA plot shows
A PCA scatter plot places each sample as a point in a 2D space defined by:
- PC1 (x-axis): The dimension explaining the most variance across all samples
- PC2 (y-axis): The dimension explaining the second most variance, orthogonal to PC1
Samples that cluster together are more similar to each other. Separated clusters suggest biological differences (treatment, genotype) or technical artifacts (batch effects).
Step 1 — Prepare your data for PCA
PCA is sensitive to data scale. Preparation depends on data type:
RNA-seq:
- Use variance-stabilized transformed (VST) or rlog counts from DESeq2
- Do NOT use raw counts or even CPM — the high dynamic range distorts PCA
Proteomics / metabolomics:
- Log₂ transform and median normalize before PCA
- Impute or exclude features with >50% missing values
Single-cell RNA-seq:
- Use highly variable gene selection (top 2000–5000 HVGs)
- Apply log1p normalization; scale each feature
General rule: Center and scale all features (mean = 0, SD = 1 across samples) before PCA unless you have a specific reason not to. In R, prcomp(t(mat), center=TRUE, scale=TRUE).
Step 2 — Run PCA
In R:
pca <- prcomp(t(vst_mat), center = TRUE, scale. = TRUE)
pca_df <- as.data.frame(pca$x[, 1:2])
pca_df$Group <- metadata$group
# Variance explained
var_exp <- round(100 * summary(pca)$importance[2, 1:2], 1)
In Python:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
X = scaler.fit_transform(mat.T)
pca = PCA(n_components=2)
coords = pca.fit_transform(X)
var_exp = pca.explained_variance_ratio_ * 100
Step 3 — Plot with correct axis labels
The most common mistake on PCA plots: axis labels that say "PC1" and "PC2" without the percentage variance explained.
Correct axis labels:
- X: "PC1 (38.4% variance)"
- Y: "PC2 (22.1% variance)"
Always include the variance explained. It tells readers how much biological information these two dimensions capture. If PC1 + PC2 together explain <40%, mention this in the legend and consider showing additional PCs.
Step 4 — Color, shape, and ellipses
Color and shape:
- Color points by the primary biological group (treatment, genotype, tissue)
- Use shape to encode a secondary variable (sex, batch, time point)
- Avoid using color alone — shapes allow grayscale reproduction
Confidence ellipses: Adding 95% confidence ellipses by group is standard and strongly recommended:
- Use 95% CI ellipses (not standard deviation ellipses)
- Make ellipses semi-transparent (alpha = 0.15–0.25)
- Match ellipse color to point color
R (ggplot2):
library(ggplot2)
ggplot(pca_df, aes(x = PC1, y = PC2, color = Group, shape = Group)) +
geom_point(size = 2.5) +
stat_ellipse(aes(fill = Group), geom = "polygon",
alpha = 0.15, level = 0.95) +
scale_shape_manual(values = c(16, 17, 15, 18)) +
labs(x = paste0("PC1 (", var_exp[1], "% variance)"),
y = paste0("PC2 (", var_exp[2], "% variance)")) +
theme_classic(base_size = 9) +
coord_fixed() # equal aspect ratio
Step 5 — Sample labels
Label outlier samples or specific samples of interest. Rules:
- Do not label all samples if n > 10 — use the group legend instead
- Use
ggrepelin R oradjustTextin Python to avoid label overlap - Label outliers to flag them in the text ("Sample X_04 was excluded as an outlier")
Step 6 — What to report in the legend
The figure legend must state:
- What data the PCA was performed on (e.g., "VST-normalized RNA-seq counts for top 500 variable genes")
- Number of samples per group
- What the color/shape encoding represents
- That ellipses are 95% confidence ellipses (if shown)
Interpreting and presenting PCA results
Group separation: If biological groups cluster separately along PC1 or PC2, this validates that the primary variance is biological, not technical.
Batch effects: If samples cluster by batch (library preparation date, sequencing lane) rather than biology, batch correction is needed before downstream analysis.
Outliers: Samples far from their group cluster should be investigated. Common causes: sample swap, RNA degradation, contamination.
Percent variance: If PC1 + PC2 < 30–40%, two dimensions do not adequately represent the data. Consider showing a PC1 vs PC3 or PC2 vs PC3 panel as a supplementary figure.
Building PCA plots in FigureGuild
FigureGuild's Graph Builder generates PCA plots from pasted expression matrices. Paste your sample × feature matrix, specify grouping, and the plot is generated with correct variance-explained labels, 95% ellipses, and publication styling. Export at journal DPI.
FAQ
Should I show PC1 vs PC2 or a different pair? Start with PC1 vs PC2. If your biological groups do not separate on these axes, check PC1 vs PC3 and PC2 vs PC3. If groups only separate on later PCs, investigate potential batch effects or data quality issues.
What does it mean if PC1 explains 80% of variance? One dimension dominates — this often indicates a strong batch effect or a single dominant biological variable (e.g., tissue type). Investigate what PC1 correlates with.
Should I use PCA or UMAP/tSNE? PCA is linear and preserves global structure — preferred for sample-level QC of bulk RNA-seq and proteomics. UMAP/tSNE are non-linear and better for visualizing clusters in single-cell data. For bulk datasets with <100 samples, PCA is standard.
How many genes should I use for bulk RNA-seq PCA?
Commonly the top 500–5000 most variable genes. Using all ~20,000 genes includes many that are simply noise. A common choice is nrow(vst_mat) >= 1 — but subsetting to variable genes produces cleaner separation.
My confidence ellipses don't show — why?
If you have fewer than ~3 samples per group, the ellipse cannot be estimated reliably. stat_ellipse requires at least 3 observations per group.