"Finding the Axes of Variation" — Understanding Your Data with Principal Component Analysis
There's a thought experiment that helps explain what Principal Component Analysis does. Imagine you're trying to describe the differences between a large group of people, and you have a thousand measurements for each person — height, weight, age, dozens of blood markers, hundreds more. That's an impossibly high-dimensional space. PCA's job is to find the most important directions of variation in that space and let you look along those directions instead.
In genomics, the same logic applies. You have thousands of gene expression measurements per sample. PCA collapses that complexity into a small number of "principal components" — axes that capture the most variance in the dataset. The first principal component captures the most variation of all. The second captures the most of what remains, and so on. When you plot your samples along the first two or three of these axes, you're looking at a compressed but surprisingly faithful summary of the whole dataset.
R2's PCA module makes this process visual and interactive — and importantly, it connects the mathematical output directly back to biological interpretation.
You start by selecting your dataset and running PCA. What appears is a scatter plot — one dot per sample, arranged in two-dimensional space defined by the first two principal components. If your dataset contains distinct biological groups, they will often separate in this space without you having told the algorithm anything about them. It's one of those analysis moments that feels almost like magic: the data organising itself by its own internal logic.
Then you start asking questions of the structure. You colour the dots by a clinical track — tumour subtype, treatment history, survival outcome — and see whether the PCA separation aligns with biological reality. When the colours cluster cleanly, you know the PCA has found something real.
R2 also lets you view the data in 3D, rotating the first three principal components as a spinning cloud of samples. What looks like an amorphous blob from one angle sometimes reveals clean, unexpected structure when rotated to a different view — subpopulations that are invisible in 2D, sitting in a layer of their own.
Perhaps the most biologically useful step comes last: you can ask R2 which genes are driving each principal component using the Toplister function. If PC1 separates your tumour subtypes, the genes that load most heavily onto PC1 are the ones whose expression is most responsible for that separation. That gene list becomes the starting point for a mechanistic investigation — the PCA has pointed you at the right molecules to care about.
This is Part of an ongoing series on the R2 Genomics Analysis and Visualization Platform, developed at Amsterdam UMC. All analyses can be freely performed at r2.amc.nl. Full tutorials at r2-tutorials.readthedocs.io.
Comments
Post a Comment