"Carving Nature at Its Joints" — Discovering Tumour Subtypes with K-Means Clustering

 The dataset has been sitting on your desk — metaphorically speaking — for months. One hundred and seventy-five medulloblastoma tumours. Clinically, they're all "medulloblastoma." But you've read enough papers to suspect that name is hiding several very different diseases underneath it. The question is whether the data agrees, and if so, how many distinct groups actually exist.

This is the job of K-means clustering — and in R2, it's a surprisingly tactile, exploratory experience.

The idea behind K-means is straightforward: you tell the algorithm how many groups (K) you want to find, and it sorts your samples into those groups by minimising the expression differences within each group while maximising the differences between them. The result is a coloured heatmap where each row is a gene, each column is a sample, and the colour — from deep blue through white to vivid red — reflects whether that gene is low, average, or high in that sample.

What makes R2's implementation particularly useful for biologists is the ability to layer your biological knowledge on top of the mathematical result. Once the algorithm has drawn its boundaries, you can ask: do these computationally-derived clusters correspond to known clinical subtypes? You overlay a track — say, histological subtype or molecular marker status — and see whether the colours align. When they do, it's deeply satisfying. When they don't, it raises an even more interesting question.

There's an important nuance to know: K-means clustering has a random initialisation step, which means running it twice can give slightly different results. R2 helps you handle this with the option to fix a random seed, ensuring your clustering is reproducible — essential when you're building a figure for a paper.

You can also vary K itself — try 3 groups, then 4, then 5 — and watch how the structure changes. Sometimes a clear "elbow" in the data tells you the natural number of subtypes. Other times, biology is messier than the algorithm would like, and that messiness is itself informative.

The heatmap you produce at the end of this process is often the most visually arresting figure in a manuscript. Genes grouped by behaviour, samples grouped by similarity, clinical annotations running across the top like a legend. It tells the whole story at a glance.

This is Part of an ongoing series on the R2 Genomics Analysis and Visualization Platform, developed at Amsterdam UMC. All analyses can be freely performed at r2.amc.nl. Full tutorials at r2-tutorials.readthedocs.io.

Comments

Popular posts from this blog

Plotting updates for the open online R2platform. The data science platform for biomedical researchers

R2: An Interactive Online Portal for Tumor Subgroup Gene Expression and Survival Analyses, Intended for Biomedical Researchers