"The Map of Your Data" — Seeing Hidden Structure with t-SNE and UMAP
Sometimes you don't have a hypothesis. Sometimes you just have a dataset and a feeling that there's structure in it that you haven't found yet.
Maybe you've collected 150 tumour samples from three different clinical sites. Maybe you're working with a published dataset covering multiple subtypes of a disease. You know, intellectually, that these samples aren't all the same — but where exactly the boundaries are, and whether they reflect biology or clinical annotation or something else entirely, is unclear. You need to see the data as a whole.
This is what dimensionality reduction is for. And R2 makes it surprisingly approachable.
t-SNE (t-distributed Stochastic Neighbour Embedding) and UMAP are algorithms that take a dataset with thousands of gene expression measurements per sample and compress that information down to two dimensions — a map where samples that look similar genomically end up close together, and samples that are different end up far apart. The result is a scatter plot that reveals the hidden shape of your data.
In R2, you navigate to the Sample Maps module, select your dataset, and R2 will either load a pre-computed map or generate one for you. What appears on screen is a cloud of dots — one per sample — arranged by genomic similarity. No labels yet. Just structure.
Now the exploration begins. You colour the dots by a clinical track — say, tumour subtype — and watch whether the colours cluster or scatter. If your subtypes are genuinely distinct at the molecular level, you'll see it immediately: tight islands of colour, cleanly separated in space. If they intermingle, that's also information — it means the subtype labels may not reflect the underlying biology as cleanly as assumed.
You can layer in gene expression: colour the dots by the expression level of your favourite gene, and watch which corner of the map lights up. You can draw freehand circles around clusters using the lasso tool and turn those circles into new sample groups for downstream analysis. Or use DBSCAN, an automated clustering algorithm, to let the algorithm define the groups for you.
The result is a dataset that feels legible. Not a table of numbers, but a landscape — with terrain you can explore, landmarks you can label, and regions you can return to.
It's the kind of analysis that makes you want to reschedule your PCRs for the afternoon and just... stay in the data a little longer.
This is Part of an ongoing series on the R2 Genomics Analysis and Visualization Platform, developed at Amsterdam UMC. All analyses can be freely performed at r2.amc.nl. Full tutorials at r2-tutorials.readthedocs.io.
Comments
Post a Comment