Although Extremely Useful for Visualizing High-Dimensional Data, t-SNE Plots Can Sometimes Be Mysterious or Misleading. By Exploring How It Behaves in Simple Cases, We Can Learn to Use It More Effectively.
A popular method for exploring high-dimensional data is something called t-SNE, introduced by van der Maaten and Hinton in 2008 [1]. The technique has become widespread in the field of machine learning, since it has an almost magical ability to create compelling two-dimensional “maps” from data with hundreds or even thousands of dimensions.
1. Those Hyperparameters Really Matter
Let’s start with the “hello world” of t-SNE: a data set of two widely separated clusters. To make things as simple as possible, we’ll consider clusters in a 2D plane, as shown in the lefthand diagram. (For clarity, the two clusters are color coded.) The diagrams at right show t-SNE plots for five different perplexity values.
With perplexity values in the range (5 – 50) suggested by van der Maaten & Hinton, the diagrams do show these clusters, although with very different shapes. Outside that range, things get a little weird. With perplexity 2, local variations dominate. The image for perplexity 100, with merged clusters, illustrates a pitfall: for the algorithm to operate properly, the perplexity really should be smaller than the number of points. Implementations can give unexpected behavior otherwise.
2. Cluster Sizes in a t-SNE Plot Mean Nothing
So far, so good. But what if the two clusters have different standard deviations, and so different sizes? (By size we mean bounding box measurements, not number of points.) Below are t-SNE plots for a mixture of Gaussians in plane, where one is 10 times as dispersed as the other.
Surprisingly, the two clusters look about same size in the t-SNE plots. What’s going on? The t-SNE algorithm adapts its notion of “distance” to regional density variations in the data set. As a result, it naturally expands dense clusters, and contracts sparse ones, evening out cluster sizes. To be clear, this is a different effect than the run-of-the-mill fact that any dimensionality reduction technique will distort distances. (After all, in this example all data was two-dimensional to begin with.) Rather, density equalization happens by design and is a predictable feature of t-SNE.
3. Distances between Clusters Might Not Mean Anything
What about distances between clusters? The next diagrams show three Gaussians of 50 points each, one pair being 5 times as far apart as another pair.
At perplexity 50, the diagram gives a good sense of the global geometry. For lower perplexity values the clusters look equidistant. When the perplexity is 100, we see the global geometry fine, but one of the cluster appears, falsely, much smaller than the others. Since perplexity 50 gave us a good picture in this example, can we always set perplexity to 50 if we want to see global geometry?
4. Random Noise Doesn’t Always Look Random.
A classic pitfall is thinking you see patterns in what is really just random data. Recognizing noise when you see it is a critical skill, but it takes time to build up the right intuitions. A tricky thing about t-SNE is that it throws a lot of existing intuition out the window.
The plot with perplexity 2 seems to show dramatic clusters. If you were tuning perplexity to bring out structure in the data, you might think you’d hit the jackpot. Of course, since we know the cloud of points was generated randomly, it has no statistically interesting clusters: those “clumps” aren’t meaningful. If you look back at previous examples, low perplexity values often lead to this kind of distribution. Recognizing these clumps as random noise is an important part of reading t-SNE plots.
5. You Can See Some Shapes, Sometimes
It’s rare for data to be distributed in a perfectly symmetric way. Let’s take a look at an axis-aligned Gaussian distribution in 50 dimensions, where the standard deviation in coordinate i is 1/i. That is, we’re looking at a long-ish ellipsoidal cloud of points.
For high enough perplexity values, the elongated shapes are easy to read. On the other hand, at low perplexity, local effects and meaningless “clumping” take center stage. More extreme shapes also come through, but again only at the right perplexity. For example, here are two clusters of 75 points each in 2D, arranged in parallel lines with a bit of noise.
6. For Topology, You May Need More than One Plot
Sometimes you can read topological information off a t-SNE plot, but that typically requires views at multiple perplexities.
The perplexity 30 view shows the basic topology correctly, but again t-SNE greatly exaggerates the size of the smaller group of points. At perplexity 50, there’s a new phenomenon: the outer group becomes a circle, as the plot tries to depict the fact that all its points are about the same distance from the inner group. If you looked at this image alone, it would be easy to misread these outer points as a one-dimensional structure.
Conclusion
There’s a reason that t-SNE has become so popular: it’s incredibly flexible, and can often find structure where other dimensionality-reduction algorithms cannot. Unfortunately, that very flexibility makes it tricky to interpret. Out of sight from the user, the algorithm makes all sorts of adjustments that tidy up its visualizations. Don’t let the hidden “magic” scare you away from the whole technique, though. The good news is that by studying how t-SNE behaves in simple cases, it’s possible to develop an intuition for what’s going on.
FAQs
Q: What is t-SNE? A: t-SNE is a dimensionality reduction technique introduced by van der Maaten and Hinton in 2008.
Q: Why is t-SNE so popular? A: t-SNE is incredibly flexible and can often find structure where other dimensionality-reduction algorithms cannot.
Q: What are the hyperparameters in t-SNE? A: The hyperparameters in t-SNE include perplexity, which says how to balance attention between local and global aspects of your data.
Q: How do I interpret t-SNE plots? A: To interpret t-SNE plots, you need to understand how the algorithm behaves in simple cases and recognize patterns and noise.
Q: Can I always set perplexity to 50 to see global geometry? A: No, the perplexity value has to be adjusted based on the data set and the desired level of detail.

