Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

# Load the MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")
Two example images from the MNIST dataset.

By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

# (1000, 784)
# 1000 Samples with 784 features
# (1000,)
# 1000 labels
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images

Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
t-SNE example on MNIST subsample.

Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

t-SNE step by step

The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

Definition of euclidean distance for two features. Image credit licensed under CC BY 4.0 by Kmhkmh.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,

distance_matrix_sorted = pairwise_distances(X_sorted,

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[1].set_title("Sorted by Label")

When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

t-distributions for sigma of 1 and 2

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity = $' + str(perplexity_01),
           r'$\sigma = 10; Perplexity = $' + str(perplexity_10)))
Distributions of joint probabilities for different values of sigma.

Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
                        desired_perplexity = perplexity,

As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

Optimize Embedding with Gradient Descent

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
             'degrees_of_freedom': 1,
             'n_samples': 1000,

# Perform gradient descent
embedding_done = _t_sne._gradient_descent(_t_sne._kl_divergence,

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
                               'tsne_2': tsne_result[:,1],
                               'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.