Interactive data dashboards in Jupyter notebook with ipywidgets and Bokeh

In this post I will go though the code for a simple data dashboard that visualizes the Iris dataset. It features two dropdown menus and three checkboxes. The dropdown menus choose the features on the x and y axes, while the checkboxes make samples visible or invisible based on their species. Below is a screenshot and a video of the dashboard. Before I start with the code I give a brief rational for using ipywidgets and Bokeh.

Combining ipywidgets with Bokeh

A main advantage of ipywidgets is that it is designed specifically for Jupyter notebooks and the IPython kernel. Bokeh on the other hand can build data dashboard for a variety of more complex web deployment contexts. This makes it more powerful and technically it could be used to build the entire dashboard. In a notebook context however, I prefer the simplicity of ipywidgets over the power of Bokeh. I actually started doing the plotting with Matplotlib or seaborn. This worked well in general but I ran into some problems when directing the plot to specific places in the dashboard. I did not encounter any such issues with Bokeh for reasons I do not yet understand. Using Bokeh also gives some nice interactive features in the figure without any extra effort. I think it just lends itself better to these interactive dashboards than Matplotlib. If you don’t want to learn about Bokeh and already know Matplotlib, ipywidgets plus Matplotlib is definitely a good option and most of the ipywidgets principles I show here apply either way.

The dashboard code

Here is the code that generates the dashboard when executed in a Jupyter notebook.

from sklearn import datasets
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show, output_notebook
import ipywidgets as widgets
from IPython.display import display, clear_output
output_notebook()

"""Load Iris dataset and transform the pandas DataFrame"""
iris = datasets.load_iris()
data = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

"""Define callback function for the UI"""
def var_dropdown(x):
    """This function is executed when a dropdown value is changed.
    It creates a new figure according to the new dropdown values."""
    p = create_figure(
    x_dropdown.children[0].value,
    y_dropdown.children[0].value,
    data)
    fig[0] = p
    
    for species, checkbox in species_checkboxes.items():
        check = checkbox.children[0].value
        fig[0].select_one({'name': species}).visible = check
    
    with output_figure:
        clear_output(True)
        show(fig[0])
    fig[0]=p
    
    return x

def f_species_checkbox(x, q):
    """This function is executed when a checkbox is clicked.
    It directly changes the visibility of the current figure."""
    fig[0].select_one({'name': q}).visible = x
    with output_figure:
        clear_output(True)
        show(fig[0])
    return x

def create_figure(x_var, y_var, data):
    """This is a helper function that creates a new figure and 
    plots values from all three species. x_var and y_var control
    the features on each axis."""
    species_colors=['coral', 'deepskyblue', 'darkblue']
    p = figure(title="",
               x_axis_label=x_var,
               y_axis_label=y_var)
    species_nr = 0
    for species in iris['target_names']:
        curr_dtps = data['target'] == species_nr
        circle = p.circle(
            data[x_var][curr_dtps],
            data[y_var][curr_dtps],
            line_width=2,
            color=species_colors[species_nr],
            name=species
            )
        species_nr += 1
    return p

# The output widget is where we direct our figures
output_figure = widgets.Output()

# Create the default figure
fig = []  # Storing the figure in a singular list is a bit of a 
          # hack. We need it to properly mutate the current
          # figure in our callbacks.
p = create_figure(
    iris['feature_names'][0],
    iris['feature_names'][1],
    data)
fig.append(p)
with output_figure:
    show(fig[0])

# Checkboxes to select visible species.
species_checkboxes = {}
for species in iris['target_names']:
    curr_cb = widgets.interactive(f_species_checkbox,
                                  x=True,
                                  q=widgets.fixed(species))
    curr_cb.children[0].description = species
    species_checkboxes[species] = curr_cb
    
"""Create the widgets in the menu"""
# Dropdown menu for x-axis feature.
x_dropdown = widgets.interactive(var_dropdown,
                                 x=iris['feature_names']);
x_dropdown.children[0].description = 'x-axis'
x_dropdown.children[0].value = iris['feature_names'][0]

# Dropdown menu for y-axis feature.
y_dropdown = widgets.interactive(var_dropdown,
                                 x=iris['feature_names']);
y_dropdown.children[0].description = 'y-axis'
y_dropdown.children[0].value = iris['feature_names'][1]



# This creates the menu 
menu=widgets.VBox([x_dropdown,
                   y_dropdown,
                   *species_checkboxes.values()])

"""Create the full app with menu and output"""
# The Layout adds some styling to our app.
# You can add Layout to any widget.
app_layout = widgets.Layout(display='flex',
                flex_flow='row nowrap',
                align_items='center',
                border='none',
                width='100%',
                margin='5px 5px 5px 5px')

# The final app is just a box
app=widgets.Box([menu, output_figure], layout=app_layout)

# Display the app
display(app)

Loading the Iris data

The Iris dataset contains 150 samples. Each sample belongs to one of three species and four features are measured for each sample: sepal length, sepal width, petal length and petal width, all in cm. The goal of the dashboard is to show a scatterplot of two features at a time and an option to turn visibility for each species on or off. I load the Iris data from the sklearn package but it is a widely used toy dataset and you can get it from other places. We also convert the dataset to a Pandas DataFrame. That just makes the data easier to handle. The callback functions that make the UI interactive are defined next.

Callback functions

Callback functions are executed when something happens in the UI. For now, the functions are just defined and later they are connected to widgets. We define two callback functions, var_dropdown(x) and f_species_checkbox(x, q). create_figure is not a callback function but a helper to create a new figure. var_dropdown(x) is responsible for the two dropdown menus. The dropdowns determine what is displayed on the figure axes. When a user changes the dropdown value, var_dropdown(x) creates a new figure where the features on x- and y-axis are determined by the new dropdown values. The first parameter x is the new value that the user chose. It is not used here, because the same call function will serve two different dropdown menus. So we don’t know which menu x refers to. Instead, we directly access the dropdown values with x_dropdown.children[0].value. This will be defined later. The same goes for the checkboxes we access in the for loop with checkbox.children[0].value. Each species has a checkbox and we will create them later in the code. Finally we display our figure in the with output_figure: context, which directs our figure to the output_figure widget.

The f_species_checkbox(x, q) callback is similar. It additionally features a parameter q, which is a fixed parameter which identifies the checkbox that triggered the callback. We use it to determine, which parts of the figure we need to make visible/invisible with fig[0].select_one({'name': q}).visible = x. Whenever we make changes to the look of the figure, we must redirect it to our output inside with output_figure: to make the changes visible.

Those are our callbacks. The figure creation in create_figure(x_var, y_var, data): is straightforward thanks to Bokeh. figure() creates the figure and then the for species in iris['target_names']: loop creates the points for each species. Next up is the actual widget creation.

Creating widgets

The first widget we create is output_figure = widgets.Output() which will display the figure. We next create a default figure and direct it to output_figure. The fact that we store the figure in a singular list by fig.append(p) is a bit of a hack. This has to do with scope within the callback functions. If we reassign p = figure(), we only change p inside the function. If we change fig[0] = figure() on the other hand, we change the list outside the function because lists are mutable.

Now that we have our figure we create a checkbox for each species in the for species in iris['target_names']: loop and store it in a dictionary so we can access each with the species name. Most of the magic happens in the widgets.interactive(f_species_checkbox, x=True, q=widgets.fixed(species)). It create the widget and links it to the f_species_checkbox callback all in one line. It decides to create a checkbox based on x=True. Booleans create checkboxes but later we will create the dropdown menus also with widgets.interactive. The additional parameter q=widgets.fixed(species) tells us which checkbox called f_species_checkbox.

The following two dropdown widgets are very similar. Nothing special there.

Now that we have our widgets, we need to actually assemble them in our UI. We do that with menu=widgets.VBox([x_dropdown, y_dropdown, *species_checkboxes.values()]). This creates a box where the widgets inside are oriented in a column. The V in VBox means vertical, hence a column. We are almost done. The specifications in widgets.Layout() are not critical but I want to show them here. This is not a widget in itself but we can pass it to a widget to change the style. widgets.Layout() exposed properties you might know from CSS.

To finish up we create the full app with app=widgets.Box([menu, output_figure], layout=app_layout). The app_layout specifies with the flex_flow that the menu and the figure output are oriented as a row, menu on the left and figure on the right. Finally, we display out app in the output under our cell with display(app).

Possible improvements

To improve this code, I think it would be better if the callbacks depended less on global variables. Therefore, a closer look at widgets.interactive might be useful. Alternatively, the global variables that are known to be used inside functions could be made explicit with the global keyword. Finally, in create_figure(), I am creating a counter variable species_nr = 0. This is unnecessary in Python but I did not have time to think through the Pythonic way to do this. I hope this has been useful for you. Let me know what kind of data dashboards you are building.

Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

# Load the MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")
Two example images from the MNIST dataset.

By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images

Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
t-SNE example on MNIST subsample.

Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

t-SNE step by step

The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

Definition of euclidean distance for two features. Image credit licensed under CC BY 4.0 by Kmhkmh.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,
                                     metric='euclidean')

distance_matrix_sorted = pairwise_distances(X_sorted,
                                            metric='euclidean')

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[0].set_title("Unsorted")
ax[1].set_title("Sorted by Label")

When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

t-distributions for sigma of 1 and 2

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity = $' + str(perplexity_01),
           r'$\sigma = 10; Perplexity = $' + str(perplexity_10)))
Distributions of joint probabilities for different values of sigma.

Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
                        desired_perplexity = perplexity,
                        verbose=False)

As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

Optimize Embedding with Gradient Descent

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
                                    n_components).astype(np.float32)

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
             'degrees_of_freedom': 1,
             'n_samples': 1000,
             'n_components':2}

# Perform gradient descent
embedding_done = _t_sne._gradient_descent(_t_sne._kl_divergence,
                                          embedding_init,
                                          0,
                                          1000,
                                          kwargs=kl_kwargs)

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
                               'tsne_2': tsne_result[:,1],
                               'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.