# Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

## How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")


By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images


Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

## The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,
metric='euclidean')

distance_matrix_sorted = pairwise_distances(X_sorted,
metric='euclidean')

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[0].set_title("Unsorted")
ax[1].set_title("Sorted by Label")


When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

## Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity =$' + str(perplexity_01),
r'$\sigma = 10; Perplexity =$' + str(perplexity_10)))


Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
desired_perplexity = perplexity,
verbose=False)


As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

## Optimize Embedding with Gradient Descent

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
n_components).astype(np.float32)

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
'degrees_of_freedom': 1,
'n_samples': 1000,
'n_components':2}

embedding_init,
0,
1000,
kwargs=kl_kwargs)

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
'tsne_2': tsne_result[:,1],
'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.

# Getting Started with Pandas DataFrame

A DataFrame is a spreadsheet like data structure. We can think of it as a collection of rows and columns. This row-column structure is useful for many different kinds of data. The most widely used DataFrame implementation in Python is from the Pandas package. First we will learn how to create DataFrames. We will also learn how to do some basic data analysis with them. Finally, we will compare the DataFrame to the ndarray data structure and learn why data frames are useful in other packages such as Seaborn.

## How to Create a DataFrame

There two major ways to create a DataFrame. We can directly call DataFrame() and pass it data in a dictionary, list or array. Alternatively we can use several functions to load data from a file directly into a DataFrame. While it is very common in data science to load data from file, there are also many occasions where we need to create DataFrame from other data structures. We will first learn how to create a DataFrame from a dictionary.

import pandas as pd
d = {"Frequency": [20, 50, 8],
"Location": [2, 3, 1],
"Cell Type": ["Interneuron", "Interneuron", "Pyramidal"]}
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, index=row_names)
print(df)

"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


In our dictionary the keys are used as the column names. The data under each key then becomes the column. The row names are defined separately by passing a collection to the index parameter of DataFrame. We can get column and row names with the columns and index attributes.

df.columns
# Index(['Freq (Hz)', 'Loc (cm)', 'Cell Type'], dtype='object')
df.index
# Index(['C1', 'C2', 'C3'], dtype='object')


We can also change column and row names through those same attributes.

df.index = ["Cell_1", "Cell_2", "Cell_3"]
df.columns = ["Freq (Hz)", "Loc (cm)", "Cell Type"]
"""
Freq (Hz)  Loc (cm)    Cell Type
Cell_1         20         2  Interneuron
Cell_2         50         3  Interneuron
Cell_3          8         1    Pyramidal
"""


These names are useful because they give us a descriptive way of indexing into columns and rows. If we use indexing syntax on the DataFrame, we can get individual columns.

df['Freq (Hz)']
"""
Cell_1    20
Cell_2    50
Cell_3     8
Name: Freq (Hz), dtype: int64
"""


Row names are not found this way and using a row key will raise an error. However, we can get rows with the df.loc attribute.

df['Cell_1']
# KeyError: 'Cell_1'
df.loc['Cell_1']
"""
Freq (Hz)             20
Loc (cm)               2
Cell Type    Interneuron
Name: Cell_1, dtype: object
"""


We could also create a DataFrame from other kinds of collections that are not dictionaries. For example we can use a list.

d = [[20, 2, "Interneuron"],
[50, 3, "Interneuron"],
[8, 1, "Pyramidal"]]
column_names = ["Frequency", "Location", "Cells"]
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, columns=column_names, index=row_names)
print(df)
"""
Frequency  Location        Cells
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


In that case there are no dictionary keys that could be use to infer the column names. This means we need to pass the column_names to the columns parameter. Mostly anything that structures our data in a two-dimensional way can be used to create a DataFrame. Next we will learn about functions that allow us to load different file types as a DataFrame.

The list of file types Pandas can read and write is rather long and you can find it here. I only want to cover the most commonly used .csv file here. They have the particular advantage that they can also be read by humans, because they are essentially text files. They are also widely supported by a variety of languages and programs. First, let’s create our file. Because it is a text file, we can write a literal string to file.

text_file = open("example.csv", "w")
text_file.write(""",Frequency,Location,Cell Type
C1,20,2,Interneuron
C2,50,3,Interneuron
C3,8,1,Pyramidal""")
text_file.close()


In this file columns are separated by commas and rows are separated by new lines. This is what .csv means, it stands for comma-separated values. To load this file into a DataFrame we need to pass the file name and which column contains the row names. Pandas assumes by default that the first row contains the column names.

df = pd.read_csv("example.csv", index_col=0)
print(df)
"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


There are many more parameters we can specify for read_csv in case we have a file that is structured differently. In fact we can load files that have a value delimiter other than the comma, by specifying the delimiter parameter.

text_file = open("example.csv", "w")
text_file.write("""-Frequency-Location-Cell Type
C1-20-2-Interneuron
C2-50-3-Interneuron
C3-8-1-Pyramidal""")
text_file.close()
print(df)
"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


We specify '-' as the delimiter and and it also works. Although the function is called read_csv it is not strictly bound to comma separated values. We can also skip rows, columns and specify many more options you can learn about from the documentation. For well structured .csv files however, we need very few arguments as shown above. Next we will learn how to do basic calculations with the DataFrame.

## Basic Math with DataFrame

A variety of functions such as df.mean(), df.median() and df.std() are available to do basic statistics on our DataFrame. By default they all return values per column. That is because columns are assumed to contain our variables (or features) and each row contains a sample.

df.mean()
"""
Freq (Hz)    26.0
Loc (cm)      2.0
dtype: float64
"""

df.median()
"""
Freq (Hz)    20.0
Loc (cm)      2.0
dtype: float64
"""

df.std()
"""
Freq (Hz)    21.633308
Loc (cm)      1.000000
dtype: float64
"""


One big advantage of the column is that within a column the data type is clearly defined. Within a row on the other hand different data types can exist. In our case we have two numeric types and a string. When we call these statistical methods, numeric types are ignored. In our case that is 'Cell Type'. Technically we can also use the axis parameter to calculate these statistics for each sample but this is not always useful and has to again ignore one of the columns.

df.mean(axis=1)
"""
C1    11.0
C2    26.5
C3     4.5
dtype: float64
"""


We can also use other mathematical operators. They are applied element-wise and their effect will depend on the data type of the value.

print(df * 3)
"""
Frequency  Location                      Cell Type
C1         60         6  InterneuronInterneuronInterneuron
C2        150         9  InterneuronInterneuronInterneuron
C3         24         3        PyramidalPyramidalPyramidal
"""


Often times these operations make more sense for individual columns. As explained above we can use indexing to get individual columns and we can even assign new results to an existing or new column.

norm_freq = df['Frequency'] / df.mean()['Frequency']
norm_freq
"""
C1    0.769231
C2    1.923077
C3    0.307692
Name: Frequency, dtype: float64
"""
df['Norm Freq'] = norm_freq
print(df)
"""
Frequency  Location    Cell Type  Norm Freq
C1         20         2  Interneuron   0.769231
C2         50         3  Interneuron   1.923077
C3          8         1    Pyramidal   0.307692
"""


If you are familiar with NumPy, most of these DataFrame operations will seem very familiar because they mostly work like array operations. Because Pandas builds on NumPy, most NumPy functions (for example np.sin) work on numeric columns. I don’t want to go deeper and instead move on to visualizing DataFrames with Seaborn.

## Seaborn for Data Visualization

Seaborn is a high-level data visualization package that builds on Matplotlib. It does not necessarily require a DataFrame. It can work with other data structures such as ndarray but it is particularly convenient with DataFrame. First, let us get a more interesting data set. Luckily Seaborn comes with some nice example data sets and they conveniently load into Pandas DataFrame.

import seaborn as sns
type(df)
# pandas.core.frame.DataFrame
print(df)
"""
sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]
"""

print(df.columns)
"""
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
"""


The Iris data set contains information about different species of iris plants. It contains 150 samples and 5 features. The 'species' feature tells us what species a particular sample belongs to. The names of those columns are very useful when we structure our plots in Seaborn. Let’s first try a basic bar graph.

sns.set(context='paper',
style='whitegrid',
palette='colorblind',
font='Arial',
font_scale=2,
color_codes=True)
fig = sns.barplot(x='species', y='sepal_length', data=df)


We use sns.barplot and we have to pass our DataFrame to the data parameter. Then for x and y we define which column name should appear there. We put 'species' on the x-axis so that is how data is aggregated inside the bars. Setosa, versicolor and virginica are the different species. The sns.set() function defines multiple parameters of Seaborn and forces a certain style on the plots that I personally prefer. Bar graphs have grown out of fashion and for good reason. They are not very informative about the distribution of their underlying values. I prefer the violin plot to get a better idea of the distribution.

fig = sns.violinplot(x='species', y='sepal_length', data=df)


We even get a small box plot within the violin plot for free. Seaborn works its magic through the DataFrame column names. This makes plotting more convenient but also makes our code more descriptive than it would be with pure NumPy. Our code literally tells us, that 'species' will be on the x-axis.

## Summary

We learned that we can create a DataFrame from a dictionary or another kind of collection. The most important features are the column and row names. Columns organize features and rows organize samples by convention. We can also load files into a DataFrame. For example we can use read_csv to load .csv or other text based files. We can also use methods like df.mean() to get basic statistics of our DataFrame. Finally, Seaborn is very useful to visualize a DataFrame.

# A Curve Fitting Guide for the Busy Experimentalist

Curve fitting is an extremely useful analysis tool to describe the relationship between variables or discover a trend within noisy data. Here I’ll focus on a pragmatic introduction curve fitting: how to do it in Python, why can it fail and how do we interpret the results? Finally, I will also give a brief glimpse at the larger themes behind curve fitting, such as mathematical optimization, to the extent that I think is useful for the casual curve fitter.

## Curve Fitting Made Easy with SciPy

We start by creating a noisy exponential decay function. The exponential decay function has two parameters: the time constant tau and the initial value at the beginning of the curve init. We’ll evenly sample from this function and add some white noise. We then use curve_fit to fit parameters to the data.

import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize

# The exponential decay function
def exp_decay(x, tau, init):
return init*np.e**(-x/tau)

# Parameters for the exp_decay function
real_tau = 30
real_init = 250

# Sample exp_decay function and add noise
np.random.seed(100)
dt=0.1
x = np.arange(0,100,dt)
noise=np.random.normal(scale=50, size=x.shape[0])
y = exp_decay(x, real_tau, real_init)
y_noisy = y + noise

# Use scipy.optimize.curve_fit to fit parameters to noisy data
popt, pcov = scipy.optimize.curve_fit(exp_decay, x, y_noisy)
fit_tau, fit_init = popt

# Sample exp_decay with optimized parameters
y_fit = exp_decay(x, opt_tau, opt_init)

fig, ax = plt.subplots(1)
ax.scatter(x, y_noisy,
alpha=0.8,
color= "#1b9e77",
label="Exponential Decay + Noise")
ax.plot(x, y,
color="#d95f02",
label="Exponential Decay")
ax.plot(x, y_fit,
color="#7570b3",
label="Fit")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
ax.set_title("Curve Fit Exponential Decay")


Our fit parameters are almost identical to the actual parameters. We get 30.60 for fit_tau and 245.03 for fit_init both very close to the real values of 30 and 250. All we had to do was call scipy.optimize.curve_fit and pass it the function we want to fit, the x data and the y data. The function we are passing should have a certain structure. The first argument must be the input data. All other arguments are the parameters to be fit. From the call signature of def exp_decay(x, tau, init) we can see that x is the input data while tau and init are the parameters to be optimized such that the difference between the function output and y_noisy is minimal. Technically this can work for any number of parameters and any kind of function. It also works when the sampling is much more sparse. Below is a fit on 20 randomly chosen data points.

Of course the accuracy will decrease with the sampling. So why would this every fail? The most common failure mode in my opinion is bad initial parameters.

## Choosing Good Initial Parameters

The initial parameters of a function are the starting parameters before being optimized. The initial parameters are very important because most optimization methods don’t just look for the best fit randomly. That would take too long. Instead, it starts with the initial parameters, changes them slightly and checks if the fit improves. When changing the parameters shows very little improvement, the fit is considered done. That makes it very easy for the method to stop with bad parameters if it stops in a local minimum or a saddle point. Let’s look at an example of a bad fit. We will change our tau to a negative number, which will result in exponential growth.

In this case fitting didn’t work. For a real_tau and real_init of -30 and 20 we get a fit_tau and fit_init of 885223976.9 and 106.4, both way off. So what happened? Although we never specified the initial parameters (p0), curve_fit chooses default parameters of 1 for both fit_tau and fit_init. Starting from 1, curve_fit never finds good parameters. So what happens if we choose better parameters? Looking at our exp_decay definition and the exponential growth in our noisy data, we know for sure that our tau has to be negative. Let’s see what happens when we choose a negative initial value of -5.

p0 = [-5, 1]
popt, pcov = scipy.optimize.curve_fit(exp_decay, x, y_noisy, p0=p0)
fit_tau, fit_init = popt
y_fit = exp_decay(x, fit_tau, fit_init)
fig, ax = plt.subplots(1)
ax.scatter(x, y_noisy,
alpha=0.8,
color= "#1b9e77",
label="Exponential Decay + Noise")
ax.plot(x, y,
color="#d95f02",
label="Exponential Decay")
ax.plot(x, y_fit,
color="#7570b3",
label="Fit")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
ax.set_title("Curve Fit Exponential Growth Good Initials")


With an initial parameter of -5 for tau we get good parameters of -30.4 for tau and 20.6 for init (real values were -30 and 20). The key point is that initial conditions are extremely important because they can change the result we get. This is an extreme case, where the fit works almost perfectly for some initial parameters or completely fails for others. In more subtle cases different initial conditions might result in slightly better or worse fits that could still be relevant to our research question. But what does it mean for a fit to be better or worse? In our example we can always compare it to the actual function. In more realistic settings we can only compare our fit to the noisy data.

## Interpreting Fitting Results

In most research setting we don’t know our exact parameters. If we did, we would not need to do fitting at all. So to compare the goodness of different parameters we need to compare our fit to the data. How do we calculate the error between our data and the prediction of the fit? There are many different measures but among the most simple ones is the sum of squared residuals (SSR).

def ssr(y, fy):
"""Sum of squared residuals"""
return ((y - fy) ** 2).sum()


We take the difference between our data (y) and the output of our function given a parameter set (fy). We square that difference and sum it up. In fact this is what curve_fit optimizes. Its whole purpose is to find the parameters that give the smallest value of this function, the least square. The parameters that give the smallest SSR are considered the best fit. We saw that this process can fail, depending on the function and the initial parameters, but let’s assume for a moment it worked. If we found the smallest SSR, does that mean we found the perfect fit? Unfortunately not. What we found was a good estimate for the best fitting parameters given our function. There are probably other functions out there that can fit our data better. We can use the SSR to find better fitting functions in a process called cross-validation. Instead of comparing different parameters of the same function we compare different functions. However, if we increase the number of parameters we run into a problem called overfitting. I will not get into the details of overfitting here because it is beyond our scope.

The main point is that we must stay clear of misinterpretations of best fit. We are always fitting the parameters and not the function. If our fitting works, we get a good estimate for the best fitting parameters. But sometimes our fitting doesn’t work. This is because our fitting method did not converge to the minimum SSR and in the final chapter we will find out why that might happen in our example.

## The Error Landscape of Exponential Decay

To understand why fitting can fail depending on the initial conditions we should consider the landscape of our sum of squared residuals (SSR). We will calculate it by assuming that we already know the init parameter, so we keep it constant. Then we calculate the SSR for many values of tau smaller than zero and many values for tau larger than zero. Plotting the SSR against the guessed tau will hopefully show us how the SSR looks around the ideal fit.

real_tau = -30.0
real_init = 20.0

noise=np.random.normal(scale=50, size=x.shape[0])
y = exp_decay(x, real_tau, real_init)
y_noisy = y + noise
dtau = 0.1
guess_tau_n = np.arange(-60, -4.9, dtau)
guess_tau_p = np.arange(1, 60, dtau)

# The SSR function
def ssr(y, fy):
"""Sum of squared residuals"""
return ((y - fy) ** 2).sum()

loss_arr_n = [ssr(y_noisy, exp_decay(x, tau, real_init))
for tau in guess_tau_n]
loss_arr_p = [ssr(y_noisy, exp_decay(x, tau, real_init))
for tau in guess_tau_p]

"""Plotting"""
fig, ax = plt.subplots(1,2)
ax[0].scatter(guess_tau_n, loss_arr_n)
real_tau_loss = ssr(y_noisy, exp_decay(x, real_tau, real_init))
ax[0].scatter(real_tau, real_tau_loss, s=100)
ax[0].scatter(guess_tau_n[-1], loss_arr_n[-1], s=100)
ax[0].set_yscale("log")
ax[0].set_xlabel("Guessed Tau")
ax[0].set_ylabel("SSR Standard Log Scale")
ax[0].legend(("All Points", "Real Minimum", "-5 Initial Guess"))

ax[1].scatter(guess_tau_p, loss_arr_p)
ax[1].scatter(guess_tau_p[0], loss_arr_p[0], s=100)
ax[1].set_xlabel("Guessed Tau")
ax[1].set_ylabel("SSR")
ax[1].legend(("All Points", "1 Initial Guess"))


On the left we see the SSR landscape for tau smaller than 0. Here we see that towards zero, the error becomes extremely large (note the logarithmic y scale). This is because towards zero the exponential growth becomes ever faster. As we move to more negative values we find a minimum near -30 (orange), our real tau. This is the parameter curve_fit would find if it only optimized tau and started initially at -5 (green). The optimization method does not move to more negative values from -30 because there the SSR becomes worse, it increases.

On the right side we get a picture of why optimization failed when we started at 1. There is no local minimum. The SSR just keeps decreasing with larger values of tau. That is why the tau was so larger when fitting failed (885223976.9). If we set our initial parameter anywhere in this part of the SSR landscape, this is where tau will go. Now there are other optimization methods that can overcome bad initial parameters. But few are completely immune to this issue.

## Easy to Learn Hard to Master.

Curve fitting is a very useful technique and it is really easy in Python with Scipy but there are some pitfalls. First of all, be aware of the initial values. They can lead to complete fitting failure or affect results in more subtle systematic ways. We should also remind ourselves that even with decent fitting results, there might be a more suitable function out there that can fit our data even better. In this particular example we always knew what the underlying function was. This is rarely the case in real research settings. Most of the time it is much more productive to think more deeply about possible underlying functions than finding more complicated fitting methods.

Finally, we barely scratched the surface here. Mathematical optimization is an entire field in itself and it is relevant to many areas such as statistics, machine learning, deep learning and many more. I tried to give the most pragmatic introduction to the topic here. If want to go deeper into the topic I recommend this Scipy lecture and of course the official Scipy documentation for optimization and root finding.