# A Curve Fitting Guide for the Busy Experimentalist

Curve fitting is an extremely useful analysis tool to describe the relationship between variables or discover a trend within noisy data. Here I’ll focus on a pragmatic introduction curve fitting: how to do it in Python, why can it fail and how do we interpret the results? Finally, I will also give a brief glimpse at the larger themes behind curve fitting, such as mathematical optimization, to the extent that I think is useful for the casual curve fitter.

## Curve Fitting Made Easy with SciPy

We start by creating a noisy exponential decay function. The exponential decay function has two parameters: the time constant `tau` and the initial value at the beginning of the curve `init`. We’ll evenly sample from this function and add some white noise. We then use `curve_fit` to fit parameters to the data.

```import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize

# The exponential decay function
def exp_decay(x, tau, init):
return init*np.e**(-x/tau)

# Parameters for the exp_decay function
real_tau = 30
real_init = 250

# Sample exp_decay function and add noise
np.random.seed(100)
dt=0.1
x = np.arange(0,100,dt)
noise=np.random.normal(scale=50, size=x.shape)
y = exp_decay(x, real_tau, real_init)
y_noisy = y + noise

# Use scipy.optimize.curve_fit to fit parameters to noisy data
popt, pcov = scipy.optimize.curve_fit(exp_decay, x, y_noisy)
fit_tau, fit_init = popt

# Sample exp_decay with optimized parameters
y_fit = exp_decay(x, opt_tau, opt_init)

fig, ax = plt.subplots(1)
ax.scatter(x, y_noisy,
alpha=0.8,
color= "#1b9e77",
label="Exponential Decay + Noise")
ax.plot(x, y,
color="#d95f02",
label="Exponential Decay")
ax.plot(x, y_fit,
color="#7570b3",
label="Fit")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
ax.set_title("Curve Fit Exponential Decay")
```

Our fit parameters are almost identical to the actual parameters. We get `30.60` for `fit_tau` and `245.03` for `fit_init` both very close to the real values of `30` and `250`. All we had to do was call `scipy.optimize.curve_fit` and pass it the function we want to fit, the `x` data and the `y` data. The function we are passing should have a certain structure. The first argument must be the input data. All other arguments are the parameters to be fit. From the call signature of `def exp_decay(x, tau, init)` we can see that `x` is the input data while `tau` and `init` are the parameters to be optimized such that the difference between the function output and `y_noisy` is minimal. Technically this can work for any number of parameters and any kind of function. It also works when the sampling is much more sparse. Below is a fit on 20 randomly chosen data points.

Of course the accuracy will decrease with the sampling. So why would this every fail? The most common failure mode in my opinion is bad initial parameters.

## Choosing Good Initial Parameters

The initial parameters of a function are the starting parameters before being optimized. The initial parameters are very important because most optimization methods don’t just look for the best fit randomly. That would take too long. Instead, it starts with the initial parameters, changes them slightly and checks if the fit improves. When changing the parameters shows very little improvement, the fit is considered done. That makes it very easy for the method to stop with bad parameters if it stops in a local minimum or a saddle point. Let’s look at an example of a bad fit. We will change our tau to a negative number, which will result in exponential growth.

In this case fitting didn’t work. For a `real_tau` and `real_init` of `-30` and `20` we get a `fit_tau` and `fit_init` of 885223976.9 and 106.4, both way off. So what happened? Although we never specified the initial parameters (`p0`), `curve_fit` chooses default parameters of `1` for both `fit_tau` and `fit_init`. Starting from 1, `curve_fit` never finds good parameters. So what happens if we choose better parameters? Looking at our `exp_decay` definition and the exponential growth in our noisy data, we know for sure that our tau has to be negative. Let’s see what happens when we choose a negative initial value of `-5`.

```p0 = [-5, 1]
popt, pcov = scipy.optimize.curve_fit(exp_decay, x, y_noisy, p0=p0)
fit_tau, fit_init = popt
y_fit = exp_decay(x, fit_tau, fit_init)
fig, ax = plt.subplots(1)
ax.scatter(x, y_noisy,
alpha=0.8,
color= "#1b9e77",
label="Exponential Decay + Noise")
ax.plot(x, y,
color="#d95f02",
label="Exponential Decay")
ax.plot(x, y_fit,
color="#7570b3",
label="Fit")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
ax.set_title("Curve Fit Exponential Growth Good Initials")
```

With an initial parameter of `-5` for tau we get good parameters of `-30.4` for `tau` and `20.6` for `init` (real values were `-30` and `20`). The key point is that initial conditions are extremely important because they can change the result we get. This is an extreme case, where the fit works almost perfectly for some initial parameters or completely fails for others. In more subtle cases different initial conditions might result in slightly better or worse fits that could still be relevant to our research question. But what does it mean for a fit to be better or worse? In our example we can always compare it to the actual function. In more realistic settings we can only compare our fit to the noisy data.

## Interpreting Fitting Results

In most research setting we don’t know our exact parameters. If we did, we would not need to do fitting at all. So to compare the goodness of different parameters we need to compare our fit to the data. How do we calculate the error between our data and the prediction of the fit? There are many different measures but among the most simple ones is the sum of squared residuals (SSR).

```def ssr(y, fy):
"""Sum of squared residuals"""
return ((y - fy) ** 2).sum()
```

We take the difference between our data (`y`) and the output of our function given a parameter set (`fy`). We square that difference and sum it up. In fact this is what `curve_fit` optimizes. Its whole purpose is to find the parameters that give the smallest value of this function, the least square. The parameters that give the smallest SSR are considered the best fit. We saw that this process can fail, depending on the function and the initial parameters, but let’s assume for a moment it worked. If we found the smallest SSR, does that mean we found the perfect fit? Unfortunately not. What we found was a good estimate for the best fitting parameters given our function. There are probably other functions out there that can fit our data better. We can use the SSR to find better fitting functions in a process called cross-validation. Instead of comparing different parameters of the same function we compare different functions. However, if we increase the number of parameters we run into a problem called overfitting. I will not get into the details of overfitting here because it is beyond our scope.

The main point is that we must stay clear of misinterpretations of best fit. We are always fitting the parameters and not the function. If our fitting works, we get a good estimate for the best fitting parameters. But sometimes our fitting doesn’t work. This is because our fitting method did not converge to the minimum SSR and in the final chapter we will find out why that might happen in our example.

## The Error Landscape of Exponential Decay

To understand why fitting can fail depending on the initial conditions we should consider the landscape of our sum of squared residuals (SSR). We will calculate it by assuming that we already know the `init` parameter, so we keep it constant. Then we calculate the SSR for many values of `tau` smaller than zero and many values for tau larger than zero. Plotting the SSR against the guessed tau will hopefully show us how the SSR looks around the ideal fit.

```real_tau = -30.0
real_init = 20.0

noise=np.random.normal(scale=50, size=x.shape)
y = exp_decay(x, real_tau, real_init)
y_noisy = y + noise
dtau = 0.1
guess_tau_n = np.arange(-60, -4.9, dtau)
guess_tau_p = np.arange(1, 60, dtau)

# The SSR function
def ssr(y, fy):
"""Sum of squared residuals"""
return ((y - fy) ** 2).sum()

loss_arr_n = [ssr(y_noisy, exp_decay(x, tau, real_init))
for tau in guess_tau_n]
loss_arr_p = [ssr(y_noisy, exp_decay(x, tau, real_init))
for tau in guess_tau_p]

"""Plotting"""
fig, ax = plt.subplots(1,2)
ax.scatter(guess_tau_n, loss_arr_n)
real_tau_loss = ssr(y_noisy, exp_decay(x, real_tau, real_init))
ax.scatter(real_tau, real_tau_loss, s=100)
ax.scatter(guess_tau_n[-1], loss_arr_n[-1], s=100)
ax.set_yscale("log")
ax.set_xlabel("Guessed Tau")
ax.set_ylabel("SSR Standard Log Scale")
ax.legend(("All Points", "Real Minimum", "-5 Initial Guess"))

ax.scatter(guess_tau_p, loss_arr_p)
ax.scatter(guess_tau_p, loss_arr_p, s=100)
ax.set_xlabel("Guessed Tau")
ax.set_ylabel("SSR")
ax.legend(("All Points", "1 Initial Guess"))
```

On the left we see the SSR landscape for `tau` smaller than 0. Here we see that towards zero, the error becomes extremely large (note the logarithmic `y` scale). This is because towards zero the exponential growth becomes ever faster. As we move to more negative values we find a minimum near `-30` (orange), our real `tau`. This is the parameter `curve_fit` would find if it only optimized tau and started initially at `-5` (green). The optimization method does not move to more negative values from `-30` because there the SSR becomes worse, it increases.

On the right side we get a picture of why optimization failed when we started at `1`. There is no local minimum. The SSR just keeps decreasing with larger values of `tau`. That is why the tau was so larger when fitting failed (885223976.9). If we set our initial parameter anywhere in this part of the SSR landscape, this is where tau will go. Now there are other optimization methods that can overcome bad initial parameters. But few are completely immune to this issue.

## Easy to Learn Hard to Master.

Curve fitting is a very useful technique and it is really easy in Python with Scipy but there are some pitfalls. First of all, be aware of the initial values. They can lead to complete fitting failure or affect results in more subtle systematic ways. We should also remind ourselves that even with decent fitting results, there might be a more suitable function out there that can fit our data even better. In this particular example we always knew what the underlying function was. This is rarely the case in real research settings. Most of the time it is much more productive to think more deeply about possible underlying functions than finding more complicated fitting methods.

Finally, we barely scratched the surface here. Mathematical optimization is an entire field in itself and it is relevant to many areas such as statistics, machine learning, deep learning and many more. I tried to give the most pragmatic introduction to the topic here. If want to go deeper into the topic I recommend this Scipy lecture and of course the official Scipy documentation for optimization and root finding.

# Managing Files and Directories with Python

We cannot always avoid the details of file management, especially when analyzing raw data. It can come in the form of multiple files distributed over multiple directories. Accessing those files and the directory system can be a critical aspect of raw data processing. Luckily, Python has very convenient methods to handle files and directories in the `os` module. Here we will look at functions to look inspect the contents of directories (`os.listdir, os.walk`) and functions that help us manipulate files and directoris (`os.rename, os.remove, os.mkdir`).

## Directory Contents

One of the most important functions to manage files is `os.listdir(directory)`. It returns a list of strings that contains to the names of all directories and files in path.

```import os

directory = r"C:\Users\Daniel\tutorial"

dir_content = os.listdir(directory)
# ['images', 'test_one.txt', 'test_two.txt']
```

Now that we know what is in our directory, how do we find out which of these are files and which are subdirectories? You might be tempted to just check for a period (.) inside the string because it separates the file from its extension. This method is error prone, because directories could contain periods and files don’t necessarily have extensions. It is much better to use `os.path.isfile()` and `os.path.isdir()`.

```files = [x for x in dir_content if os.path.isfile(directory + os.sep + x)]
# ['test_one.txt', 'test_two.txt']

dirs = [x for x in dir_content if os.path.isdir(directory + os.sep + x)]
# ['images']
```

Now we know the contents of directory. We have two files and one directory. In case you were wondering about `os.sep`, that is the directory separator of the operating system. On my Window 10 that is the `'\'`. What if we need both files that are in our directory and those that are in sub-directories? This is the perfect case to use `os.walk()`. It gives a convenient way to loop through a directory and all its sub-directories.

```for root, dirs, files in os.walk(directory):
print(root)
print(dirs)
print(files)

# C:\Users\Daniel\tutorial
# ['images']
# ['file_1.txt', 'file_2.txt']
# C:\Users\Daniel\tutorial\images
# []
# ['plot_1.png', 'plot_2.png']
```

By default `os.walk()` goes from top down. The first name `root` tells us the full path of the directory we are currently at. Printing root tells us that we start at the top `directory`. While root is a string, both `dirs` and `files` are lists. They tell us for the current root, which files and directories are there. For the first directory we already found out on our own that the contents are. Two text files and a sub-directory. Our loop next goes to the sub-directory images. In there are no more sub-directories but two image files. If there were more sub-categories at any level (directory or directory\images), `os.walk` would go through all of them. Next we will find out how to create/move/rename/delete both files and directories.

## Manipulating Files and Directories

Let’s say I want to rename the .txt files. I don’t like the numbering and would prefer them to have a leading zero in case they go into the double digits. We can use `os.rename` for this job.

```directory = r"C:\Users\Daniel\tutorial"

dir_content = os.listdir(directory)

txt_f = [x for x in dir_content
if os.path.isfile(directory + os.sep + x) and ".txt" in x]
# ['file_1.txt', 'file_2.txt']
for f_name in txt_f:
f_name_split = f_name.split("_")
num = f_name_split.split(".")
new_name = f_name_split + "_" + num.zfill(2) + ".txt"
os.rename(directory + os.sep + f_name,
directory + os.sep + new_name)

os.listdir(directory)
['file_01.txt', 'file_02.txt', 'images']
```

Now that we renamed our files, let’s create another sub-directory for these .txt files. To create new directories we use `os.mkdir()`.

```os.listdir(directory)
# ['file_01.txt', 'file_02.txt', 'images']

os.mkdir(directory + os.sep + 'texts')

os.listdir(directory)
# ['file_01.txt', 'file_02.txt', 'images', 'texts']
```

Now we need to move the .txt files. There is no dedicated move function in the os module. Instead we use rename but instead of changing the name of the file, we change the path to the directory.

```9directory = r"C:\Users\Daniel\tutorial"
dir_content = os.listdir(directory)
txt_f = [x for x in dir_content
if os.path.isfile(directory + os.sep + x) and ".txt" in x]

for f in txt_f:
old_path = directory + os.sep + f
new_path = directory + os.sep + "texts" + os.sep + f
os.rename(old_path, new_path)

os.listdir(directory)
# ['images', 'texts']

os.listdir(directory+os.sep+'texts')
# ['file_01.txt', 'file_02.txt']
```

Now our .txt files are in the ‘\texts’ sub-directory. Unfortunately there is no copy function in `os`. Instead we have to use another module called shutil. You can use a signature like this.

```from shutil import copyfile
copyfile(source, destination)
```

Finally, to remove a file we simple use `os.remove()`.

```os.listdir(directory+os.sep+"texts")
# ['file_01.txt', 'file_02.txt']

os.remove(directory+os.sep+'texts'+os.sep+"file_01.txt")

os.listdir(directory+os.sep+"texts")
# ['file_02.txt']
```

And that’s it. You might have noticed that we did not cover how to create or read files. The `os` module is technically able to create and read files but in data science we usually depend on more high level interfaces to read files. For example, we might want to open a .csv file with pandas `pd.read_csv`. Using the lower level os functions will rarely be necessary. Thank you for reading and let me know if you have any questions.

In case you want to learn more about the `os` module, here are the `os` module docs.