How to organize your research data during analysis

As researchers we often have to manage the entire data lifecycle from generation to the final report. Raw data are major parts of any analysis pipeline and it is important to properly store them. Many guidelines on data management focus on raw data but also stop there. However, raw data without analysis are pretty much worthless and there are major data management decisions to be made during analysis. Which intermediate results to save? What format? Where? Cloud storage? Do I build a database? Every decision will have effects in the actual analysis code and can determine how easy errors can be found and fixed. Here I want to give a detailed overview of the different options we have to organize our data during analysis. Python is my main programming language and this will be reflected here in that I prefer open source tools and I am not experienced with some of the commercial solutions that might exist. However, I will also cover the advantages/disadvantages of human-readable data formats and tools such as Microsoft Excel. We get started with raw data.

Raw data

Raw data are like a good hypothesis. Just like a hypothesis is perfect until you ruin it with data, raw data are perfect until you ruin them with a thorough analysis. Raw data are truth. This is how it happened. Therefore, the raw data is never changed and we always keep multiple copies of it. Even after we published the results of our analysis, it is a good idea to keep the raw data for at least 10 years. Depending on your funding and where you publish you might be legally and/or ethically required to keep raw data for a defined time period.

Anything we do with our raw data during the analysis we do in read-only mode. The only thing we might want to change is the file name. However, we should check whether there is a way to automate the file naming in the program that generates them. Automating file names is preferred and avoids human error. If we receive data from collaborators we should also make sure that we are not deleting or changing data that they placed in the file name or the directory hierarchy.

From an analysis point of view, we might not need any information in the file name at all. We could randomly generate it. However, we almost always want to store important metadata in the file name because it has two major advantages: 1. The file name is human-readable. You need neither programming knowledge nor knowledge about the file format to read it. 2. The file name is readable by the operating system. This means, our programs can read it even if they don’t know the file format. For example, you can use file name information to decide whether or not to load a file during analysis and save loading time.

So what do we put in the file name? The first rule is that it is better to save too much information as opposed to too little. But most file systems have limits on the file name size (I hit it many times in Windows 10). When you choose your file name format you can ask yourself: what would I want a human to know about the files before they even open them and what would be good to have easily accessible during analysis? You will read in a lot of guides that you must save the date and time. That is useful because it makes it easier to relate a file to your lab book. Otherwise I only recommend you to give each file a unique identifier. This can be a number or a string of characters. I prefer numbers because they can have the added bonus of giving the recording order at a glance. Other information in the file names is up to you.

Once you have decided which other information to put in your file names, choose one delimiter to separate items and another one for readability. I use underscore to separate items and minus for readability. That allows me to do things like: “001_YYYY-MM-DD_version-2.fmt”. Avoid empty spaces in file names. Whatever decision you make, you must remain consistent.

When it comes to the directories, I recommend having all raw files in the same directory. I used to save some directory hierarchy structure but nowadays I would put all of that into the file name. If you have a lot of files, some directory structure can help humans navigate it but during analysis I almost always prefer one directory for simplicity. The file name is certainly a good way to save metadata but there are other ways to save them. In the next chapter we will discuss metadata more generally.

Metadata

At the raw data stage, the definition of metadata is clear. It is any data that describes the context of the raw data but does not fit any of the raw data formats. Images are a good example. The raw data of an image are the pixel values that make up the digital image. Metadata of the image can be anything that relates to the context where the image was taken. The time it was taken, the place it was taken, the exposure time, the model of the lens used, the temperature that day, the person who took the image, the width, the height and much, much more. As the analysis progresses, metadata may become data. Let’s say your analysis pipeline counts the number of birds in each image and you want to find out whether there are more early birds than late birds. In that case the time of day the picture was taken becomes an independent variable.

So what kind of metadata should you save? That is up to you, your research question and your field but it is usually better to save too much metadata than too little. Metadata you did not save is almost impossible to reconstruct later, so think carefully when you decide that something is not worth saving.

How to store metadata? If you are lucky, all metadata you need might already be inside your raw data files. Most of the time however, you will have to add some metadata manually. In that case I prefer to create a single .csv file where each row describes a single raw data file. One of the advantages of .csv files is that they can be read by humans and script with a variety of programs.

The same metadata.csv file opened in four different programs. OpenOffice Calc (top left), Notepad (top right), Spyder IDE (lower left) and Notepad++ (lower right).

The id uniquely identifies the row and thereby the raw data file it refers to. A drawback of a .csv file is that it only works well for a single spreadsheet. If you require multiple related sheets – for example because you have multiple raw data sources – you might require a more complex format such as JSON or even build a database. More later on the reasons for and against building a database. Next, we will assume that we found a way to process our raw data and need to decide how to store it for analysis.

Analysis and storage of transformed raw data

Analysis of raw data most often means to iterate through all raw data files and do some computation on each. In our fictional example, we were taking images and then extracting birds from those images. Once we have extracted the birds, our unit of analysis changes. This has consequences for the metadata, which describes the raw data files. We need to relate the metadata from the raw data files to the birds and there are two major options to do this. First, we generate a spreadsheet where each row is a bird. In the image_id column we save the unique id of the image the bird was extracted from.

The extracted birds are related to a raw image by the fact that a specific bird was extracted from a single specific image. The tables in the top left and top right contain all information but during analysis we would need to open two files and relate the information. The table at the bottom has the advantage that both tables are merged into one. This can offer convenience but it increases the storage size of the file.

This is a relational data structure. The image_id in our birds table identifies a unique id in our metadata which allows us to figure out where, when and with what exposure a given bird was photographed. This is a great way to structure data. The only downside is that we have to load two files and then merge them during the analysis. Personally I prefer to create a single file where both tables are already merged. The downside of this data structure is that is is larger in storage size (because the metadata columns are repeated multiple times). Whether or not this is an issue for you depends on the total size of your data, the amount of rows extracted per image and your available disk space. If you decide on the relational data structure you might also want to consider building a database, because they are particularly well suited for relational storage. Next, we will discuss some advantages and disadvantages of databases.

You probably don’t want to build a database. Unless you do.

A database is any structured collection of information. This means that a collection of spreadsheets as we saw above could technically be considered a database. However, a database usually implies more rigorous rules of storage and access than a spreadsheet can guarantee. These rules give databases some of their major advantages. For example, multiple people and programs can interface with a database without messing things up. If two people open the same spreadsheet in excel and work on it they are almost guaranteed to end up with two conflicting versions. There are many other advantages of databases. When you decide for or against a database you should anticipate whether you will be able to make use of those advantages. Here is my list.

You should consider building a database if:

  • You will have to integrate multiple raw data sources.
    • Multiple raw data sources usually mean more relationships. Multiple relationships become harder to manage with spreadsheet but are the perfect use-case for a relational database.
  • You expect to store a massive amount of data.
    • A database usually stores data more efficiently, which can save you storage space.
    • A database can also make access more efficient, because you don’t load the entire database but request smaller packages of information at a time. This avoids running into RAM issues.
  • You will be adding data continuously to the project over a long period of time.
    • Having a fixed logic for adding information to the database is a massive advantage over long time periods and can avoid a lot of errors.
    • During very long projects you might want to train other people to interface with the database or even hand the project over to someone else. Having the relational rules that avoid errors is very helpful, as opposed to a spreadsheet where anyone can do whatever.
  • You want to make you data available through a web interface.
    • A web server interfacing with a spreadsheet would be very inefficient.
  • Multiple people need read/write access simultaneously.
    • As mentioned above, multi-user access works much better for databases.
  • You have security concerns.
    • This can be a security concern regarding the physical integrity of the data or malicious/unauthorized access. Databases are not perfect but they provide better safeguards than a spreadsheet

Those advantages sound amazing. So why would you ever not build a database? Well, many research projects are structured in a way that you cannot make use of any of these advantages. If a single person generates all their data in a week, processes the raw data the other week and prepares a final report in the third week, a database is overkill. The rules that database architectures follow provide major advantages but they also make them less flexible. Building and maintaining a database comes with some extra work that you only want to take on when you can make use of some of the advantages. Generally speaking, the larger a project is (regarding everything from number of people involved over data sources to time frame), the more worthwhile a database becomes.

Before we move on to data analysis, I want to briefly mention cloud storage. Cloud storage has some advantages but most of them can also be achieved in other ways. In my opinion, the most important reason to consider cloud storage is if you regularly need access to the data from a large variety of different locations that are well connected to the internet. However, if you have too much money, a commercial cloud storage can also provide services for you that might be annoying to maintain yourself. On the other hand, as a researcher you may also not be allowed (because of funding or institutional policies) to use certain cloud services. Other details of cloud storage are beyond this guide.

Visualization and statistical analysis

So far we extracted samples (birds) from our images and quantified features of those birds (species and size). Some additional features are given by the image metadata. There might be additional intermediate processing steps whose results you want to save. That’s up to you. For example, maybe you had to apply some filters to the images when you extracted the birds. Do you want to save these filtered images? If the filtering takes very long and you want to visually inspect the result, you probably want to save them. Otherwise, having the script that performs the filtering is sufficient.

The plots you draw are something you definitely want to save. You also want to save the results of statistical analysis. At this stage it is important to have a clear relationship between analysis scripts and their outputs. I hate to look through multiple scripts, trying to find the code that generated a specific figure I want to change. You could just try to put everything into one script. I have done this before but depending on the size of the project this becomes unwieldy very quickly and you will find yourselves scrolling through thousands of lines of code. There is no perfect recipe here so I think a practical example is in order.

A practical example for a project structure

Here I will give some examples, how the bird projects could look as a directory hierarchy.

The directory structure of an example project.

In the project folder itself are only Python scripts and a README.txt, where we can give information that anyone looking at our project should be aware of. Having everything else in specific folders makes it easier to locate a Python script. There is one more advantage: if you use git to version control or share your projects, you can exclude data and figures from tracking. Data is often too large to push to remote or we don’t want to make unpublished data public. Therefore we can simply add the data folders we want to keep local to a .gitignore file. However, managing your project with git is not a must and for smaller projects it does not always pay off.

Now for the scripts themselves. Each script should have a very well defined input and output. extract_birds.py for example loops through all raw files in ./raw and creates the file ./data/birds.csv. The other two scripts both use that same birds.csv file but they answer different questions and have different outputs. analyze_location.py creates the figure 02_figure_location.png and the statistical output location_anova.txt. On the other hand analyze_species_size.py creates 01_figure_species_size.png and species_ttest.txt.

In this example the same script visualizes and performs the statistical test. You could split this up further and have a script for visualization and another for stats. I prefer to combine them because both tasks require loading of the same data. On another note, I prefer to keep things like reports, papers and presentation in a separate folder. In my experience they clutter the project space.

Planning a project in detail is important but there are also diminishing returns. At some point the only way forward is to actually start the project. I hope you learned some useful tricks to manage your own data more effectively.

Differential Equations with SciPy – odeint or solve_ivp

SciPy features two different interfaces to solve differential equations: odeint and solve_ivp. The newer one is solve_ivp and it is recommended but odeint is still widespread, probably because of its simplicity. Here I will go through the difference between both with a focus on moving to the more modern solve_ivp interface. The primary advantage is that solve_ivp offers several methods for solving differential equations whereas odeint is restricted to one. We get started by setting up our system of differential equations and some parameters of the simulation.

UPDATE 07.02.2021: Note that I am focusing here on usage of the different interfaces and not on benchmarking accuracy or performance. Faruk Krecinic has contacted me and noted that assessing accuracy would require a system with a known solution as a benchmark. This is beyond the scope of this blog post. He also pointed out to me that the hmax parameter of odeint is important for the extent to which both interfaces give similar results. You can learn more about the parameters of odeint in the docs.

import numpy as np
from scipy.integrate import odeint, solve_ivp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def lorenz(t, state, sigma, beta, rho):
    x, y, z = state
    
    dx = sigma * (y - x)
    dy = x * (rho - z) - y
    dz = x * y - beta * z
    
    return [dx, dy, dz]

sigma = 10.0
beta = 8.0 / 3.0
rho = 28.0

p = (sigma, beta, rho)  # Parameters of the system

y0 = [1.0, 1.0, 1.0]  # Initial state of the system

We will be using the Lorenz system. We can directly move on the solving the system with both odeint and solve_ivp.

t_span = (0.0, 40.0)
t = np.arange(0.0, 40.0, 0.01)

result_odeint = odeint(lorenz, y0, t, p, tfirst=True)
result_solve_ivp = solve_ivp(lorenz, t_span, y0, args=p)

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(result_odeint[:, 0],
        result_odeint[:, 1],
        result_odeint[:, 2])
ax.set_title("odeint")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(result_solve_ivp.y[0, :],
        result_solve_ivp.y[1, :],
        result_solve_ivp.y[2, :])
ax.set_title("solve_ivp")
Simulation results from odeint and solve_ivp. Note that the solve_ivp looks very different primarily because of the default temporal resolution that is applied. Changing the the temporal resolution and getting very similar results to odeint is easy and shown below.

The first thing that sticks out is that the solve_ivp solution is less smooth. That is because it is calculated at fewer time points, which in turn has to do with the difference between t_span and t. The odeint interface expects t, an array of time points for which we want to calculate the solution. The temporal resolution of the system is given by the interval between time points. The solve_ivp interface on the other hand expects t_span, a tuple that gives the start and end of the simulation interval. solve_ivp determines the temporal resolution by itself, depending on the integration method and the desired accuracy of the solution. We can confirm that the temporal resolution of solve_ivp is lower in this example by inspecting the output of both functions.

t.shape
# (4000,)

result_odeint.shape
# (4000, 3)

result_solve_ivp.t.shape
# (467,)

The t array has 4000 elements and therefore the result of odeint has 4000 rows, each row being a time point defined by t. The result of solve_ivp is different. It has its own time array as an attribute and it has 1989 elements. This tells us that solve_ivp indeed calculated fewer time points affection temporal resolution. So how can we increase the the number of time points in solve_ivp? There are three ways: 1. We can manually define the time points to integrate, similar to odeint. 2. We can decrease the error of the solution we are willing to tolerate. 3. We can change to a more accurate integration method. We will first change the integration method. The default integration method of solve_ivp is RK45 and we will compare it to the default method of odeint, which is LSODA.

solve_ivp_rk45 = solve_ivp(lorenz, t_span, y0, args=p,
                            method='RK45')
solve_ivp_lsoda = solve_ivp(lorenz, t_span, y0, args=p,
                           method='LSODA')

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(solve_ivp_rk45.y[0, :],
        solve_ivp_rk45.y[1, :],
        solve_ivp_rk45.y[2, :])
ax.set_title("RK45")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(solve_ivp_lsoda.y[0, :],
        solve_ivp_lsoda.y[1, :],
        solve_ivp_lsoda.y[2, :])
ax.set_title("LSODA")
Comparison between RK45 and LSODA integration methods of solve_ivp.

The LSODA method is already more accurate but we can make it even more accurate but it is still not as accurate as the solution we got from odeint. That is because we made odeint solve at even higher temporal resolution when we passed it t. To get the exact same result from solve_ivp we got from odeint, we must pass it the exact time points we want to solve with the t_eval parameter.

t = np.arange(0.0, 40.0, 0.01)
result_odeint = odeint(lorenz, y0, t, p, tfirst=True)
result_solve_ivp = solve_ivp(lorenz, t_span, y0, args=p,
                             method='LSODA', t_eval=t)

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(result_odeint[:, 0],
        result_odeint[:, 1],
        result_odeint[:, 2])
ax.set_title("odeint")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(result_solve_ivp.y[0, :],
        result_solve_ivp.y[1, :],
        result_solve_ivp.y[2, :])
ax.set_title("solve_ivp LSODA")
odeint and solve_ivp with identical integration method and temporal resolution.

Now both solutions have identical temporal resolution. But their solution is still not identical. I was unable to confirm why that is the case but I suspect very small floating point errors. The Lorenz attractor is a chaotic system and even small errors can make it diverge. The following plot shows the first variable of the system for odeint and solve_ivp from the above simulation. It confirms my suspicion that floating point accuracy is to blame but was unable to confirm the source.

Solutions from odeint and solve_ivp diverge even for identical temporal resolution and integration method.

There is one more way to make the system more smooth: decrease the tolerated error. We will not go into that here because it does not relate to the difference between odeint and solve_ivp. It can be used in both interfaces. There are some more subtle differences. For example, by default, odeint expects the first parameter of the problem to be the state and the second parameter to be time t. This is the other way around for solve_ivp. To make odeint accept a problem definition where time is the first parameter we set the parameter tfirst=True.

In summary, solve_ivp offers several integration methods while odeint only uses LSODA.

Differential Equations in Python with SciPy

Differential equations are special because they don’t tell us the value of a variable straight up. Instead, they tell us by how much the variable will change with respect to the change of another variable. Usually that other variable is time. To numerically solve a system of differential equations we need to track the systems change over time starting at an initial state. This process is called numerical integration and there is a SciPy function for it called odeint. We will learn how to use this package by simulating the ‘hello world’ of differential equations: the Lorenz system.

Here is the first part of the code where we define the function that describes the dynamics of the system.

import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from mpl_toolkits.mplot3d import Axes3D

def lorenz(state, t, sigma, beta, rho):
    x, y, z = state
    
    dx = sigma * (y - x)
    dy = x * (rho - z) - y
    dz = x * y - beta * z
    
    return [dx, dy, dz]

We start with some imports. Of course we need NumPy and odeint is imported from scipy.integrat. Matplotlib will be used to plot the result of our simulation. After that we define the system of differential equations that defines our Lorenz system. It consists of three differential equations that we fit into one function called lorenz. This function needs a specific call signature (lorenz(state, t, sigma, beta, rho)) because we will later pass it to odeint which expects specific parameters in specific places. Most importantly, the first parameter must be the state of the system.The state of the Lorenz system is defined by three variables: x, y, z. Our state object has to be a sequence with an order that reflects this.

Inside the lorenz function, the first thing we do is to unpack the state into the three state variables. This is followed by the three differential equations that described the dynamic changes of the state variables. The fact that the variable t does not show up in any of these equations is a common point of confusion. The amount of change certainly depends on the amount of time. So why can we ignore t here? The answer is that our numerical integrator will keep track of t for us. For this particular system we could actually build a function that does not take the parameter t but I include it because it can be useful if you want to add discontinuities that depend on t.

While t does not appear in the equations, sigma, beta & rho do. They are the parameters of the system and the system’s properties depend on them. We will set those parameters next.

sigma = 10.0
beta = 8.0 / 3.0
rho = 28.0

p = (sigma, beta, rho)

These are the parameters Lorenz himself used and they are known to produce the type of dynamic that the Lorenz system is most known for: the Lorenz attractor. It is important that we store these parameters in a tuple in this exact order because of our functions structure. It must be a tuple rather than another type of collection because odeint expects it. Now that our parameters are defined, we will move on to define the initial values of the system. This is a critical part of solving differential equations. These equations tell us by how much the system state changes but they cannot tell us where to start.

y0 = [1.0, 1.0, 1.0]

Our system will start with all variables at 1.0. Now we can solve the system and plot the result.

t = np.arange(0.0, 40.0, 0.01)

result = odeint(lorenz, y0, t, p)

fig = plt.figure()
ax = fig.gca(projection="3d")
ax.plot(result[:, 0], result[:, 1], result[:, 2])
The Lorenz attractor.

We solve the system with a simple call to to odeint and we pass it the function that defines out system, the initial state, the time points t for which we want to solve the system and the parameters p. result is a two-dimensional array where the rows are the time points and the columns are the state variables at that those time points. And this is how we can solve differential equations with SciPy.

A Curve Fitting Guide for the Busy Experimentalist

Curve fitting is an extremely useful analysis tool to describe the relationship between variables or discover a trend within noisy data. Here I’ll focus on a pragmatic introduction curve fitting: how to do it in Python, why can it fail and how do we interpret the results? Finally, I will also give a brief glimpse at the larger themes behind curve fitting, such as mathematical optimization, to the extent that I think is useful for the casual curve fitter.

Curve Fitting Made Easy with SciPy

We start by creating a noisy exponential decay function. The exponential decay function has two parameters: the time constant tau and the initial value at the beginning of the curve init. We’ll evenly sample from this function and add some white noise. We then use curve_fit to fit parameters to the data.

import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize

# The exponential decay function
def exp_decay(x, tau, init):
    return init*np.e**(-x/tau)

# Parameters for the exp_decay function
real_tau = 30
real_init = 250

# Sample exp_decay function and add noise
np.random.seed(100)
dt=0.1
x = np.arange(0,100,dt)
noise=np.random.normal(scale=50, size=x.shape[0])
y = exp_decay(x, real_tau, real_init)
y_noisy = y + noise

# Use scipy.optimize.curve_fit to fit parameters to noisy data
popt, pcov = scipy.optimize.curve_fit(exp_decay, x, y_noisy)
fit_tau, fit_init = popt

# Sample exp_decay with optimized parameters
y_fit = exp_decay(x, opt_tau, opt_init)

fig, ax = plt.subplots(1)
ax.scatter(x, y_noisy,
           alpha=0.8,
           color= "#1b9e77",
           label="Exponential Decay + Noise")
ax.plot(x, y,
        color="#d95f02",
        label="Exponential Decay")
ax.plot(x, y_fit,
        color="#7570b3",
        label="Fit")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
ax.set_title("Curve Fit Exponential Decay")

Our fit parameters are almost identical to the actual parameters. We get 30.60 for fit_tau and 245.03 for fit_init both very close to the real values of 30 and 250. All we had to do was call scipy.optimize.curve_fit and pass it the function we want to fit, the x data and the y data. The function we are passing should have a certain structure. The first argument must be the input data. All other arguments are the parameters to be fit. From the call signature of def exp_decay(x, tau, init) we can see that x is the input data while tau and init are the parameters to be optimized such that the difference between the function output and y_noisy is minimal. Technically this can work for any number of parameters and any kind of function. It also works when the sampling is much more sparse. Below is a fit on 20 randomly chosen data points.

Of course the accuracy will decrease with the sampling. So why would this every fail? The most common failure mode in my opinion is bad initial parameters.

Choosing Good Initial Parameters

The initial parameters of a function are the starting parameters before being optimized. The initial parameters are very important because most optimization methods don’t just look for the best fit randomly. That would take too long. Instead, it starts with the initial parameters, changes them slightly and checks if the fit improves. When changing the parameters shows very little improvement, the fit is considered done. That makes it very easy for the method to stop with bad parameters if it stops in a local minimum or a saddle point. Let’s look at an example of a bad fit. We will change our tau to a negative number, which will result in exponential growth.

In this case fitting didn’t work. For a real_tau and real_init of -30 and 20 we get a fit_tau and fit_init of 885223976.9 and 106.4, both way off. So what happened? Although we never specified the initial parameters (p0), curve_fit chooses default parameters of 1 for both fit_tau and fit_init. Starting from 1, curve_fit never finds good parameters. So what happens if we choose better parameters? Looking at our exp_decay definition and the exponential growth in our noisy data, we know for sure that our tau has to be negative. Let’s see what happens when we choose a negative initial value of -5.

p0 = [-5, 1]
popt, pcov = scipy.optimize.curve_fit(exp_decay, x, y_noisy, p0=p0)
fit_tau, fit_init = popt
y_fit = exp_decay(x, fit_tau, fit_init)
fig, ax = plt.subplots(1)
ax.scatter(x, y_noisy,
           alpha=0.8,
           color= "#1b9e77",
           label="Exponential Decay + Noise")
ax.plot(x, y,
        color="#d95f02",
        label="Exponential Decay")
ax.plot(x, y_fit,
        color="#7570b3",
        label="Fit")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend()
ax.set_title("Curve Fit Exponential Growth Good Initials")

With an initial parameter of -5 for tau we get good parameters of -30.4 for tau and 20.6 for init (real values were -30 and 20). The key point is that initial conditions are extremely important because they can change the result we get. This is an extreme case, where the fit works almost perfectly for some initial parameters or completely fails for others. In more subtle cases different initial conditions might result in slightly better or worse fits that could still be relevant to our research question. But what does it mean for a fit to be better or worse? In our example we can always compare it to the actual function. In more realistic settings we can only compare our fit to the noisy data.

Interpreting Fitting Results

In most research setting we don’t know our exact parameters. If we did, we would not need to do fitting at all. So to compare the goodness of different parameters we need to compare our fit to the data. How do we calculate the error between our data and the prediction of the fit? There are many different measures but among the most simple ones is the sum of squared residuals (SSR).

def ssr(y, fy):
    """Sum of squared residuals"""
    return ((y - fy) ** 2).sum()

We take the difference between our data (y) and the output of our function given a parameter set (fy). We square that difference and sum it up. In fact this is what curve_fit optimizes. Its whole purpose is to find the parameters that give the smallest value of this function, the least square. The parameters that give the smallest SSR are considered the best fit. We saw that this process can fail, depending on the function and the initial parameters, but let’s assume for a moment it worked. If we found the smallest SSR, does that mean we found the perfect fit? Unfortunately not. What we found was a good estimate for the best fitting parameters given our function. There are probably other functions out there that can fit our data better. We can use the SSR to find better fitting functions in a process called cross-validation. Instead of comparing different parameters of the same function we compare different functions. However, if we increase the number of parameters we run into a problem called overfitting. I will not get into the details of overfitting here because it is beyond our scope.

The main point is that we must stay clear of misinterpretations of best fit. We are always fitting the parameters and not the function. If our fitting works, we get a good estimate for the best fitting parameters. But sometimes our fitting doesn’t work. This is because our fitting method did not converge to the minimum SSR and in the final chapter we will find out why that might happen in our example.

The Error Landscape of Exponential Decay

To understand why fitting can fail depending on the initial conditions we should consider the landscape of our sum of squared residuals (SSR). We will calculate it by assuming that we already know the init parameter, so we keep it constant. Then we calculate the SSR for many values of tau smaller than zero and many values for tau larger than zero. Plotting the SSR against the guessed tau will hopefully show us how the SSR looks around the ideal fit.

real_tau = -30.0
real_init = 20.0

noise=np.random.normal(scale=50, size=x.shape[0])
y = exp_decay(x, real_tau, real_init)
y_noisy = y + noise
dtau = 0.1
guess_tau_n = np.arange(-60, -4.9, dtau)
guess_tau_p = np.arange(1, 60, dtau)

# The SSR function
def ssr(y, fy):
    """Sum of squared residuals"""
    return ((y - fy) ** 2).sum()

loss_arr_n = [ssr(y_noisy, exp_decay(x, tau, real_init)) 
              for tau in guess_tau_n]
loss_arr_p = [ssr(y_noisy, exp_decay(x, tau, real_init))
              for tau in guess_tau_p]

"""Plotting"""
fig, ax = plt.subplots(1,2)
ax[0].scatter(guess_tau_n, loss_arr_n)
real_tau_loss = ssr(y_noisy, exp_decay(x, real_tau, real_init))
ax[0].scatter(real_tau, real_tau_loss, s=100)
ax[0].scatter(guess_tau_n[-1], loss_arr_n[-1], s=100)
ax[0].set_yscale("log")
ax[0].set_xlabel("Guessed Tau")
ax[0].set_ylabel("SSR Standard Log Scale")
ax[0].legend(("All Points", "Real Minimum", "-5 Initial Guess"))

ax[1].scatter(guess_tau_p, loss_arr_p)
ax[1].scatter(guess_tau_p[0], loss_arr_p[0], s=100)
ax[1].set_xlabel("Guessed Tau")
ax[1].set_ylabel("SSR")
ax[1].legend(("All Points", "1 Initial Guess"))

On the left we see the SSR landscape for tau smaller than 0. Here we see that towards zero, the error becomes extremely large (note the logarithmic y scale). This is because towards zero the exponential growth becomes ever faster. As we move to more negative values we find a minimum near -30 (orange), our real tau. This is the parameter curve_fit would find if it only optimized tau and started initially at -5 (green). The optimization method does not move to more negative values from -30 because there the SSR becomes worse, it increases.

On the right side we get a picture of why optimization failed when we started at 1. There is no local minimum. The SSR just keeps decreasing with larger values of tau. That is why the tau was so larger when fitting failed (885223976.9). If we set our initial parameter anywhere in this part of the SSR landscape, this is where tau will go. Now there are other optimization methods that can overcome bad initial parameters. But few are completely immune to this issue.

Easy to Learn Hard to Master.

Curve fitting is a very useful technique and it is really easy in Python with Scipy but there are some pitfalls. First of all, be aware of the initial values. They can lead to complete fitting failure or affect results in more subtle systematic ways. We should also remind ourselves that even with decent fitting results, there might be a more suitable function out there that can fit our data even better. In this particular example we always knew what the underlying function was. This is rarely the case in real research settings. Most of the time it is much more productive to think more deeply about possible underlying functions than finding more complicated fitting methods.

Finally, we barely scratched the surface here. Mathematical optimization is an entire field in itself and it is relevant to many areas such as statistics, machine learning, deep learning and many more. I tried to give the most pragmatic introduction to the topic here. If want to go deeper into the topic I recommend this Scipy lecture and of course the official Scipy documentation for optimization and root finding.

Filtering Data with SciPy

Time series data may contain signals at many different frequencies. Sharp increases or decreases have a high frequency. Slow increases or decreases have a low frequency. Filtering allows us to take different frequency components out of the data.

Signal filtering is a science on its own and I’ll focus on the practical aspects here and stick to two filter types: butterworth and Chebyshev type I. Each of those filters can be used for different purposes. We can use them as low pass, high pass, band pass or notch filters. Low pass filters leave low frequencies alone but attack high frequencies. High pass filters leave high frequencies alone but attach low frequencies. The title image shows an example of low and high pass filters used on the same data. Band pass filters leave a specific frequency band alone and attack all other frequencies. Notch filters attack a specific frequency band, leaving the rest alone. Let’s look at an example.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import butter, cheby1, filtfilt

data = np.load("example_data.npy")

order = 3
Wn = 4000  # in Hz
btype = 'lowpass'
fs = 50000  # in Hz

b, a = butter(order, Wn, btype, fs = fs)
data_butter = filtfilt(b, a, data)

This is a butterworth lowpass filter with a cutoff frequency of 4000Hz (Wn). That means, signals below 4000Hz are is the pass band. They are largely left alone. Signals above 4000Hz are in the stop band, they are diminished. fs is the sampling frequency of the data. If the units are Hz, it tells us how many data points are recorded during one second. filtfilt is the function that does the actual filtering on the data, based on the filter (b, a) that was designed previously. Filtering is not a perfect process. Filters have what is called roll-off at the critical 4000Hz frequency.

Ideally, we would like a filter response that falls down straight. Anything in the pass band is untouched, anything in the stop band is shutdown the same way. As you can see, our actual filter does no live up to the ideal. It already slightly attenuates signal that is part of the pass band and it falls much slower in the stop band. If we need a steeper roll off, we can increase the order of our filter.

Some filter types have steeper roll off than others. For example, the Chebyshev type I filter achieves steeper roll off by tolerating some ripple in the pass band.

This can lead to distortions in the data depending on the size of the ripple. The Chebyshev type I filter takes an argument rp that defines the amount of pass band ripple that is tolerated in units of dB. The more ripple we tolerate, the steeper the roll off will be. Here you can see how large ripple causes oscillations in the data.

Generally, the butterworth filter is sufficient for most situations and is safer because it does not ripple. I hope this post helped you filtering your own data. If you want to learn more, check out the SciPy signal docs. Both the butter and cheby1 filter are there with many, many more.

Smoothing Data by Rolling Average with NumPy

Time series data often comes with some amount of noise. One of the easiest ways to get rid of noise is to smooth the data with a simple uniform kernel, also called a rolling average. The title image shows data and their smoothed version. The data is the second discrete derivative from the recording of a neuronal action potential. Derivatives are notoriously noisy. We can get the result shown in the title image with np.convolve

import numpy as np

data = np.load("example_data.npy")
kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved = np.convolve(data, kernel, mode='same')

Convolution is a mathematical operation that combines two arrays. One of those arrays is our data and we convolve it with the kernel array. During convolution we center the kernel at a data point. We multiple each data point in the kernel with each corresponding data point, sum up all the results and that is the new data point at the center. Let’s look at an example.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel)
# array([ 2,  7, 13, 23, 24, 18, 19,  4])

For this result to make sense you must know, that np.convolve flips the kernel around. So step by step the calculations go as follows:

[4, 3, 2, 1]  # The flipped kernel
          x
         [2, 3, 1, 4, 1]  # The data
          2= 2
         [2]

   [4, 3, 2, 1]
          x  x
         [2, 3, 1, 4, 1]
          4+ 3= 7
         [2, 7]

      [4, 3, 2, 1]
          x  x  x
         [2, 3, 1, 4, 1]
          6+ 6+ 1=13
         [2, 7, 13]

         [4, 3, 2, 1]
          x  x  x  x
         [2, 3, 1, 4, 1]
          8+ 9+ 2+ 4= 23
         [2, 7,13,23]
# This continues until the arrays stop touching. You get the idea.

One thing you’ll notice is that the edges are problematic. There is really no good way to avoid that. Data points at the edges only see part of the kernel but the mode parameter defines what should happen at the edges. I prefer the 'same' mode because it means that the new array will have the same shape as the original data, which makes plotting easier. However, if you start to use more complicated kernels, the edges might become virtually useless. In that case, mode should be 'valid'. Then, the values at the edges that did not see the entire kernel are discarded. The output array is smaller in shape than the input array.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel, mode='valid')
array([23, 24])

The default behavior you saw above is called 'full'. It keeps all data points, so the output array is larger in shape than the input array. You might also have noticed that the size of the kernel is very important. Actually, we need to divide the array of ones by its length. Can you guess what would happen if we forgot about dividing it?

If you guessed that the signal would become larger in magnitude you guessed right. We would be summing up all data points in the kernel. By dividing it we ensure that we take the average of the data points. But the kernel size is even more important. If we make the kernel larger the outcome changes dramatically.

kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved_10 = np.convolve(data, kernel, mode='same')

kernel_size = 20
kernel = np.ones(kernel_size) / kernel_size
data_convolved_20 = np.convolve(data, kernel, mode='same')

plt.plot(data_convolved_20)
plt.plot(data_convolved_10)
plt.legend(("Kernel Size 10", "Kernel Size 20"))

The larger we make the kernel, the smaller sharp peaks become. The peaks are also shifted in time. To be specific, a rolling mean is a low-pass filter. This means that is leaves low frequency signals alone, while making high frequency signals smaller. Sharp increases in the data have a high frequency. If we make the kernel larger, the filter attenuates high frequency signals more. This is exactly how the rolling average works. It gets rid of high frequency noise. It also means that we must be careful not to distort the signal too much with the rolling average filter.

Fantastic Programming Languages and Where to Find Them

There are many programming languages out there and committing to one of them can be intimidating. The good news is that most programming languages that are relevant today are solid. You can’t go wrong with any of them and they are all worth your time. At the same time, the programming language you pick will strongly influence your work as a PhD student or postdoc and the opportunities you have afterwards. Here I’ll guide you through the things to consider when choosing your first or second programming language.

Which language are others using?
The conformists way

Most programming languages are amazing and the technical differences between them are small and only relevant under special circumstances. More important than the technical details are the community, research field and laboratory you want to join. If you already know which lab you want to join, find out which language is used there. If the lab uses multiple languages try to find out what kind of task you will have, find the people with similar tasks and find out which language they are using. If you don’t know your lab yet but you know the field, try to find out which language dominates that field. You can do so by reading papers, job advertisements or by directly writing to other PhD students and postdocs.

This conformist approach is not satisfying for everyone. I get it. I actually brought a language to my lab that nobody else there was using. More on that later. Most people will benefit greatly from this conformist route. Let me first tell you the many advantages, before I explain why you might benefit from choosing another language. First, you maximize the people that can help while you are learning and while you are engaging with the technical details of your tasks. Second, you will find many solutions ready to use. You will be able to grab scripts and functions from your colleagues and you will be able to move on from programming details to solving your actual task much faster. Third, your colleagues are successfully contributing to the field so it is likely that other people in the same field are using the same programming language. That means your programming experience will help you find a postdoc job if you want to stay in that field.

So why would you want to miss out on those advantages? In short: you don’t. You probably don’t know better than your future colleagues (yet). You don’t want to reinvent the wheel. But there are some other things to consider and sometimes it can pay off to deviate from lab culture. If you do so, this will affect your work. In a nutshell, you will be less productive in the short-term but more productive in the long-term, if you choose your programming language well.

Which languages are used outside academia

Many of us are not looking to stay in academic research. Even if you are committed to academia, this point is worth considering. Things and people can change. It is considered good practice to have a plan B. While many programming languages are used both inside and outside of academia, some labs use programming languages that are nearly worthless in the non-academic job market. Sometimes very specific research requires a niche programming language. Other times a lab was simply unable or unwilling to transition to a more common language.

I recommend making your plan B as concrete as possible. Maybe it doesn’t involve programming at all. Then you should fall back to the conformists way. Otherwise, check your plan B job market for programming languages that are required or advantageous. I will go through some programming languages later and give my opinion on their usage inside and outside academia. However, I cannot give a definitive answer and these job markets evolve rapidly. If you learned programming during your PhD you will be in a great position to pick up another language. You will probably have to learn more than one language anyway. Companies have so called ‘stacks’. A stack is a collection of software (including some programming languages) and people will be hired for “full-stack” or subsets of that stack. A non-academic stack will likely involve at least passing familiarity with more than one programming language. Either way, keep an eye out for labs that use niche programming languages. It might be worthwhile to defy lab culture and choose a more common language.

Performance is less important than you think

Beginners consistently overestimate the importance of performance or speed. I’ve been there. When I started out I though fast computations would be the deciding factor for or against any programming language. It isn’t, because human time is more valuable than computer time. By orders of magnitude. In research, the bottleneck is rarely computational time, it’s almost always human time. Performance only becomes relevant with very computationally intensive projects.

Imagine you are writing a script that will take one minute to run in the end. A 10x decrease in performance (now it takes 10 minutes) is very tolerable, if you get some perks for it. It is now easier to debug, easier to build on and more other people can use it and give you credit for it. If you are writing a simulation project that takes 10 days, a 10x decrease (now it takes 100 days) does not look so attractive anymore. In the real world performance of both scripts and programs is slightly more complicated but the point stands. If you are not sure whether you are in the 1 minute or 10 days category, you should try to figure it our before deciding. Just ask your colleagues and advisers.

With that, I want to move on to some fantastic programming languages. We will look at their strengths and their weaknesses. Always keep in mind the conformists way. Only choose your own language when there are clear advantages. I will briefly discuss which languages are worthwhile to use even if your lab is not on board and which ones are to be avoided even if your lab is working with them. As a disclaimer: I have hands on experience with Python, R and Matlab. For the other languages I either have second hand experience (people in my surrounding work with them) or I did some research about them.

Python and R

Python and R share a chapter because they are both excellent and should be your first choice if there are no other languages established in the lab. If the lab uses either or both, even better! Both are completely free and open source. Python is my personal favorite but I am slightly biased after years of working with it almost daily.

Both Python and R are also heavily used outside of academia. R is consistently ranked as the top required language for data scientists. Python ranks second. A downside of R is that it is specifically designed for statistics and data analysis. This can be an advantage, because as scientists this is the biggest part of our job when we program. However, Python is a complete all-rounder. It can do data analysis but it can also do web development, game development and everything else you can think of. Web development is also possible with R, but it centrally revolves around data analysis and visualization. I guess I’m trying to say, Python would be better for your private coin collecting website (minor upside).

Some communities slightly favor Python, others favor R. Astronomy for example really likes Python. Single cell sequencing on the other hand prefers R. Check with your field and colleagues. Finally, both languages are well documented and have massive communities behind them. This makes it much more likely that someone already solved an issue you are trying to google. In summary, any second you spend learning Python or R is well worth it.

MATLAB

MATLAB is developed by MathWorks and it is specifically designed for science and engineering. Unlike Python and R, it is neither free nor open source. If your lab pays for a license, the heavy price tag might not bother you. The language itself is more similar to Python than R (many of Python’s numerical computation capabilities were developed with MATLAB users in mind). A nice upside is that the language comes with a very strong graphical user interface and debugging capabilities. This can be very helpful. Unfortunately, MATLAB is much less popular outside of academia than Python and R. Especially smaller companies and early start-ups are unwilling to pay for MATLAB when there are free equivalents. Overall, even academics seem to be slowly transitioning away from MATLAB. However, I would not advise strictly against MATLAB if it is very popular in your field of research or the lab you want to join. Especially since Python and MATLAB are similar enough that transitioning is easy, once you learned MATLAB. I would only advise against it if you have concrete plans to leave academia for data science.

Julia

Julia is being traded as the future Python. For now it has a smaller community but it was specifically designed to keep the advantages of Python while improving performance. I currently don’t recommend Julia, unless you have some computationally intensive projects or you anticipate such projects in your professional future. The more people use a language, the higher the chance that even specialized tasks are already implemented by someone else. Julia is not yet widely adopted. If your lab uses Julia, I recommend rolling with it.

Igor Pro

Igor Pro by WaveMetrics is a commercial software and programming language. Like MATLAB, it comes with a rather rich graphical user interface. It is the first programming language I actively discourage. Even if you feel like spending money, you are probably better off with MATLAB. Igor Pro is even less popular outside of academia than MATLAB.

When I started my master thesis, the established language for my main task (intracellular electrophysiology) was Igor Pro. I decided against using it, because I had never heard about it before, the graphical user interface did not look very appealing and I had some Python experience from small hobby projects. So I decided to do the analysis myself with Python. The consequence of that was that I was extremely slow in the beginning. Had I just done it with Igor Pro, I could have taken the scripts that were already used in the lab and could have used them with minimal learning effort. Instead I had to reinvent the wheel and learn Python at the same time. This made me extremely inefficient in the short term.

In the long term it was the best choice I made during my masters, because in the long term it made me more efficient. More than that, I was able to take on new tasks that would have been nearly impossible with Igor Pro. I started to get into biophysical neuronal network simulations. Python has several packages for that. I’m not aware of any such Packages for Igor Pro. That being said, you or your colleagues might not be willing to lose short term efficiency, especially if you don’t care for programming and just want to get the job done. If you enter a lab where Igor Pro is being used, roll with it to get things done more quickly.

JavaScript

JavaScript is particularly popular for web development but it can also do some data analysis. ImageJ, a popular scientific image processing software, is written in Java and you can write JavaScript code for it. It is free and open source. It is worth checking out if you are going to do a lot of image processing. Otherwise I don’t recommend it for scientific purposes.

Most programming languages are fantastic

Committing to a programming language is difficult. Luckily, all relevant programming languages today are amazing. They all get the job done and are well worth your time (especially the ones on top of this list). And this brings me to my take-home message. Don’t worry about the technical differences between Python, R and MATLAB. Especially don’t worry about performance. Your scientific field and laboratory are much more important factors for your choice. Isolating yourself comes with a price. I also hope I made clear why and when that price is worth it. It might give you long term advantages. Finally, the best thing you can do today is to start programming and to stop worrying.