How to organize your research data during analysis

As researchers we often have to manage the entire data lifecycle from generation to the final report. Raw data are major parts of any analysis pipeline and it is important to properly store them. Many guidelines on data management focus on raw data but also stop there. However, raw data without analysis are pretty much worthless and there are major data management decisions to be made during analysis. Which intermediate results to save? What format? Where? Cloud storage? Do I build a database? Every decision will have effects in the actual analysis code and can determine how easy errors can be found and fixed. Here I want to give a detailed overview of the different options we have to organize our data during analysis. Python is my main programming language and this will be reflected here in that I prefer open source tools and I am not experienced with some of the commercial solutions that might exist. However, I will also cover the advantages/disadvantages of human-readable data formats and tools such as Microsoft Excel. We get started with raw data.

Raw data

Raw data are like a good hypothesis. Just like a hypothesis is perfect until you ruin it with data, raw data are perfect until you ruin them with a thorough analysis. Raw data are truth. This is how it happened. Therefore, the raw data is never changed and we always keep multiple copies of it. Even after we published the results of our analysis, it is a good idea to keep the raw data for at least 10 years. Depending on your funding and where you publish you might be legally and/or ethically required to keep raw data for a defined time period.

Anything we do with our raw data during the analysis we do in read-only mode. The only thing we might want to change is the file name. However, we should check whether there is a way to automate the file naming in the program that generates them. Automating file names is preferred and avoids human error. If we receive data from collaborators we should also make sure that we are not deleting or changing data that they placed in the file name or the directory hierarchy.

From an analysis point of view, we might not need any information in the file name at all. We could randomly generate it. However, we almost always want to store important metadata in the file name because it has two major advantages: 1. The file name is human-readable. You need neither programming knowledge nor knowledge about the file format to read it. 2. The file name is readable by the operating system. This means, our programs can read it even if they don’t know the file format. For example, you can use file name information to decide whether or not to load a file during analysis and save loading time.

So what do we put in the file name? The first rule is that it is better to save too much information as opposed to too little. But most file systems have limits on the file name size (I hit it many times in Windows 10). When you choose your file name format you can ask yourself: what would I want a human to know about the files before they even open them and what would be good to have easily accessible during analysis? You will read in a lot of guides that you must save the date and time. That is useful because it makes it easier to relate a file to your lab book. Otherwise I only recommend you to give each file a unique identifier. This can be a number or a string of characters. I prefer numbers because they can have the added bonus of giving the recording order at a glance. Other information in the file names is up to you.

Once you have decided which other information to put in your file names, choose one delimiter to separate items and another one for readability. I use underscore to separate items and minus for readability. That allows me to do things like: “001_YYYY-MM-DD_version-2.fmt”. Avoid empty spaces in file names. Whatever decision you make, you must remain consistent.

When it comes to the directories, I recommend having all raw files in the same directory. I used to save some directory hierarchy structure but nowadays I would put all of that into the file name. If you have a lot of files, some directory structure can help humans navigate it but during analysis I almost always prefer one directory for simplicity. The file name is certainly a good way to save metadata but there are other ways to save them. In the next chapter we will discuss metadata more generally.

At the raw data stage, the definition of metadata is clear. It is any data that describes the context of the raw data but does not fit any of the raw data formats. Images are a good example. The raw data of an image are the pixel values that make up the digital image. Metadata of the image can be anything that relates to the context where the image was taken. The time it was taken, the place it was taken, the exposure time, the model of the lens used, the temperature that day, the person who took the image, the width, the height and much, much more. As the analysis progresses, metadata may become data. Let’s say your analysis pipeline counts the number of birds in each image and you want to find out whether there are more early birds than late birds. In that case the time of day the picture was taken becomes an independent variable.

So what kind of metadata should you save? That is up to you, your research question and your field but it is usually better to save too much metadata than too little. Metadata you did not save is almost impossible to reconstruct later, so think carefully when you decide that something is not worth saving.

How to store metadata? If you are lucky, all metadata you need might already be inside your raw data files. Most of the time however, you will have to add some metadata manually. In that case I prefer to create a single .csv file where each row describes a single raw data file. One of the advantages of .csv files is that they can be read by humans and script with a variety of programs.

The id uniquely identifies the row and thereby the raw data file it refers to. A drawback of a .csv file is that it only works well for a single spreadsheet. If you require multiple related sheets – for example because you have multiple raw data sources – you might require a more complex format such as JSON or even build a database. More later on the reasons for and against building a database. Next, we will assume that we found a way to process our raw data and need to decide how to store it for analysis.

Analysis and storage of transformed raw data

Analysis of raw data most often means to iterate through all raw data files and do some computation on each. In our fictional example, we were taking images and then extracting birds from those images. Once we have extracted the birds, our unit of analysis changes. This has consequences for the metadata, which describes the raw data files. We need to relate the metadata from the raw data files to the birds and there are two major options to do this. First, we generate a spreadsheet where each row is a bird. In the image_id column we save the unique id of the image the bird was extracted from.

This is a relational data structure. The image_id in our birds table identifies a unique id in our metadata which allows us to figure out where, when and with what exposure a given bird was photographed. This is a great way to structure data. The only downside is that we have to load two files and then merge them during the analysis. Personally I prefer to create a single file where both tables are already merged. The downside of this data structure is that is is larger in storage size (because the metadata columns are repeated multiple times). Whether or not this is an issue for you depends on the total size of your data, the amount of rows extracted per image and your available disk space. If you decide on the relational data structure you might also want to consider building a database, because they are particularly well suited for relational storage. Next, we will discuss some advantages and disadvantages of databases.

You probably don’t want to build a database. Unless you do.

A database is any structured collection of information. This means that a collection of spreadsheets as we saw above could technically be considered a database. However, a database usually implies more rigorous rules of storage and access than a spreadsheet can guarantee. These rules give databases some of their major advantages. For example, multiple people and programs can interface with a database without messing things up. If two people open the same spreadsheet in excel and work on it they are almost guaranteed to end up with two conflicting versions. There are many other advantages of databases. When you decide for or against a database you should anticipate whether you will be able to make use of those advantages. Here is my list.

You should consider building a database if:

• You will have to integrate multiple raw data sources.
• Multiple raw data sources usually mean more relationships. Multiple relationships become harder to manage with spreadsheet but are the perfect use-case for a relational database.
• You expect to store a massive amount of data.
• A database usually stores data more efficiently, which can save you storage space.
• A database can also make access more efficient, because you don’t load the entire database but request smaller packages of information at a time. This avoids running into RAM issues.
• You will be adding data continuously to the project over a long period of time.
• Having a fixed logic for adding information to the database is a massive advantage over long time periods and can avoid a lot of errors.
• During very long projects you might want to train other people to interface with the database or even hand the project over to someone else. Having the relational rules that avoid errors is very helpful, as opposed to a spreadsheet where anyone can do whatever.
• You want to make you data available through a web interface.
• A web server interfacing with a spreadsheet would be very inefficient.
• Multiple people need read/write access simultaneously.
• As mentioned above, multi-user access works much better for databases.
• You have security concerns.
• This can be a security concern regarding the physical integrity of the data or malicious/unauthorized access. Databases are not perfect but they provide better safeguards than a spreadsheet

Those advantages sound amazing. So why would you ever not build a database? Well, many research projects are structured in a way that you cannot make use of any of these advantages. If a single person generates all their data in a week, processes the raw data the other week and prepares a final report in the third week, a database is overkill. The rules that database architectures follow provide major advantages but they also make them less flexible. Building and maintaining a database comes with some extra work that you only want to take on when you can make use of some of the advantages. Generally speaking, the larger a project is (regarding everything from number of people involved over data sources to time frame), the more worthwhile a database becomes.

Before we move on to data analysis, I want to briefly mention cloud storage. Cloud storage has some advantages but most of them can also be achieved in other ways. In my opinion, the most important reason to consider cloud storage is if you regularly need access to the data from a large variety of different locations that are well connected to the internet. However, if you have too much money, a commercial cloud storage can also provide services for you that might be annoying to maintain yourself. On the other hand, as a researcher you may also not be allowed (because of funding or institutional policies) to use certain cloud services. Other details of cloud storage are beyond this guide.

Visualization and statistical analysis

So far we extracted samples (birds) from our images and quantified features of those birds (species and size). Some additional features are given by the image metadata. There might be additional intermediate processing steps whose results you want to save. That’s up to you. For example, maybe you had to apply some filters to the images when you extracted the birds. Do you want to save these filtered images? If the filtering takes very long and you want to visually inspect the result, you probably want to save them. Otherwise, having the script that performs the filtering is sufficient.

The plots you draw are something you definitely want to save. You also want to save the results of statistical analysis. At this stage it is important to have a clear relationship between analysis scripts and their outputs. I hate to look through multiple scripts, trying to find the code that generated a specific figure I want to change. You could just try to put everything into one script. I have done this before but depending on the size of the project this becomes unwieldy very quickly and you will find yourselves scrolling through thousands of lines of code. There is no perfect recipe here so I think a practical example is in order.

A practical example for a project structure

Here I will give some examples, how the bird projects could look as a directory hierarchy.

In the project folder itself are only Python scripts and a README.txt, where we can give information that anyone looking at our project should be aware of. Having everything else in specific folders makes it easier to locate a Python script. There is one more advantage: if you use git to version control or share your projects, you can exclude data and figures from tracking. Data is often too large to push to remote or we don’t want to make unpublished data public. Therefore we can simply add the data folders we want to keep local to a .gitignore file. However, managing your project with git is not a must and for smaller projects it does not always pay off.

Now for the scripts themselves. Each script should have a very well defined input and output. extract_birds.py for example loops through all raw files in ./raw and creates the file ./data/birds.csv. The other two scripts both use that same birds.csv file but they answer different questions and have different outputs. analyze_location.py creates the figure 02_figure_location.png and the statistical output location_anova.txt. On the other hand analyze_species_size.py creates 01_figure_species_size.png and species_ttest.txt.

In this example the same script visualizes and performs the statistical test. You could split this up further and have a script for visualization and another for stats. I prefer to combine them because both tasks require loading of the same data. On another note, I prefer to keep things like reports, papers and presentation in a separate folder. In my experience they clutter the project space.

Planning a project in detail is important but there are also diminishing returns. At some point the only way forward is to actually start the project. I hope you learned some useful tricks to manage your own data more effectively.

Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")


By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images


Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,
metric='euclidean')

distance_matrix_sorted = pairwise_distances(X_sorted,
metric='euclidean')

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[0].set_title("Unsorted")
ax[1].set_title("Sorted by Label")


When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity =$' + str(perplexity_01),
r'$\sigma = 10; Perplexity =$' + str(perplexity_10)))


Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
desired_perplexity = perplexity,
verbose=False)


As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
n_components).astype(np.float32)

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
'degrees_of_freedom': 1,
'n_samples': 1000,
'n_components':2}

embedding_init,
0,
1000,
kwargs=kl_kwargs)

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
'tsne_2': tsne_result[:,1],
'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.

Getting Started with Pandas DataFrame

A DataFrame is a spreadsheet like data structure. We can think of it as a collection of rows and columns. This row-column structure is useful for many different kinds of data. The most widely used DataFrame implementation in Python is from the Pandas package. First we will learn how to create DataFrames. We will also learn how to do some basic data analysis with them. Finally, we will compare the DataFrame to the ndarray data structure and learn why data frames are useful in other packages such as Seaborn.

How to Create a DataFrame

There two major ways to create a DataFrame. We can directly call DataFrame() and pass it data in a dictionary, list or array. Alternatively we can use several functions to load data from a file directly into a DataFrame. While it is very common in data science to load data from file, there are also many occasions where we need to create DataFrame from other data structures. We will first learn how to create a DataFrame from a dictionary.

import pandas as pd
d = {"Frequency": [20, 50, 8],
"Location": [2, 3, 1],
"Cell Type": ["Interneuron", "Interneuron", "Pyramidal"]}
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, index=row_names)
print(df)

"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


In our dictionary the keys are used as the column names. The data under each key then becomes the column. The row names are defined separately by passing a collection to the index parameter of DataFrame. We can get column and row names with the columns and index attributes.

df.columns
# Index(['Freq (Hz)', 'Loc (cm)', 'Cell Type'], dtype='object')
df.index
# Index(['C1', 'C2', 'C3'], dtype='object')


We can also change column and row names through those same attributes.

df.index = ["Cell_1", "Cell_2", "Cell_3"]
df.columns = ["Freq (Hz)", "Loc (cm)", "Cell Type"]
"""
Freq (Hz)  Loc (cm)    Cell Type
Cell_1         20         2  Interneuron
Cell_2         50         3  Interneuron
Cell_3          8         1    Pyramidal
"""


These names are useful because they give us a descriptive way of indexing into columns and rows. If we use indexing syntax on the DataFrame, we can get individual columns.

df['Freq (Hz)']
"""
Cell_1    20
Cell_2    50
Cell_3     8
Name: Freq (Hz), dtype: int64
"""


Row names are not found this way and using a row key will raise an error. However, we can get rows with the df.loc attribute.

df['Cell_1']
# KeyError: 'Cell_1'
df.loc['Cell_1']
"""
Freq (Hz)             20
Loc (cm)               2
Cell Type    Interneuron
Name: Cell_1, dtype: object
"""


We could also create a DataFrame from other kinds of collections that are not dictionaries. For example we can use a list.

d = [[20, 2, "Interneuron"],
[50, 3, "Interneuron"],
[8, 1, "Pyramidal"]]
column_names = ["Frequency", "Location", "Cells"]
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, columns=column_names, index=row_names)
print(df)
"""
Frequency  Location        Cells
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


In that case there are no dictionary keys that could be use to infer the column names. This means we need to pass the column_names to the columns parameter. Mostly anything that structures our data in a two-dimensional way can be used to create a DataFrame. Next we will learn about functions that allow us to load different file types as a DataFrame.

The list of file types Pandas can read and write is rather long and you can find it here. I only want to cover the most commonly used .csv file here. They have the particular advantage that they can also be read by humans, because they are essentially text files. They are also widely supported by a variety of languages and programs. First, let’s create our file. Because it is a text file, we can write a literal string to file.

text_file = open("example.csv", "w")
text_file.write(""",Frequency,Location,Cell Type
C1,20,2,Interneuron
C2,50,3,Interneuron
C3,8,1,Pyramidal""")
text_file.close()


In this file columns are separated by commas and rows are separated by new lines. This is what .csv means, it stands for comma-separated values. To load this file into a DataFrame we need to pass the file name and which column contains the row names. Pandas assumes by default that the first row contains the column names.

df = pd.read_csv("example.csv", index_col=0)
print(df)
"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


There are many more parameters we can specify for read_csv in case we have a file that is structured differently. In fact we can load files that have a value delimiter other than the comma, by specifying the delimiter parameter.

text_file = open("example.csv", "w")
text_file.write("""-Frequency-Location-Cell Type
C1-20-2-Interneuron
C2-50-3-Interneuron
C3-8-1-Pyramidal""")
text_file.close()
print(df)
"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


We specify '-' as the delimiter and and it also works. Although the function is called read_csv it is not strictly bound to comma separated values. We can also skip rows, columns and specify many more options you can learn about from the documentation. For well structured .csv files however, we need very few arguments as shown above. Next we will learn how to do basic calculations with the DataFrame.

Basic Math with DataFrame

A variety of functions such as df.mean(), df.median() and df.std() are available to do basic statistics on our DataFrame. By default they all return values per column. That is because columns are assumed to contain our variables (or features) and each row contains a sample.

df.mean()
"""
Freq (Hz)    26.0
Loc (cm)      2.0
dtype: float64
"""

df.median()
"""
Freq (Hz)    20.0
Loc (cm)      2.0
dtype: float64
"""

df.std()
"""
Freq (Hz)    21.633308
Loc (cm)      1.000000
dtype: float64
"""


One big advantage of the column is that within a column the data type is clearly defined. Within a row on the other hand different data types can exist. In our case we have two numeric types and a string. When we call these statistical methods, numeric types are ignored. In our case that is 'Cell Type'. Technically we can also use the axis parameter to calculate these statistics for each sample but this is not always useful and has to again ignore one of the columns.

df.mean(axis=1)
"""
C1    11.0
C2    26.5
C3     4.5
dtype: float64
"""


We can also use other mathematical operators. They are applied element-wise and their effect will depend on the data type of the value.

print(df * 3)
"""
Frequency  Location                      Cell Type
C1         60         6  InterneuronInterneuronInterneuron
C2        150         9  InterneuronInterneuronInterneuron
C3         24         3        PyramidalPyramidalPyramidal
"""


Often times these operations make more sense for individual columns. As explained above we can use indexing to get individual columns and we can even assign new results to an existing or new column.

norm_freq = df['Frequency'] / df.mean()['Frequency']
norm_freq
"""
C1    0.769231
C2    1.923077
C3    0.307692
Name: Frequency, dtype: float64
"""
df['Norm Freq'] = norm_freq
print(df)
"""
Frequency  Location    Cell Type  Norm Freq
C1         20         2  Interneuron   0.769231
C2         50         3  Interneuron   1.923077
C3          8         1    Pyramidal   0.307692
"""


If you are familiar with NumPy, most of these DataFrame operations will seem very familiar because they mostly work like array operations. Because Pandas builds on NumPy, most NumPy functions (for example np.sin) work on numeric columns. I don’t want to go deeper and instead move on to visualizing DataFrames with Seaborn.

Seaborn for Data Visualization

Seaborn is a high-level data visualization package that builds on Matplotlib. It does not necessarily require a DataFrame. It can work with other data structures such as ndarray but it is particularly convenient with DataFrame. First, let us get a more interesting data set. Luckily Seaborn comes with some nice example data sets and they conveniently load into Pandas DataFrame.

import seaborn as sns
type(df)
# pandas.core.frame.DataFrame
print(df)
"""
sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]
"""

print(df.columns)
"""
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
"""


The Iris data set contains information about different species of iris plants. It contains 150 samples and 5 features. The 'species' feature tells us what species a particular sample belongs to. The names of those columns are very useful when we structure our plots in Seaborn. Let’s first try a basic bar graph.

sns.set(context='paper',
style='whitegrid',
palette='colorblind',
font='Arial',
font_scale=2,
color_codes=True)
fig = sns.barplot(x='species', y='sepal_length', data=df)


We use sns.barplot and we have to pass our DataFrame to the data parameter. Then for x and y we define which column name should appear there. We put 'species' on the x-axis so that is how data is aggregated inside the bars. Setosa, versicolor and virginica are the different species. The sns.set() function defines multiple parameters of Seaborn and forces a certain style on the plots that I personally prefer. Bar graphs have grown out of fashion and for good reason. They are not very informative about the distribution of their underlying values. I prefer the violin plot to get a better idea of the distribution.

fig = sns.violinplot(x='species', y='sepal_length', data=df)


We even get a small box plot within the violin plot for free. Seaborn works its magic through the DataFrame column names. This makes plotting more convenient but also makes our code more descriptive than it would be with pure NumPy. Our code literally tells us, that 'species' will be on the x-axis.

Summary

We learned that we can create a DataFrame from a dictionary or another kind of collection. The most important features are the column and row names. Columns organize features and rows organize samples by convention. We can also load files into a DataFrame. For example we can use read_csv to load .csv or other text based files. We can also use methods like df.mean() to get basic statistics of our DataFrame. Finally, Seaborn is very useful to visualize a DataFrame.

Animations with Matplotlib

Anything that can be plotted with Matplotlib can also be animated. This is especially useful when data changes over time. Animations allow us to see the dynamics in our data, which is nearly impossible with most static plots. Here we will learn how to animate with Matplotlib by producing this traveling wave animation.

This is the code to make the animation. It creates the traveling wave, defines two functions that handle the animation and creates the animation with the FuncAnimation class. Let’s take it step by step.

import numpy as np
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt

# Create the traveling wave
def wave(x, t, wavelength, speed):
return np.sin((2*np.pi)*(x-speed*t)/wavelength)

x = np.arange(0,4,0.01)[np.newaxis,:]
t = np.arange(0,2,0.01)[:,np.newaxis]
wavelength = 1
speed = 1
yt = wave(x, t, wavelength, speed)  # shape is [t,y]

# Create the figure and axes to animate
fig, ax = plt.subplots(1)
# init_func() is called at the beginning of the animation
def init_func():
ax.clear()

# update_plot() is called between frames
def update_plot(i):
ax.clear()
ax.plot(x[0,:], yt[i,:], color='k')

# Create animation
anim = FuncAnimation(fig,
update_plot,
frames=np.arange(0, len(t[:,0])),
init_func=init_func)

# Save animation
anim.save('traveling_wave.mp4',
dpi=150,
fps=30,
writer='ffmpeg')


On the first three lines we import NumPy, Matplotlib and most importantly the FuncAnimation class. It will take the center stage in our code as it will create the animation later on by combining all the parts we need. On lines 5-13 we create the traveling wave. I don’t want to go into too much detail, as it is just a toy example for the animation. The important part is that we get the array yt, which defines the wave at each time point. So yt[0] contains the wave at t0 , yt[1] at t1 and so on. This is important, since we will be iterating over time during the animation. If you want to learn more about the traveling wave, you can change wavelength, speed and play around with the wave() function.

Now that we have our wave, we can start preparing the animation. We create a the figure and the axes we want to use with plt.subplots(1). Then we create a the init_func(). This one will be called whenever the animation starts or repeats. In this particular example it is pretty useless. I include it here because it is a useful feature for more complex animations.

Now we get to update_plot(), the heart of our animation. This function updates our figure between frames. It determines what we see on each frame. It is the most important function and it is shockingly simple. The parameter i is an integer that defines what frame we are at. We use that integer as an index into the first dimension of yt. We plot the wave as it looks at t=i. Importantly, we must clean up our axes with ax.clear(). If we would forget about clearing, our plot would quickly become all black, filled with waves.

Now FuncAnimation is where it all comes together. We pass it fig, update_plot and init_func. We also pass frames, those are the values that i will take on during the animation. Technically, this gets the animation going in your interactive Python console but most of the time we want to save our animation. We do that by calling anim.save(). We pass it the file name as a string, the resolution in dpi, the frames per second and finally the writer class used for generating the animation. Not all writers work for all file formats. I prefer .mp4 with the ffmpeg writer. If there are issues with saving, the most common problem is that the writer we are trying to use is not installed. If you want to find out if the ffmpeg writer is available on your machine, you can type matplotlib.animation.FFMpegWriter().isAvailable(). It returns True if the writer is available and False otherwise. If you are using Anaconda you can install the codec from here.

This wraps up our tutorial. This particular example is very simple, but anything that can be plotted can also be animated. I hope you are now on your way to create your own animations. I will leave you with a more involved animation I created.

Filtering Data with SciPy

Time series data may contain signals at many different frequencies. Sharp increases or decreases have a high frequency. Slow increases or decreases have a low frequency. Filtering allows us to take different frequency components out of the data.

Signal filtering is a science on its own and I’ll focus on the practical aspects here and stick to two filter types: butterworth and Chebyshev type I. Each of those filters can be used for different purposes. We can use them as low pass, high pass, band pass or notch filters. Low pass filters leave low frequencies alone but attack high frequencies. High pass filters leave high frequencies alone but attach low frequencies. The title image shows an example of low and high pass filters used on the same data. Band pass filters leave a specific frequency band alone and attack all other frequencies. Notch filters attack a specific frequency band, leaving the rest alone. Let’s look at an example.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import butter, cheby1, filtfilt

order = 3
Wn = 4000  # in Hz
btype = 'lowpass'
fs = 50000  # in Hz

b, a = butter(order, Wn, btype, fs = fs)
data_butter = filtfilt(b, a, data)


This is a butterworth lowpass filter with a cutoff frequency of 4000Hz (Wn). That means, signals below 4000Hz are is the pass band. They are largely left alone. Signals above 4000Hz are in the stop band, they are diminished. fs is the sampling frequency of the data. If the units are Hz, it tells us how many data points are recorded during one second. filtfilt is the function that does the actual filtering on the data, based on the filter (b, a) that was designed previously. Filtering is not a perfect process. Filters have what is called roll-off at the critical 4000Hz frequency.

Ideally, we would like a filter response that falls down straight. Anything in the pass band is untouched, anything in the stop band is shutdown the same way. As you can see, our actual filter does no live up to the ideal. It already slightly attenuates signal that is part of the pass band and it falls much slower in the stop band. If we need a steeper roll off, we can increase the order of our filter.

Some filter types have steeper roll off than others. For example, the Chebyshev type I filter achieves steeper roll off by tolerating some ripple in the pass band.

This can lead to distortions in the data depending on the size of the ripple. The Chebyshev type I filter takes an argument rp that defines the amount of pass band ripple that is tolerated in units of dB. The more ripple we tolerate, the steeper the roll off will be. Here you can see how large ripple causes oscillations in the data.

Generally, the butterworth filter is sufficient for most situations and is safer because it does not ripple. I hope this post helped you filtering your own data. If you want to learn more, check out the SciPy signal docs. Both the butter and cheby1 filter are there with many, many more.

Fantastic Programming Languages and Where to Find Them

There are many programming languages out there and committing to one of them can be intimidating. The good news is that most programming languages that are relevant today are solid. You can’t go wrong with any of them and they are all worth your time. At the same time, the programming language you pick will strongly influence your work as a PhD student or postdoc and the opportunities you have afterwards. Here I’ll guide you through the things to consider when choosing your first or second programming language.

Which language are others using?The conformists way

Most programming languages are amazing and the technical differences between them are small and only relevant under special circumstances. More important than the technical details are the community, research field and laboratory you want to join. If you already know which lab you want to join, find out which language is used there. If the lab uses multiple languages try to find out what kind of task you will have, find the people with similar tasks and find out which language they are using. If you don’t know your lab yet but you know the field, try to find out which language dominates that field. You can do so by reading papers, job advertisements or by directly writing to other PhD students and postdocs.

So why would you want to miss out on those advantages? In short: you don’t. You probably don’t know better than your future colleagues (yet). You don’t want to reinvent the wheel. But there are some other things to consider and sometimes it can pay off to deviate from lab culture. If you do so, this will affect your work. In a nutshell, you will be less productive in the short-term but more productive in the long-term, if you choose your programming language well.

Which languages are used outside academia

Many of us are not looking to stay in academic research. Even if you are committed to academia, this point is worth considering. Things and people can change. It is considered good practice to have a plan B. While many programming languages are used both inside and outside of academia, some labs use programming languages that are nearly worthless in the non-academic job market. Sometimes very specific research requires a niche programming language. Other times a lab was simply unable or unwilling to transition to a more common language.

I recommend making your plan B as concrete as possible. Maybe it doesn’t involve programming at all. Then you should fall back to the conformists way. Otherwise, check your plan B job market for programming languages that are required or advantageous. I will go through some programming languages later and give my opinion on their usage inside and outside academia. However, I cannot give a definitive answer and these job markets evolve rapidly. If you learned programming during your PhD you will be in a great position to pick up another language. You will probably have to learn more than one language anyway. Companies have so called ‘stacks’. A stack is a collection of software (including some programming languages) and people will be hired for “full-stack” or subsets of that stack. A non-academic stack will likely involve at least passing familiarity with more than one programming language. Either way, keep an eye out for labs that use niche programming languages. It might be worthwhile to defy lab culture and choose a more common language.

Performance is less important than you think

Beginners consistently overestimate the importance of performance or speed. I’ve been there. When I started out I though fast computations would be the deciding factor for or against any programming language. It isn’t, because human time is more valuable than computer time. By orders of magnitude. In research, the bottleneck is rarely computational time, it’s almost always human time. Performance only becomes relevant with very computationally intensive projects.

Imagine you are writing a script that will take one minute to run in the end. A 10x decrease in performance (now it takes 10 minutes) is very tolerable, if you get some perks for it. It is now easier to debug, easier to build on and more other people can use it and give you credit for it. If you are writing a simulation project that takes 10 days, a 10x decrease (now it takes 100 days) does not look so attractive anymore. In the real world performance of both scripts and programs is slightly more complicated but the point stands. If you are not sure whether you are in the 1 minute or 10 days category, you should try to figure it our before deciding. Just ask your colleagues and advisers.

With that, I want to move on to some fantastic programming languages. We will look at their strengths and their weaknesses. Always keep in mind the conformists way. Only choose your own language when there are clear advantages. I will briefly discuss which languages are worthwhile to use even if your lab is not on board and which ones are to be avoided even if your lab is working with them. As a disclaimer: I have hands on experience with Python, R and Matlab. For the other languages I either have second hand experience (people in my surrounding work with them) or I did some research about them.

Python and R

Python and R share a chapter because they are both excellent and should be your first choice if there are no other languages established in the lab. If the lab uses either or both, even better! Both are completely free and open source. Python is my personal favorite but I am slightly biased after years of working with it almost daily.

Both Python and R are also heavily used outside of academia. R is consistently ranked as the top required language for data scientists. Python ranks second. A downside of R is that it is specifically designed for statistics and data analysis. This can be an advantage, because as scientists this is the biggest part of our job when we program. However, Python is a complete all-rounder. It can do data analysis but it can also do web development, game development and everything else you can think of. Web development is also possible with R, but it centrally revolves around data analysis and visualization. I guess I’m trying to say, Python would be better for your private coin collecting website (minor upside).

Some communities slightly favor Python, others favor R. Astronomy for example really likes Python. Single cell sequencing on the other hand prefers R. Check with your field and colleagues. Finally, both languages are well documented and have massive communities behind them. This makes it much more likely that someone already solved an issue you are trying to google. In summary, any second you spend learning Python or R is well worth it.

MATLAB

MATLAB is developed by MathWorks and it is specifically designed for science and engineering. Unlike Python and R, it is neither free nor open source. If your lab pays for a license, the heavy price tag might not bother you. The language itself is more similar to Python than R (many of Python’s numerical computation capabilities were developed with MATLAB users in mind). A nice upside is that the language comes with a very strong graphical user interface and debugging capabilities. This can be very helpful. Unfortunately, MATLAB is much less popular outside of academia than Python and R. Especially smaller companies and early start-ups are unwilling to pay for MATLAB when there are free equivalents. Overall, even academics seem to be slowly transitioning away from MATLAB. However, I would not advise strictly against MATLAB if it is very popular in your field of research or the lab you want to join. Especially since Python and MATLAB are similar enough that transitioning is easy, once you learned MATLAB. I would only advise against it if you have concrete plans to leave academia for data science.

Julia

Julia is being traded as the future Python. For now it has a smaller community but it was specifically designed to keep the advantages of Python while improving performance. I currently don’t recommend Julia, unless you have some computationally intensive projects or you anticipate such projects in your professional future. The more people use a language, the higher the chance that even specialized tasks are already implemented by someone else. Julia is not yet widely adopted. If your lab uses Julia, I recommend rolling with it.

Igor Pro

Igor Pro by WaveMetrics is a commercial software and programming language. Like MATLAB, it comes with a rather rich graphical user interface. It is the first programming language I actively discourage. Even if you feel like spending money, you are probably better off with MATLAB. Igor Pro is even less popular outside of academia than MATLAB.

When I started my master thesis, the established language for my main task (intracellular electrophysiology) was Igor Pro. I decided against using it, because I had never heard about it before, the graphical user interface did not look very appealing and I had some Python experience from small hobby projects. So I decided to do the analysis myself with Python. The consequence of that was that I was extremely slow in the beginning. Had I just done it with Igor Pro, I could have taken the scripts that were already used in the lab and could have used them with minimal learning effort. Instead I had to reinvent the wheel and learn Python at the same time. This made me extremely inefficient in the short term.

In the long term it was the best choice I made during my masters, because in the long term it made me more efficient. More than that, I was able to take on new tasks that would have been nearly impossible with Igor Pro. I started to get into biophysical neuronal network simulations. Python has several packages for that. I’m not aware of any such Packages for Igor Pro. That being said, you or your colleagues might not be willing to lose short term efficiency, especially if you don’t care for programming and just want to get the job done. If you enter a lab where Igor Pro is being used, roll with it to get things done more quickly.

JavaScript

JavaScript is particularly popular for web development but it can also do some data analysis. ImageJ, a popular scientific image processing software, is written in Java and you can write JavaScript code for it. It is free and open source. It is worth checking out if you are going to do a lot of image processing. Otherwise I don’t recommend it for scientific purposes.

Most programming languages are fantastic

Committing to a programming language is difficult. Luckily, all relevant programming languages today are amazing. They all get the job done and are well worth your time (especially the ones on top of this list). And this brings me to my take-home message. Don’t worry about the technical differences between Python, R and MATLAB. Especially don’t worry about performance. Your scientific field and laboratory are much more important factors for your choice. Isolating yourself comes with a price. I also hope I made clear why and when that price is worth it. It might give you long term advantages. Finally, the best thing you can do today is to start programming and to stop worrying.

Getting Started with NumPy

• The array is the central NumPy object
• Pass any sequence type object to the np.array() constructor to create an array
• Use functions like np.zeros, np.arange and np.linspace to create arrays
• Use np.random to create arrays with randomly generated values

NumPy is a Python package for numerical computing. Python is not specifically designed to deal with large amounts of data but NumPy can make data analysis both more efficient and readable than it is with pure Python. Without NumPy, we would simply store numbers in a list and perform operations on those numbers by looping through the list. NumPy brings us an object called the array, which is essential to anything data related in Python and most other data analysis packages in one way or another build on the NumPy array. Here we will learn several ways to create NumPy arrays but first let’s talk about installing NumPy.

Setting up NumPy

I highly recommend installing Python with a data science platform such as https://www.anaconda.com/ that comes with NumPy and other science critical packages.
To find out if you already have NumPy installed with your distribution try to import it

import numpy as np


If that does not work, try to install NumPy with the package installer for Python (pip) by going to your commdand line. There try:

pip install numpy

Finally, you can take a look at the docs for installation instructions. https://scipy.org/install.html

Three ways to create arrays

Now let’s create our first array. An array is a sequence of numbers so we can convert any Python sequence to an array. One of the most commonly used Python sequence is the list. To convert a Python list to an array we simply pass a list to the numpy array constructor

import numpy as np
my_list = [4, 2, 7, 9]
my_array = np.array(my_list)


This creates a NumPy array with the entries 4, 2, 7, 9. We can do the same with a tuple.

my_tuple = (4, 2, 7, 9)
my_array = np.array(my_list)


Of course we can also convert nested sequences to arrays and it works exactly the same way.

my_nested_list = [[4, 2, 7, 9], [3, 2, 5, 8]]
my_array = np.array(my_nested_list)


This is the first way to create arrays. Pass a sequence to the np.array constructor. The second way is to use numpy functions to create arrays. One such function is np.zeros.

zeros = np.zeros((3, 4))
np.array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])


np. zeros gives us an array where each entry is 0 and it requires one argument: the shape of the array we want to get from it. Here we got an array with three rows and four columns, because we pass it the tuple (3, 4). This function is useful if you know how many values you need (the structure) but you do not know which values should be in there yet. So you can pre-initialize an all zero array and then assign the actual values to the array as you compute them. Another array creation function is called np.arange

arange= np.arange(5, 30, 2)
arange
array([ 5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])


np.arange gives us a sequence starting at 5, stopping at 30 (not including 30) and going in steps of 2. This function is very useful to generate sequences that can be used to index into another array. We will learn more about indexing in a future blog post. A function very similar to np.arange is np.linspace.

linspace = np.linspace(5, 29, 13)
linspace
array([ 5.,  7.,  9., 11., 13., 15., 17., 19., 21., 23., 25., 27., 29.])


Instead of taking the step size between values, linspace takes the number of entries in the output array. Also, the final value is inclusive (29 is the final value). Finally the third way to generate numpy arrays is with the np.random module. First, lets look at np.random.randint

randint = np.random.randint(5, 30, (3,4))
array([[26, 17, 26, 24],
[20, 16, 29, 25],
[25, 21, 26, 26]])


This creates an array containing random integers between 5 and 30 (non-inclusive) with 3 rows and 4 columns. If you try this code at home the values of your array will (most probably) look different but the shape should be the same. Finally lets look at np.random.randn

randn = np.random.randn(4,5)  # Random numbers from normal distribution
randn
array([[-2.34229894, -1.43985814, -0.51260701, -2.58213476,  1.61196437],
[-0.69767456, -0.0950676 , -0.22415381, -0.90219875,  0.33513859],
[ 0.56432586, -1.62877834, -0.60056852,  1.37310251, -1.20494281],
[-0.20589457,  1.34870661, -0.89139339, -0.40300812, -0.15703367]])


np.random.randn gives us an array with numbers randomly drawn from the standard normal distribution, a gaussian distribution with mean of 0 and variance 1. Each argument we pass to the function creates another dimension. In this case we get 4 rows and 5 columns.

Summary

We learned how to create arrays, the central NumPy object. Working with NumPy means to work with arrays and now that we know how to create them we are well prepared to get working. In the next blog post we will take a look at some of the basic arithmetic functions we can perform on arrays and show that they are both more efficient and readable than Python builtin functions.