# Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

## How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")


By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images


Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

## The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,
metric='euclidean')

distance_matrix_sorted = pairwise_distances(X_sorted,
metric='euclidean')

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[0].set_title("Unsorted")
ax[1].set_title("Sorted by Label")


When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

## Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity =$' + str(perplexity_01),
r'$\sigma = 10; Perplexity =$' + str(perplexity_10)))


Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
desired_perplexity = perplexity,
verbose=False)


As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

## Optimize Embedding with Gradient Descent

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
n_components).astype(np.float32)

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
'degrees_of_freedom': 1,
'n_samples': 1000,
'n_components':2}

embedding_init,
0,
1000,
kwargs=kl_kwargs)

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
'tsne_2': tsne_result[:,1],
'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')


And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.

# Getting Started with Pandas DataFrame

A DataFrame is a spreadsheet like data structure. We can think of it as a collection of rows and columns. This row-column structure is useful for many different kinds of data. The most widely used DataFrame implementation in Python is from the Pandas package. First we will learn how to create DataFrames. We will also learn how to do some basic data analysis with them. Finally, we will compare the DataFrame to the ndarray data structure and learn why data frames are useful in other packages such as Seaborn.

## How to Create a DataFrame

There two major ways to create a DataFrame. We can directly call DataFrame() and pass it data in a dictionary, list or array. Alternatively we can use several functions to load data from a file directly into a DataFrame. While it is very common in data science to load data from file, there are also many occasions where we need to create DataFrame from other data structures. We will first learn how to create a DataFrame from a dictionary.

import pandas as pd
d = {"Frequency": [20, 50, 8],
"Location": [2, 3, 1],
"Cell Type": ["Interneuron", "Interneuron", "Pyramidal"]}
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, index=row_names)
print(df)

"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


In our dictionary the keys are used as the column names. The data under each key then becomes the column. The row names are defined separately by passing a collection to the index parameter of DataFrame. We can get column and row names with the columns and index attributes.

df.columns
# Index(['Freq (Hz)', 'Loc (cm)', 'Cell Type'], dtype='object')
df.index
# Index(['C1', 'C2', 'C3'], dtype='object')


We can also change column and row names through those same attributes.

df.index = ["Cell_1", "Cell_2", "Cell_3"]
df.columns = ["Freq (Hz)", "Loc (cm)", "Cell Type"]
"""
Freq (Hz)  Loc (cm)    Cell Type
Cell_1         20         2  Interneuron
Cell_2         50         3  Interneuron
Cell_3          8         1    Pyramidal
"""


These names are useful because they give us a descriptive way of indexing into columns and rows. If we use indexing syntax on the DataFrame, we can get individual columns.

df['Freq (Hz)']
"""
Cell_1    20
Cell_2    50
Cell_3     8
Name: Freq (Hz), dtype: int64
"""


Row names are not found this way and using a row key will raise an error. However, we can get rows with the df.loc attribute.

df['Cell_1']
# KeyError: 'Cell_1'
df.loc['Cell_1']
"""
Freq (Hz)             20
Loc (cm)               2
Cell Type    Interneuron
Name: Cell_1, dtype: object
"""


We could also create a DataFrame from other kinds of collections that are not dictionaries. For example we can use a list.

d = [[20, 2, "Interneuron"],
[50, 3, "Interneuron"],
[8, 1, "Pyramidal"]]
column_names = ["Frequency", "Location", "Cells"]
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, columns=column_names, index=row_names)
print(df)
"""
Frequency  Location        Cells
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


In that case there are no dictionary keys that could be use to infer the column names. This means we need to pass the column_names to the columns parameter. Mostly anything that structures our data in a two-dimensional way can be used to create a DataFrame. Next we will learn about functions that allow us to load different file types as a DataFrame.

The list of file types Pandas can read and write is rather long and you can find it here. I only want to cover the most commonly used .csv file here. They have the particular advantage that they can also be read by humans, because they are essentially text files. They are also widely supported by a variety of languages and programs. First, let’s create our file. Because it is a text file, we can write a literal string to file.

text_file = open("example.csv", "w")
text_file.write(""",Frequency,Location,Cell Type
C1,20,2,Interneuron
C2,50,3,Interneuron
C3,8,1,Pyramidal""")
text_file.close()


In this file columns are separated by commas and rows are separated by new lines. This is what .csv means, it stands for comma-separated values. To load this file into a DataFrame we need to pass the file name and which column contains the row names. Pandas assumes by default that the first row contains the column names.

df = pd.read_csv("example.csv", index_col=0)
print(df)
"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


There are many more parameters we can specify for read_csv in case we have a file that is structured differently. In fact we can load files that have a value delimiter other than the comma, by specifying the delimiter parameter.

text_file = open("example.csv", "w")
text_file.write("""-Frequency-Location-Cell Type
C1-20-2-Interneuron
C2-50-3-Interneuron
C3-8-1-Pyramidal""")
text_file.close()
print(df)
"""
Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""


We specify '-' as the delimiter and and it also works. Although the function is called read_csv it is not strictly bound to comma separated values. We can also skip rows, columns and specify many more options you can learn about from the documentation. For well structured .csv files however, we need very few arguments as shown above. Next we will learn how to do basic calculations with the DataFrame.

## Basic Math with DataFrame

A variety of functions such as df.mean(), df.median() and df.std() are available to do basic statistics on our DataFrame. By default they all return values per column. That is because columns are assumed to contain our variables (or features) and each row contains a sample.

df.mean()
"""
Freq (Hz)    26.0
Loc (cm)      2.0
dtype: float64
"""

df.median()
"""
Freq (Hz)    20.0
Loc (cm)      2.0
dtype: float64
"""

df.std()
"""
Freq (Hz)    21.633308
Loc (cm)      1.000000
dtype: float64
"""


One big advantage of the column is that within a column the data type is clearly defined. Within a row on the other hand different data types can exist. In our case we have two numeric types and a string. When we call these statistical methods, numeric types are ignored. In our case that is 'Cell Type'. Technically we can also use the axis parameter to calculate these statistics for each sample but this is not always useful and has to again ignore one of the columns.

df.mean(axis=1)
"""
C1    11.0
C2    26.5
C3     4.5
dtype: float64
"""


We can also use other mathematical operators. They are applied element-wise and their effect will depend on the data type of the value.

print(df * 3)
"""
Frequency  Location                      Cell Type
C1         60         6  InterneuronInterneuronInterneuron
C2        150         9  InterneuronInterneuronInterneuron
C3         24         3        PyramidalPyramidalPyramidal
"""


Often times these operations make more sense for individual columns. As explained above we can use indexing to get individual columns and we can even assign new results to an existing or new column.

norm_freq = df['Frequency'] / df.mean()['Frequency']
norm_freq
"""
C1    0.769231
C2    1.923077
C3    0.307692
Name: Frequency, dtype: float64
"""
df['Norm Freq'] = norm_freq
print(df)
"""
Frequency  Location    Cell Type  Norm Freq
C1         20         2  Interneuron   0.769231
C2         50         3  Interneuron   1.923077
C3          8         1    Pyramidal   0.307692
"""


If you are familiar with NumPy, most of these DataFrame operations will seem very familiar because they mostly work like array operations. Because Pandas builds on NumPy, most NumPy functions (for example np.sin) work on numeric columns. I don’t want to go deeper and instead move on to visualizing DataFrames with Seaborn.

## Seaborn for Data Visualization

Seaborn is a high-level data visualization package that builds on Matplotlib. It does not necessarily require a DataFrame. It can work with other data structures such as ndarray but it is particularly convenient with DataFrame. First, let us get a more interesting data set. Luckily Seaborn comes with some nice example data sets and they conveniently load into Pandas DataFrame.

import seaborn as sns
type(df)
# pandas.core.frame.DataFrame
print(df)
"""
sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]
"""

print(df.columns)
"""
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
"""


The Iris data set contains information about different species of iris plants. It contains 150 samples and 5 features. The 'species' feature tells us what species a particular sample belongs to. The names of those columns are very useful when we structure our plots in Seaborn. Let’s first try a basic bar graph.

sns.set(context='paper',
style='whitegrid',
palette='colorblind',
font='Arial',
font_scale=2,
color_codes=True)
fig = sns.barplot(x='species', y='sepal_length', data=df)


We use sns.barplot and we have to pass our DataFrame to the data parameter. Then for x and y we define which column name should appear there. We put 'species' on the x-axis so that is how data is aggregated inside the bars. Setosa, versicolor and virginica are the different species. The sns.set() function defines multiple parameters of Seaborn and forces a certain style on the plots that I personally prefer. Bar graphs have grown out of fashion and for good reason. They are not very informative about the distribution of their underlying values. I prefer the violin plot to get a better idea of the distribution.

fig = sns.violinplot(x='species', y='sepal_length', data=df)


We even get a small box plot within the violin plot for free. Seaborn works its magic through the DataFrame column names. This makes plotting more convenient but also makes our code more descriptive than it would be with pure NumPy. Our code literally tells us, that 'species' will be on the x-axis.

## Summary

We learned that we can create a DataFrame from a dictionary or another kind of collection. The most important features are the column and row names. Columns organize features and rows organize samples by convention. We can also load files into a DataFrame. For example we can use read_csv to load .csv or other text based files. We can also use methods like df.mean() to get basic statistics of our DataFrame. Finally, Seaborn is very useful to visualize a DataFrame.

# Animations with Matplotlib

Anything that can be plotted with Matplotlib can also be animated. This is especially useful when data changes over time. Animations allow us to see the dynamics in our data, which is nearly impossible with most static plots. Here we will learn how to animate with Matplotlib by producing this traveling wave animation.

This is the code to make the animation. It creates the traveling wave, defines two functions that handle the animation and creates the animation with the FuncAnimation class. Let’s take it step by step.

import numpy as np
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt

# Create the traveling wave
def wave(x, t, wavelength, speed):
return np.sin((2*np.pi)*(x-speed*t)/wavelength)

x = np.arange(0,4,0.01)[np.newaxis,:]
t = np.arange(0,2,0.01)[:,np.newaxis]
wavelength = 1
speed = 1
yt = wave(x, t, wavelength, speed)  # shape is [t,y]

# Create the figure and axes to animate
fig, ax = plt.subplots(1)
# init_func() is called at the beginning of the animation
def init_func():
ax.clear()

# update_plot() is called between frames
def update_plot(i):
ax.clear()
ax.plot(x[0,:], yt[i,:], color='k')

# Create animation
anim = FuncAnimation(fig,
update_plot,
frames=np.arange(0, len(t[:,0])),
init_func=init_func)

# Save animation
anim.save('traveling_wave.mp4',
dpi=150,
fps=30,
writer='ffmpeg')


On the first three lines we import NumPy, Matplotlib and most importantly the FuncAnimation class. It will take the center stage in our code as it will create the animation later on by combining all the parts we need. On lines 5-13 we create the traveling wave. I don’t want to go into too much detail, as it is just a toy example for the animation. The important part is that we get the array yt, which defines the wave at each time point. So yt[0] contains the wave at t0 , yt[1] at t1 and so on. This is important, since we will be iterating over time during the animation. If you want to learn more about the traveling wave, you can change wavelength, speed and play around with the wave() function.

Now that we have our wave, we can start preparing the animation. We create a the figure and the axes we want to use with plt.subplots(1). Then we create a the init_func(). This one will be called whenever the animation starts or repeats. In this particular example it is pretty useless. I include it here because it is a useful feature for more complex animations.

Now we get to update_plot(), the heart of our animation. This function updates our figure between frames. It determines what we see on each frame. It is the most important function and it is shockingly simple. The parameter i is an integer that defines what frame we are at. We use that integer as an index into the first dimension of yt. We plot the wave as it looks at t=i. Importantly, we must clean up our axes with ax.clear(). If we would forget about clearing, our plot would quickly become all black, filled with waves.

Now FuncAnimation is where it all comes together. We pass it fig, update_plot and init_func. We also pass frames, those are the values that i will take on during the animation. Technically, this gets the animation going in your interactive Python console but most of the time we want to save our animation. We do that by calling anim.save(). We pass it the file name as a string, the resolution in dpi, the frames per second and finally the writer class used for generating the animation. Not all writers work for all file formats. I prefer .mp4 with the ffmpeg writer. If there are issues with saving, the most common problem is that the writer we are trying to use is not installed. If you want to find out if the ffmpeg writer is available on your machine, you can type matplotlib.animation.FFMpegWriter().isAvailable(). It returns True if the writer is available and False otherwise. If you are using Anaconda you can install the codec from here.

This wraps up our tutorial. This particular example is very simple, but anything that can be plotted can also be animated. I hope you are now on your way to create your own animations. I will leave you with a more involved animation I created.

# Filtering Data with SciPy

Time series data may contain signals at many different frequencies. Sharp increases or decreases have a high frequency. Slow increases or decreases have a low frequency. Filtering allows us to take different frequency components out of the data.

Signal filtering is a science on its own and I’ll focus on the practical aspects here and stick to two filter types: butterworth and Chebyshev type I. Each of those filters can be used for different purposes. We can use them as low pass, high pass, band pass or notch filters. Low pass filters leave low frequencies alone but attack high frequencies. High pass filters leave high frequencies alone but attach low frequencies. The title image shows an example of low and high pass filters used on the same data. Band pass filters leave a specific frequency band alone and attack all other frequencies. Notch filters attack a specific frequency band, leaving the rest alone. Let’s look at an example.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import butter, cheby1, filtfilt

order = 3
Wn = 4000  # in Hz
btype = 'lowpass'
fs = 50000  # in Hz

b, a = butter(order, Wn, btype, fs = fs)
data_butter = filtfilt(b, a, data)


This is a butterworth lowpass filter with a cutoff frequency of 4000Hz (Wn). That means, signals below 4000Hz are is the pass band. They are largely left alone. Signals above 4000Hz are in the stop band, they are diminished. fs is the sampling frequency of the data. If the units are Hz, it tells us how many data points are recorded during one second. filtfilt is the function that does the actual filtering on the data, based on the filter (b, a) that was designed previously. Filtering is not a perfect process. Filters have what is called roll-off at the critical 4000Hz frequency.

Ideally, we would like a filter response that falls down straight. Anything in the pass band is untouched, anything in the stop band is shutdown the same way. As you can see, our actual filter does no live up to the ideal. It already slightly attenuates signal that is part of the pass band and it falls much slower in the stop band. If we need a steeper roll off, we can increase the order of our filter.

Some filter types have steeper roll off than others. For example, the Chebyshev type I filter achieves steeper roll off by tolerating some ripple in the pass band.

This can lead to distortions in the data depending on the size of the ripple. The Chebyshev type I filter takes an argument rp that defines the amount of pass band ripple that is tolerated in units of dB. The more ripple we tolerate, the steeper the roll off will be. Here you can see how large ripple causes oscillations in the data.

Generally, the butterworth filter is sufficient for most situations and is safer because it does not ripple. I hope this post helped you filtering your own data. If you want to learn more, check out the SciPy signal docs. Both the butter and cheby1 filter are there with many, many more.

# Fantastic Programming Languages and Where to Find Them

There are many programming languages out there and committing to one of them can be intimidating. The good news is that most programming languages that are relevant today are solid. You can’t go wrong with any of them and they are all worth your time. At the same time, the programming language you pick will strongly influence your work as a PhD student or postdoc and the opportunities you have afterwards. Here I’ll guide you through the things to consider when choosing your first or second programming language.

## Which language are others using?The conformists way

Most programming languages are amazing and the technical differences between them are small and only relevant under special circumstances. More important than the technical details are the community, research field and laboratory you want to join. If you already know which lab you want to join, find out which language is used there. If the lab uses multiple languages try to find out what kind of task you will have, find the people with similar tasks and find out which language they are using. If you don’t know your lab yet but you know the field, try to find out which language dominates that field. You can do so by reading papers, job advertisements or by directly writing to other PhD students and postdocs.

So why would you want to miss out on those advantages? In short: you don’t. You probably don’t know better than your future colleagues (yet). You don’t want to reinvent the wheel. But there are some other things to consider and sometimes it can pay off to deviate from lab culture. If you do so, this will affect your work. In a nutshell, you will be less productive in the short-term but more productive in the long-term, if you choose your programming language well.

## Which languages are used outside academia

Many of us are not looking to stay in academic research. Even if you are committed to academia, this point is worth considering. Things and people can change. It is considered good practice to have a plan B. While many programming languages are used both inside and outside of academia, some labs use programming languages that are nearly worthless in the non-academic job market. Sometimes very specific research requires a niche programming language. Other times a lab was simply unable or unwilling to transition to a more common language.

I recommend making your plan B as concrete as possible. Maybe it doesn’t involve programming at all. Then you should fall back to the conformists way. Otherwise, check your plan B job market for programming languages that are required or advantageous. I will go through some programming languages later and give my opinion on their usage inside and outside academia. However, I cannot give a definitive answer and these job markets evolve rapidly. If you learned programming during your PhD you will be in a great position to pick up another language. You will probably have to learn more than one language anyway. Companies have so called ‘stacks’. A stack is a collection of software (including some programming languages) and people will be hired for “full-stack” or subsets of that stack. A non-academic stack will likely involve at least passing familiarity with more than one programming language. Either way, keep an eye out for labs that use niche programming languages. It might be worthwhile to defy lab culture and choose a more common language.

## Performance is less important than you think

Beginners consistently overestimate the importance of performance or speed. I’ve been there. When I started out I though fast computations would be the deciding factor for or against any programming language. It isn’t, because human time is more valuable than computer time. By orders of magnitude. In research, the bottleneck is rarely computational time, it’s almost always human time. Performance only becomes relevant with very computationally intensive projects.

Imagine you are writing a script that will take one minute to run in the end. A 10x decrease in performance (now it takes 10 minutes) is very tolerable, if you get some perks for it. It is now easier to debug, easier to build on and more other people can use it and give you credit for it. If you are writing a simulation project that takes 10 days, a 10x decrease (now it takes 100 days) does not look so attractive anymore. In the real world performance of both scripts and programs is slightly more complicated but the point stands. If you are not sure whether you are in the 1 minute or 10 days category, you should try to figure it our before deciding. Just ask your colleagues and advisers.

With that, I want to move on to some fantastic programming languages. We will look at their strengths and their weaknesses. Always keep in mind the conformists way. Only choose your own language when there are clear advantages. I will briefly discuss which languages are worthwhile to use even if your lab is not on board and which ones are to be avoided even if your lab is working with them. As a disclaimer: I have hands on experience with Python, R and Matlab. For the other languages I either have second hand experience (people in my surrounding work with them) or I did some research about them.

## Python and R

Python and R share a chapter because they are both excellent and should be your first choice if there are no other languages established in the lab. If the lab uses either or both, even better! Both are completely free and open source. Python is my personal favorite but I am slightly biased after years of working with it almost daily.

Both Python and R are also heavily used outside of academia. R is consistently ranked as the top required language for data scientists. Python ranks second. A downside of R is that it is specifically designed for statistics and data analysis. This can be an advantage, because as scientists this is the biggest part of our job when we program. However, Python is a complete all-rounder. It can do data analysis but it can also do web development, game development and everything else you can think of. Web development is also possible with R, but it centrally revolves around data analysis and visualization. I guess I’m trying to say, Python would be better for your private coin collecting website (minor upside).

Some communities slightly favor Python, others favor R. Astronomy for example really likes Python. Single cell sequencing on the other hand prefers R. Check with your field and colleagues. Finally, both languages are well documented and have massive communities behind them. This makes it much more likely that someone already solved an issue you are trying to google. In summary, any second you spend learning Python or R is well worth it.

## MATLAB

MATLAB is developed by MathWorks and it is specifically designed for science and engineering. Unlike Python and R, it is neither free nor open source. If your lab pays for a license, the heavy price tag might not bother you. The language itself is more similar to Python than R (many of Python’s numerical computation capabilities were developed with MATLAB users in mind). A nice upside is that the language comes with a very strong graphical user interface and debugging capabilities. This can be very helpful. Unfortunately, MATLAB is much less popular outside of academia than Python and R. Especially smaller companies and early start-ups are unwilling to pay for MATLAB when there are free equivalents. Overall, even academics seem to be slowly transitioning away from MATLAB. However, I would not advise strictly against MATLAB if it is very popular in your field of research or the lab you want to join. Especially since Python and MATLAB are similar enough that transitioning is easy, once you learned MATLAB. I would only advise against it if you have concrete plans to leave academia for data science.

## Julia

Julia is being traded as the future Python. For now it has a smaller community but it was specifically designed to keep the advantages of Python while improving performance. I currently don’t recommend Julia, unless you have some computationally intensive projects or you anticipate such projects in your professional future. The more people use a language, the higher the chance that even specialized tasks are already implemented by someone else. Julia is not yet widely adopted. If your lab uses Julia, I recommend rolling with it.

## Igor Pro

Igor Pro by WaveMetrics is a commercial software and programming language. Like MATLAB, it comes with a rather rich graphical user interface. It is the first programming language I actively discourage. Even if you feel like spending money, you are probably better off with MATLAB. Igor Pro is even less popular outside of academia than MATLAB.

When I started my master thesis, the established language for my main task (intracellular electrophysiology) was Igor Pro. I decided against using it, because I had never heard about it before, the graphical user interface did not look very appealing and I had some Python experience from small hobby projects. So I decided to do the analysis myself with Python. The consequence of that was that I was extremely slow in the beginning. Had I just done it with Igor Pro, I could have taken the scripts that were already used in the lab and could have used them with minimal learning effort. Instead I had to reinvent the wheel and learn Python at the same time. This made me extremely inefficient in the short term.

In the long term it was the best choice I made during my masters, because in the long term it made me more efficient. More than that, I was able to take on new tasks that would have been nearly impossible with Igor Pro. I started to get into biophysical neuronal network simulations. Python has several packages for that. I’m not aware of any such Packages for Igor Pro. That being said, you or your colleagues might not be willing to lose short term efficiency, especially if you don’t care for programming and just want to get the job done. If you enter a lab where Igor Pro is being used, roll with it to get things done more quickly.

## JavaScript

JavaScript is particularly popular for web development but it can also do some data analysis. ImageJ, a popular scientific image processing software, is written in Java and you can write JavaScript code for it. It is free and open source. It is worth checking out if you are going to do a lot of image processing. Otherwise I don’t recommend it for scientific purposes.

## Most programming languages are fantastic

Committing to a programming language is difficult. Luckily, all relevant programming languages today are amazing. They all get the job done and are well worth your time (especially the ones on top of this list). And this brings me to my take-home message. Don’t worry about the technical differences between Python, R and MATLAB. Especially don’t worry about performance. Your scientific field and laboratory are much more important factors for your choice. Isolating yourself comes with a price. I also hope I made clear why and when that price is worth it. It might give you long term advantages. Finally, the best thing you can do today is to start programming and to stop worrying.

# Getting Started with NumPy

• The array is the central NumPy object
• Pass any sequence type object to the np.array() constructor to create an array
• Use functions like np.zeros, np.arange and np.linspace to create arrays
• Use np.random to create arrays with randomly generated values

NumPy is a Python package for numerical computing. Python is not specifically designed to deal with large amounts of data but NumPy can make data analysis both more efficient and readable than it is with pure Python. Without NumPy, we would simply store numbers in a list and perform operations on those numbers by looping through the list. NumPy brings us an object called the array, which is essential to anything data related in Python and most other data analysis packages in one way or another build on the NumPy array. Here we will learn several ways to create NumPy arrays but first let’s talk about installing NumPy.

## Setting up NumPy

I highly recommend installing Python with a data science platform such as https://www.anaconda.com/ that comes with NumPy and other science critical packages.
To find out if you already have NumPy installed with your distribution try to import it

import numpy as np


If that does not work, try to install NumPy with the package installer for Python (pip) by going to your commdand line. There try:

pip install numpy

Finally, you can take a look at the docs for installation instructions. https://scipy.org/install.html

## Three ways to create arrays

Now let’s create our first array. An array is a sequence of numbers so we can convert any Python sequence to an array. One of the most commonly used Python sequence is the list. To convert a Python list to an array we simply pass a list to the numpy array constructor

import numpy as np
my_list = [4, 2, 7, 9]
my_array = np.array(my_list)


This creates a NumPy array with the entries 4, 2, 7, 9. We can do the same with a tuple.

my_tuple = (4, 2, 7, 9)
my_array = np.array(my_list)


Of course we can also convert nested sequences to arrays and it works exactly the same way.

my_nested_list = [[4, 2, 7, 9], [3, 2, 5, 8]]
my_array = np.array(my_nested_list)


This is the first way to create arrays. Pass a sequence to the np.array constructor. The second way is to use numpy functions to create arrays. One such function is np.zeros.

zeros = np.zeros((3, 4))
np.array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])


np. zeros gives us an array where each entry is 0 and it requires one argument: the shape of the array we want to get from it. Here we got an array with three rows and four columns, because we pass it the tuple (3, 4). This function is useful if you know how many values you need (the structure) but you do not know which values should be in there yet. So you can pre-initialize an all zero array and then assign the actual values to the array as you compute them. Another array creation function is called np.arange

arange= np.arange(5, 30, 2)
arange
array([ 5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])


np.arange gives us a sequence starting at 5, stopping at 30 (not including 30) and going in steps of 2. This function is very useful to generate sequences that can be used to index into another array. We will learn more about indexing in a future blog post. A function very similar to np.arange is np.linspace.

linspace = np.linspace(5, 29, 13)
linspace
array([ 5.,  7.,  9., 11., 13., 15., 17., 19., 21., 23., 25., 27., 29.])


Instead of taking the step size between values, linspace takes the number of entries in the output array. Also, the final value is inclusive (29 is the final value). Finally the third way to generate numpy arrays is with the np.random module. First, lets look at np.random.randint

randint = np.random.randint(5, 30, (3,4))
array([[26, 17, 26, 24],
[20, 16, 29, 25],
[25, 21, 26, 26]])


This creates an array containing random integers between 5 and 30 (non-inclusive) with 3 rows and 4 columns. If you try this code at home the values of your array will (most probably) look different but the shape should be the same. Finally lets look at np.random.randn

randn = np.random.randn(4,5)  # Random numbers from normal distribution
randn
array([[-2.34229894, -1.43985814, -0.51260701, -2.58213476,  1.61196437],
[-0.69767456, -0.0950676 , -0.22415381, -0.90219875,  0.33513859],
[ 0.56432586, -1.62877834, -0.60056852,  1.37310251, -1.20494281],
[-0.20589457,  1.34870661, -0.89139339, -0.40300812, -0.15703367]])


np.random.randn gives us an array with numbers randomly drawn from the standard normal distribution, a gaussian distribution with mean of 0 and variance 1. Each argument we pass to the function creates another dimension. In this case we get 4 rows and 5 columns.

## Summary

We learned how to create arrays, the central NumPy object. Working with NumPy means to work with arrays and now that we know how to create them we are well prepared to get working. In the next blog post we will take a look at some of the basic arithmetic functions we can perform on arrays and show that they are both more efficient and readable than Python builtin functions.