Matplotlib and the Object-Oriented Interface

Matplotlib is a great data visualization library for Python and there are two ways of using it. The functional interface (also known as pyplot interface) allows us to interactively create simple plots. The object-oriented interface on the other hand gives us more control when we create figures that contain multiple plots. While having two interfaces gives us a lot of freedom, they also cause some confusion. The most common error is to use the functional interface when using the object-oriented would be much easier. For beginners it is now highly recommended to use the object-oriented interface under most circumstances because they have a tendency to overuse the functional one. I made that mistake myself for a long time. I started out with the functional interface and only knew that one for a long time. Here I will explain the difference between both, starting with the object-oriented interface. If you have never used it, now is probably the time to start.

Figures, Axes & Methods

When using the object-oriented interface, we create objects and do the plotting with their methods. Methods are the functions that come with the object. We create both a figure and an axes object with plt.subplots(1). Then we use the ax.plot() method from our axes object to create the plot. We also use two more methods, ax.set_xlabel() and ax.set_ylabel() to label our axes.

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0, 10, 0.1)
y = np.sin(np.pi * x) + x

fig, ax = plt.subplots(1)
ax.plot(x, y)
ax.set_xlabel("x")
ax.set_ylabel("y")

The big advantage is that we can very easily create multiple plots and we can very naturally keep track of where we are plotting what, because the method that does the plotting is associated with a specific axes object. In the next example we will plot on three different axes that we create all with plt.subplots(3).

x = np.arange(0,10,0.1)
ys = [np.sin(np.pi*x) + x,
      np.sin(np.pi*x) * x,
      np.sin(np.pi*x) / x]

fig, ax = plt.subplots(3)
ax[0].plot(x,ys[0])
ax[1].plot(x,ys[1])
ax[2].plot(x,ys[2])

ax[0].set_title("Addition")
ax[1].set_title("Multiplication")
ax[2].set_title("Division")

for a in ax:
    a.set_xlabel("x")
    a.set_ylabel("y")

When we create multiple axes objects, they are available to us through the ax array. We can index into them and we can also loop through all of them. We can take the above example even further and pack even the plotting into the for loop.

x = np.arange(0,10,0.1)
ys = [np.sin(np.pi*x) + x,
      np.sin(np.pi*x) * x,
      np.sin(np.pi*x) / x]

fig, ax = plt.subplots(3)
titles = ["Addition", "Multiplication", "Division"]
for idx, a in enumerate(ax):
    a.plot(x, ys[idx])
    a.set_title(titles[idx])
    a.set_xlabel("x")
    a.set_ylabel("y")

This code produces exactly the same three axes figure as above. There are other ways to use the object-oriented interface. For example, we can create an empty figure without axes using fig = plt.figure(). We can then create subplots in that figure with ax = fig.add_subplot(). This is exactly the same concept as always but instead of creating figure and axes at the same time, we use the figure method to create axes. If personally prefer fig, ax = plt.subplots() but fig.add_subplot() is slightly more flexible in the way it allows us to arrange the axes. For example, plt.subplots(x, y) allows us to create a figure with axes arranged in x rows and y columns. Using fig.add_subplot() we could create a column with 2 axes and another with 3 axes.

fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,3)
ax3 = fig.add_subplot(3,2,2)
ax4 = fig.add_subplot(3,2,4)
ax5 = fig.add_subplot(3,2,6)

Personally I prefer to avoid these arrangements, because things like tight_layout don’t work but it is doable and cannot be done with plt.subplots(). This concludes our overview of the object-oriented interface. Simply remember that you want to do your plotting through the methods of an axes object that you can create either with fig, ax = plt.subplots() or fig.add_subplot(). So what is different about the functional interface? Instead of plotting through axes methods, we do all our plotting through functions in the matplotlib.pyplot module.

One pyplot to Rule Them All

The functional interface works entirely through the pyplot module, which we import as plt by convention. In the example below we use it to create the exact same plot as in the beginning. We use plt to create the figure, do the plotting and label the axes.

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0,10,0.1)
y = np.sin(np.pi*x) + x

plt.figure()
plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("y")

You might be wondering, why we don’t need to tell plt where to plot and which axes to label. It always works with the currently active figure or axes object. If there is no active figure, plt.plot() creates its own, including the axes. If a figure is already active, it creates an axes in that figure or plots into already existing axes. This make the functional interface less explicit and slightly less readable, especially for more complex figures. For the object-oriented interface, there is a specific object for any action, because a method must be called through an object. With the functional interface, it can be a guessing game where the plotting happens and we make ourselves highly dependent on the location in our script. The line we call plt.plot() on becomes crucial. Let’s recreate the three subplots example with the functional interface.

x = np.arange(0,10,0.1)
ys = [np.sin(np.pi*x) + x,
      np.sin(np.pi*x) * x,
      np.sin(np.pi*x) / x]

plt.figure()
plt.subplot(3, 1, 1)
plt.plot(x, ys[0])
plt.xlabel("x")
plt.ylabel("y")
plt.title("Addition")
plt.subplot(3, 1, 2)
plt.plot(x, ys[1])
plt.xlabel("x")
plt.ylabel("y")
plt.title("Multiplication")
plt.subplot(3, 1, 3)
plt.plot(x, ys[2])
plt.xlabel("x")
plt.ylabel("y")
plt.title("Division")

This one is much longer than the object-oriented code because we cannot label the axes in a for loop. To be fair, in this particular example we can put the entirety of our plotting and labeling into a for loop, but we have to put either everything or nothing into the loop. This is what I mean when I say the functional interface is less flexible.

x = np.arange(0,10,0.1)
ys = [np.sin(np.pi*x) + x,
      np.sin(np.pi*x) * x,
      np.sin(np.pi*x) / x]

plt.figure()
titles = ["Addition", "Multiplication", "Division"]
for idx, y in enumerate(ys):
    plt.subplot(3,1,idx+1)
    plt.plot(x, y)
    plt.xlabel("x")
    plt.ylabel("y")
    plt.title(titles[idx])

Both interfaces are very similar. You might have noticed that the methods in the object-oriented API have the form set_attribute. This is by design and follows from an object oriented convention, where methods that change attributes have a set prefix. Methods that don’t change attributes but create entirely new objects have an add prefix. For example add_subplot. Now that we have seen both APIs at work, why is the object-oriented API recommended?

Advantages of the Object-Oriented API

First of all, Matplotlib is internally object-oriented. The pyplot interface masks that fact in an effort to make the usage more MATLAB like by putting a functional layer on top. If we avoid plt and instead work with the object-oriented interface, our plotting becomes slightly faster. More importantly, the object-oriented interface is considered more readable and explicit. Both are very important when we write Python. Readability can be somewhat subjective but I hope the code could convince you that going through the plotting methods of an axes object makes it much more clear where we are plotting. We also get more flexibility to structure our code. Because plt depends on the order of plotting, we are constraint. With the object-oriented interface we can structure our code more clearly. We can for example split plotting, labeling and other tasks into their own code blocks.

In summary, I hope you will be able to use the object-oriented interface of Matplotlib now. Simply remember to create axes with fig, ax = plt.subplots() and then most of the work happens through the ax object. Finally, the object-oriented interface is recommended because it is more efficient, readable, explicit and flexible.

Contrast Adjustment with scikit-image

Contrast is one of the most important properties of an image and contrast adjustment is one of the easiest things we can do to make our images look better. There are many ways to adjust the contrast and with most of them we have to be careful because they can artificially change our images. The title image shows three basic methods of contrast adjustment and how they affect the image histogram. We will cover all three methods here but first let us consider what it means for an image to have low contrast.

Each pixel in an image has an intensity value. In an 8-bit image the values can be between 0 and 255. Contrast issues occur, when the largest intensity value in our image is smaller than 255 or the smallest value is larger than 0. The reason is that we are not using the entire range of possible values, making the image overall darker than it could be. So what about the contrast of our example image? We can use .min() and .max() to find maximum and minimum intensity.

import numpy as np
import matplotlib.pyplot as plt
from skimage import io, exposure, data

image = io.imread("example_image.tif")
image.max()
# 52
image.min()
# 1

Our maximum of 52 is very far away from 255, which explains why our image is so dark. On the minimum we are almost perfect. To correct the contrast we can user the exposure module which gives us the function rescale_intensity.

image_minmax_scaled = exposure.rescale_intensity(image)
image_minmax_scaled.max()
# 255
image_minmax_scaled.min()
# 0

Now both the minimum and the maximum are optimized. All pixels that were equal to the original minimum are now 0 and all pixels equal to the maximum are now 255. But what happens to the values in the middle? Let’s look at a smaller example.

arr = np.array([2, 40, 100, 205, 250], dtype=np.uint8)

arr_rescaled = exposure.rescale_intensity(arr)
# array([  0,  39, 100, 208, 255], dtype=uint8)

As expected, the minimum 2 became 0 and the maximum 250 became 255. In the middle, 40 became smaller, nothing happened to 100 and 205 became larger. We will look at each step to find out how we got there.

arr = np.array([2, 40, 100, 205, 250], dtype=np.uint8)
min, max = arr.min(), arr.max()
arr_subtracted = arr - min  # Subtract the minimum
# array([  0,  38,  98, 203, 248], dtype=uint8)
arr_divided = arr_subtracted / (max - min)  # Divide by new max
# array([0.        , 0.15322581, 0.39516129, 0.81854839, 1.        ])
arr_multiplied = arr_divided * 255  # Multiply by dtype max
# array([  0.        ,  39.07258065, 100.76612903, 208.72983871,
#        255.        ])
# Convert dtype to original uint8
arr_rescaled = np.asarray(arr_multiplied, dtype=arr.dtype)
# array([  0,  39, 100, 208, 255], dtype=uint8)

We can get there in four simple steps. Subtract the minimum, divide by the maximum of the new subtracted array, multiply by the maximum value of the data type and finally convert back to the original data type. This works well if we want to rescale by minimum and maximum of the image but sometimes we need to use different values. Especially the maximum can be easily dominated by noise. For all we know, it could be just one pixel that is not necessarily representative for the entire image. As you can see from the histogram in the title image, very few pixels are near the maximum. This means we can use percentile rescaling with little information loss.

percentiles = np.percentile(image, (0.5, 99.5))
# array([ 1., 28.])
scaled = exposure.rescale_intensity(image,
                                    in_range=tuple(percentiles))

This time we scale from 1 to 28, which means that all values above or equal to 28 become the new maximum 255. As we chose the 99.5 percentile, this affects roughly 5% of the upper pixels. You can see the consequence in the image on the right. The image becomes brighter but it also means that we lose information. Pixels that were distinguishable now look exactly the same, because they are all 255. It is up to you if you can afford to lose those features. If you do quantitative image analysis you should perform rescaling with caution and always look out for information loss.

Analyzing Image Histograms with scikit-image

An image says more than a thousand words but histograms are also very important. Digital images are made of pixels and each of them has a value. A histogram tells us how many pixels of the image have a certain value. The title plot shows Chelsea the cat and the histograms for each color channel. Here is the code that generated the figure.

import numpy as np
import skimage
import matplotlib.pyplot as plt

image = skimage.data.chelsea()
image_red, image_green, image_blue = image[:,:,0], image[:,:,1], image[:,:,2]

fig, ax = plt.subplots(2,3)
ax[0,0].imshow(image_red, cmap='gray')
ax[0,1].imshow(image_green, cmap='gray')
ax[0,2].imshow(image_blue, cmap='gray')

bins = np.arange(-0.5, 255+1,1)
ax[1,0].hist(image_red.flatten(), bins = bins, color='r')
ax[1,1].hist(image_green.flatten(), bins=bins, color='g')
ax[1,2].hist(image_blue.flatten(), bins=bins, color='b')

Because Chelsea is part of the scikit-image example data, we can simply load it with skimage.data.chelsea(). With image.shape we can find out that our image has three dimensions. The first two are y and x coordinates whereas the third one represents the colors red, green and blue (RGB). We split the colors into their own variables before visualizing each of them as a grayscale image and below it we plot the histogram. Here is a short version of the above code with some slightly advanced Python features.

fig, ax = plt.subplots(2,3)
bins = np.arange(-0.5, 255+1,1)
for ci, c in enumerate('rgb'):
    ax[0,ci].imshow(image[:,:,ci], cmap='gray')
    ax[1,ci].hist(image[:,:,ci].flatten(), bins = bins, color=c)

We can see from the histogram and the grayscale image that Chelsea is slightly more red than blue or green. But how can we get more quantitative information out of the histogram? We can use np.histogram and the usual numpy functions to learn more about the properties of our histograms.

hist_red = np.histogram(image_red.flatten(), bins=bins)
hist_red[0].argmax()
# 156

The np.histogram function gives us a tuple, where the first entry are the counts and the second entry are the bin edges. This is the reason we have to index into hist_red to call .argmax() on the correct array. .argmax() tells us that the peak of the histogram is at bin 156. This means that most pixels have an intensity value of 156. The peak can be deceiving, especially when the distribution is skewed or multi-modal but for this tutorial we will accept it as a first pass. Let’s see how the other channels look.

hist_red = np.histogram(image_red.flatten(), bins=bins)
green = np.histogram(image_green.flatten(), bins=bins)
hist_blue = np.histogram(image_blue.flatten(), bins=bins)

print(hist_red[0].argmax(),
      hist_green[0].argmax(),
      hist_blue[0].argmax())

# 156 116 97

As our eyes suspected, the green and blue channel have peaks at smaller intensity values than the red channel. This confirms our suspicion that Chelsea probably is a red cat. I hope this tutorial has been helpful to get you started with scikit-image. We learned that RGB images come in an array of shape (y, x, c), where c is the color channel. We can use plt.hist() to calculate and plot the histogram and np.hist() to calculate the histogram without plotting.

Animations with Matplotlib

Anything that can be plotted with Matplotlib can also be animated. This is especially useful when data changes over time. Animations allow us to see the dynamics in our data, which is nearly impossible with most static plots. Here we will learn how to animate with Matplotlib by producing this traveling wave animation.

This is the code to make the animation. It creates the traveling wave, defines two functions that handle the animation and creates the animation with the FuncAnimation class. Let’s take it step by step.

import numpy as np
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt

# Create the traveling wave
def wave(x, t, wavelength, speed):
    return np.sin((2*np.pi)*(x-speed*t)/wavelength)

x = np.arange(0,4,0.01)[np.newaxis,:]
t = np.arange(0,2,0.01)[:,np.newaxis]
wavelength = 1
speed = 1
yt = wave(x, t, wavelength, speed)  # shape is [t,y]

# Create the figure and axes to animate
fig, ax = plt.subplots(1)
# init_func() is called at the beginning of the animation
def init_func():
    ax.clear()

# update_plot() is called between frames
def update_plot(i):
    ax.clear()
    ax.plot(x[0,:], yt[i,:], color='k')

# Create animation
anim = FuncAnimation(fig,
                     update_plot,
                     frames=np.arange(0, len(t[:,0])),
                     init_func=init_func)

# Save animation
anim.save('traveling_wave.mp4',
          dpi=150,
          fps=30,
          writer='ffmpeg')

On the first three lines we import NumPy, Matplotlib and most importantly the FuncAnimation class. It will take the center stage in our code as it will create the animation later on by combining all the parts we need. On lines 5-13 we create the traveling wave. I don’t want to go into too much detail, as it is just a toy example for the animation. The important part is that we get the array yt, which defines the wave at each time point. So yt[0] contains the wave at t0 , yt[1] at t1 and so on. This is important, since we will be iterating over time during the animation. If you want to learn more about the traveling wave, you can change wavelength, speed and play around with the wave() function.

Now that we have our wave, we can start preparing the animation. We create a the figure and the axes we want to use with plt.subplots(1). Then we create a the init_func(). This one will be called whenever the animation starts or repeats. In this particular example it is pretty useless. I include it here because it is a useful feature for more complex animations.

Now we get to update_plot(), the heart of our animation. This function updates our figure between frames. It determines what we see on each frame. It is the most important function and it is shockingly simple. The parameter i is an integer that defines what frame we are at. We use that integer as an index into the first dimension of yt. We plot the wave as it looks at t=i. Importantly, we must clean up our axes with ax.clear(). If we would forget about clearing, our plot would quickly become all black, filled with waves.

Now FuncAnimation is where it all comes together. We pass it fig, update_plot and init_func. We also pass frames, those are the values that i will take on during the animation. Technically, this gets the animation going in your interactive Python console but most of the time we want to save our animation. We do that by calling anim.save(). We pass it the file name as a string, the resolution in dpi, the frames per second and finally the writer class used for generating the animation. Not all writers work for all file formats. I prefer .mp4 with the ffmpeg writer. If there are issues with saving, the most common problem is that the writer we are trying to use is not installed. If you want to find out if the ffmpeg writer is available on your machine, you can type matplotlib.animation.FFMpegWriter().isAvailable(). It returns True if the writer is available and False otherwise.

This wraps up our tutorial. This particular example is very simple, but anything that can be plotted can also be animated. I hope you are now on your way to create your own animations. I will leave you with a more involved animation I created.

Filtering Data with SciPy

Time series data may contain signals at many different frequencies. Sharp increases or decreases have a high frequency. Slow increases or decreases have a low frequency. Filtering allows us to take different frequency components out of the data.

Signal filtering is a science on its own and I’ll focus on the practical aspects here and stick to two filter types: butterworth and Chebyshev type I. Each of those filters can be used for different purposes. We can use them as low pass, high pass, band pass or notch filters. Low pass filters leave low frequencies alone but attack high frequencies. High pass filters leave high frequencies alone but attach low frequencies. The title image shows an example of low and high pass filters used on the same data. Band pass filters leave a specific frequency band alone and attack all other frequencies. Notch filters attack a specific frequency band, leaving the rest alone. Let’s look at an example.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.signal import butter, cheby1, filtfilt

data = np.load("example_data.npy")

order = 3
Wn = 4000  # in Hz
btype = 'lowpass'
fs = 50000  # in Hz

b, a = butter(order, Wn, btype, fs = fs)
data_butter = filtfilt(b, a, data)

This is a butterworth lowpass filter with a cutoff frequency of 4000Hz (Wn). That means, signals below 4000Hz are is the pass band. They are largely left alone. Signals above 4000Hz are in the stop band, they are diminished. fs is the sampling frequency of the data. If the units are Hz, it tells us how many data points are recorded during one second. filtfilt is the function that does the actual filtering on the data, based on the filter (b, a) that was designed previously. Filtering is not a perfect process. Filters have what is called roll-off at the critical 4000Hz frequency.

Ideally, we would like a filter response that falls down straight. Anything in the pass band is untouched, anything in the stop band is shutdown the same way. As you can see, our actual filter does no live up to the ideal. It already slightly attenuates signal that is part of the pass band and it falls much slower in the stop band. If we need a steeper roll off, we can increase the order of our filter.

Some filter types have steeper roll off than others. For example, the Chebyshev type I filter achieves steeper roll off by tolerating some ripple in the pass band.

This can lead to distortions in the data depending on the size of the ripple. The Chebyshev type I filter takes an argument rp that defines the amount of pass band ripple that is tolerated in units of dB. The more ripple we tolerate, the steeper the roll off will be. Here you can see how large ripple causes oscillations in the data.

Generally, the butterworth filter is sufficient for most situations and is safer because it does not ripple. I hope this post helped you filtering your own data. If you want to learn more, check out the SciPy signal docs. Both the butter and cheby1 filter are there with many, many more.

Smoothing Data by Rolling Average with NumPy

Time series data often comes with some amount of noise. One of the easiest ways to get rid of noise is to smooth the data with a simple uniform kernel, also called a rolling average. The title image shows data and their smoothed version. The data is the second discrete derivative from the recording of a neuronal action potential. Derivatives are notoriously noisy. We can get the result shown in the title image with np.convolve

import numpy as np

data = np.load("example_data.npy")
kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved = np.convolve(data, kernel, mode='same')

Convolution is a mathematical operation that combines two arrays. One of those arrays is our data and we convolve it with the kernel array. During convolution we center the kernel at a data point. We multiple each data point in the kernel with each corresponding data point, sum up all the results and that is the new data point at the center. Let’s look at an example.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel)
# array([ 2,  7, 13, 23, 24, 18, 19,  4])

For this result to make sense you must know, that np.convolve flips the kernel around. So step by step the calculations go as follows:

[4, 3, 2, 1]  # The flipped kernel
          x
         [2, 3, 1, 4, 1]  # The data
          2= 2
         [2]

   [4, 3, 2, 1]
          x  x
         [2, 3, 1, 4, 1]
          4+ 3= 7
         [2, 7]

      [4, 3, 2, 1]
          x  x  x
         [2, 3, 1, 4, 1]
          6+ 6+ 1=13
         [2, 7, 13]

         [4, 3, 2, 1]
          x  x  x  x
         [2, 3, 1, 4, 1]
          8+ 9+ 2+ 4= 23
         [2, 7,13,23]
# This continues until the arrays stop touching. You get the idea.

One thing you’ll notice is that the edges are problematic. There is really no good way to avoid that. Data points at the edges only see part of the kernel but the mode parameter defines what should happen at the edges. I prefer the 'same' mode because it means that the new array will have the same shape as the original data, which makes plotting easier. However, if you start to use more complicated kernels, the edges might become virtually useless. In that case, mode should be 'valid'. Then, the values at the edges that did not see the entire kernel are discarded. The output array is smaller in shape than the input array.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel, mode='valid')
array([23, 24])

The default behavior you saw above is called 'full'. It keeps all data points, so the output array is larger in shape than the input array. You might also have noticed that the size of the kernel is very important. Actually, we need to divide the array of ones by its length. Can you guess what would happen if we forgot about dividing it?

If you guessed that the signal would become larger in magnitude you guessed right. We would be summing up all data points in the kernel. By dividing it we ensure that we take the average of the data points. But the kernel size is even more important. If we make the kernel larger the outcome changes dramatically.

kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved_10 = np.convolve(data, kernel, mode='same')

kernel_size = 20
kernel = np.ones(kernel_size) / kernel_size
data_convolved_20 = np.convolve(data, kernel, mode='same')

plt.plot(data_convolved_20)
plt.plot(data_convolved_10)
plt.legend(("Kernel Size 10", "Kernel Size 20"))

The larger we make the kernel, the smaller sharp peaks become. The peaks are also shifted in time. To be specific, a rolling mean is a low-pass filter. This means that is leaves low frequency signals alone, while making high frequency signals smaller. Sharp increases in the data have a high frequency. If we make the kernel larger, the filter attenuates high frequency signals more. This is exactly how the rolling average works. It gets rid of high frequency noise. It also means that we must be careful not to distort the signal too much with the rolling average filter.

Threshold Detection in NumPy

Many signals are easily detected by their size. We will learn how to detect the indices where signals cross a threshold with NumPy. These are our practice signals.

import numpy as np
import matplotlib.pyplot as plt
data = np.array([0, 0, 0, 5, 5, 5, 5, 0, 0, 0, 0, 4, 4, 4, 0, 0, 0])
plt.plot(data, marker='o')

We will perform two simple steps to detect the threshold crossings: 1. Make the data binary, in a way that they are true when larger than the threshold and false when lower or equal. 2. Take the difference of the binary signal. This gives us a boolean array that is true when the threshold was crossed. We can combine those steps into one line.

threshold = 2
threshold_crossings = np.diff(data > threshold, prepend=False)

Plotting shows us that threshold_crossings is true after the threshold was crossed.

plt.plot(data2, marker='o')
plt.plot(thr_crossings, marker='o')
plt.legend(("Data", "Threshold Crossings"))

To get the indices of the threshold crossings we can use np.argwhere(), which returns the true indices from a boolean array.

np.argwhere(threshold_crossings)[:,0]
# array([ 3,  7, 11, 14], dtype=int64)

Threshold crossings occur at 3, 7, 11 and 14. Sometimes we only need the upward or downward crossings. We can simply isolate those by slicing the indiced array.

np.argwhere(threshold_crossings)[::2,0]  # Upward crossings
# array([ 3, 11], dtype=int64)
np.argwhere(thr_crossings)[1::2,0]  # Downward crossings
# array([ 7, 14], dtype=int64)

Sometimes we want to find the point before the threshold is crossed, rather than after. There is one simple trick in np.diff instead of setting prepend=False, we set append=False.

threshold = 2
post_threshold_crossings = np.diff(data > threshold, prepend=False)
pre_threshold_crossings = np.diff(data > threshold, append=False)
plt.plot(data2, marker='o')
plt.plot(post_threshold_crossings, marker ='o')
plt.plot(pre_threshold_crossings, marker='o')
plt.legend(("Data", "Post Crossings", "Pre Crossings"))

Make sure to check out the documentation for np.diff and bonus points if you can figure out why exactly this works. Detecting threshold crossings is an easy but important part in most of my analysis pipelines and now you can do it too.

Fantastic Programming Languages and Where to Find Them

There are many programming languages out there and committing to one of them can be intimidating. The good news is that most programming languages that are relevant today are solid. You can’t go wrong with any of them and they are all worth your time. At the same time, the programming language you pick will strongly influence your work as a PhD student or postdoc and the opportunities you have afterwards. Here I’ll guide you through the things to consider when choosing your first or second programming language.

Which language are others using?
The conformists way

Most programming languages are amazing and the technical differences between them are small and only relevant under special circumstances. More important than the technical details are the community, research field and laboratory you want to join. If you already know which lab you want to join, find out which language is used there. If the lab uses multiple languages try to find out what kind of task you will have, find the people with similar tasks and find out which language they are using. If you don’t know your lab yet but you know the field, try to find out which language dominates that field. You can do so by reading papers, job advertisements or by directly writing to other PhD students and postdocs.

This conformist approach is not satisfying for everyone. I get it. I actually brought a language to my lab that nobody else there was using. More on that later. Most people will benefit greatly from this conformist route. Let me first tell you the many advantages, before I explain why you might benefit from choosing another language. First, you maximize the people that can help while you are learning and while you are engaging with the technical details of your tasks. Second, you will find many solutions ready to use. You will be able to grab scripts and functions from your colleagues and you will be able to move on from programming details to solving your actual task much faster. Third, your colleagues are successfully contributing to the field so it is likely that other people in the same field are using the same programming language. That means your programming experience will help you find a postdoc job if you want to stay in that field.

So why would you want to miss out on those advantages? In short: you don’t. You probably don’t know better than your future colleagues (yet). You don’t want to reinvent the wheel. But there are some other things to consider and sometimes it can pay off to deviate from lab culture. If you do so, this will affect your work. In a nutshell, you will be less productive in the short-term but more productive in the long-term, if you choose your programming language well.

Which languages are used outside academia

Many of us are not looking to stay in academic research. Even if you are committed to academia, this point is worth considering. Things and people can change. It is considered good practice to have a plan B. While many programming languages are used both inside and outside of academia, some labs use programming languages that are nearly worthless in the non-academic job market. Sometimes very specific research requires a niche programming language. Other times a lab was simply unable or unwilling to transition to a more common language.

I recommend making your plan B as concrete as possible. Maybe it doesn’t involve programming at all. Then you should fall back to the conformists way. Otherwise, check your plan B job market for programming languages that are required or advantageous. I will go through some programming languages later and give my opinion on their usage inside and outside academia. However, I cannot give a definitive answer and these job markets evolve rapidly. If you learned programming during your PhD you will be in a great position to pick up another language. You will probably have to learn more than one language anyway. Companies have so called ‘stacks’. A stack is a collection of software (including some programming languages) and people will be hired for “full-stack” or subsets of that stack. A non-academic stack will likely involve at least passing familiarity with more than one programming language. Either way, keep an eye out for labs that use niche programming languages. It might be worthwhile to defy lab culture and choose a more common language.

Performance is less important than you think

Beginners consistently overestimate the importance of performance or speed. I’ve been there. When I started out I though fast computations would be the deciding factor for or against any programming language. It isn’t, because human time is more valuable than computer time. By orders of magnitude. In research, the bottleneck is rarely computational time, it’s almost always human time. Performance only becomes relevant with very computationally intensive projects.

Imagine you are writing a script that will take one minute to run in the end. A 10x decrease in performance (now it takes 10 minutes) is very tolerable, if you get some perks for it. It is now easier to debug, easier to build on and more other people can use it and give you credit for it. If you are writing a simulation project that takes 10 days, a 10x decrease (now it takes 100 days) does not look so attractive anymore. In the real world performance of both scripts and programs is slightly more complicated but the point stands. If you are not sure whether you are in the 1 minute or 10 days category, you should try to figure it our before deciding. Just ask your colleagues and advisers.

With that, I want to move on to some fantastic programming languages. We will look at their strengths and their weaknesses. Always keep in mind the conformists way. Only choose your own language when there are clear advantages. I will briefly discuss which languages are worthwhile to use even if your lab is not on board and which ones are to be avoided even if your lab is working with them. As a disclaimer: I have hands on experience with Python, R and Matlab. For the other languages I either have second hand experience (people in my surrounding work with them) or I did some research about them.

Python and R

Python and R share a chapter because they are both excellent and should be your first choice if there are no other languages established in the lab. If the lab uses either or both, even better! Both are completely free and open source. Python is my personal favorite but I am slightly biased after years of working with it almost daily.

Both Python and R are also heavily used outside of academia. R is consistently ranked as the top required language for data scientists. Python ranks second. A downside of R is that it is specifically designed for statistics and data analysis. This can be an advantage, because as scientists this is the biggest part of our job when we program. However, Python is a complete all-rounder. It can do data analysis but it can also do web development, game development and everything else you can think of. Web development is also possible with R, but it centrally revolves around data analysis and visualization. I guess I’m trying to say, Python would be better for your private coin collecting website (minor upside).

Some communities slightly favor Python, others favor R. Astronomy for example really likes Python. Single cell sequencing on the other hand prefers R. Check with your field and colleagues. Finally, both languages are well documented and have massive communities behind them. This makes it much more likely that someone already solved an issue you are trying to google. In summary, any second you spend learning Python or R is well worth it.

MATLAB

MATLAB is developed by MathWorks and it is specifically designed for science and engineering. Unlike Python and R, it is neither free nor open source. If your lab pays for a license, the heavy price tag might not bother you. The language itself is more similar to Python than R (many of Python’s numerical computation capabilities were developed with MATLAB users in mind). A nice upside is that the language comes with a very strong graphical user interface and debugging capabilities. This can be very helpful. Unfortunately, MATLAB is much less popular outside of academia than Python and R. Especially smaller companies and early start-ups are unwilling to pay for MATLAB when there are free equivalents. Overall, even academics seem to be slowly transitioning away from MATLAB. However, I would not advise strictly against MATLAB if it is very popular in your field of research or the lab you want to join. Especially since Python and MATLAB are similar enough that transitioning is easy, once you learned MATLAB. I would only advise against it if you have concrete plans to leave academia for data science.

Julia

Julia is being traded as the future Python. For now it has a smaller community but it was specifically designed to keep the advantages of Python while improving performance. I currently don’t recommend Julia, unless you have some computationally intensive projects or you anticipate such projects in your professional future. The more people use a language, the higher the chance that even specialized tasks are already implemented by someone else. Julia is not yet widely adopted. If your lab uses Julia, I recommend rolling with it.

Igor Pro

Igor Pro by WaveMetrics is a commercial software and programming language. Like MATLAB, it comes with a rather rich graphical user interface. It is the first programming language I actively discourage. Even if you feel like spending money, you are probably better off with MATLAB. Igor Pro is even less popular outside of academia than MATLAB.

When I started my master thesis, the established language for my main task (intracellular electrophysiology) was Igor Pro. I decided against using it, because I had never heard about it before, the graphical user interface did not look very appealing and I had some Python experience from small hobby projects. So I decided to do the analysis myself with Python. The consequence of that was that I was extremely slow in the beginning. Had I just done it with Igor Pro, I could have taken the scripts that were already used in the lab and could have used them with minimal learning effort. Instead I had to reinvent the wheel and learn Python at the same time. This made me extremely inefficient in the short term.

In the long term it was the best choice I made during my masters, because in the long term it made me more efficient. More than that, I was able to take on new tasks that would have been nearly impossible with Igor Pro. I started to get into biophysical neuronal network simulations. Python has several packages for that. I’m not aware of any such Packages for Igor Pro. That being said, you or your colleagues might not be willing to lose short term efficiency, especially if you don’t care for programming and just want to get the job done. If you enter a lab where Igor Pro is being used, roll with it to get things done more quickly.

JavaScript

JavaScript is particularly popular for web development but it can also do some data analysis. ImageJ, a popular scientific image processing software, is written in Java and you can write JavaScript code for it. It is free and open source. It is worth checking out if you are going to do a lot of image processing. Otherwise I don’t recommend it for scientific purposes.

Most programming languages are fantastic

Committing to a programming language is difficult. Luckily, all relevant programming languages today are amazing. They all get the job done and are well worth your time (especially the ones on top of this list). And this brings me to my take-home message. Don’t worry about the technical differences between Python, R and MATLAB. Especially don’t worry about performance. Your scientific field and laboratory are much more important factors for your choice. Isolating yourself comes with a price. I also hope I made clear why and when that price is worth it. It might give you long term advantages. Finally, the best thing you can do today is to start programming and to stop worrying.

What’s in a Name?

  • To store an object, we assign it a name using the equal sign
  • Names must not start with a number and they cannot contain special characters (except for underscores)
  • When you choose names, be consistent and descriptive

Names

You have probably heard of variables. In mathematics, a variable is a placeholder for something that is not fully defined just yet. In most programming languages, we say: “We define the variable”. In Python we rarely talk about variables. We instead talk about names. We would say: “We assign a name”. Specifically, we assign a name to an object. To do so, we use the equal sign. While the vocabulary is slightly different, the result is very similar in all programming languages. Let’s look at an example.

my_name = 'Daniel'
my_name
# 'Daniel'

Here we assign the name my_name to the string object 'Daniel'. From now on we can use the name to refer to the object. We can perform operations on the name and Python will replace the name with the object it refers to.

my_name = 'Daniel'
my_name * 5
# 'DanielDanielDanielDanielDaniel'

The multiplication operator applited to a string concatenates that array multiple times. Here we get the same string five times. Note, that we use the name, instead of the literal object. We can also reassign the same name to a different object.

my_name = 'Daniel'
my_name
# 'Daniel'
my_name = 5 * my_name
my_name
# 'DanielDanielDanielDanielDaniel'

First, my_name refers to 'Daniel', then it refers to the result of the operation
my_name * 5, which is 'DanielDanielDanielDanielDaniel'. When we choose a name, there are very few rules that we must follow. However, there are many more rules that we definitely should follow. We take a look at the mandatory rules first.

Naming Rules

When we choose a name, there are many things to consider. But there are also some things that we are not allowed to do. For example a name cannot start with a number.

1st_name = 5
# SyntaxError: invalid syntax
first_name = 5
first_name
# 5
name_1 = 5
name_1
# 5

In other places of the name, numbers are allowed. Special characters are not allowed at any place of a name.

name$one = 5
# SyntaxError: invalid syntax

The only special character that is allowed is the underscore. Underscores are conventionally used in Python to structure names visually.

division_and_multiplication = 5 / 10 * 2
division_and_multiplication
# 1.0

Besides the visually structuring names, underscores also have special meaning when leading a name. For example, all objects in Python have special double underscore methods. We will learn all about them later.

Good Naming Practices

If you are just getting started with Python, you probably have a lot of things on your mind. Actually, if you follow the above rules, you can choose any name you want. However, it is very important to learn good naming practices in Python. Rule number one is consistency. Avoid wildly different ways to visually structure names within the same project. Don’t start with my_name and end with MyName. If you stay consistent within your projectsyou are already ahead of the curve. For beginners I recommend making names all lower case and structuring only using underscores. Also avoid numbers. Write name_one instead of name_1. Also, don’t be afraid of making names long. It is more important names are descriptive than short. length_of_segment is much better than los, even if there is more typing involved. Don’t be afraid to type. Be afraid of coming back to your code one year after you wrote it and trying to read completely unreadable, meaningless names.

Those are enough rules for now. If you want to adopt a perfect style right from the beginning you should read the official style guide for python called PEP8. There are generally no laws about Python style but this is the most authoritative document on coding style in Python and many people reading your code will expect you to follow the guidelines in this document. If you follow the few rules I describe above, you are already following the most important practices regarding naming.

Summary

To store objects we assign them a name. We can do that with the equal sign. The name is on the left, the object is on the right. name = object Needless to say, we’ll be spending a lot of time assigning names. Once a name is assigned, we can use the name instead of the object. Python will replace the name with the object in any operation. There are very few mandatory rules when choosing a name. For one, names cannot start with a number. Second, special characters, except for the underscore (_) are forbidden in names. There are some more rules that are good to follow. The most important one is to be consistent. Do not arbitrarily change your style within a project. If you do this you are already doing well.

Doing the Math with Python

  • We can use Python as a basic calculator for addition, subtraction, multiplication, division and exponent
  • We can use the equal sign to save results
  • The two major numeric types are floating point (float) and integer (int)

Python as a Calculator

Computers are good at math and so is Python. We can use Python like a calculator with the standard arithmetic operators for addition (+), subtraction (-), multiplication (*), division (/) and exponent (**).

An IPython console running in Spyder. Shown are the standard arithmetic operators for addition, subtraction, multiplication, division and exponent.

When we combine multiple operators on one line they are executed in the order we know from school. First comes the exponent. Then multiplication and division. Finally addition and subtraction.

2 ** 2 * 5  # Exponent precedes multiplication
# 20
2 ** 2 / 5  # Exponent precedes division
# 0.8
2 * 2 + 5  # Multiplication precedes addition
# 9
2 * 2 - 5  # Multiplication precedes subtraction
# -1
2 / 2 + 5  # Division precedes addition
# 6.0
2 / 2 - 5  # Division precedes subtraction
# 0.2857142857142857

If we want to change the order of operation, we can use parentheses. Let’s use them to see how the results above would look if the order of operation was reversed.

2 ** (2 * 5)
# 1024
2 ** (2 / 5)
# 1.3195079107728942
2 * (2 + 5)
# 14
2 * (2 - 5)
# -6
2 / (2 + 5)
# 0.2857142857142857
2 / (2 - 5)
# -0.6666666666666666

Parentheses can do many different things in Python, which we will talk about later. When used with arithmetic operators they change the order of operation. Whatever appears in parentheses is resolved first. Many calculators have a basic memory function and we can easily save results in Python.

Saving Results

To save results of mathematical operations we can use the equal (=) operator. On the left we put a name that we want to assign the result to and on the right we put the operation that we want to save.

save = 5 + 7
save
# 12

In this example we assign the result 12 to the name save. Once we save a result under a name we can reuse it in an arithmetic expression. If a name stores a number it will just be evaluated as a number when it appears with a mathematical operator.

save = 5 + 7
10 + save
# 22

The equal operator is essential in Python. Here we are using it to save the result of a mathematical operation. Later we will learn that it is used to assign any kind of object to a name. A name is what allows us to refer to an object. Objects are very important in Python. We will learn all about that later but even in the seemingly simple realm of numbers there are different kinds of objects

Floating Point Numbers and Integers

You may have noticed that sometimes numbers have a dot as a decimal delimiter and some numbers don’t. That is because numbers can be floating point numbers (float for short) or integers (int for short). Those are different numerical types. What kind of number we get depends on the arithmetic operation used and the types of numbers involved. To create a specific type of number we can simply add the dot (.) to the number. To check what kind of object we created we can use the type() function.

integer = 3
type(integer)
# int
floating = 3.0
type(floating)
# float

So 3.0 generates a float and 3 generates an integer. In Python3 (the Python you should be using!) the division operator (/) always generates a floating point number, regardless of the numbers involved.

div = 4 / 2
div
# 2.0
type(div)
# float

In this case, both numbers are integers (four and two) and four can be divided by two without remainder. Division returns a float anyway. With multiplication, addition and subtraction, the situation is slightly more complicated. The type of the result depends on the type of the numbers involved. If one of the numbers is a float, the result will be float.

integer = 4 * 2
integer
# 8
type(integer)
# int
floating = 4 * 2.0
floating
# 8.0
type(floating)
# float

For now this seems like a pointless excercise. Mathematically there is no difference between 8.0 and 8. Later we will learn why the types of objects matters. Even the difference between a floating eight and an integer eight can be important. For example, when we index into a list, the number we use as an index must be an integer.

Summary

We learned about the most important mathematical operators, addition, subtraction, multiplication, division and exponent. When those operators appear together, operator precedence decides which operation is performed first. The precedence order is the same we learned in school. We can use parentheses to change the order of operation. Operations in parentheses are executed earlier. Furthermore, we learned that we can save results by assigning a name to them. We can assign a name with the equal operator. Finally, we learned that there are different numeric types. There are integers and floating point numbers. Why the difference between them matters will become more clear later.