Here we will learn how to color individual events in a Matplotlib eventplot. An eventplot most commonly shows the timing of events (x axis) from different sources (y axis). Assigning different colors to the sources can be done easily with the colors parameter of the plt.eventplot function. However, color coding each event requires us to create the eventplot and then manually assign the colors with event_collection.set_colors. Here is the code and the resulting output.
import matplotlib as mpl
import matplotlib.pyplot as plt
import shelve
import os
import numpy as np
dirname = os.path.dirname(__file__)
data_path = os.path.join(dirname, 'evenplot_demo_spikes')
data = shelve.open(data_path)
fig, ax = plt.subplots(1)
event_collection = ax.eventplot(data['spikes'])
phases_flattened = np.hstack(data['phases'])
phase_min = np.quantile(phases_flattened, 0.05)
phase_max = np.quantile(phases_flattened, 0.95)
phases_norm = mpl.colors.Normalize(vmin=phase_min, vmax=phase_max)
normalized_phases = phases_norm(data['phases'])
viridis = mpl.colormaps['viridis']
for idx, col in enumerate(event_collection):
col.set_colors(viridis(normalized_phases[idx]))
plt.colorbar(mpl.cm.ScalarMappable(norm=phases_norm, cmap=viridis),
ax=ax, label="Phase", fraction=0.05)
We start out by loading the data with the shelve module and the os module helps us search for the data wherever our script is located (os.path.dirname(file)). Next, we create the figure and axis and plot the event times into the axis. The events will be plotted with the default blue of Matplotlib. We have to make sure to get all the events by assigning to event_collection. We want to color the events according to the data in data['phases']. That data contains a single floating point number for each event and we need to convert those numbers to a RGB color. For that, we first want to normalize our data to a value between zero and one with mpl.colors.Normalize(vmin=phase_min, vmax=phase_max). I use a trick here where I take the 5% quantiles instead of the actual maximum and minimum of the data. This should not be done for a scientific figure, because values above the vmax will have the same color although they have different values, which can be misleading. Once we have our normalized values we create a colormap object with mpl.colormaps['viridis'] and we get can get colors from it by passing it the normalized values. We then loop through our event collection (each entry in event collection is a row in our plot) and we set the color with col.set_colors(viridis(normalized_phases[idx])). Finally, we create a colorbar with plt.colorbar. And that is it. Let me know if there is an easier way to to this that I missed.
In this post we will learn how to periodically measure the power power usage of our GPU and plot it in real time with a single Python program. For this we need concurrency between the measuring and the plotting part of our code. Concurrency means that the measuring process will got to sleep after measuring. While the measuring process is asleep the plotting process can do the plotting and goes to sleep as well. After a defined amount of time the measuring process wakes up and does the measuring if the CPU allows it, then the plotting process starts and so on. We achieve concurrency with asyncio and the plotting is done with Matplotlib. To measure the GPU power we use pynmvl (Python Bindings for the NVIDIA Management Library). Before we get into the code, here is a video showing the interface in action.
This video shows the power in Watt at my GPU over a twenty seconds time window. The measurements are taken every 100 milliseconds and the plotting is done every 200 milliseconds. Everything happens in one Python script. I ran this by simply passing the below script to python in my command line.
import pynvml
import matplotlib.pyplot as plt
import time
import numpy as np
import asyncio
"""Initialize GPU measurement and parameters"""
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
measurement_interval = 0.1 # in seconds
plotting_interval = 0.2 # in seconds
time_span = 20 # time span on the plot x-axis in seconds
m = t = np.array([np.nan]*int(time_span / measurement_interval))
mW_to_W = 1e3
"""Initialize the plot"""
plt.ion()
plt.rcParams.update({'font.size': 18})
figure, ax = plt.subplots(figsize=(8,6))
line1, = ax.plot(t, m, linewidth=3)
ax.set_xlabel("Time (s)")
ax.set_ylabel("GPU Power (W)")
async def measure():
while True:
measure = pynvml.nvmlDeviceGetPowerUsage(handle) / mW_to_W
dt = time.time() - ts
m[:-1] = m[1:]
m[-1] = measure
t[:-1] = t[1:]
t[-1] = dt
await asyncio.sleep(measurement_interval)
async def plot():
while True:
line1.set_data(t, m)
tmin, tmax = np.nanmin(t), np.nanmax(t)
mmin, mmax = np.nanmin(m), np.nanmax(m)
margin = (np.abs(mmax - mmin) / 10) + 0.1
ax.set_xlim((tmin, tmax + 1))
ax.set_ylim((mmin - margin, mmax + margin))
figure.canvas.flush_events()
await asyncio.sleep(plotting_interval)
async def main():
t1 = loop.create_task(measure())
t2 = loop.create_task(plot())
await t2, t1
if __name__ == "__main__":
ts = time.time()
loop = asyncio.new_event_loop()
loop.run_until_complete(main())
We will start with the functions async def measure() and async def plot() since they are central to the program. First, note that neither of them are ordinary functions because of the async keyword. This keyword has been added in Python 3.5 and in earlier Python versions we could have instead decorated the functions with the @asyncio.coroutine decorator. The async keyword turns our function into a coroutine which allows us to use the await keyword inside. With the await keyword we can put the coroutine to sleep with await asyncio.sleep(measurement_interval). While asleep the asyncio event loop can run other coroutines that are not asleep. More on the asyncio event loop later. Because we want to keep measuring until someone terminates the program we wrap everything in measure into an infinite loop while True:.
So what do we do while measuring? Outside of the coroutine we define two arrays m, t, one to hold the measured power and the other to measure the passed time. Measuring time is important because energy is power during a time period and we generally need to be sure that the coroutine isn’t getting stuck asleep much longer than we want it to. When we measure a value we move the current elements in the measurement array one to the left by assignment with m[:-1] = m[1:]. We then assign the newly measured value to the right of the array with m[-1] = measure. That is all there is to our measurements.
Our plot coroutine works just like the measure coroutine except that it plots whatever is in the time and measurement arrays before it goes to sleep. The plotting itself is basic matplotlib but it is important to note that figure.canvas.flush_events() is critical for updating the plot in real time. Furthermore, when we initialize the plot, plt.ion() is important for the plot to show properly.
Coroutines are not called like normal functions. They do their work as tasks within an asyncio event loop. This event loop knows which coroutines are asleep and decides which coroutine starts working next. This task may seem manageable with two coroutines but with three it becomes tedious already. As a coroutine goes to sleep two may be awake, waiting to get to work. The event loop has to decide which one goes next. Luckily asyncio takes care of the details for us and we can focus on the work we want to get done instead. However, we need to create an event loop with loop = asyncio.new_event_loop() and then we start it with loop.run_until_complete(main()). The coroutines only get to work when the loop starts. Both our coroutines are in main(), thereby both become part of the event loop. Because of the event loop I recommend running the code from the command line. Running it in interactive environments can cause problems because other event loops might already be running there.
With that, we already covered the most important parts of the code. There are several things we could do differently and some of those might make the code better. For one, we could use a technique called blitting (explained here) to improve the performance of the plotting. We could also do the plotting with FuncAnimation (explained here) instead of writing our own coroutine. I tried that for a while but was not able to make the animation and the measurement() coroutine work together in the same event loop. There probably is a way to do it that I did not find. Let me know if you have other points for improvement.
You can find pynvml here. asyncio is part of the Python installation and you can find the docs here. I was inspired to do this project by a package called codecarbon that you can find here. It estimates the carbon footprint of computation and I plan to blog about it soon.
In this post I will go though the code for a simple data dashboard that visualizes the Iris dataset. It features two dropdown menus and three checkboxes. The dropdown menus choose the features on the x and y axes, while the checkboxes make samples visible or invisible based on their species. Below is a screenshot and a video of the dashboard. Before I start with the code I give a brief rational for using ipywidgets and Bokeh.
Combining ipywidgets with Bokeh
A main advantage of ipywidgets is that it is designed specifically for Jupyter notebooks and the IPython kernel. Bokeh on the other hand can build data dashboard for a variety of more complex web deployment contexts. This makes it more powerful and technically it could be used to build the entire dashboard. In a notebook context however, I prefer the simplicity of ipywidgets over the power of Bokeh. I actually started doing the plotting with Matplotlib or seaborn. This worked well in general but I ran into some problems when directing the plot to specific places in the dashboard. I did not encounter any such issues with Bokeh for reasons I do not yet understand. Using Bokeh also gives some nice interactive features in the figure without any extra effort. I think it just lends itself better to these interactive dashboards than Matplotlib. If you don’t want to learn about Bokeh and already know Matplotlib, ipywidgets plus Matplotlib is definitely a good option and most of the ipywidgets principles I show here apply either way.
The dashboard code
Here is the code that generates the dashboard when executed in a Jupyter notebook.
from sklearn import datasets
import pandas as pd
import numpy as np
from bokeh.plotting import figure, show, output_notebook
import ipywidgets as widgets
from IPython.display import display, clear_output
output_notebook()
"""Load Iris dataset and transform the pandas DataFrame"""
iris = datasets.load_iris()
data = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
"""Define callback function for the UI"""
def var_dropdown(x):
"""This function is executed when a dropdown value is changed.
It creates a new figure according to the new dropdown values."""
p = create_figure(
x_dropdown.children[0].value,
y_dropdown.children[0].value,
data)
fig[0] = p
for species, checkbox in species_checkboxes.items():
check = checkbox.children[0].value
fig[0].select_one({'name': species}).visible = check
with output_figure:
clear_output(True)
show(fig[0])
fig[0]=p
return x
def f_species_checkbox(x, q):
"""This function is executed when a checkbox is clicked.
It directly changes the visibility of the current figure."""
fig[0].select_one({'name': q}).visible = x
with output_figure:
clear_output(True)
show(fig[0])
return x
def create_figure(x_var, y_var, data):
"""This is a helper function that creates a new figure and
plots values from all three species. x_var and y_var control
the features on each axis."""
species_colors=['coral', 'deepskyblue', 'darkblue']
p = figure(title="",
x_axis_label=x_var,
y_axis_label=y_var)
species_nr = 0
for species in iris['target_names']:
curr_dtps = data['target'] == species_nr
circle = p.circle(
data[x_var][curr_dtps],
data[y_var][curr_dtps],
line_width=2,
color=species_colors[species_nr],
name=species
)
species_nr += 1
return p
# The output widget is where we direct our figures
output_figure = widgets.Output()
# Create the default figure
fig = [] # Storing the figure in a singular list is a bit of a
# hack. We need it to properly mutate the current
# figure in our callbacks.
p = create_figure(
iris['feature_names'][0],
iris['feature_names'][1],
data)
fig.append(p)
with output_figure:
show(fig[0])
# Checkboxes to select visible species.
species_checkboxes = {}
for species in iris['target_names']:
curr_cb = widgets.interactive(f_species_checkbox,
x=True,
q=widgets.fixed(species))
curr_cb.children[0].description = species
species_checkboxes[species] = curr_cb
"""Create the widgets in the menu"""
# Dropdown menu for x-axis feature.
x_dropdown = widgets.interactive(var_dropdown,
x=iris['feature_names']);
x_dropdown.children[0].description = 'x-axis'
x_dropdown.children[0].value = iris['feature_names'][0]
# Dropdown menu for y-axis feature.
y_dropdown = widgets.interactive(var_dropdown,
x=iris['feature_names']);
y_dropdown.children[0].description = 'y-axis'
y_dropdown.children[0].value = iris['feature_names'][1]
# This creates the menu
menu=widgets.VBox([x_dropdown,
y_dropdown,
*species_checkboxes.values()])
"""Create the full app with menu and output"""
# The Layout adds some styling to our app.
# You can add Layout to any widget.
app_layout = widgets.Layout(display='flex',
flex_flow='row nowrap',
align_items='center',
border='none',
width='100%',
margin='5px 5px 5px 5px')
# The final app is just a box
app=widgets.Box([menu, output_figure], layout=app_layout)
# Display the app
display(app)
Loading the Iris data
The Iris dataset contains 150 samples. Each sample belongs to one of three species and four features are measured for each sample: sepal length, sepal width, petal length and petal width, all in cm. The goal of the dashboard is to show a scatterplot of two features at a time and an option to turn visibility for each species on or off. I load the Iris data from the sklearn package but it is a widely used toy dataset and you can get it from other places. We also convert the dataset to a Pandas DataFrame. That just makes the data easier to handle. The callback functions that make the UI interactive are defined next.
Callback functions
Callback functions are executed when something happens in the UI. For now, the functions are just defined and later they are connected to widgets. We define two callback functions, var_dropdown(x) and f_species_checkbox(x, q). create_figure is not a callback function but a helper to create a new figure. var_dropdown(x) is responsible for the two dropdown menus. The dropdowns determine what is displayed on the figure axes. When a user changes the dropdown value, var_dropdown(x) creates a new figure where the features on x- and y-axis are determined by the new dropdown values. The first parameter x is the new value that the user chose. It is not used here, because the same call function will serve two different dropdown menus. So we don’t know which menu x refers to. Instead, we directly access the dropdown values with x_dropdown.children[0].value. This will be defined later. The same goes for the checkboxes we access in the for loop with checkbox.children[0].value. Each species has a checkbox and we will create them later in the code. Finally we display our figure in the with output_figure: context, which directs our figure to the output_figure widget.
The f_species_checkbox(x, q) callback is similar. It additionally features a parameter q, which is a fixed parameter which identifies the checkbox that triggered the callback. We use it to determine, which parts of the figure we need to make visible/invisible with fig[0].select_one({'name': q}).visible = x. Whenever we make changes to the look of the figure, we must redirect it to our output inside with output_figure: to make the changes visible.
Those are our callbacks. The figure creation in create_figure(x_var, y_var, data): is straightforward thanks to Bokeh. figure() creates the figure and then the for species in iris['target_names']: loop creates the points for each species. Next up is the actual widget creation.
Creating widgets
The first widget we create is output_figure = widgets.Output() which will display the figure. We next create a default figure and direct it to output_figure. The fact that we store the figure in a singular list by fig.append(p) is a bit of a hack. This has to do with scope within the callback functions. If we reassign p = figure(), we only change p inside the function. If we change fig[0] = figure() on the other hand, we change the list outside the function because lists are mutable.
Now that we have our figure we create a checkbox for each species in the for species in iris['target_names']: loop and store it in a dictionary so we can access each with the species name. Most of the magic happens in the widgets.interactive(f_species_checkbox, x=True, q=widgets.fixed(species)). It create the widget and links it to the f_species_checkbox callback all in one line. It decides to create a checkbox based on x=True. Booleans create checkboxes but later we will create the dropdown menus also with widgets.interactive. The additional parameter q=widgets.fixed(species) tells us which checkbox called f_species_checkbox.
The following two dropdown widgets are very similar. Nothing special there.
Now that we have our widgets, we need to actually assemble them in our UI. We do that with menu=widgets.VBox([x_dropdown, y_dropdown, *species_checkboxes.values()]). This creates a box where the widgets inside are oriented in a column. The V in VBox means vertical, hence a column. We are almost done. The specifications in widgets.Layout() are not critical but I want to show them here. This is not a widget in itself but we can pass it to a widget to change the style. widgets.Layout() exposed properties you might know from CSS.
To finish up we create the full app with app=widgets.Box([menu, output_figure], layout=app_layout). The app_layout specifies with the flex_flow that the menu and the figure output are oriented as a row, menu on the left and figure on the right. Finally, we display out app in the output under our cell with display(app).
Possible improvements
To improve this code, I think it would be better if the callbacks depended less on global variables. Therefore, a closer look at widgets.interactive might be useful. Alternatively, the global variables that are known to be used inside functions could be made explicit with the global keyword. Finally, in create_figure(), I am creating a counter variable species_nr = 0. This is unnecessary in Python but I did not have time to think through the Pythonic way to do this. I hope this has been useful for you. Let me know what kind of data dashboards you are building.
As researchers we often have to manage the entire data lifecycle from generation to the final report. Raw data are major parts of any analysis pipeline and it is important to properly store them. Many guidelines on data management focus on raw data but also stop there. However, raw data without analysis are pretty much worthless and there are major data management decisions to be made during analysis. Which intermediate results to save? What format? Where? Cloud storage? Do I build a database? Every decision will have effects in the actual analysis code and can determine how easy errors can be found and fixed. Here I want to give a detailed overview of the different options we have to organize our data during analysis. Python is my main programming language and this will be reflected here in that I prefer open source tools and I am not experienced with some of the commercial solutions that might exist. However, I will also cover the advantages/disadvantages of human-readable data formats and tools such as Microsoft Excel. We get started with raw data.
Raw data
Raw data are like a good hypothesis. Just like a hypothesis is perfect until you ruin it with data, raw data are perfect until you ruin them with a thorough analysis. Raw data are truth. This is how it happened. Therefore, the raw data is never changed and we always keep multiple copies of it. Even after we published the results of our analysis, it is a good idea to keep the raw data for at least 10 years. Depending on your funding and where you publish you might be legally and/or ethically required to keep raw data for a defined time period.
Anything we do with our raw data during the analysis we do in read-only mode. The only thing we might want to change is the file name. However, we should check whether there is a way to automate the file naming in the program that generates them. Automating file names is preferred and avoids human error. If we receive data from collaborators we should also make sure that we are not deleting or changing data that they placed in the file name or the directory hierarchy.
From an analysis point of view, we might not need any information in the file name at all. We could randomly generate it. However, we almost always want to store important metadata in the file name because it has two major advantages: 1. The file name is human-readable. You need neither programming knowledge nor knowledge about the file format to read it. 2. The file name is readable by the operating system. This means, our programs can read it even if they don’t know the file format. For example, you can use file name information to decide whether or not to load a file during analysis and save loading time.
So what do we put in the file name? The first rule is that it is better to save too much information as opposed to too little. But most file systems have limits on the file name size (I hit it many times in Windows 10). When you choose your file name format you can ask yourself: what would I want a human to know about the files before they even open them and what would be good to have easily accessible during analysis? You will read in a lot of guides that you must save the date and time. That is useful because it makes it easier to relate a file to your lab book. Otherwise I only recommend you to give each file a unique identifier. This can be a number or a string of characters. I prefer numbers because they can have the added bonus of giving the recording order at a glance. Other information in the file names is up to you.
Once you have decided which other information to put in your file names, choose one delimiter to separate items and another one for readability. I use underscore to separate items and minus for readability. That allows me to do things like: “001_YYYY-MM-DD_version-2.fmt”. Avoid empty spaces in file names. Whatever decision you make, you must remain consistent.
When it comes to the directories, I recommend having all raw files in the same directory. I used to save some directory hierarchy structure but nowadays I would put all of that into the file name. If you have a lot of files, some directory structure can help humans navigate it but during analysis I almost always prefer one directory for simplicity. The file name is certainly a good way to save metadata but there are other ways to save them. In the next chapter we will discuss metadata more generally.
Metadata
At the raw data stage, the definition of metadata is clear. It is any data that describes the context of the raw data but does not fit any of the raw data formats. Images are a good example. The raw data of an image are the pixel values that make up the digital image. Metadata of the image can be anything that relates to the context where the image was taken. The time it was taken, the place it was taken, the exposure time, the model of the lens used, the temperature that day, the person who took the image, the width, the height and much, much more. As the analysis progresses, metadata may become data. Let’s say your analysis pipeline counts the number of birds in each image and you want to find out whether there are more early birds than late birds. In that case the time of day the picture was taken becomes an independent variable.
So what kind of metadata should you save? That is up to you, your research question and your field but it is usually better to save too much metadata than too little. Metadata you did not save is almost impossible to reconstruct later, so think carefully when you decide that something is not worth saving.
How to store metadata? If you are lucky, all metadata you need might already be inside your raw data files. Most of the time however, you will have to add some metadata manually. In that case I prefer to create a single .csv file where each row describes a single raw data file. One of the advantages of .csv files is that they can be read by humans and script with a variety of programs.
The same metadata.csv file opened in four different programs. OpenOffice Calc (top left), Notepad (top right), Spyder IDE (lower left) and Notepad++ (lower right).
The id uniquely identifies the row and thereby the raw data file it refers to. A drawback of a .csv file is that it only works well for a single spreadsheet. If you require multiple related sheets – for example because you have multiple raw data sources – you might require a more complex format such as JSON or even build a database. More later on the reasons for and against building a database. Next, we will assume that we found a way to process our raw data and need to decide how to store it for analysis.
Analysis and storage of transformed raw data
Analysis of raw data most often means to iterate through all raw data files and do some computation on each. In our fictional example, we were taking images and then extracting birds from those images. Once we have extracted the birds, our unit of analysis changes. This has consequences for the metadata, which describes the raw data files. We need to relate the metadata from the raw data files to the birds and there are two major options to do this. First, we generate a spreadsheet where each row is a bird. In the image_id column we save the unique id of the image the bird was extracted from.
The extracted birds are related to a raw image by the fact that a specific bird was extracted from a single specific image. The tables in the top left and top right contain all information but during analysis we would need to open two files and relate the information. The table at the bottom has the advantage that both tables are merged into one. This can offer convenience but it increases the storage size of the file.
This is a relational data structure. The image_id in our birds table identifies a unique id in our metadata which allows us to figure out where, when and with what exposure a given bird was photographed. This is a great way to structure data. The only downside is that we have to load two files and then merge them during the analysis. Personally I prefer to create a single file where both tables are already merged. The downside of this data structure is that is is larger in storage size (because the metadata columns are repeated multiple times). Whether or not this is an issue for you depends on the total size of your data, the amount of rows extracted per image and your available disk space. If you decide on the relational data structure you might also want to consider building a database, because they are particularly well suited for relational storage. Next, we will discuss some advantages and disadvantages of databases.
You probably don’t want to build a database. Unless you do.
A database is any structured collection of information. This means that a collection of spreadsheets as we saw above could technically be considered a database. However, a database usually implies more rigorous rules of storage and access than a spreadsheet can guarantee. These rules give databases some of their major advantages. For example, multiple people and programs can interface with a database without messing things up. If two people open the same spreadsheet in excel and work on it they are almost guaranteed to end up with two conflicting versions. There are many other advantages of databases. When you decide for or against a database you should anticipate whether you will be able to make use of those advantages. Here is my list.
You should consider building a database if:
You will have to integrate multiple raw data sources.
Multiple raw data sources usually mean more relationships. Multiple relationships become harder to manage with spreadsheet but are the perfect use-case for a relational database.
You expect to store a massive amount of data.
A database usually stores data more efficiently, which can save you storage space.
A database can also make access more efficient, because you don’t load the entire database but request smaller packages of information at a time. This avoids running into RAM issues.
You will be adding data continuously to the project over a long period of time.
Having a fixed logic for adding information to the database is a massive advantage over long time periods and can avoid a lot of errors.
During very long projects you might want to train other people to interface with the database or even hand the project over to someone else. Having the relational rules that avoid errors is very helpful, as opposed to a spreadsheet where anyone can do whatever.
You want to make you data available through a web interface.
A web server interfacing with a spreadsheet would be very inefficient.
Multiple people need read/write access simultaneously.
As mentioned above, multi-user access works much better for databases.
You have security concerns.
This can be a security concern regarding the physical integrity of the data or malicious/unauthorized access. Databases are not perfect but they provide better safeguards than a spreadsheet
Those advantages sound amazing. So why would you ever not build a database? Well, many research projects are structured in a way that you cannot make use of any of these advantages. If a single person generates all their data in a week, processes the raw data the other week and prepares a final report in the third week, a database is overkill. The rules that database architectures follow provide major advantages but they also make them less flexible. Building and maintaining a database comes with some extra work that you only want to take on when you can make use of some of the advantages. Generally speaking, the larger a project is (regarding everything from number of people involved over data sources to time frame), the more worthwhile a database becomes.
Before we move on to data analysis, I want to briefly mention cloud storage. Cloud storage has some advantages but most of them can also be achieved in other ways. In my opinion, the most important reason to consider cloud storage is if you regularly need access to the data from a large variety of different locations that are well connected to the internet. However, if you have too much money, a commercial cloud storage can also provide services for you that might be annoying to maintain yourself. On the other hand, as a researcher you may also not be allowed (because of funding or institutional policies) to use certain cloud services. Other details of cloud storage are beyond this guide.
Visualization and statistical analysis
So far we extracted samples (birds) from our images and quantified features of those birds (species and size). Some additional features are given by the image metadata. There might be additional intermediate processing steps whose results you want to save. That’s up to you. For example, maybe you had to apply some filters to the images when you extracted the birds. Do you want to save these filtered images? If the filtering takes very long and you want to visually inspect the result, you probably want to save them. Otherwise, having the script that performs the filtering is sufficient.
The plots you draw are something you definitely want to save. You also want to save the results of statistical analysis. At this stage it is important to have a clear relationship between analysis scripts and their outputs. I hate to look through multiple scripts, trying to find the code that generated a specific figure I want to change. You could just try to put everything into one script. I have done this before but depending on the size of the project this becomes unwieldy very quickly and you will find yourselves scrolling through thousands of lines of code. There is no perfect recipe here so I think a practical example is in order.
A practical example for a project structure
Here I will give some examples, how the bird projects could look as a directory hierarchy.
The directory structure of an example project.
In the project folder itself are only Python scripts and a README.txt, where we can give information that anyone looking at our project should be aware of. Having everything else in specific folders makes it easier to locate a Python script. There is one more advantage: if you use git to version control or share your projects, you can exclude data and figures from tracking. Data is often too large to push to remote or we don’t want to make unpublished data public. Therefore we can simply add the data folders we want to keep local to a .gitignore file. However, managing your project with git is not a must and for smaller projects it does not always pay off.
Now for the scripts themselves. Each script should have a very well defined input and output. extract_birds.py for example loops through all raw files in ./raw and creates the file ./data/birds.csv. The other two scripts both use that same birds.csv file but they answer different questions and have different outputs. analyze_location.py creates the figure 02_figure_location.png and the statistical output location_anova.txt. On the other hand analyze_species_size.py creates 01_figure_species_size.png and species_ttest.txt.
In this example the same script visualizes and performs the statistical test. You could split this up further and have a script for visualization and another for stats. I prefer to combine them because both tasks require loading of the same data. On another note, I prefer to keep things like reports, papers and presentation in a separate folder. In my experience they clutter the project space.
Planning a project in detail is important but there are also diminishing returns. At some point the only way forward is to actually start the project. I hope you learned some useful tricks to manage your own data more effectively.
As a scientist I have the privilege to always learn and discover new things. I have recently started a new blog called Microbe Food, about single cell food production, you might be interested in. This leaves my programming blog mostly dormant for the moment but I hope the content is still useful.
SciPy features two different interfaces to solve differential equations: odeint and solve_ivp. The newer one is solve_ivp and it is recommended but odeint is still widespread, probably because of its simplicity. Here I will go through the difference between both with a focus on moving to the more modern solve_ivp interface. The primary advantage is that solve_ivp offers several methods for solving differential equations whereas odeint is restricted to one. We get started by setting up our system of differential equations and some parameters of the simulation.
UPDATE 07.02.2021: Note that I am focusing here on usage of the different interfaces and not on benchmarking accuracy or performance. Faruk Krecinic has contacted me and noted that assessing accuracy would require a system with a known solution as a benchmark. This is beyond the scope of this blog post. He also pointed out to me that the hmax parameter of odeint is important for the extent to which both interfaces give similar results. You can learn more about the parameters of odeint in the docs.
import numpy as np
from scipy.integrate import odeint, solve_ivp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def lorenz(t, state, sigma, beta, rho):
x, y, z = state
dx = sigma * (y - x)
dy = x * (rho - z) - y
dz = x * y - beta * z
return [dx, dy, dz]
sigma = 10.0
beta = 8.0 / 3.0
rho = 28.0
p = (sigma, beta, rho) # Parameters of the system
y0 = [1.0, 1.0, 1.0] # Initial state of the system
We will be using the Lorenz system. We can directly move on the solving the system with both odeint and solve_ivp.
Simulation results from odeint and solve_ivp. Note that the solve_ivp looks very different primarily because of the default temporal resolution that is applied. Changing the the temporal resolution and getting very similar results to odeint is easy and shown below.
The first thing that sticks out is that the solve_ivp solution is less smooth. That is because it is calculated at fewer time points, which in turn has to do with the difference between t_span and t. The odeint interface expects t, an array of time points for which we want to calculate the solution. The temporal resolution of the system is given by the interval between time points. The solve_ivp interface on the other hand expects t_span, a tuple that gives the start and end of the simulation interval. solve_ivp determines the temporal resolution by itself, depending on the integration method and the desired accuracy of the solution. We can confirm that the temporal resolution of solve_ivp is lower in this example by inspecting the output of both functions.
The t array has 4000 elements and therefore the result of odeint has 4000 rows, each row being a time point defined by t. The result of solve_ivp is different. It has its own time array as an attribute and it has 1989 elements. This tells us that solve_ivp indeed calculated fewer time points affection temporal resolution. So how can we increase the the number of time points in solve_ivp? There are three ways: 1. We can manually define the time points to integrate, similar to odeint. 2. We can decrease the error of the solution we are willing to tolerate. 3. We can change to a more accurate integration method. We will first change the integration method. The default integration method of solve_ivp is RK45 and we will compare it to the default method of odeint, which is LSODA.
Comparison between RK45 and LSODA integration methods of solve_ivp.
The LSODA method is already more accurate but we can make it even more accurate but it is still not as accurate as the solution we got from odeint. That is because we made odeint solve at even higher temporal resolution when we passed it t. To get the exact same result from solve_ivp we got from odeint, we must pass it the exact time points we want to solve with the t_eval parameter.
odeint and solve_ivp with identical integration method and temporal resolution.
Now both solutions have identical temporal resolution. But their solution is still not identical. I was unable to confirm why that is the case but I suspect very small floating point errors. The Lorenz attractor is a chaotic system and even small errors can make it diverge. The following plot shows the first variable of the system for odeint and solve_ivp from the above simulation. It confirms my suspicion that floating point accuracy is to blame but was unable to confirm the source.
Solutions from odeint and solve_ivp diverge even for identical temporal resolution and integration method.
There is one more way to make the system more smooth: decrease the tolerated error. We will not go into that here because it does not relate to the difference between odeint and solve_ivp. It can be used in both interfaces. There are some more subtle differences. For example, by default, odeint expects the first parameter of the problem to be the state and the second parameter to be time t. This is the other way around for solve_ivp. To make odeint accept a problem definition where time is the first parameter we set the parameter tfirst=True.
In summary, solve_ivp offers several integration methods while odeint only uses LSODA.
Differential equations are special because they don’t tell us the value of a variable straight up. Instead, they tell us by how much the variable will change with respect to the change of another variable. Usually that other variable is time. To numerically solve a system of differential equations we need to track the systems change over time starting at an initial state. This process is called numerical integration and there is a SciPy function for it called odeint. We will learn how to use this package by simulating the ‘hello world’ of differential equations: the Lorenz system.
Here is the first part of the code where we define the function that describes the dynamics of the system.
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from mpl_toolkits.mplot3d import Axes3D
def lorenz(state, t, sigma, beta, rho):
x, y, z = state
dx = sigma * (y - x)
dy = x * (rho - z) - y
dz = x * y - beta * z
return [dx, dy, dz]
We start with some imports. Of course we need NumPy and odeint is imported from scipy.integrat. Matplotlib will be used to plot the result of our simulation. After that we define the system of differential equations that defines our Lorenz system. It consists of three differential equations that we fit into one function called lorenz. This function needs a specific call signature (lorenz(state, t, sigma, beta, rho)) because we will later pass it to odeint which expects specific parameters in specific places. Most importantly, the first parameter must be the state of the system.The state of the Lorenz system is defined by three variables: x, y, z. Our state object has to be a sequence with an order that reflects this.
Inside the lorenz function, the first thing we do is to unpack the state into the three state variables. This is followed by the three differential equations that described the dynamic changes of the state variables. The fact that the variable t does not show up in any of these equations is a common point of confusion. The amount of change certainly depends on the amount of time. So why can we ignore t here? The answer is that our numerical integrator will keep track of t for us. For this particular system we could actually build a function that does not take the parameter t but I include it because it can be useful if you want to add discontinuities that depend on t.
While t does not appear in the equations, sigma, beta & rho do. They are the parameters of the system and the system’s properties depend on them. We will set those parameters next.
These are the parameters Lorenz himself used and they are known to produce the type of dynamic that the Lorenz system is most known for: the Lorenz attractor. It is important that we store these parameters in a tuple in this exact order because of our functions structure. It must be a tuple rather than another type of collection because odeint expects it. Now that our parameters are defined, we will move on to define the initial values of the system. This is a critical part of solving differential equations. These equations tell us by how much the system state changes but they cannot tell us where to start.
y0 = [1.0, 1.0, 1.0]
Our system will start with all variables at 1.0. Now we can solve the system and plot the result.
We solve the system with a simple call to to odeint and we pass it the function that defines out system, the initial state, the time points t for which we want to solve the system and the parameters p. result is a two-dimensional array where the rows are the time points and the columns are the state variables at that those time points. And this is how we can solve differential equations with SciPy.
The deep feedforward neural network is the most simple network architecture. They are a great entry point to many deep learning concepts. They can also be pretty effective for many applications but they have been replaced by more specialized networks in most areas (for example recurrent neural networks or convolutional neural networks). Here we will build one with pyTorch and we will go from feature selection to training. The dataset is the Titanic challenge, where our model has to predict who survives the sinking of the Titanic. It is the introductory data science competition on Kaggle. We will start by looking at our data.
Missing Values and String to Numerical Conversion
First we need to load the data and find out what we are dealing with. It already comes split into training and test data, both being .csv files. We load both files and take a look at their general structure with .info().
For each passenger we have 11 features. In the test data, the 'Survived' column is missing, because our goal is to predict it. For the training data we know who survived and we will use this to predict who survived in the test data. Before we can do that, we need to take a closer look at our data. The first issue we encounter is missing values. In the training data are 891 passengers. However, the 'Age' column contains only 714 non-null values. The others are missing. We say they are not-a-number (NaN) and we need to take care of them. If we were to feed NaNs into our model it would completely destroy the calculations. One option is to remove the rows that contain NaNs. This would exclude a large number of rows. Probably more than we were comfortable with, considering that there will be other columns with NaNs. Additionally, the test data also have NaNs. Should we just give up for those test passengers and guess?
A better idea is to replace missing values with the column mean. Given no other information, the mean is the best guess. To be extra careful, we can take the column median, which is more resistance to outliers. Importantly, we will replace NaNs in the test data with the median from the training data. This is not strictly necessary for the Titanic toy example but it is really good practice if you ever design an important model that will be deployed on completely unknown data. Generally, we should use the test data exclusively to calculate our test accuracy. Nothing else. It should never touch the training data and we should act like we don’t even have access to it yet, until we get to the model testing. There are more complicated ways to replace NaNs but they go beyond this blog post. Let us replace NaNs with the median using the .fillna method. Because we need to do the same preprocessing to the training and test dataset we will loop through them. We will have to do the same for the "Fare" column. There is only one value missing and it is in the test data, but we must fill it.
train_test_datasets = [train, test]
median_age = train["Age"].median()
median_fare = train["Fare"].median()
for dataset in train_test_datasets:
dataset["Age"].fillna(median_age, inplace=True)
dataset["Fare"].fillna(median_fare, inplace=True)
That takes care of "Age" and “Fare” but we cannot apply the same strategy to "Cabin" or "Embarked" because they are nominal columns. Note that their dtype is object instead of a numeric types such as int64. Let us take a look at "Cabin" first.
train["Cabin"]
"""
PassengerId
1 NaN
2 C85
3 NaN
4 C123
5 NaN
887 NaN
888 B42
889 NaN
890 C148
891 NaN
Name: Cabin, Length: 891, dtype: object
"""
type(train["Cabin"][1])
# float
type(train["Cabin"][2])
# str
train["Cabin"].isna().sum()
# 687
Some rows contain strings, other contain floats. The number of NaNs is large at 687. I was tempted to simply delete this column because of the large number of NaNs. I decided to keep it because it might be important. I am no ship expert, but the location of the cabin might determine how accessible life boats are and thereby influence survival. Now there are several ways to proceed. I will go with a compromise between simplicity and retaining information. We will extract the first letter, creating a new nominal feature called “Cabin Letter”.
There are eight different cabin letters and the NaN problem persists. Because we are dealing with a nominal feature we can create a ninth category. These are the passengers where we don’t know their cabin number. We will do so as we convert the letters to numbers. For our network we will need to convert everything to floats anyway.
Now the NaNs are -1, which is fine for now. Later we will convert all these nominal variables to dummy variables anyway, but we will get to that. We could extract more information from "Cabin" but for this demonstration I will leave it here and delete "Cabin".
for dataset in train_test_datasets:
dataset.drop("Cabin", axis=1, inplace=True)
The final variable that suffers from NaNs is "Embarked". The data in it is less messy so we can go straight to getting the categorical codes.
for dataset in train_test_datasets:
dataset["Embarked"] = pd.Categorical(dataset["Embarked"])
dataset["Embarked"] = dataset["Embarked"].cat.codes
Now we succesfully removed all NaN values but some minor issues about our columns remain. "Name", "Sex" and "Ticket" do not have a numerical type. They are of type object. "Name" and "Ticket" probably don’t tell us a lot about survival, unless we would do some serious feature engineering on them. So we will simply drop them.
for dataset in train_test_datasets:
dataset.drop(["Name", "Ticket"], axis=1, inplace=True)
That just leaves us with "Sex". We can use .cat.codes as above to convert it to nubmers.
for dataset in train_test_datasets:
dataset["Sex"] = pd.Categorical(dataset["Sex"]).codes
Now let us take one more look at our datasets and make sure we took care of NaNs and everything is numeric.
Everything looks in order. However, there is one more necessary preprocessing step before we can move to standardization. We need to convert categorical features to dummy variables. That means, each category in a column gets its own new column and a sample of that category will have a 1 there. Let’s see how that looks.
Convert Categories to Dummy Variables
We will use the .get_dummies function on all our categorical columns but first let us try it on the "Pclass" column.
Dummy variables make sense if we just stare at them long enough. What we are looking for is the conversion of each category into an individual column. train["Pclass"] has three categories, [1, 2, 3]. Therefore, we get three dummy variables. A row that has a 1 in "Pclass" will get a 1 in the first column and 0 in the other two. A row that has a 2 will get a 1 in the second column and 0 in the others. A row that has a 3 will get a 1 in the third column and 0 in the others. Now we will use the same function on all our categorical columns. Note that we will not use our usual loop trough train and test data because there is no way to generate the dummy variables inplace.
The code looks solid but something went wrong. We now have more columns in our train_dummies than in test_dummies. What happened? Let us look at the columns and see if we can spot the issue.
test_dummies is missing the following columns: "Embarked_-1“, "Embarked_0" and "Cabin Letter_7". Those are the three missing columns. Remember that "Survived" is purposely not in the test dataset. This happened because in the test data there were no rows that had missing values or a 0 in "Embarked" or a 7 in "Cabin Letter". This raises another issue. When we converted string categories to numbers with cat.codes, we may have converted letters differently in the test and training dataset. Luckily, we can convert strings directly to dummy variables. This is a great opportunity to take a step back and clean up our code. We will start from scratch.
import pandas as pd
# Load data
train = pd.read_csv("train.csv", index_col='PassengerId')
test = pd.read_csv("test.csv", index_col='PassengerId')
# Merge train and test for wrangling and preprocessing
train_test_datasets = [train, test]
"""Data wrangling"""
# Split cabin into letter and number
median_age = train["Age"].median()
median_fare = train["Fare"].median()
for idx, dataset in enumerate(train_test_datasets):
dataset["Age"].fillna(median_age, inplace=True)
dataset["Fare"].fillna(median_fare, inplace=True)
dataset["Cabin Letter"] = dataset["Cabin"].str.slice(0, 1)
dataset.drop("Cabin", axis=1, inplace=True)
#dataset["Embarked"] = dataset["Embarked"].cat.codes
dataset.drop(["Name", "Ticket"], axis=1, inplace=True)
categorical_cols = ["Pclass", "Sex", "Embarked", "Cabin Letter","SibSp"]
train_dummies = pd.get_dummies(train,
columns=categorical_cols,
prefix=categorical_cols,
dummy_na=True)
test_dummies = pd.get_dummies(test,
columns=categorical_cols,
prefix=categorical_cols,
dummy_na=True)
train_dummies.shape
# (891, 32)
test_dummies.shape
# (418, 30)
Now we have more columns because we added the dummy_na=True parameter. This gives us a "Sex_nan" column, although there were no NaNs in the original data. This is excessive but won’t be an issue for our network. Now we just need to add the one columns that is missing from test_dummies. It should contain only zeros, because there were no rows with that category in the test dataset.
import numpy as np
test_dummies["Cabin Letter_T"] = np.zeros(test_dummies.shape[0])
Now our dummy variables are in order and we can move on balancing the training data.
Balancing the Training Data
Balancing the training data will be important during learning. We want to train our model to predict survival. Imagine the extreme case, where we train a model only on passengers that survived. This model will be terrible, because it has no idea how a surviving passenger looks on paper. Now don’t worry, this training data has rows on both surviving and non-surviving passengers. However, a training data where one of the outcomes is more frequent can bias a model. Let us find out what’s the situation with this training data.
total_samples = train_dummies.shape[0] # Number of rows in DataFrame
number_surviving = (train_dummies['Survived'] == 1).sum() # Number of survivors
perc_survivors = (number_surviving / total_samples) * 100
# 38.38383838383838
In this training data 38% of passengers survived. There are no hard rules on how balanced data should be, but I am not happy with that number. If it was 45% I think it would be fine but we should do something about 38%. Because the number of survivors is lower than the non-survivors, we will randomly select as many non-survivors as there are survivors.
Now we have 342 survivors and 342 non-survivors. Perfectly balanced, as all things should be. Concatenating like this we must also remember to shuffle the rows with .sample(frac=1), otherwise we might run into problems later with unbalanced batches. Now that our data is balanced, we can move on to standardization.
Standardization
Both datasets are now free from missing values, categories are converted to dummy variables and our labels are balanced. Now we can move on to standardization, the process of bringing features to the same scale. We achieve this by subtracting the mean and dividing by the standard deviation. This has some advantages for neural networks. It can increase the learning rate and result in better fits. There are also disadvantages, as some features become harder to interpret because they lose their physical units. Because we rarely try to interpret neuronal networks, we almost always standardize our data. This is how it works with scikit-learn.
Originally the mean age is 29.23 and the standard deviation is 13.24. After standardization, the mean is near zero because we subtracted it and the standard deviation is near 1 because we divided by it. So our standardization worked. Now the absolute values of all features are on the scale of the standard normal distribution. We are now almost ready to create our actual network. But before, we will split our training data into a training and validation data. This will help us avoid overfitting.
Validation Data Split
Using train_test_split we get 547 training and 137 validation samples. test_size=0.1 specifies that 10% of samples should be in the test data, which in our case will be used for validation, as we already have another designated test data set. We are now ready to build our network an start training.
from sklearn.model_selection import train_test_split
X = train_scaled
X_train, X_validate, y_train, y_validate = train_test_split(X, y,
test_size=0.1)
X_train.shape
# (547, 31)
X_validate.shape
# (137, 31)
Building the Network with pyTorch
We start by converting our arrays to tensors. This is the data structure pyTorch will expect as input to the network later.
This network takes our 31 input features and transforms them to 50 hidden units using a fully connected linear layer. This layer also learns a bias term by default. We then apply the rectifying linear unit to introduce some non-linearity. We then convert the 50 hidden units to one output unit, to which we apply the Sigmoid function. This means our deep neural network has one hidden layer with 50 units. The Sigmoid function makes sure that our output domain is between 0 and 1. This is important because we are making a binary classification of surviving (1) and non-surviving (0) passengers. Now we define our loos function and our learning method.
Our loss function is binary cross entropy loss, which works well for the binary classification problem we are facing. Our learning algorithm is Adam, a variety of gradient descent that features L2 regularization (weight_decay). Before we start training we will split our train data into 41 mini batches.
Training our network on the mini batches instead of the whole data makes learning quicker and can result in models that generalize better. Now we are ready to train the network.
n_epochs = 2000
loss_list = []
validate_loss_list = []
for epoch in range(n_epochs):
for batch_idx in range(n_batches):
optimizer.zero_grad()
outputs = model(train_features_batched[batch_idx].float())
loss = criterion(outputs.flatten().float(),
train_labels_batched[batch_idx].float())
loss.backward()
optimizer.step()
outputs = model(train_features.float())
validation_outputs = model(validation_features.float())
loss = criterion(outputs.flatten().float(),
train_labels.float())
validate_loss = criterion(validation_outputs.flatten().float(),
validation_labels.float())
loss_list.append(loss.item())
validate_loss_list.append(validate_loss)
print('Finished Training')
import matplotlib.pyplot as plt
plt.plot(loss_list, linewidth=3)
plt.plot(validate_loss_list, linewidth=3)
plt.legend(("Training Loss", "Validation Loss"))
plt.xlabel("Epoch")
plt.ylabel("BCE Loss")
Loss curves of model training. Yours will look different, as we did not seed the randomness during preprocessing or model initialization.
This code starts by creating empty lists to record the loss of both test and validation data. n_epochs tells us how many times we train the network on the entire data. The actual learning happens in the inner loop that exposes batch_idx. We train the network after each mini batch. The output is whatever the network current spits out for a given batch. Then we compare that output to the known labels, which is our loss. loss.backward() calculates the gradient of the loss with respect to the parameters. Finally, the actual learning happens when we call optimizer.step(), which adjust the parameters of our model according to the gradients. This is why pyTorch is handy, it takes very good care of the gradients. The loss we are plotting is calculated in the out loop, where the loss across the entire data instead of the mini batches is calculated. The rest is just plotting. So how did it go? Our network learns very quickly. After 250 epochs the validation loss already stops improving. Everything that comes afterwards is probably overfitting the training data. That means the training loss improves while the validation loss is stagnant or becomes worse. The last piece of code we have to write calculates our prediction for the test data and saves it in a way that we can upload it to Kaggle to find out how well we did.
I will upload three different predictions to Kaggle. One from a network that is untrained. To do so I will set n_epochs to 0. This amounts to guessing, because the network is randomly initialized. Then I will upload predictions from a model that stops after 200 epochs. This should be a well trained model. Then I will use a model that stopped after 2000 epochs. This should be slightly overfit. Let’s see how we did.
We got a 0.39952 accuracy score for the untrained model, meaning that 39.952 % of our predictions were correct. Pretty bad but expected from a guessing model. Our model trained on 200 epochs has a 0.70095 score. Much better and a good sign that our model actually learned something. Our model trained on 2000 epochs got a score of 0.74162. Not exactly what we would expect. This probably means that our validation data set was not representative for the test data set. I also read from other people that overfitting the test data improves test accuracy slightly in the Titanic example. How could we improve our model?
Where to go from here
I decided to use a neural network because I wanted to write about it but it is not the best model. A random forest classifier seems to do better here. The highest score without cheating is around 0.83, which is achieved with this kind of model. But what could we do to improve our above code. One great improvement would be to use another validation method. Our validation method sets a lot of data aside that the model is never trained on and we don’t have massive amounts of data to begin with. A cross-validation technique could help us us that data for training. A second improvement would be to use a more formal way to determine the epoch where we want to stop training. We just looked at it and picked nice looking spots in the loss curve. Early stopping is a technique that could help us find a more objective point to stop the training. Finally, we could learn more about the Titanic data and get more features out of the data. For example, we discarded the cabin number, keeping only the letter. However, the number could be very important. A good way to learn more about the Titanic data is to browse the Kaggle discussion board.
t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.
How to use t-SNE with scikit-learn
We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.
from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt
# Load the MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]
# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")
Two example images from the MNIST dataset.
By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.
X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images
Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.
from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns
# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images
# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
t-SNE example on MNIST subsample.
Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.
t-SNE step by step
The Distance Matrix
The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.
Definition of euclidean distance for two features. Image credit licensed under CC BY 4.0 by Kmhkmh.
The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.
When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.
Joint Probabilities
The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.
t-distributions for sigma of 1 and 2
This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.
Distributions of joint probabilities for different values of sigma.
Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.
from sklearn.manifold import _t_sne
perplexity = 30 # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
desired_perplexity = perplexity,
verbose=False)
As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.
Optimize Embedding with Gradient Descent
Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.
# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
n_components).astype(np.float32)
embedding_init = X_embedded.ravel() # Flatten the two dimensional array to 1D
# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
'degrees_of_freedom': 1,
'n_samples': 1000,
'n_components':2}
# Perform gradient descent
embedding_done = _t_sne._gradient_descent(_t_sne._kl_divergence,
embedding_init,
0,
1000,
kwargs=kl_kwargs)
# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)
# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
'tsne_2': tsne_result[:,1],
'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.
A DataFrame is a spreadsheet like data structure. We can think of it as a collection of rows and columns. This row-column structure is useful for many different kinds of data. The most widely used DataFrame implementation in Python is from the Pandas package. First we will learn how to create DataFrames. We will also learn how to do some basic data analysis with them. Finally, we will compare the DataFrame to the ndarray data structure and learn why data frames are useful in other packages such as Seaborn.
How to Create a DataFrame
There two major ways to create a DataFrame. We can directly call DataFrame() and pass it data in a dictionary, list or array. Alternatively we can use several functions to load data from a file directly into a DataFrame. While it is very common in data science to load data from file, there are also many occasions where we need to create DataFrame from other data structures. We will first learn how to create a DataFrame from a dictionary.
In our dictionary the keys are used as the column names. The data under each key then becomes the column. The row names are defined separately by passing a collection to the index parameter of DataFrame. We can get column and row names with the columns and index attributes.
These names are useful because they give us a descriptive way of indexing into columns and rows. If we use indexing syntax on the DataFrame, we can get individual columns.
In that case there are no dictionary keys that could be use to infer the column names. This means we need to pass the column_names to the columns parameter. Mostly anything that structures our data in a two-dimensional way can be used to create a DataFrame. Next we will learn about functions that allow us to load different file types as a DataFrame.
Loading Files as a DataFrame
The list of file types Pandas can read and write is rather long and you can find it here. I only want to cover the most commonly used .csv file here. They have the particular advantage that they can also be read by humans, because they are essentially text files. They are also widely supported by a variety of languages and programs. First, let’s create our file. Because it is a text file, we can write a literal string to file.
text_file = open("example.csv", "w")
text_file.write(""",Frequency,Location,Cell Type
C1,20,2,Interneuron
C2,50,3,Interneuron
C3,8,1,Pyramidal""")
text_file.close()
In this file columns are separated by commas and rows are separated by new lines. This is what .csv means, it stands for comma-separated values. To load this file into a DataFrame we need to pass the file name and which column contains the row names. Pandas assumes by default that the first row contains the column names.
There are many more parameters we can specify for read_csv in case we have a file that is structured differently. In fact we can load files that have a value delimiter other than the comma, by specifying the delimiter parameter.
We specify '-' as the delimiter and and it also works. Although the function is called read_csv it is not strictly bound to comma separated values. We can also skip rows, columns and specify many more options you can learn about from the documentation. For well structured .csv files however, we need very few arguments as shown above. Next we will learn how to do basic calculations with the DataFrame.
Basic Math with DataFrame
A variety of functions such as df.mean(), df.median() and df.std() are available to do basic statistics on our DataFrame. By default they all return values per column. That is because columns are assumed to contain our variables (or features) and each row contains a sample.
One big advantage of the column is that within a column the data type is clearly defined. Within a row on the other hand different data types can exist. In our case we have two numeric types and a string. When we call these statistical methods, numeric types are ignored. In our case that is 'Cell Type'. Technically we can also use the axis parameter to calculate these statistics for each sample but this is not always useful and has to again ignore one of the columns.
Often times these operations make more sense for individual columns. As explained above we can use indexing to get individual columns and we can even assign new results to an existing or new column.
If you are familiar with NumPy, most of these DataFrame operations will seem very familiar because they mostly work like array operations. Because Pandas builds on NumPy, most NumPy functions (for example np.sin) work on numeric columns. I don’t want to go deeper and instead move on to visualizing DataFrames with Seaborn.
Seaborn for Data Visualization
Seaborn is a high-level data visualization package that builds on Matplotlib. It does not necessarily require a DataFrame. It can work with other data structures such as ndarray but it is particularly convenient with DataFrame. First, let us get a more interesting data set. Luckily Seaborn comes with some nice example data sets and they conveniently load into Pandas DataFrame.
The Iris data set contains information about different species of iris plants. It contains 150 samples and 5 features. The 'species' feature tells us what species a particular sample belongs to. The names of those columns are very useful when we structure our plots in Seaborn. Let’s first try a basic bar graph.
We use sns.barplot and we have to pass our DataFrame to the data parameter. Then for x and y we define which column name should appear there. We put 'species' on the x-axis so that is how data is aggregated inside the bars. Setosa, versicolor and virginica are the different species. The sns.set() function defines multiple parameters of Seaborn and forces a certain style on the plots that I personally prefer. Bar graphs have grown out of fashion and for good reason. They are not very informative about the distribution of their underlying values. I prefer the violin plot to get a better idea of the distribution.
We even get a small box plot within the violin plot for free. Seaborn works its magic through the DataFrame column names. This makes plotting more convenient but also makes our code more descriptive than it would be with pure NumPy. Our code literally tells us, that 'species' will be on the x-axis.
Summary
We learned that we can create a DataFrame from a dictionary or another kind of collection. The most important features are the column and row names. Columns organize features and rows organize samples by convention. We can also load files into a DataFrame. For example we can use read_csv to load .csv or other text based files. We can also use methods like df.mean() to get basic statistics of our DataFrame. Finally, Seaborn is very useful to visualize a DataFrame.
Creating great looking animations in Julia is shockingly easy thanks for the Plots package and some macro magic. Here we will learn how to turn data into high quality animations. We will learn about the @animate macro, frames and the gif function.
Two Steps to Animations
To create animations we simply generate frames with the @animate macro and then generate a file with the gif function. Both are part of the Plots package from Julia, so we have to start with using Plots to make that package available. Next create the data we will plot, which is a simple sine wave. Then we generate the frame with the @animate macro and animate them with the gif function. This is the code and the resulting animation.
using Plots
x = collect(1:0.1:30)
y = sin.(x)
df = 2
anim = @animate for i = 1:df:length(x)
plot(x[1:i], y[1:i], legend=false)
end
gif(anim, "tutorial_anim_fps30.gif", fps = 30)
Macros and Meta-Programming
The @animate macro deserves some extra attention, because it looks like magic. Macros are related to a concept called meta-programming. In Julia, all code is a data structure that can be manipulated in a similar way to all other data structures. This effectively means that we can write code that manipulates our code. That’s what a macro is, a function that modifies code. In our case, the code being modified is the for loop behind our @animate macro. It is modified in a way that it catches the frame at the end of each loop iteration and saves it into anim. We can create code that does the same job ourselves.
anim = Plots.Animation()
for i = 1:df:length(x)
plot(x[1:i], y[1:i], legend=false)
Plots.frame(anim)
end
gif(anim, "tutorial_anim_fps30.gif", fps = 30)
We use the Plots.Animation() function to create our animation object where we will store our frames. During the for loop we then call Plots.frame(anim) to store the frame after each iteration in our anim object. These are the essential steps that the @animate macro takes care of. If you want to learn what the macro does in detail you can call @macroexpand on it.
@macroexpand @animate for i = 1:df:length(x)
plot(x[1:i], y[1:i], legend=false)
end
There is another macro that is even more convenient. The @gif macro. It saves us from having to call gif() on our anim object.
@gif for i = 1:df:length(x)
plot(x[1:i], y[1:i], legend=false)
end
This directly displays the animation if interactive Julia is available for it. The downside of this is that we do not save the animation to disk and we do not have an anim object available to do more animations later. It is most useful to quickly troubleshoot animations interactively.
Beyond plot()
The @animate macro supports animations of anything that can be plotted with Plots. For example, we can animate a heatmap.
anim = @animate for i = 1:100
mat = rand(0:100, 32, 32)
heatmap(mat, clim=(0,255))
end
gif(anim, "tutorial_heatmap_anim.gif", fps = 10)
The frame that is being caught is the state of the active figure at the end of the for loop. The for loop itself gives us a great deal of control, how many frames we want to create. For example in the previous examples, I skipped frames with the df variable.