Balanced spiking neural networks with NumPy

Balanced spiking neural networks are a cornerstone of computational neuroscience. They make for a nice introduction into spiking neuronal networks and they can be trained to store information (Nicola & Clopath, 2017). Here I will present a Python port I made from a MATLAB implementation by Nicola & Clopath and I will go through some of its features. We only need NumPy.

import numpy as np
from numpy.random import rand, randn
import matplotlib.pyplot as plt


def balanced_spiking_network(dt=0.00005, T=2.0, tref=0.002, tm=0.01,
                             vreset=-65.0, vpeak=-40.0, n=2000, 
                             td=0.02, tr=0.002, p=0.1, 
                             offset=-40.00, g=0.04, seed=100, 
                             nrec=10):
    """Simulate a balanced spiking neuronal network

    Parameters
    ----------
    dt : float
        Sampling interval of the simulation.
    T : float
        Duration of the simulation.
    tref : float
        Refractory time of the neurons.
    tm : float
        Time constant of the neurons.
    vreset : float
        The voltage neurons are set to after a spike.
    vpeak : float
        The voltage above which a spike is triggered
    n : int
        The number of neurons.
    td : float
        Synaptic decay time constant.
    tr : float
        Synaptic rise time constant.
    p : float
        Connection probability between neurons.
    offset : float
        A constant input into all neurons.
    g : float
        Scaling factor of synaptic strength
    seed : int
        The seed makes NumPy random number generator deterministic.
    nrec : int
        The number of neurons to record.

    Returns
    -------
    ndarray
        A 2D array of recorded voltages. Rows are time points,
        columns are the recorded neurons. Shape: (int(T/dt), nrec).
    """

    np.random.seed(seed)  # Seeding randomness for reproducibility

    """Setup weight matrix"""
    w = g * (randn(n, n)) * (rand(n, n) < p) / (np.sqrt(n) * p)
    # Set the row mean to zero
    row_means = np.mean(w, axis=1, where=np.abs(w) > 0)[:, None]
    row_means = np.repeat(row_means, w.shape[0], axis=1)
    w[np.abs(w) > 0] = w[np.abs(w) > 0] - row_means[np.abs(w) > 0]

    """Preinitialize recording"""
    nt = round(T/dt)  # Number of time steps
    rec = np.zeros((nt, nrec))

    """Initial conditions"""
    ipsc = np.zeros(n)  # Post synaptic current storage variable
    hm = np.zeros(n)  # Storage variable for filtered firing rates
    tlast = np.zeros((n))  # Used to set  the refractory times
    v = vreset + rand(n)*(30-vreset)  # Initialize neuron voltage

    """Start integration loop"""
    for i in np.arange(0, nt, 1):
        inp = ipsc + offset  # Total input current

        # Voltage equation with refractory period
        # Only change if voltage outside of refractory time period
        dv = (dt * i > tlast + tref) * (-v + inp) / tm
        v = v + dt*dv

        index = np.argwhere(v >= vpeak)[:, 0]  # Spiked neurons

        # Get the weight matrix column sum of spikers
        if len(index) > 0:
            # Compute the increase in current due to spiking
            jd = w[:, index].sum(axis=1)

        else:
            jd = 0*ipsc

        # Used to set the refractory period of LIF neurons
        tlast = (tlast + (dt * i - tlast) *
                 np.array(v >= vpeak, dtype=int))

        ipsc = ipsc * np.exp(-dt / tr) + hm * dt

        # Integrate the current
        hm = (hm * np.exp(-dt / td) + jd *
              (int(len(index) > 0)) / (tr * td))

        v = v + (30 - v) * (v >= vpeak)

        rec[i, :] = v[0:nrec]  # Record a random voltage
        v = v + (vreset - v) * (v >= vpeak)

    return rec


if __name__ == '__main__':
    rec = balanced_spiking_network()
    """PLOTTING"""
    fig, ax = plt.subplots(1)
    ax.plot(rec[:, 0] - 100.0)
    ax.plot(rec[:, 1])
    ax.plot(rec[:, 2] + 100.0)
Three neurons in the balanced spiking neural network.

The weight matrix

At the core of any balanced network is the weight matrix. We define it on line 54 to 58. Initializing it from a normal distribution and normalizing the row mean makes sure that excitation and inhibition are in balance. That is what keeps the network spiking irregularly although the input to the network remain constant. The constant input to the network is the offset parameter.

Refractory period

The refractory period is a time window where no action potential can be generated. We achieve this by setting the voltage to a low value right after the spike and then we do not update the voltage of the spike for a given time. This time window is given be tref. We update the voltage on line 76. In the same line we check how long ago the last spike occurred with the expression (dt * i > tlast + tref). Therefore, we need to track the most recent spike time with tlast. Of course we have some other things to do when a neuron reaches the spiking threshold vpeak. First we set the voltage to a value well above the threshold on line 99. This is purely visual to give a spiky appearance in the recording. So right after we recorded on line 101 we set the voltage to its reset value vreset.

Play around with some of the parameters. You can find the code here: https://gist.github.com/danielmk/9adc7409f40a076ffec0cdf85dea4519

Spiking neural networks for a low-energy future

Spiking neural networks (SNNs) have some disadvantages compared to artificial neural networks (ANNs) but they have the potential to run for a fraction of the energy. Whether SNNs will be able to replace ANNs and how much energy they will be using depends on many engineering and neuroscience advances. Here I will go through some of the technical background of the SNN energy advantages and some of the current numbers.

Energy efficient SNN features

The energy efficiency of SNNs comes primarily from two features. Firstly, the spike is a discrete event and energy is only used when a spike occurs. This is probably the most fundamental feature that distinguishes SNNs from ANNs. This means that the energy efficiency of a SNN depends not only on the number of neurons but also on the number of spikes the model requires to perform. The second feature is local memory. At the heart of all models are parameters. On traditional hardware such as CPUs and GPUs, the part of the chip that performs the calculations is not the same that remembers the parameters. Loading the parameters onto the chip is much more energy intensive than the computation itself. Therefore, when the parameters can be stored locally on the chip that computes, efficiency advantages result. This is not something unique to spiking neural networks. Some tensor processing units (TPUs) also feature local memory and they are specifically designed for ANNs. When most people speak about the energy advantages of SNNs, they assume local memory.

SNNs also require specialized hardware to run efficiently. That hardware is called neuromorphic. It makes efficient use of the binary nature of spikes and local memory. Neuromorphic hardware is so far only available for research purpose and making it more widely available will be one of the challenges to SNN adoption. Next will be some numbers on energy efficiency.

How efficient are we talking?

How much more efficient SNNs are depends on many factors of the comparison. What is the task, what are the model architectures and what is the hardware. Making projections into the future is even harder, since machine learning advances are made quickly on both SNNs and ANNs. Projecting the absolute amount of energy that could be saved is then even harder because it requires AI demand predictions which can change non-linearly with technical advances. I would be interested in finding formal work on some of these uncertainties or work on some myself but for now here are some numbers.

The Loihi processor from Intel Labs is a recent piece of neuromorphic hardware. Depending on the size of their example problem they find that Loihi is 2.58x, 8.08x or 48.74x more energy efficient than a 1.67-GHz Atom CPU (Davies et al. 2018).

Yin et al. (2020) present a method to train SNNs (backpropagation of surrogate gradients). They calculate the theoretical energy consumption for a spiking recurrent network they train with the method and some ANN architectures. Depending on the task, their SNN was 126.2x, 935x, 1602x, 1776x or 3353.3x more efficient than a Long Short-Term Memory network (LSTM; also depends on some details of the LSTM implementation). Their network was 41.3x more efficient compared to a recurrent ANN. Here is a talk from the last author Sander Bohte where he summarizes the findings as >100x more efficient than best recurrent ANN and 1000x more efficient than LSTM. All their calculations assume local memory.

Panda et al. (2012) tried several methods to generate SNNs for image classification and calculated theoretical energy consumption. They estimate better efficiencies of SNNs of 6.52x, 7.7x, 10.6x, 74.9x, 81.3x, 104.8x depending on model architecture and parameter space.

Merolla et al. (2014) present the TrueNorth neuromorphic architecture. They compare synaptic operations per second (SOPS) of their architecture to floating-point operations per second (FLOPS) of traditional chips. They say that TrueNorth can deliver 46 billion SOPS per watt. The most energy-efficient supercomputer they say (at time of their writing) generates 4.5 billion FLOPS per watt.

These numbers highlight the potential for some massive energy savings but benchmarks are always complicated. Making good comparisons can be hard, especially since the unit of computational efficiency is fundamentally different. Either way, SNNs on neuromorphic hardware are extremely energy efficient but to truly save energy they must become better at the tasks ANNs already solve.

References

Davies et al. 2018. Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro. 10.1109/MM.2018.112130359

Yin, Corradi & Bohte 2020. Effective and Efficient Computation with Multiple-timescale Spiking Recurrent Neural Networks. https://arxiv.org/abs/2005.11633

Panda, Aketi & Roy, 2012. Towards Scalable, Efficient and Accurate Deep Spiking Neural Networks with Backward Residual Connections, Stochastic Softmax and Hybridization. https://arxiv.org/abs/1910.13931.

Merolla et al. (2014). A million spiking-neuron integrated circuit with a scalable communication network and interface. https://science.sciencemag.org/content/345/6197/668

Spiking Neuronal Networks in Python

Spiking neural networks (SNNs) turn some input into an output much like artificial neural networks (ANNs), which are already widely used today. Both achieve the same goal in different ways. The units of an ANN are single floating-point numbers that represent the activity levels of the units for a given input. Neuroscientists loosely understand this number as the average spike rate of a unit. In ANNs this number is usually the result of multiplying the input with a weight matrix. SNNs work differently in that they simulate units as spiking point processes. How often and when a unit spikes depends on the input and the connections between neurons. A spike causes a discontinuity in other connected units. SNNs are not yet widely used outside of laboratories but they are important to help neuroscientists model brain circuitry. Here we will create spiking point models and connect them. We will be using Brian2, a Python package to simulate SNNs. You can install it with either

conda install -c conda-forge brian2
or
pip install brian2

We will start by defining our spiking unit model. There are many different models of spiking. Here we will define a conductance based version of the leaky integrate-and-fire model.

from brian2 import *
import numpy as np
import matplotlib.pyplot as plt

start_scope()

# Neuronal Parameters
c = 100*pF
vl = -70*mV
gl = 5*nS

# Synaptic Parameters
ge_tau = 20*ms
ve = 0*mV
gi_tau = 100*ms
vi = -80*mV
w_ge = 1.0*nS
w_gi = 0.0*nS

lif = '''
dv/dt = -(gl * (v - vl) + ge * (v - ve) + gi *(ve - vi) - I)/c : volt
dge/dt = -ge / ge_tau : siemens
dgi/dt = -gi / gi_tau : siemens
I : amp
'''

The intended way to import Brian2 is from brian2 import *. This feels a bit dangerous but it is very important to keep the code readable, especially when dealing with physical units. Speaking of units, each of our parameters has a physical unit. We have picofarad (pF), millivolt (mV) and nanosiemens (nS). These are all imported from Brian2 and we assign units with the multiplication operator *. To find out what each of those parameters does, we can look at our actual model. The model is defined by the string we assign to lif. The first line of the string is the model of our spiking unit:

dv/dt = -(gl * (v - vl) + ge * (v - ve) + gi *(v - vi) - I)/c : volt

This is a differential equation that describes the change of voltage with respect to time. In electrical terms, voltage is changed by currents. These physical terms are not strictly necessary for the computational function of SNNs but they are helpful to understand the biological background behind them. Three currents flow in our model.

The first one is gl * (v - vl). This current is given by the conductance (gl), the current voltage (v) and the equilibrium voltage (vl). It flows whenever the voltage differs from vl. gl is just a scaling constant that determines how strong the drive towards vl is. This is why vl is called the resting potential, because v does not change when it is equal to vl. This term makes our model a leaky integrate-and-fire model as opposed to an integrate-and-fire model. Of course the voltage only remains at rest if it is not otherwise changed. There are three other currents that can do that. Two of them correspond to excitatory and inhibitory synaptic inputs. ge * (v - ve) drives the voltage towards ve. Under most circumstances, this will increase to voltage as ve equals 0mV. On the other hand gi *(ve - vi) drives the voltage towards vi, which is even slightly smaller than vl with -80mV. This current keeps the voltage low. Finally there is I, which is the input current that the model receives. We will define this later for each neuron. Finally, currents don’t change the voltage immediately. They are all slowed down by the capacitance c. Therefore, we divide the sum of all currents by c.

There are two more differential equations that describe our model:

dge/dt = -ge / ge_tau : siemens
dgi/dt = -gi / gi_tau : siemens

These describe the change of the excitatory and inhibitory synaptic conductances. We did not yet implement the way spiking changes ge or gi. However, these equations tell us that both ge and gi will decay towards zero with the time constants ge_tau and gi_tau. So far so good, but what about spiking? That comes up next, when we turn the string that represents our model into something that Brian2 can actually simulate.

G = NeuronGroup(3, lif, threshold='v > -40*mV',
                reset='v = vl', method='euler')
G.I = [0.7, 0.5, 0]*nA
G.v = [-70, -70, -70]*mV

The NeuronGroup creates for us three units that are defined by the equations in lif. The threshold parameter gives the condition to register a spike. In this case, we register a spike, when the voltage is larger than 40mV. When a spike is registered, an event is triggered, defined by reset. Once a spike is registered, we reset the voltage to the resting voltage vl. The method parameter gives the integration method to solve the differential equations.

Once our units are defined, we can interface with some parameters of the neurons. For example, G.I = [0.7, 0.5, 0]*nA sets the input current of the zeroth neuron to 0.7nA, the first neuron to 0.5nA and the last neuron to 0nA. Not all parameters are accessible like this. I is available because we defined I : amp in our lif string. This says that I is available for changes. Next, we define the initial state of our neurons. G.v = [-70, -70, -70]*mV sets all of them to the resting voltage. A good place to start.

You might be disappointed by this implementation of spiking. Where is the sodium? Where is the amplification of depolarization? The leaky integrate-and-fire model doesn’t feature a spiking mechanism, except for the discontinuity at the threshold. If you are interested in incorporating spike-like mechanisms you should look for the exponential leaky integrate-and-fire or a Hodgkin-Huxley like model.

We are only missing one more ingredient for an actual network model: the synapses. And we are in a great position, because our model already defines the excitatory and the inhibitory conductances. Now we just need to make use of them.

Se = Synapses(G, G, on_pre='ge_post += w_ge')
Se.connect(i=0, j=2)

Si = Synapses(G, G, on_pre='gi_post += w_gi')
Si.connect(i=1, j=2)

First we create an excitatory connection from our neurons onto themselves. The connection is excitatory because it increases the conductance ge of the postsynaptic neuron by w_ge. We then call Se.connect(i=0, j=2) to define which neurons actually connect. In this case, the zeroth neuron connects to the second neuron. This means, spikes in the zeroth neuron cause ge to increase in the second neuron. We then create the inhibitory connection and make the first neuron inhibit the second neuron. Now we are ready to run the actual simulation and plot the result. Remember that for now w_gi = 0.0*nS, meaning that we will only see the excitatory connection.

M = StateMonitor(G, 'v', record=True)

run(100*ms)

fig, ax = plt.subplots(1)
ax.plot(M.t/ms, M.v[0]/mV)
ax.plot(M.t/ms, M.v[1]/mV)
ax.plot(M.t/ms, M.v[2]/mV)
ax.set_xlabel('time (ms)')
ax.set_ylabel('voltage (mV)')
ax.legend(('N0', 'N1', 'N2'))
Voltage traces from three neurons. Spiking of N0 increases voltage in N2. Spiking of N1 does nothing in this simulation because its weight onto N2 was set to 0mV.

The first two neurons are regularly spiking. N0 is slightly faster because it receives a larger input current than N1. N2 on the other hand rests at -70mV because it does not receive an input current. When N0 spikes, it causes the voltage of N2 to increase as we expected, because N0 increases the excitatory conductance. N1 is not doing anything here, because its weight is set to 0. Continued activity of N0 eventually causes N2 to reach the threshold of -40mV, making it register its own spike and reset the voltage. What happens if we introduce the inhibitory weight?

w_gi = 0.5*nS
run(100*ms)

fig, ax = plt.subplots(1)
ax.plot(M.t/ms, M.v[0]/mV)
ax.plot(M.t/ms, M.v[1]/mV)
ax.plot(M.t/ms, M.v[2]/mV)
ax.set_xlabel('time (ms)')
ax.set_ylabel('voltage (mV)')
ax.legend(('N0', 'N1', 'N2'))
The same as above but now N1 has a weight of 0.5nS. This inhibits N2 and prevents a spike.

With a weight of 0.5nS, N1 inhibits N2 to an extent that prevents the spike. We have now built a network of leaky integrate-and-fire neurons that features both excitatory and inhibitory synapses. This is just the start to getting a functional network that does something interesting with the input. We will need to increase the number of neurons, decide on a connectivity rule between them, initialize or even learn weights, decide on a coding scheme for the output and much more. Many of these I will cover in later blog posts so stay tuned.

Differential Equations with SciPy – odeint or solve_ivp

SciPy features two different interfaces to solve differential equations: odeint and solve_ivp. The newer one is solve_ivp and it is recommended but odeint is still widespread, probably because of its simplicity. Here I will go through the difference between both with a focus on moving to the more modern solve_ivp interface. The primary advantage is that solve_ivp offers several methods for solving differential equations whereas odeint is restricted to one. We get started by setting up our system of differential equations and some parameters of the simulation.

import numpy as np
from scipy.integrate import odeint, solve_ivp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def lorenz(t, state, sigma, beta, rho):
    x, y, z = state
    
    dx = sigma * (y - x)
    dy = x * (rho - z) - y
    dz = x * y - beta * z
    
    return [dx, dy, dz]

sigma = 10.0
beta = 8.0 / 3.0
rho = 28.0

p = (sigma, beta, rho)  # Parameters of the system

y0 = [1.0, 1.0, 1.0]  # Initial state of the system

We will be using the Lorenz system. We can directly move on the solving the system with both odeint and solve_ivp.

t_span = (0.0, 40.0)
t = np.arange(0.0, 40.0, 0.01)

result_odeint = odeint(lorenz, y0, t, p, tfirst=True)
result_solve_ivp = solve_ivp(lorenz, t_span, y0, args=p)

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(result_odeint[:, 0],
        result_odeint[:, 1],
        result_odeint[:, 2])
ax.set_title("odeint")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(result_solve_ivp.y[0, :],
        result_solve_ivp.y[1, :],
        result_solve_ivp.y[2, :])
ax.set_title("solve_ivp")
Simulation results from odeint and solve_ivp.

The first thing that sticks out is that the solve_ivp solution is less smooth. That is because it is calculated at fewer time points, which in turn has to do with the difference between t_span and t. The odeint interface expects t, an array of time points for which we want to calculate the solution. The temporal resolution of the system is given by the interval between time points. The solve_ivp interface on the other hand expects t_span, a tuple that gives the start and end of the simulation interval. solve_ivp determines the temporal resolution by itself, depending on the integration method and the desired accuracy of the solution. We can confirm that the temporal resolution of solve_ivp is lower in this example by inspecting the output of both functions.

t.shape
# (4000,)

result_odeint.shape
# (4000, 3)

result_solve_ivp.t.shape
# (467,)

The t array has 4000 elements and therefore the result of odeint has 4000 rows, each row being a time point defined by t. The result of solve_ivp is different. It has its own time array as an attribute and it has 1989 elements. This tells us that solve_ivp indeed calculated fewer time points affection temporal resolution. So how can we increase the the number of time points in solve_ivp? There are three ways: 1. We can manually define the time points to integrate, similar to odeint. 2. We can decrease the error of the solution we are willing to tolerate. 3. We can change to a more accurate integration method. We will first change the integration method. The default integration method of solve_ivp is RK45 and we will compare it to the default method of odeint, which is LSODA.

solve_ivp_rk45 = solve_ivp(lorenz, t_span, y0, args=p,
                            method='RK45')
solve_ivp_lsoda = solve_ivp(lorenz, t_span, y0, args=p,
                           method='LSODA')

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(solve_ivp_rk45.y[0, :],
        solve_ivp_rk45.y[1, :],
        solve_ivp_rk45.y[2, :])
ax.set_title("RK45")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(solve_ivp_lsoda.y[0, :],
        solve_ivp_lsoda.y[1, :],
        solve_ivp_lsoda.y[2, :])
ax.set_title("LSODA")
Comparison between RK45 and LSODA integration methods of solve_ivp.

The LSODA method is already more accurate but we can make it even more accurate but it is still not as accurate as the solution we got from odeint. That is because we made odeint solve at even higher temporal resolution when we passed it t. To get the exact same result from solve_ivp we got from odeint, we must pass it the exact time points we want to solve with the t_eval parameter.

t = np.arange(0.0, 40.0, 0.01)
result_odeint = odeint(lorenz, y0, t, p, tfirst=True)
result_solve_ivp = solve_ivp(lorenz, t_span, y0, args=p,
                             method='LSODA', t_eval=t)

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(result_odeint[:, 0],
        result_odeint[:, 1],
        result_odeint[:, 2])
ax.set_title("odeint")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(result_solve_ivp.y[0, :],
        result_solve_ivp.y[1, :],
        result_solve_ivp.y[2, :])
ax.set_title("solve_ivp LSODA")
odeint and solve_ivp with identical integration method and temporal resolution.

Now both solutions have identical temporal resolution. But their solution is still not identical. I was unable to confirm why that is the case but I suspect very small floating point errors. The Lorenz attractor is a chaotic system and even small errors can make it diverge. The following plot shows the first variable of the system for odeint and solve_ivp from the above simulation. It confirms my suspicion that floating point accuracy is to blame but was unable to confirm the source.

Solutions from odeint and solve_ivp diverge even for identical temporal resolution and integration method.

There is one more way to make the system more smooth: decrease the tolerated error. We will not go into that here because it does not relate to the difference between odeint and solve_ivp. It can be used in both interfaces. There are some more subtle differences. For example, by default, odeint expects the first parameter of the problem to be the state and the second parameter to be time t. This is the other way around for solve_ivp. To make odeint accept a problem definition where time is the first parameter we set the parameter tfirst=True.

In summary, solve_ivp offers several integration methods while odeint only uses LSODA.

Differential Equations in Python with SciPy

Differential equations are special because they don’t tell us the value of a variable straight up. Instead, they tell us by how much the variable will change with respect to the change of another variable. Usually that other variable is time. To numerically solve a system of differential equations we need to track the systems change over time starting at an initial state. This process is called numerical integration and there is a SciPy function for it called odeint. We will learn how to use this package by simulating the ‘hello world’ of differential equations: the Lorenz system.

Here is the first part of the code where we define the function that describes the dynamics of the system.

import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from mpl_toolkits.mplot3d import Axes3D

def lorenz(state, t, sigma, beta, rho):
    x, y, z = state
    
    dx = sigma * (y - x)
    dy = x * (rho - z) - y
    dz = x * y - beta * z
    
    return [dx, dy, dz]

We start with some imports. Of course we need NumPy and odeint is imported from scipy.integrat. Matplotlib will be used to plot the result of our simulation. After that we define the system of differential equations that defines our Lorenz system. It consists of three differential equations that we fit into one function called lorenz. This function needs a specific call signature (lorenz(state, t, sigma, beta, rho)) because we will later pass it to odeint which expects specific parameters in specific places. Most importantly, the first parameter must be the state of the system.The state of the Lorenz system is defined by three variables: x, y, z. Our state object has to be a sequence with an order that reflects this.

Inside the lorenz function, the first thing we do is to unpack the state into the three state variables. This is followed by the three differential equations that described the dynamic changes of the state variables. The fact that the variable t does not show up in any of these equations is a common point of confusion. The amount of change certainly depends on the amount of time. So why can we ignore t here? The answer is that our numerical integrator will keep track of t for us. For this particular system we could actually build a function that does not take the parameter t but I include it because it can be useful if you want to add discontinuities that depend on t.

While t does not appear in the equations, sigma, beta & rho do. They are the parameters of the system and the system’s properties depend on them. We will set those parameters next.

sigma = 10.0
beta = 8.0 / 3.0
rho = 28.0

p = (sigma, beta, rho)

These are the parameters Lorenz himself used and they are known to produce the type of dynamic that the Lorenz system is most known for: the Lorenz attractor. It is important that we store these parameters in a tuple in this exact order because of our functions structure. It must be a tuple rather than another type of collection because odeint expects it. Now that our parameters are defined, we will move on to define the initial values of the system. This is a critical part of solving differential equations. These equations tell us by how much the system state changes but they cannot tell us where to start.

y0 = [1.0, 1.0, 1.0]

Our system will start with all variables at 1.0. Now we can solve the system and plot the result.

t = np.arange(0.0, 40.0, 0.01)

result = odeint(lorenz, y0, t, p)

fig = plt.figure()
ax = fig.gca(projection="3d")
ax.plot(result[:, 0], result[:, 1], result[:, 2])
The Lorenz attractor.

We solve the system with a simple call to to odeint and we pass it the function that defines out system, the initial state, the time points t for which we want to solve the system and the parameters p. result is a two-dimensional array where the rows are the time points and the columns are the state variables at that those time points. And this is how we can solve differential equations with SciPy.

A Deep Feedforward Network in pyTorch for the Titanic Challenge

The deep feedforward neural network is the most simple network architecture. They are a great entry point to many deep learning concepts. They can also be pretty effective for many applications but they have been replaced by more specialized networks in most areas (for example recurrent neural networks or convolutional neural networks). Here we will build one with pyTorch and we will go from feature selection to training. The dataset is the Titanic challenge, where our model has to predict who survives the sinking of the Titanic. It is the introductory data science competition on Kaggle. We will start by looking at our data.

Missing Values and String to Numerical Conversion

First we need to load the data and find out what we are dealing with. It already comes split into training and test data, both being .csv files. We load both files and take a look at their general structure with .info().

import pandas as pd
train = pd.read_csv("train.csv", index_col='PassengerId')
test = pd.read_csv("test.csv", index_col='PassengerId')
train.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
"""

test.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       332 non-null    float64
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      417 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB
"""

For each passenger we have 11 features. In the test data, the 'Survived' column is missing, because our goal is to predict it. For the training data we know who survived and we will use this to predict who survived in the test data. Before we can do that, we need to take a closer look at our data. The first issue we encounter is missing values. In the training data are 891 passengers. However, the 'Age' column contains only 714 non-null values. The others are missing. We say they are not-a-number (NaN) and we need to take care of them. If we were to feed NaNs into our model it would completely destroy the calculations. One option is to remove the rows that contain NaNs. This would exclude a large number of rows. Probably more than we were comfortable with, considering that there will be other columns with NaNs. Additionally, the test data also have NaNs. Should we just give up for those test passengers and guess?

A better idea is to replace missing values with the column mean. Given no other information, the mean is the best guess. To be extra careful, we can take the column median, which is more resistance to outliers. Importantly, we will replace NaNs in the test data with the median from the training data. This is not strictly necessary for the Titanic toy example but it is really good practice if you ever design an important model that will be deployed on completely unknown data. Generally, we should use the test data exclusively to calculate our test accuracy. Nothing else. It should never touch the training data and we should act like we don’t even have access to it yet, until we get to the model testing. There are more complicated ways to replace NaNs but they go beyond this blog post. Let us replace NaNs with the median using the .fillna method. Because we need to do the same preprocessing to the training and test dataset we will loop through them. We will have to do the same for the "Fare" column. There is only one value missing and it is in the test data, but we must fill it.

train_test_datasets = [train, test]
median_age = train["Age"].median()
median_fare = train["Fare"].median()
for dataset in train_test_datasets:
    dataset["Age"].fillna(median_age, inplace=True)
    dataset["Fare"].fillna(median_fare, inplace=True)

That takes care of "Age" and “Fare” but we cannot apply the same strategy to "Cabin" or "Embarked" because they are nominal columns. Note that their dtype is object instead of a numeric types such as int64. Let us take a look at "Cabin" first.

train["Cabin"]
"""
PassengerId
1       NaN
2       C85
3       NaN
4      C123
5       NaN

887     NaN
888     B42
889     NaN
890    C148
891     NaN
Name: Cabin, Length: 891, dtype: object
"""
type(train["Cabin"][1])
# float
type(train["Cabin"][2])
# str
train["Cabin"].isna().sum()
# 687

Some rows contain strings, other contain floats. The number of NaNs is large at 687. I was tempted to simply delete this column because of the large number of NaNs. I decided to keep it because it might be important. I am no ship expert, but the location of the cabin might determine how accessible life boats are and thereby influence survival. Now there are several ways to proceed. I will go with a compromise between simplicity and retaining information. We will extract the first letter, creating a new nominal feature called “Cabin Letter”.

for dataset in train_test_datasets:
    dataset["Cabin Letter"] = dataset["Cabin"].str.slice(0, 1)
train["Cabin Letter"].unique()
# array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

There are eight different cabin letters and the NaN problem persists. Because we are dealing with a nominal feature we can create a ninth category. These are the passengers where we don’t know their cabin number. We will do so as we convert the letters to numbers. For our network we will need to convert everything to floats anyway.

for dataset in train_test_datasets:
    dataset["Cabin Letter"] = pd.Categorical(dataset["Cabin Letter"]).codes
train["Cabin Letter"]
"""
PassengerId
1     -1
2      2
3     -1
4      2
5     -1
      ..
887   -1
888    1
889   -1
890    2
891   -1
Name: Cabin Letter, Length: 891, dtype: int8
"""

train['Cabin Letter'].unique()
# array([-1,  2,  4,  6,  3,  0,  1,  5,  7], dtype=int8)

Now the NaNs are -1, which is fine for now. Later we will convert all these nominal variables to dummy variables anyway, but we will get to that. We could extract more information from "Cabin" but for this demonstration I will leave it here and delete "Cabin".

for dataset in train_test_datasets:
    dataset.drop("Cabin", axis=1, inplace=True)

The final variable that suffers from NaNs is "Embarked". The data in it is less messy so we can go straight to getting the categorical codes.

for dataset in train_test_datasets:
    dataset["Embarked"] = pd.Categorical(dataset["Embarked"])
    dataset["Embarked"] = dataset["Embarked"].cat.codes

Now we succesfully removed all NaN values but some minor issues about our columns remain. "Name", "Sex" and "Ticket" do not have a numerical type. They are of type object. "Name" and "Ticket" probably don’t tell us a lot about survival, unless we would do some serious feature engineering on them. So we will simply drop them.

for dataset in train_test_datasets:
    dataset.drop(["Name", "Ticket"], axis=1, inplace=True)

That just leaves us with "Sex". We can use .cat.codes as above to convert it to nubmers.

for dataset in train_test_datasets:
    dataset["Sex"] = pd.Categorical(dataset["Sex"]).codes

Now let us take one more look at our datasets and make sure we took care of NaNs and everything is numeric.

train.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Sex           891 non-null    int8   
 3   Age           891 non-null    float64
 4   SibSp         891 non-null    int64  
 5   Parch         891 non-null    int64  
 6   Fare          891 non-null    float64
 7   Embarked      891 non-null    int8   
 8   Cabin Letter  891 non-null    int8   
dtypes: float64(2), int64(4), int8(3)
memory usage: 51.3 KB
"""

test.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Pclass        418 non-null    int64  
 1   Sex           418 non-null    int8   
 2   Age           418 non-null    float64
 3   SibSp         418 non-null    int64  
 4   Parch         418 non-null    int64  
 5   Fare          418 non-null    float64
 6   Embarked      418 non-null    int8   
 7   Cabin Letter  418 non-null    int8   
dtypes: float64(2), int64(3), int8(3)
memory usage: 20.8 KB
"""

Everything looks in order. However, there is one more necessary preprocessing step before we can move to standardization. We need to convert categorical features to dummy variables. That means, each category in a column gets its own new column and a sample of that category will have a 1 there. Let’s see how that looks.

Convert Categories to Dummy Variables

We will use the .get_dummies function on all our categorical columns but first let us try it on the "Pclass" column.

train["Pclass"]
"""
PassengerId
1      3
2      1
3      3
4      1
5      3
      ..
887    2
888    1
889    3
890    1
891    3
Name: Pclass, Length: 891, dtype: int64
"""

pd.get_dummies(train["Pclass"])
"""
             1  2  3
PassengerId         
1            0  0  1
2            1  0  0
3            0  0  1
4            1  0  0
5            0  0  1
        .. .. ..
887          0  1  0
888          1  0  0
889          0  0  1
890          1  0  0
891          0  0  1

[891 rows x 3 columns]
"""

Dummy variables make sense if we just stare at them long enough. What we are looking for is the conversion of each category into an individual column. train["Pclass"] has three categories, [1, 2, 3]. Therefore, we get three dummy variables. A row that has a 1 in "Pclass" will get a 1 in the first column and 0 in the other two. A row that has a 2 will get a 1 in the second column and 0 in the others. A row that has a 3 will get a 1 in the third column and 0 in the others. Now we will use the same function on all our categorical columns. Note that we will not use our usual loop trough train and test data because there is no way to generate the dummy variables inplace.

categorical_cols = ["Pclass", "Sex", "Embarked", "Cabin Letter","SibSp"]
train_dummies = pd.get_dummies(train,
                               columns=categorical_cols,
                               prefix=categorical_cols)

test_dummies = pd.get_dummies(test,
                              columns=categorical_cols,
                              prefix=categorical_cols)
train_dummies.shape
# (891, 29)
test_dummies.shape
# (418, 26)

The code looks solid but something went wrong. We now have more columns in our train_dummies than in test_dummies. What happened? Let us look at the columns and see if we can spot the issue.

train_dummies.columns
"""
Index(['Survived', 'Age', 'Parch', 'Fare', 'Pclass_1', 'Pclass_2', 'Pclass_3',
       'Sex_0', 'Sex_1', 'Embarked_-1', 'Embarked_0', 'Embarked_1',
       'Embarked_2', 'Cabin Letter_-1', 'Cabin Letter_0', 'Cabin Letter_1',
       'Cabin Letter_2', 'Cabin Letter_3', 'Cabin Letter_4', 'Cabin Letter_5',
       'Cabin Letter_6', 'Cabin Letter_7', 'SibSp_0', 'SibSp_1', 'SibSp_2',
       'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8'],
      dtype='object')
"""

test_dummies.columns
"""
Index(['Age', 'Parch', 'Fare', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_0',
       'Sex_1', 'Embarked_0', 'Embarked_1', 'Embarked_2', 'Cabin Letter_-1',
       'Cabin Letter_0', 'Cabin Letter_1', 'Cabin Letter_2', 'Cabin Letter_3',
       'Cabin Letter_4', 'Cabin Letter_5', 'Cabin Letter_6', 'SibSp_0',
       'SibSp_1', 'SibSp_2', 'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8'],
      dtype='object')
"""

test_dummies is missing the following columns: "Embarked_-1“, "Embarked_0" and "Cabin Letter_7". Those are the three missing columns. Remember that "Survived" is purposely not in the test dataset. This happened because in the test data there were no rows that had missing values or a 0 in "Embarked" or a 7 in "Cabin Letter". This raises another issue. When we converted string categories to numbers with cat.codes, we may have converted letters differently in the test and training dataset. Luckily, we can convert strings directly to dummy variables. This is a great opportunity to take a step back and clean up our code. We will start from scratch.

import pandas as pd

# Load data
train = pd.read_csv("train.csv", index_col='PassengerId')
test = pd.read_csv("test.csv", index_col='PassengerId')

# Merge train and test for wrangling and preprocessing
train_test_datasets = [train, test]

"""Data wrangling"""
# Split cabin into letter and number
median_age = train["Age"].median()
median_fare = train["Fare"].median()
for idx, dataset in enumerate(train_test_datasets):
    dataset["Age"].fillna(median_age, inplace=True)
    dataset["Fare"].fillna(median_fare, inplace=True)
    dataset["Cabin Letter"] = dataset["Cabin"].str.slice(0, 1)
    dataset.drop("Cabin", axis=1, inplace=True)
    #dataset["Embarked"] = dataset["Embarked"].cat.codes
    dataset.drop(["Name", "Ticket"], axis=1, inplace=True)

categorical_cols = ["Pclass", "Sex", "Embarked", "Cabin Letter","SibSp"]
train_dummies = pd.get_dummies(train,
                               columns=categorical_cols,
                               prefix=categorical_cols,
                               dummy_na=True)

test_dummies = pd.get_dummies(test,
                              columns=categorical_cols,
                              prefix=categorical_cols,
                              dummy_na=True)

train_dummies.shape
# (891, 32)
test_dummies.shape
# (418, 30)

Now we have more columns because we added the dummy_na=True parameter. This gives us a "Sex_nan" column, although there were no NaNs in the original data. This is excessive but won’t be an issue for our network. Now we just need to add the one columns that is missing from test_dummies. It should contain only zeros, because there were no rows with that category in the test dataset.

train_dummies.columns
"""
Index(['Survived', 'Age', 'Parch', 'Fare', 'Pclass_1.0', 'Pclass_2.0',
       'Pclass_3.0', 'Pclass_nan', 'Sex_female', 'Sex_male', 'Sex_nan',
       'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_nan',
       'Cabin Letter_A', 'Cabin Letter_B', 'Cabin Letter_C', 'Cabin Letter_D',
       'Cabin Letter_E', 'Cabin Letter_F', 'Cabin Letter_G', 'Cabin Letter_T',
       'Cabin Letter_nan', 'SibSp_0.0', 'SibSp_1.0', 'SibSp_2.0', 'SibSp_3.0',
       'SibSp_4.0', 'SibSp_5.0', 'SibSp_8.0', 'SibSp_nan'],
      dtype='object')
"""

test_dummies.columns
"""
Index(['Age', 'Parch', 'Fare', 'Pclass_1.0', 'Pclass_2.0', 'Pclass_3.0',
       'Pclass_nan', 'Sex_female', 'Sex_male', 'Sex_nan', 'Embarked_C',
       'Embarked_Q', 'Embarked_S', 'Embarked_nan', 'Cabin Letter_A',
       'Cabin Letter_B', 'Cabin Letter_C', 'Cabin Letter_D', 'Cabin Letter_E',
       'Cabin Letter_F', 'Cabin Letter_G', 'Cabin Letter_nan', 'SibSp_0.0',
       'SibSp_1.0', 'SibSp_2.0', 'SibSp_3.0', 'SibSp_4.0', 'SibSp_5.0',
       'SibSp_8.0', 'SibSp_nan'],
      dtype='object')
"""

"Cabin Letter_T" is the missing one.

import numpy as np
test_dummies["Cabin Letter_T"] = np.zeros(test_dummies.shape[0])

Now our dummy variables are in order and we can move on balancing the training data.

Balancing the Training Data

Balancing the training data will be important during learning. We want to train our model to predict survival. Imagine the extreme case, where we train a model only on passengers that survived. This model will be terrible, because it has no idea how a surviving passenger looks on paper. Now don’t worry, this training data has rows on both surviving and non-surviving passengers. However, a training data where one of the outcomes is more frequent can bias a model. Let us find out what’s the situation with this training data.

total_samples = train_dummies.shape[0]  # Number of rows in DataFrame
number_surviving = (train_dummies['Survived'] == 1).sum()  # Number of survivors
perc_survivors = (number_surviving / total_samples) * 100
# 38.38383838383838

In this training data 38% of passengers survived. There are no hard rules on how balanced data should be, but I am not happy with that number. If it was 45% I think it would be fine but we should do something about 38%. Because the number of survivors is lower than the non-survivors, we will randomly select as many non-survivors as there are survivors.

bool_survivors = train_dummies['Survived'] == 1
bool_nonsurvivors = train_dummies['Survived'] == 0
all_survivors = train_dummies[bool_survivors]
all_nonsurvivors = train_dummies[bool_nonsurvivors]
random_nonsurvivors = all_nonsurvivors.sample(number_surviving)
train_balanced = pd.concat((all_survivors, random_nonsurvivors))
train_balanced = train_balanced.sample(frac=1)
(train_balanced["Survived"] == 0).sum()
# 342
(train_balanced["Survived"] == 1).sum()
# 342

Now we have 342 survivors and 342 non-survivors. Perfectly balanced, as all things should be. Concatenating like this we must also remember to shuffle the rows with .sample(frac=1), otherwise we might run into problems later with unbalanced batches. Now that our data is balanced, we can move on to standardization.

Standardization

Both datasets are now free from missing values, categories are converted to dummy variables and our labels are balanced. Now we can move on to standardization, the process of bringing features to the same scale. We achieve this by subtracting the mean and dividing by the standard deviation. This has some advantages for neural networks. It can increase the learning rate and result in better fits. There are also disadvantages, as some features become harder to interpret because they lose their physical units. Because we rarely try to interpret neuronal networks, we almost always standardize our data. This is how it works with scikit-learn.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(train_dummies.iloc[:,1:])
train_scaled = scaler.transform(train_balanced.iloc[:,1:])
test_scaled = scaler.transform(test_dummies)

train_scaled = pd.DataFrame(train_scaled,
                            index=train_balanced.index,
                            columns=train_balanced.iloc[:,1:].columns)

test_scaled = pd.DataFrame(test_scaled,
                           index=test_dummies.index,
                           columns=test_dummies.columns)

y = train_balanced["Survived"]

train_balanced["Age"].mean(), train_balanced["Age"].std()
# (29.23489766081871, 13.239715945608669)
train_scaled["Age"].mean(), train_scaled["Age"].std()
# (-0.009735709395380402, 1.0174700960384269)

Originally the mean age is 29.23 and the standard deviation is 13.24. After standardization, the mean is near zero because we subtracted it and the standard deviation is near 1 because we divided by it. So our standardization worked. Now the absolute values of all features are on the scale of the standard normal distribution. We are now almost ready to create our actual network. But before, we will split our training data into a training and validation data. This will help us avoid overfitting.

Validation Data Split

Using train_test_split we get 547 training and 137 validation samples. test_size=0.1 specifies that 10% of samples should be in the test data, which in our case will be used for validation, as we already have another designated test data set. We are now ready to build our network an start training.

from sklearn.model_selection import train_test_split

X = train_scaled
X_train, X_validate, y_train, y_validate = train_test_split(X, y,
                                                            test_size=0.1)

X_train.shape
# (547, 31)
X_validate.shape
# (137, 31)

Building the Network with pyTorch

We start by converting our arrays to tensors. This is the data structure pyTorch will expect as input to the network later.

import torch

train_features = torch.tensor(X_train.to_numpy())
train_labels = torch.tensor(y_train.to_numpy())

validation_features = torch.tensor(X_validate.to_numpy())
validation_labels = torch.tensor(y_validate.to_numpy())

Now we build the neural network by calling torch.nn.Sequential.

n_features = train_features.shape[1]
# 31
model = torch.nn.Sequential(torch.nn.Linear(n_features, 50),
                            torch.nn.ReLU(),
                            torch.nn.Linear(50, 1),
                            torch.nn.Sigmoid())

This network takes our 31 input features and transforms them to 50 hidden units using a fully connected linear layer. This layer also learns a bias term by default. We then apply the rectifying linear unit to introduce some non-linearity. We then convert the 50 hidden units to one output unit, to which we apply the Sigmoid function. This means our deep neural network has one hidden layer with 50 units. The Sigmoid function makes sure that our output domain is between 0 and 1. This is important because we are making a binary classification of surviving (1) and non-surviving (0) passengers. Now we define our loos function and our learning method.

criterion = torch.nn.BCELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=0.001)

Our loss function is binary cross entropy loss, which works well for the binary classification problem we are facing. Our learning algorithm is Adam, a variety of gradient descent that features L2 regularization (weight_decay). Before we start training we will split our train data into 41 mini batches.

n_batches = 41
train_features_batched = train_features.reshape(41,
                                               int(train_features.shape[0]/n_batches),
                                               train_features.shape[1])
train_labels_batched = train_labels.reshape(n_batches,
                                            int(train_labels.shape[0]/n_batches))

Training our network on the mini batches instead of the whole data makes learning quicker and can result in models that generalize better. Now we are ready to train the network.

n_epochs = 2000
loss_list = []
validate_loss_list = []

for epoch in range(n_epochs):
    for batch_idx in range(n_batches):
        optimizer.zero_grad()
        
        outputs = model(train_features_batched[batch_idx].float())
        
    
        loss = criterion(outputs.flatten().float(),
                         train_labels_batched[batch_idx].float())
    
        
        loss.backward()
        
        optimizer.step()
        
    outputs = model(train_features.float())
    
    validation_outputs = model(validation_features.float())
    
        
    loss = criterion(outputs.flatten().float(),
                     train_labels.float())
    
    validate_loss = criterion(validation_outputs.flatten().float(),
                              validation_labels.float())
    
    loss_list.append(loss.item())
    
    validate_loss_list.append(validate_loss)

print('Finished Training')

import matplotlib.pyplot as plt
plt.plot(loss_list, linewidth=3)
plt.plot(validate_loss_list, linewidth=3)
plt.legend(("Training Loss", "Validation Loss"))
plt.xlabel("Epoch")
plt.ylabel("BCE Loss")
Loss curves of model training. Yours will look different, as we did not seed the randomness during preprocessing or model initialization.

This code starts by creating empty lists to record the loss of both test and validation data. n_epochs tells us how many times we train the network on the entire data. The actual learning happens in the inner loop that exposes batch_idx. We train the network after each mini batch. The output is whatever the network current spits out for a given batch. Then we compare that output to the known labels, which is our loss. loss.backward() calculates the gradient of the loss with respect to the parameters. Finally, the actual learning happens when we call optimizer.step(), which adjust the parameters of our model according to the gradients. This is why pyTorch is handy, it takes very good care of the gradients. The loss we are plotting is calculated in the out loop, where the loss across the entire data instead of the mini batches is calculated. The rest is just plotting. So how did it go? Our network learns very quickly. After 250 epochs the validation loss already stops improving. Everything that comes afterwards is probably overfitting the training data. That means the training loss improves while the validation loss is stagnant or becomes worse. The last piece of code we have to write calculates our prediction for the test data and saves it in a way that we can upload it to Kaggle to find out how well we did.

test_features = torch.tensor(test_scaled.to_numpy())

test_prediction = model(test_features.float()).detach().numpy().flatten()

test_prediction_binary = (test_prediction > 0.5).astype(np.int)

test_prediction_df = pd.DataFrame(test_prediction_binary,
                                  index=test.index,
                                  columns=["Survived"])

test_prediction_df.to_csv("prediction_submission.csv")

I will upload three different predictions to Kaggle. One from a network that is untrained. To do so I will set n_epochs to 0. This amounts to guessing, because the network is randomly initialized. Then I will upload predictions from a model that stops after 200 epochs. This should be a well trained model. Then I will use a model that stopped after 2000 epochs. This should be slightly overfit. Let’s see how we did.

We got a 0.39952 accuracy score for the untrained model, meaning that 39.952 % of our predictions were correct. Pretty bad but expected from a guessing model. Our model trained on 200 epochs has a 0.70095 score. Much better and a good sign that our model actually learned something. Our model trained on 2000 epochs got a score of 0.74162. Not exactly what we would expect. This probably means that our validation data set was not representative for the test data set. I also read from other people that overfitting the test data improves test accuracy slightly in the Titanic example. How could we improve our model?

Where to go from here

I decided to use a neural network because I wanted to write about it but it is not the best model. A random forest classifier seems to do better here. The highest score without cheating is around 0.83, which is achieved with this kind of model. But what could we do to improve our above code. One great improvement would be to use another validation method. Our validation method sets a lot of data aside that the model is never trained on and we don’t have massive amounts of data to begin with. A cross-validation technique could help us us that data for training. A second improvement would be to use a more formal way to determine the epoch where we want to stop training. We just looked at it and picked nice looking spots in the loss curve. Early stopping is a technique that could help us find a more objective point to stop the training. Finally, we could learn more about the Titanic data and get more features out of the data. For example, we discarded the cabin number, keeping only the letter. However, the number could be very important. A good way to learn more about the Titanic data is to browse the Kaggle discussion board.

References

Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

# Load the MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")
Two example images from the MNIST dataset.

By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images

Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
t-SNE example on MNIST subsample.

Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

t-SNE step by step

The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

Definition of euclidean distance for two features. Image credit licensed under CC BY 4.0 by Kmhkmh.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,
                                     metric='euclidean')

distance_matrix_sorted = pairwise_distances(X_sorted,
                                            metric='euclidean')

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[0].set_title("Unsorted")
ax[1].set_title("Sorted by Label")

When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

t-distributions for sigma of 1 and 2

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity = $' + str(perplexity_01),
           r'$\sigma = 10; Perplexity = $' + str(perplexity_10)))
Distributions of joint probabilities for different values of sigma.

Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
                        desired_perplexity = perplexity,
                        verbose=False)

As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

Optimize Embedding with Gradient Descent

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
                                    n_components).astype(np.float32)

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
             'degrees_of_freedom': 1,
             'n_samples': 1000,
             'n_components':2}

# Perform gradient descent
embedding_done = _t_sne._gradient_descent(_t_sne._kl_divergence,
                                          embedding_init,
                                          0,
                                          1000,
                                          kwargs=kl_kwargs)

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
                               'tsne_2': tsne_result[:,1],
                               'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.

Getting Started with Pandas DataFrame

A DataFrame is a spreadsheet like data structure. We can think of it as a collection of rows and columns. This row-column structure is useful for many different kinds of data. The most widely used DataFrame implementation in Python is from the Pandas package. First we will learn how to create DataFrames. We will also learn how to do some basic data analysis with them. Finally, we will compare the DataFrame to the ndarray data structure and learn why data frames are useful in other packages such as Seaborn.

How to Create a DataFrame

There two major ways to create a DataFrame. We can directly call DataFrame() and pass it data in a dictionary, list or array. Alternatively we can use several functions to load data from a file directly into a DataFrame. While it is very common in data science to load data from file, there are also many occasions where we need to create DataFrame from other data structures. We will first learn how to create a DataFrame from a dictionary.

import pandas as pd
d = {"Frequency": [20, 50, 8],
     "Location": [2, 3, 1],
     "Cell Type": ["Interneuron", "Interneuron", "Pyramidal"]}
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, index=row_names)
print(df)

"""
    Frequency  Location    Cell Type
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""

In our dictionary the keys are used as the column names. The data under each key then becomes the column. The row names are defined separately by passing a collection to the index parameter of DataFrame. We can get column and row names with the columns and index attributes.

df.columns
# Index(['Freq (Hz)', 'Loc (cm)', 'Cell Type'], dtype='object')
df.index
# Index(['C1', 'C2', 'C3'], dtype='object')

We can also change column and row names through those same attributes.

df.index = ["Cell_1", "Cell_2", "Cell_3"]
df.columns = ["Freq (Hz)", "Loc (cm)", "Cell Type"]
"""
        Freq (Hz)  Loc (cm)    Cell Type
Cell_1         20         2  Interneuron
Cell_2         50         3  Interneuron
Cell_3          8         1    Pyramidal
"""

These names are useful because they give us a descriptive way of indexing into columns and rows. If we use indexing syntax on the DataFrame, we can get individual columns.

df['Freq (Hz)']
"""
Cell_1    20
Cell_2    50
Cell_3     8
Name: Freq (Hz), dtype: int64
"""

Row names are not found this way and using a row key will raise an error. However, we can get rows with the df.loc attribute.

df['Cell_1']
# KeyError: 'Cell_1'
df.loc['Cell_1']
"""
Freq (Hz)             20
Loc (cm)               2
Cell Type    Interneuron
Name: Cell_1, dtype: object
"""

We could also create a DataFrame from other kinds of collections that are not dictionaries. For example we can use a list.

d = [[20, 2, "Interneuron"],
     [50, 3, "Interneuron"],
     [8, 1, "Pyramidal"]]
column_names = ["Frequency", "Location", "Cells"]
row_names = ["C1", "C2", "C3"]
df = pd.DataFrame(d, columns=column_names, index=row_names)
print(df)
"""
    Frequency  Location        Cells
C1         20         2  Interneuron
C2         50         3  Interneuron
C3          8         1    Pyramidal
"""

In that case there are no dictionary keys that could be use to infer the column names. This means we need to pass the column_names to the columns parameter. Mostly anything that structures our data in a two-dimensional way can be used to create a DataFrame. Next we will learn about functions that allow us to load different file types as a DataFrame.

Loading Files as a DataFrame

The list of file types Pandas can read and write is rather long and you can find it here. I only want to cover the most commonly used .csv file here. They have the particular advantage that they can also be read by humans, because they are essentially text files. They are also widely supported by a variety of languages and programs. First, let’s create our file. Because it is a text file, we can write a literal string to file.

text_file = open("example.csv", "w")
text_file.write(""",Frequency,Location,Cell Type
                 C1,20,2,Interneuron
                 C2,50,3,Interneuron
                 C3,8,1,Pyramidal""")
text_file.close()

In this file columns are separated by commas and rows are separated by new lines. This is what .csv means, it stands for comma-separated values. To load this file into a DataFrame we need to pass the file name and which column contains the row names. Pandas assumes by default that the first row contains the column names.

df = pd.read_csv("example.csv", index_col=0)
print(df)
"""
     Frequency  Location    Cell Type
 C1         20         2  Interneuron
 C2         50         3  Interneuron
 C3          8         1    Pyramidal
"""

There are many more parameters we can specify for read_csv in case we have a file that is structured differently. In fact we can load files that have a value delimiter other than the comma, by specifying the delimiter parameter.

text_file = open("example.csv", "w")
text_file.write("""-Frequency-Location-Cell Type
                 C1-20-2-Interneuron
                 C2-50-3-Interneuron
                 C3-8-1-Pyramidal""")
text_file.close()
df = pd.read_csv("example.csv", index_col=0, delimiter='-')
print(df)
"""
     Frequency  Location    Cell Type
 C1         20         2  Interneuron
 C2         50         3  Interneuron
 C3          8         1    Pyramidal
"""

We specify '-' as the delimiter and and it also works. Although the function is called read_csv it is not strictly bound to comma separated values. We can also skip rows, columns and specify many more options you can learn about from the documentation. For well structured .csv files however, we need very few arguments as shown above. Next we will learn how to do basic calculations with the DataFrame.

Basic Math with DataFrame

A variety of functions such as df.mean(), df.median() and df.std() are available to do basic statistics on our DataFrame. By default they all return values per column. That is because columns are assumed to contain our variables (or features) and each row contains a sample.

df.mean()
"""
Freq (Hz)    26.0
Loc (cm)      2.0
dtype: float64
"""

df.median()
"""
Freq (Hz)    20.0
Loc (cm)      2.0
dtype: float64
"""

df.std()
"""
Freq (Hz)    21.633308
Loc (cm)      1.000000
dtype: float64
"""

One big advantage of the column is that within a column the data type is clearly defined. Within a row on the other hand different data types can exist. In our case we have two numeric types and a string. When we call these statistical methods, numeric types are ignored. In our case that is 'Cell Type'. Technically we can also use the axis parameter to calculate these statistics for each sample but this is not always useful and has to again ignore one of the columns.

df.mean(axis=1)
"""
C1    11.0
C2    26.5
C3     4.5
dtype: float64
"""

We can also use other mathematical operators. They are applied element-wise and their effect will depend on the data type of the value.

print(df * 3)
"""
         Frequency  Location                      Cell Type
 C1         60         6  InterneuronInterneuronInterneuron
 C2        150         9  InterneuronInterneuronInterneuron
 C3         24         3        PyramidalPyramidalPyramidal
"""

Often times these operations make more sense for individual columns. As explained above we can use indexing to get individual columns and we can even assign new results to an existing or new column.

norm_freq = df['Frequency'] / df.mean()['Frequency']
norm_freq
"""
 C1    0.769231
 C2    1.923077
 C3    0.307692
Name: Frequency, dtype: float64
"""
df['Norm Freq'] = norm_freq
print(df)
"""
     Frequency  Location    Cell Type  Norm Freq
 C1         20         2  Interneuron   0.769231
 C2         50         3  Interneuron   1.923077
 C3          8         1    Pyramidal   0.307692
"""

If you are familiar with NumPy, most of these DataFrame operations will seem very familiar because they mostly work like array operations. Because Pandas builds on NumPy, most NumPy functions (for example np.sin) work on numeric columns. I don’t want to go deeper and instead move on to visualizing DataFrames with Seaborn.

Seaborn for Data Visualization

Seaborn is a high-level data visualization package that builds on Matplotlib. It does not necessarily require a DataFrame. It can work with other data structures such as ndarray but it is particularly convenient with DataFrame. First, let us get a more interesting data set. Luckily Seaborn comes with some nice example data sets and they conveniently load into Pandas DataFrame.

import seaborn as sns
df = sns.load_dataset('iris')
type(df)
# pandas.core.frame.DataFrame
print(df)
"""
     sepal_length  sepal_width  petal_length  petal_width    species
0             5.1          3.5           1.4          0.2     setosa
1             4.9          3.0           1.4          0.2     setosa
2             4.7          3.2           1.3          0.2     setosa
3             4.6          3.1           1.5          0.2     setosa
4             5.0          3.6           1.4          0.2     setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  virginica
146           6.3          2.5           5.0          1.9  virginica
147           6.5          3.0           5.2          2.0  virginica
148           6.2          3.4           5.4          2.3  virginica
149           5.9          3.0           5.1          1.8  virginica

[150 rows x 5 columns]
"""

print(df.columns)
"""
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
"""

The Iris data set contains information about different species of iris plants. It contains 150 samples and 5 features. The 'species' feature tells us what species a particular sample belongs to. The names of those columns are very useful when we structure our plots in Seaborn. Let’s first try a basic bar graph.

sns.set(context='paper',
        style='whitegrid',
        palette='colorblind',
        font='Arial',
        font_scale=2,
        color_codes=True)
fig = sns.barplot(x='species', y='sepal_length', data=df)

We use sns.barplot and we have to pass our DataFrame to the data parameter. Then for x and y we define which column name should appear there. We put 'species' on the x-axis so that is how data is aggregated inside the bars. Setosa, versicolor and virginica are the different species. The sns.set() function defines multiple parameters of Seaborn and forces a certain style on the plots that I personally prefer. Bar graphs have grown out of fashion and for good reason. They are not very informative about the distribution of their underlying values. I prefer the violin plot to get a better idea of the distribution.

fig = sns.violinplot(x='species', y='sepal_length', data=df)

We even get a small box plot within the violin plot for free. Seaborn works its magic through the DataFrame column names. This makes plotting more convenient but also makes our code more descriptive than it would be with pure NumPy. Our code literally tells us, that 'species' will be on the x-axis.

Summary

We learned that we can create a DataFrame from a dictionary or another kind of collection. The most important features are the column and row names. Columns organize features and rows organize samples by convention. We can also load files into a DataFrame. For example we can use read_csv to load .csv or other text based files. We can also use methods like df.mean() to get basic statistics of our DataFrame. Finally, Seaborn is very useful to visualize a DataFrame.

Animations with Julia

Creating great looking animations in Julia is shockingly easy thanks for the Plots package and some macro magic. Here we will learn how to turn data into high quality animations. We will learn about the @animate macro, frames and the gif function.

Two Steps to Animations

To create animations we simply generate frames with the @animate macro and then generate a file with the gif function. Both are part of the Plots package from Julia, so we have to start with using Plots to make that package available. Next create the data we will plot, which is a simple sine wave. Then we generate the frame with the @animate macro and animate them with the gif function. This is the code and the resulting animation.

using Plots

x = collect(1:0.1:30)
y = sin.(x)
df = 2

anim = @animate for i = 1:df:length(x)
    plot(x[1:i], y[1:i], legend=false)
end

gif(anim, "tutorial_anim_fps30.gif", fps = 30)

Macros and Meta-Programming

The @animate macro deserves some extra attention, because it looks like magic. Macros are related to a concept called meta-programming. In Julia, all code is a data structure that can be manipulated in a similar way to all other data structures. This effectively means that we can write code that manipulates our code. That’s what a macro is, a function that modifies code. In our case, the code being modified is the for loop behind our @animate macro. It is modified in a way that it catches the frame at the end of each loop iteration and saves it into anim. We can create code that does the same job ourselves.

anim = Plots.Animation()
for i = 1:df:length(x)
    plot(x[1:i], y[1:i], legend=false)
    Plots.frame(anim)
end

gif(anim, "tutorial_anim_fps30.gif", fps = 30)

We use the Plots.Animation() function to create our animation object where we will store our frames. During the for loop we then call Plots.frame(anim) to store the frame after each iteration in our anim object. These are the essential steps that the @animate macro takes care of. If you want to learn what the macro does in detail you can call @macroexpand on it.

@macroexpand @animate for i = 1:df:length(x)
    plot(x[1:i], y[1:i], legend=false)
end

There is another macro that is even more convenient. The @gif macro. It saves us from having to call gif() on our anim object.

@gif for i = 1:df:length(x)
    plot(x[1:i], y[1:i], legend=false)
end

This directly displays the animation if interactive Julia is available for it. The downside of this is that we do not save the animation to disk and we do not have an anim object available to do more animations later. It is most useful to quickly troubleshoot animations interactively.

Beyond plot()

The @animate macro supports animations of anything that can be plotted with Plots. For example, we can animate a heatmap.

anim = @animate for i = 1:100
    mat = rand(0:100, 32, 32)
    heatmap(mat, clim=(0,255))
end

gif(anim, "tutorial_heatmap_anim.gif", fps = 10)

The frame that is being caught is the state of the active figure at the end of the for loop. The for loop itself gives us a great deal of control, how many frames we want to create. For example in the previous examples, I skipped frames with the df variable.

That’s it for animations. To learn more you can take a look at the official Plots documentation.

The Type System of Julia

Every value in Julia has a type. Like other popular languages such as Python, Julia is dynamically typed. That means, we don’t need to explicitly define the type of a value when we create it. However, types are particularly important in Julia because we can use explicit type definitions to speed up calculations. Here we will get an introduction into the type system of Julia.

Dynamic Typing

When we assign a value in Julia we don’t need to specify its type. The type is guessed based on the given value. Let’s create an integer, a float and a string.

myint = 3
myfloat = 3.0
mystr = "3.0"
println(typeof(myint), ": ", myint)
println(typeof(myfloat), ": ", myfloat)
println(typeof(mystr), ": ", mystr)
# Int64: 3
# Float64: 3.0
# String: 3.0

We use typeof() to find out what type a value has. To create an integer, we assign a number without decimal. To create a float, we simply attach a decimal place. For a string we need quotation marks around our value. There is another way to more explicitly define the type. Our float value is a Float64 (it uses 64 bits). What if we wanted a Float32?

myfloat = convert(Float32, 3.0)
typeof(myfloat)
# Float32

We use the convert(Type, value) function to explicitly convert 3.0 to Float32. If we want to ensure that a value is of a certain type, we can use the double colon syntax (::). It evaluates normally, if the value has the specified type but throws an error, if it has a different type.

3::Int
# 3
3::Float64
# TypeError: in typeassert, expected Float64, got Int64
# 
# Stacktrace:
#  [1] top-level scope at In[8]:1

This type assertion syntax is frequently used when defining functions. A function is a piece of code that takes input arguments and processes them to produce output values. We will learn more about functions later. Because Julia is dynamically typed, we could write our functions in a way that they work the same regardless of input types. However, defining functions in a way that is specific to the input types can be good for performance and Julia makes heavy use of that fact. For a given function, multiple methods might exist. Each method is responsible for a given set of input parameters. For example, we can inspect all the different methods of the mathematical cos by calling the methods function on it.

methods(cos)
"""
# 12 methods for generic function cos:
cos(x::BigFloat) in Base.MPFR at mpfr.jl:744
cos(::Missing) in Base.Math at math.jl:1167
cos(a::Complex{Float16}) in Base.Math at math.jl:1115
cos(a::Float16) in Base.Math at math.jl:1114
cos(z::Complex{T}) where T in Base at complex.jl:823
cos(x::T) where T<:Union{Float32, Float64} in Base.Math at special/trig.jl:100
cos(x::Real) in Base.Math at special/trig.jl:124
cos(A::LinearAlgebra.Hermitian{#s661,S} where S<:(AbstractArray{#s662,2} where #s662<:#s661) where #s661<:Complex) in LinearAlgebra at C:\Users\Daniel\AppData\Local\Programs\Julia\Julia-1.4.2\share\julia\stdlib\v1.4\LinearAlgebra\src\symmetric.jl:907
cos(A::Union{LinearAlgebra.Hermitian{#s662,S}, LinearAlgebra.Symmetric{#s662,S}} where S where #s662<:Real) in LinearAlgebra at C:\Users\Daniel\AppData\Local\Programs\Julia\Julia-1.4.2\share\julia\stdlib\v1.4\LinearAlgebra\src\symmetric.jl:903
cos(D::LinearAlgebra.Diagonal) in LinearAlgebra at C:\Users\Daniel\AppData\Local\Programs\Julia\Julia-1.4.2\share\julia\stdlib\v1.4\LinearAlgebra\src\diagonal.jl:561
cos(A::AbstractArray{#s662,2} where #s662<:Real) in LinearAlgebra at C:\Users\Daniel\AppData\Local\Programs\Julia\Julia-1.4.2\share\julia\stdlib\v1.4\LinearAlgebra\src\dense.jl:793
cos(A::AbstractArray{#s662,2} where #s662<:Complex) in LinearAlgebra at C:\Users\Daniel\AppData\Local\Programs\Julia\Julia-1.4.2\share\julia\stdlib\v1.4\LinearAlgebra\src\dense.jl:800
"""

The cos function has 12 different methods. The double colon type assertion syntax checks the input type. Which method is used can depend on the type of all inputs. This is called multiple dispatch. Multiple inputs determine which method is dispatched. We will learn more about multiple dispatch in another post. Since arrays are central to scientific computing, we next look at the type system and arrays.

Array Types

There are several different ways to define arrays. One way is to use literal numbers enclosed by square brackets. In that case, the type is inferred in the same way as above, where we used literal numbers.

myintarr = [1, 2, 3]
typeof(myintarr)
# Array{Int64,1}
myfloatarr = [1.0, 2.0, 3.0]
typeof(myfloatarr)
# Array{Float64,1}

For an array, typeof() tells us not only the type of the array elements but also how many dimensions the array has. One important feature of the array is that all elements must have the same type. So what happens when we create an array from literals with different types? The type is determined by the most complex type.

myintarr = [1, 2, 3]
typeof(myintarr)
# Array{Int64,1}
myfloatarr = [1, 2.0, 3]
typeof(myfloatarr)
# Array{Float64,1}
mystrarr = [1, "2.0", 3]
typeof(mystrarr)
# Array{Any,1}

We see that only one float value makes the entire array Float64. However, when one of the elements is a string, the type becomes Any, rather than String. That is because Julia does not convert numbers to strings. We get an error if we call convert(String, 3). If the most complex type is the Float64, Julia tries to convert all other values to that type. This works in the case where the other values are integers, because convert(Float64, 3) works. If other values cannot be converted to the most complex type all values take on the Any type. This means that values in the array can be of different types. They could literally be anything. This defeats the purpose of an array so we generally try to avoid it. There are other array creation methods besides the literal way. We can call a variety of methods that allow us to create arrays and they usually allow us to specify the type.

zerosarr = zeros(Int8, (2, 3))
# 2×3 Array{Int8,2}:
#  0  0  0
#  0  0  0
zerosarr = zeros((2, 3))
# 2×3 Array{Float64,2}:
#  0.0  0.0  0.0
#  0.0  0.0  0.0

If we don’t specify the type we want, the zeros() function default to Float64. There are other functions such as ones(), giving an array of ones, or rand(), giving an array of random numbers. The type of the output can be specified for each.

That is it for our basic introduction into types. You can find a more complete introduction to Julia types in the official documentation. In summary, we don’t need to explicitly specify the type of values but sometimes it might help to make math more efficient. We will learn in later posts more about performance and efficiency in Julia.