Differential Equations with SciPy – odeint or solve_ivp

SciPy features two different interfaces to solve differential equations: odeint and solve_ivp. The newer one is solve_ivp and it is recommended but odeint is still widespread, probably because of its simplicity. Here I will go through the difference between both with a focus on moving to the more modern solve_ivp interface. The primary advantage is that solve_ivp offers several methods for solving differential equations whereas odeint is restricted to one. We get started by setting up our system of differential equations and some parameters of the simulation.

import numpy as np
from scipy.integrate import odeint, solve_ivp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def lorenz(t, state, sigma, beta, rho):
    x, y, z = state
    
    dx = sigma * (y - x)
    dy = x * (rho - z) - y
    dz = x * y - beta * z
    
    return [dx, dy, dz]

sigma = 10.0
beta = 8.0 / 3.0
rho = 28.0

p = (sigma, beta, rho)  # Parameters of the system

y0 = [1.0, 1.0, 1.0]  # Initial state of the system

We will be using the Lorenz system. We can directly move on the solving the system with both odeint and solve_ivp.

t_span = (0.0, 40.0)
t = np.arange(0.0, 40.0, 0.01)

result_odeint = odeint(lorenz, y0, t, p, tfirst=True)
result_solve_ivp = solve_ivp(lorenz, t_span, y0, args=p)

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(result_odeint[:, 0],
        result_odeint[:, 1],
        result_odeint[:, 2])
ax.set_title("odeint")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(result_solve_ivp.y[0, :],
        result_solve_ivp.y[1, :],
        result_solve_ivp.y[2, :])
ax.set_title("solve_ivp")
Simulation results from odeint and solve_ivp.

The first thing that sticks out is that the solve_ivp solution is less smooth. That is because it is calculated at fewer time points, which in turn has to do with the difference between t_span and t. The odeint interface expects t, an array of time points for which we want to calculate the solution. The temporal resolution of the system is given by the interval between time points. The solve_ivp interface on the other hand expects t_span, a tuple that gives the start and end of the simulation interval. solve_ivp determines the temporal resolution by itself, depending on the integration method and the desired accuracy of the solution. We can confirm that the temporal resolution of solve_ivp is lower in this example by inspecting the output of both functions.

t.shape
# (4000,)

result_odeint.shape
# (4000, 3)

result_solve_ivp.t.shape
# (467,)

The t array has 4000 elements and therefore the result of odeint has 4000 rows, each row being a time point defined by t. The result of solve_ivp is different. It has its own time array as an attribute and it has 1989 elements. This tells us that solve_ivp indeed calculated fewer time points affection temporal resolution. So how can we increase the the number of time points in solve_ivp? There are three ways: 1. We can manually define the time points to integrate, similar to odeint. 2. We can decrease the error of the solution we are willing to tolerate. 3. We can change to a more accurate integration method. We will first change the integration method. The default integration method of solve_ivp is RK45 and we will compare it to the default method of odeint, which is LSODA.

solve_ivp_rk45 = solve_ivp(lorenz, t_span, y0, args=p,
                            method='RK45')
solve_ivp_lsoda = solve_ivp(lorenz, t_span, y0, args=p,
                           method='LSODA')

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(solve_ivp_rk45.y[0, :],
        solve_ivp_rk45.y[1, :],
        solve_ivp_rk45.y[2, :])
ax.set_title("RK45")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(solve_ivp_lsoda.y[0, :],
        solve_ivp_lsoda.y[1, :],
        solve_ivp_lsoda.y[2, :])
ax.set_title("LSODA")
Comparison between RK45 and LSODA integration methods of solve_ivp.

The LSODA method is already more accurate but we can make it even more accurate but it is still not as accurate as the solution we got from odeint. That is because we made odeint solve at even higher temporal resolution when we passed it t. To get the exact same result from solve_ivp we got from odeint, we must pass it the exact time points we want to solve with the t_eval parameter.

t = np.arange(0.0, 40.0, 0.01)
result_odeint = odeint(lorenz, y0, t, p, tfirst=True)
result_solve_ivp = solve_ivp(lorenz, t_span, y0, args=p,
                             method='LSODA', t_eval=t)

fig = plt.figure()
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.plot(result_odeint[:, 0],
        result_odeint[:, 1],
        result_odeint[:, 2])
ax.set_title("odeint")

ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.plot(result_solve_ivp.y[0, :],
        result_solve_ivp.y[1, :],
        result_solve_ivp.y[2, :])
ax.set_title("solve_ivp LSODA")
odeint and solve_ivp with identical integration method and temporal resolution.

Now both solutions have identical temporal resolution. But their solution is still not identical. I was unable to confirm why that is the case but I suspect very small floating point errors. The Lorenz attractor is a chaotic system and even small errors can make it diverge. The following plot shows the first variable of the system for odeint and solve_ivp from the above simulation. It confirms my suspicion that floating point accuracy is to blame but was unable to confirm the source.

Solutions from odeint and solve_ivp diverge even for identical temporal resolution and integration method.

There is one more way to make the system more smooth: decrease the tolerated error. We will not go into that here because it does not relate to the difference between odeint and solve_ivp. It can be used in both interfaces. There are some more subtle differences. For example, by default, odeint expects the first parameter of the problem to be the state and the second parameter to be time t. This is the other way around for solve_ivp. To make odeint accept a problem definition where time is the first parameter we set the parameter tfirst=True.

In summary, solve_ivp offers several integration methods while odeint only uses LSODA.

A Deep Feedforward Network in pyTorch for the Titanic Challenge

The deep feedforward neural network is the most simple network architecture. They are a great entry point to many deep learning concepts. They can also be pretty effective for many applications but they have been replaced by more specialized networks in most areas (for example recurrent neural networks or convolutional neural networks). Here we will build one with pyTorch and we will go from feature selection to training. The dataset is the Titanic challenge, where our model has to predict who survives the sinking of the Titanic. It is the introductory data science competition on Kaggle. We will start by looking at our data.

Missing Values and String to Numerical Conversion

First we need to load the data and find out what we are dealing with. It already comes split into training and test data, both being .csv files. We load both files and take a look at their general structure with .info().

import pandas as pd
train = pd.read_csv("train.csv", index_col='PassengerId')
test = pd.read_csv("test.csv", index_col='PassengerId')
train.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
"""

test.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    418 non-null    int64  
 1   Name      418 non-null    object 
 2   Sex       418 non-null    object 
 3   Age       332 non-null    float64
 4   SibSp     418 non-null    int64  
 5   Parch     418 non-null    int64  
 6   Ticket    418 non-null    object 
 7   Fare      417 non-null    float64
 8   Cabin     91 non-null     object 
 9   Embarked  418 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB
"""

For each passenger we have 11 features. In the test data, the 'Survived' column is missing, because our goal is to predict it. For the training data we know who survived and we will use this to predict who survived in the test data. Before we can do that, we need to take a closer look at our data. The first issue we encounter is missing values. In the training data are 891 passengers. However, the 'Age' column contains only 714 non-null values. The others are missing. We say they are not-a-number (NaN) and we need to take care of them. If we were to feed NaNs into our model it would completely destroy the calculations. One option is to remove the rows that contain NaNs. This would exclude a large number of rows. Probably more than we were comfortable with, considering that there will be other columns with NaNs. Additionally, the test data also have NaNs. Should we just give up for those test passengers and guess?

A better idea is to replace missing values with the column mean. Given no other information, the mean is the best guess. To be extra careful, we can take the column median, which is more resistance to outliers. Importantly, we will replace NaNs in the test data with the median from the training data. This is not strictly necessary for the Titanic toy example but it is really good practice if you ever design an important model that will be deployed on completely unknown data. Generally, we should use the test data exclusively to calculate our test accuracy. Nothing else. It should never touch the training data and we should act like we don’t even have access to it yet, until we get to the model testing. There are more complicated ways to replace NaNs but they go beyond this blog post. Let us replace NaNs with the median using the .fillna method. Because we need to do the same preprocessing to the training and test dataset we will loop through them. We will have to do the same for the "Fare" column. There is only one value missing and it is in the test data, but we must fill it.

train_test_datasets = [train, test]
median_age = train["Age"].median()
median_fare = train["Fare"].median()
for dataset in train_test_datasets:
    dataset["Age"].fillna(median_age, inplace=True)
    dataset["Fare"].fillna(median_fare, inplace=True)

That takes care of "Age" and “Fare” but we cannot apply the same strategy to "Cabin" or "Embarked" because they are nominal columns. Note that their dtype is object instead of a numeric types such as int64. Let us take a look at "Cabin" first.

train["Cabin"]
"""
PassengerId
1       NaN
2       C85
3       NaN
4      C123
5       NaN

887     NaN
888     B42
889     NaN
890    C148
891     NaN
Name: Cabin, Length: 891, dtype: object
"""
type(train["Cabin"][1])
# float
type(train["Cabin"][2])
# str
train["Cabin"].isna().sum()
# 687

Some rows contain strings, other contain floats. The number of NaNs is large at 687. I was tempted to simply delete this column because of the large number of NaNs. I decided to keep it because it might be important. I am no ship expert, but the location of the cabin might determine how accessible life boats are and thereby influence survival. Now there are several ways to proceed. I will go with a compromise between simplicity and retaining information. We will extract the first letter, creating a new nominal feature called “Cabin Letter”.

for dataset in train_test_datasets:
    dataset["Cabin Letter"] = dataset["Cabin"].str.slice(0, 1)
train["Cabin Letter"].unique()
# array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

There are eight different cabin letters and the NaN problem persists. Because we are dealing with a nominal feature we can create a ninth category. These are the passengers where we don’t know their cabin number. We will do so as we convert the letters to numbers. For our network we will need to convert everything to floats anyway.

for dataset in train_test_datasets:
    dataset["Cabin Letter"] = pd.Categorical(dataset["Cabin Letter"]).codes
train["Cabin Letter"]
"""
PassengerId
1     -1
2      2
3     -1
4      2
5     -1
      ..
887   -1
888    1
889   -1
890    2
891   -1
Name: Cabin Letter, Length: 891, dtype: int8
"""

train['Cabin Letter'].unique()
# array([-1,  2,  4,  6,  3,  0,  1,  5,  7], dtype=int8)

Now the NaNs are -1, which is fine for now. Later we will convert all these nominal variables to dummy variables anyway, but we will get to that. We could extract more information from "Cabin" but for this demonstration I will leave it here and delete "Cabin".

for dataset in train_test_datasets:
    dataset.drop("Cabin", axis=1, inplace=True)

The final variable that suffers from NaNs is "Embarked". The data in it is less messy so we can go straight to getting the categorical codes.

for dataset in train_test_datasets:
    dataset["Embarked"] = pd.Categorical(dataset["Embarked"])
    dataset["Embarked"] = dataset["Embarked"].cat.codes

Now we succesfully removed all NaN values but some minor issues about our columns remain. "Name", "Sex" and "Ticket" do not have a numerical type. They are of type object. "Name" and "Ticket" probably don’t tell us a lot about survival, unless we would do some serious feature engineering on them. So we will simply drop them.

for dataset in train_test_datasets:
    dataset.drop(["Name", "Ticket"], axis=1, inplace=True)

That just leaves us with "Sex". We can use .cat.codes as above to convert it to nubmers.

for dataset in train_test_datasets:
    dataset["Sex"] = pd.Categorical(dataset["Sex"]).codes

Now let us take one more look at our datasets and make sure we took care of NaNs and everything is numeric.

train.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Survived      891 non-null    int64  
 1   Pclass        891 non-null    int64  
 2   Sex           891 non-null    int8   
 3   Age           891 non-null    float64
 4   SibSp         891 non-null    int64  
 5   Parch         891 non-null    int64  
 6   Fare          891 non-null    float64
 7   Embarked      891 non-null    int8   
 8   Cabin Letter  891 non-null    int8   
dtypes: float64(2), int64(4), int8(3)
memory usage: 51.3 KB
"""

test.info()
"""
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Pclass        418 non-null    int64  
 1   Sex           418 non-null    int8   
 2   Age           418 non-null    float64
 3   SibSp         418 non-null    int64  
 4   Parch         418 non-null    int64  
 5   Fare          418 non-null    float64
 6   Embarked      418 non-null    int8   
 7   Cabin Letter  418 non-null    int8   
dtypes: float64(2), int64(3), int8(3)
memory usage: 20.8 KB
"""

Everything looks in order. However, there is one more necessary preprocessing step before we can move to standardization. We need to convert categorical features to dummy variables. That means, each category in a column gets its own new column and a sample of that category will have a 1 there. Let’s see how that looks.

Convert Categories to Dummy Variables

We will use the .get_dummies function on all our categorical columns but first let us try it on the "Pclass" column.

train["Pclass"]
"""
PassengerId
1      3
2      1
3      3
4      1
5      3
      ..
887    2
888    1
889    3
890    1
891    3
Name: Pclass, Length: 891, dtype: int64
"""

pd.get_dummies(train["Pclass"])
"""
             1  2  3
PassengerId         
1            0  0  1
2            1  0  0
3            0  0  1
4            1  0  0
5            0  0  1
        .. .. ..
887          0  1  0
888          1  0  0
889          0  0  1
890          1  0  0
891          0  0  1

[891 rows x 3 columns]
"""

Dummy variables make sense if we just stare at them long enough. What we are looking for is the conversion of each category into an individual column. train["Pclass"] has three categories, [1, 2, 3]. Therefore, we get three dummy variables. A row that has a 1 in "Pclass" will get a 1 in the first column and 0 in the other two. A row that has a 2 will get a 1 in the second column and 0 in the others. A row that has a 3 will get a 1 in the third column and 0 in the others. Now we will use the same function on all our categorical columns. Note that we will not use our usual loop trough train and test data because there is no way to generate the dummy variables inplace.

categorical_cols = ["Pclass", "Sex", "Embarked", "Cabin Letter","SibSp"]
train_dummies = pd.get_dummies(train,
                               columns=categorical_cols,
                               prefix=categorical_cols)

test_dummies = pd.get_dummies(test,
                              columns=categorical_cols,
                              prefix=categorical_cols)
train_dummies.shape
# (891, 29)
test_dummies.shape
# (418, 26)

The code looks solid but something went wrong. We now have more columns in our train_dummies than in test_dummies. What happened? Let us look at the columns and see if we can spot the issue.

train_dummies.columns
"""
Index(['Survived', 'Age', 'Parch', 'Fare', 'Pclass_1', 'Pclass_2', 'Pclass_3',
       'Sex_0', 'Sex_1', 'Embarked_-1', 'Embarked_0', 'Embarked_1',
       'Embarked_2', 'Cabin Letter_-1', 'Cabin Letter_0', 'Cabin Letter_1',
       'Cabin Letter_2', 'Cabin Letter_3', 'Cabin Letter_4', 'Cabin Letter_5',
       'Cabin Letter_6', 'Cabin Letter_7', 'SibSp_0', 'SibSp_1', 'SibSp_2',
       'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8'],
      dtype='object')
"""

test_dummies.columns
"""
Index(['Age', 'Parch', 'Fare', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_0',
       'Sex_1', 'Embarked_0', 'Embarked_1', 'Embarked_2', 'Cabin Letter_-1',
       'Cabin Letter_0', 'Cabin Letter_1', 'Cabin Letter_2', 'Cabin Letter_3',
       'Cabin Letter_4', 'Cabin Letter_5', 'Cabin Letter_6', 'SibSp_0',
       'SibSp_1', 'SibSp_2', 'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8'],
      dtype='object')
"""

test_dummies is missing the following columns: "Embarked_-1“, "Embarked_0" and "Cabin Letter_7". Those are the three missing columns. Remember that "Survived" is purposely not in the test dataset. This happened because in the test data there were no rows that had missing values or a 0 in "Embarked" or a 7 in "Cabin Letter". This raises another issue. When we converted string categories to numbers with cat.codes, we may have converted letters differently in the test and training dataset. Luckily, we can convert strings directly to dummy variables. This is a great opportunity to take a step back and clean up our code. We will start from scratch.

import pandas as pd

# Load data
train = pd.read_csv("train.csv", index_col='PassengerId')
test = pd.read_csv("test.csv", index_col='PassengerId')

# Merge train and test for wrangling and preprocessing
train_test_datasets = [train, test]

"""Data wrangling"""
# Split cabin into letter and number
median_age = train["Age"].median()
median_fare = train["Fare"].median()
for idx, dataset in enumerate(train_test_datasets):
    dataset["Age"].fillna(median_age, inplace=True)
    dataset["Fare"].fillna(median_fare, inplace=True)
    dataset["Cabin Letter"] = dataset["Cabin"].str.slice(0, 1)
    dataset.drop("Cabin", axis=1, inplace=True)
    #dataset["Embarked"] = dataset["Embarked"].cat.codes
    dataset.drop(["Name", "Ticket"], axis=1, inplace=True)

categorical_cols = ["Pclass", "Sex", "Embarked", "Cabin Letter","SibSp"]
train_dummies = pd.get_dummies(train,
                               columns=categorical_cols,
                               prefix=categorical_cols,
                               dummy_na=True)

test_dummies = pd.get_dummies(test,
                              columns=categorical_cols,
                              prefix=categorical_cols,
                              dummy_na=True)

train_dummies.shape
# (891, 32)
test_dummies.shape
# (418, 30)

Now we have more columns because we added the dummy_na=True parameter. This gives us a "Sex_nan" column, although there were no NaNs in the original data. This is excessive but won’t be an issue for our network. Now we just need to add the one columns that is missing from test_dummies. It should contain only zeros, because there were no rows with that category in the test dataset.

train_dummies.columns
"""
Index(['Survived', 'Age', 'Parch', 'Fare', 'Pclass_1.0', 'Pclass_2.0',
       'Pclass_3.0', 'Pclass_nan', 'Sex_female', 'Sex_male', 'Sex_nan',
       'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_nan',
       'Cabin Letter_A', 'Cabin Letter_B', 'Cabin Letter_C', 'Cabin Letter_D',
       'Cabin Letter_E', 'Cabin Letter_F', 'Cabin Letter_G', 'Cabin Letter_T',
       'Cabin Letter_nan', 'SibSp_0.0', 'SibSp_1.0', 'SibSp_2.0', 'SibSp_3.0',
       'SibSp_4.0', 'SibSp_5.0', 'SibSp_8.0', 'SibSp_nan'],
      dtype='object')
"""

test_dummies.columns
"""
Index(['Age', 'Parch', 'Fare', 'Pclass_1.0', 'Pclass_2.0', 'Pclass_3.0',
       'Pclass_nan', 'Sex_female', 'Sex_male', 'Sex_nan', 'Embarked_C',
       'Embarked_Q', 'Embarked_S', 'Embarked_nan', 'Cabin Letter_A',
       'Cabin Letter_B', 'Cabin Letter_C', 'Cabin Letter_D', 'Cabin Letter_E',
       'Cabin Letter_F', 'Cabin Letter_G', 'Cabin Letter_nan', 'SibSp_0.0',
       'SibSp_1.0', 'SibSp_2.0', 'SibSp_3.0', 'SibSp_4.0', 'SibSp_5.0',
       'SibSp_8.0', 'SibSp_nan'],
      dtype='object')
"""

"Cabin Letter_T" is the missing one.

import numpy as np
test_dummies["Cabin Letter_T"] = np.zeros(test_dummies.shape[0])

Now our dummy variables are in order and we can move on balancing the training data.

Balancing the Training Data

Balancing the training data will be important during learning. We want to train our model to predict survival. Imagine the extreme case, where we train a model only on passengers that survived. This model will be terrible, because it has no idea how a surviving passenger looks on paper. Now don’t worry, this training data has rows on both surviving and non-surviving passengers. However, a training data where one of the outcomes is more frequent can bias a model. Let us find out what’s the situation with this training data.

total_samples = train_dummies.shape[0]  # Number of rows in DataFrame
number_surviving = (train_dummies['Survived'] == 1).sum()  # Number of survivors
perc_survivors = (number_surviving / total_samples) * 100
# 38.38383838383838

In this training data 38% of passengers survived. There are no hard rules on how balanced data should be, but I am not happy with that number. If it was 45% I think it would be fine but we should do something about 38%. Because the number of survivors is lower than the non-survivors, we will randomly select as many non-survivors as there are survivors.

bool_survivors = train_dummies['Survived'] == 1
bool_nonsurvivors = train_dummies['Survived'] == 0
all_survivors = train_dummies[bool_survivors]
all_nonsurvivors = train_dummies[bool_nonsurvivors]
random_nonsurvivors = all_nonsurvivors.sample(number_surviving)
train_balanced = pd.concat((all_survivors, random_nonsurvivors))
train_balanced = train_balanced.sample(frac=1)
(train_balanced["Survived"] == 0).sum()
# 342
(train_balanced["Survived"] == 1).sum()
# 342

Now we have 342 survivors and 342 non-survivors. Perfectly balanced, as all things should be. Concatenating like this we must also remember to shuffle the rows with .sample(frac=1), otherwise we might run into problems later with unbalanced batches. Now that our data is balanced, we can move on to standardization.

Standardization

Both datasets are now free from missing values, categories are converted to dummy variables and our labels are balanced. Now we can move on to standardization, the process of bringing features to the same scale. We achieve this by subtracting the mean and dividing by the standard deviation. This has some advantages for neural networks. It can increase the learning rate and result in better fits. There are also disadvantages, as some features become harder to interpret because they lose their physical units. Because we rarely try to interpret neuronal networks, we almost always standardize our data. This is how it works with scikit-learn.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(train_dummies.iloc[:,1:])
train_scaled = scaler.transform(train_balanced.iloc[:,1:])
test_scaled = scaler.transform(test_dummies)

train_scaled = pd.DataFrame(train_scaled,
                            index=train_balanced.index,
                            columns=train_balanced.iloc[:,1:].columns)

test_scaled = pd.DataFrame(test_scaled,
                           index=test_dummies.index,
                           columns=test_dummies.columns)

y = train_balanced["Survived"]

train_balanced["Age"].mean(), train_balanced["Age"].std()
# (29.23489766081871, 13.239715945608669)
train_scaled["Age"].mean(), train_scaled["Age"].std()
# (-0.009735709395380402, 1.0174700960384269)

Originally the mean age is 29.23 and the standard deviation is 13.24. After standardization, the mean is near zero because we subtracted it and the standard deviation is near 1 because we divided by it. So our standardization worked. Now the absolute values of all features are on the scale of the standard normal distribution. We are now almost ready to create our actual network. But before, we will split our training data into a training and validation data. This will help us avoid overfitting.

Validation Data Split

Using train_test_split we get 547 training and 137 validation samples. test_size=0.1 specifies that 10% of samples should be in the test data, which in our case will be used for validation, as we already have another designated test data set. We are now ready to build our network an start training.

from sklearn.model_selection import train_test_split

X = train_scaled
X_train, X_validate, y_train, y_validate = train_test_split(X, y,
                                                            test_size=0.1)

X_train.shape
# (547, 31)
X_validate.shape
# (137, 31)

Building the Network with pyTorch

We start by converting our arrays to tensors. This is the data structure pyTorch will expect as input to the network later.

import torch

train_features = torch.tensor(X_train.to_numpy())
train_labels = torch.tensor(y_train.to_numpy())

validation_features = torch.tensor(X_validate.to_numpy())
validation_labels = torch.tensor(y_validate.to_numpy())

Now we build the neural network by calling torch.nn.Sequential.

n_features = train_features.shape[1]
# 31
model = torch.nn.Sequential(torch.nn.Linear(n_features, 50),
                            torch.nn.ReLU(),
                            torch.nn.Linear(50, 1),
                            torch.nn.Sigmoid())

This network takes our 31 input features and transforms them to 50 hidden units using a fully connected linear layer. This layer also learns a bias term by default. We then apply the rectifying linear unit to introduce some non-linearity. We then convert the 50 hidden units to one output unit, to which we apply the Sigmoid function. This means our deep neural network has one hidden layer with 50 units. The Sigmoid function makes sure that our output domain is between 0 and 1. This is important because we are making a binary classification of surviving (1) and non-surviving (0) passengers. Now we define our loos function and our learning method.

criterion = torch.nn.BCELoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=0.001)

Our loss function is binary cross entropy loss, which works well for the binary classification problem we are facing. Our learning algorithm is Adam, a variety of gradient descent that features L2 regularization (weight_decay). Before we start training we will split our train data into 41 mini batches.

n_batches = 41
train_features_batched = train_features.reshape(41,
                                               int(train_features.shape[0]/n_batches),
                                               train_features.shape[1])
train_labels_batched = train_labels.reshape(n_batches,
                                            int(train_labels.shape[0]/n_batches))

Training our network on the mini batches instead of the whole data makes learning quicker and can result in models that generalize better. Now we are ready to train the network.

n_epochs = 2000
loss_list = []
validate_loss_list = []

for epoch in range(n_epochs):
    for batch_idx in range(n_batches):
        optimizer.zero_grad()
        
        outputs = model(train_features_batched[batch_idx].float())
        
    
        loss = criterion(outputs.flatten().float(),
                         train_labels_batched[batch_idx].float())
    
        
        loss.backward()
        
        optimizer.step()
        
    outputs = model(train_features.float())
    
    validation_outputs = model(validation_features.float())
    
        
    loss = criterion(outputs.flatten().float(),
                     train_labels.float())
    
    validate_loss = criterion(validation_outputs.flatten().float(),
                              validation_labels.float())
    
    loss_list.append(loss.item())
    
    validate_loss_list.append(validate_loss)

print('Finished Training')

import matplotlib.pyplot as plt
plt.plot(loss_list, linewidth=3)
plt.plot(validate_loss_list, linewidth=3)
plt.legend(("Training Loss", "Validation Loss"))
plt.xlabel("Epoch")
plt.ylabel("BCE Loss")
Loss curves of model training. Yours will look different, as we did not seed the randomness during preprocessing or model initialization.

This code starts by creating empty lists to record the loss of both test and validation data. n_epochs tells us how many times we train the network on the entire data. The actual learning happens in the inner loop that exposes batch_idx. We train the network after each mini batch. The output is whatever the network current spits out for a given batch. Then we compare that output to the known labels, which is our loss. loss.backward() calculates the gradient of the loss with respect to the parameters. Finally, the actual learning happens when we call optimizer.step(), which adjust the parameters of our model according to the gradients. This is why pyTorch is handy, it takes very good care of the gradients. The loss we are plotting is calculated in the out loop, where the loss across the entire data instead of the mini batches is calculated. The rest is just plotting. So how did it go? Our network learns very quickly. After 250 epochs the validation loss already stops improving. Everything that comes afterwards is probably overfitting the training data. That means the training loss improves while the validation loss is stagnant or becomes worse. The last piece of code we have to write calculates our prediction for the test data and saves it in a way that we can upload it to Kaggle to find out how well we did.

test_features = torch.tensor(test_scaled.to_numpy())

test_prediction = model(test_features.float()).detach().numpy().flatten()

test_prediction_binary = (test_prediction > 0.5).astype(np.int)

test_prediction_df = pd.DataFrame(test_prediction_binary,
                                  index=test.index,
                                  columns=["Survived"])

test_prediction_df.to_csv("prediction_submission.csv")

I will upload three different predictions to Kaggle. One from a network that is untrained. To do so I will set n_epochs to 0. This amounts to guessing, because the network is randomly initialized. Then I will upload predictions from a model that stops after 200 epochs. This should be a well trained model. Then I will use a model that stopped after 2000 epochs. This should be slightly overfit. Let’s see how we did.

We got a 0.39952 accuracy score for the untrained model, meaning that 39.952 % of our predictions were correct. Pretty bad but expected from a guessing model. Our model trained on 200 epochs has a 0.70095 score. Much better and a good sign that our model actually learned something. Our model trained on 2000 epochs got a score of 0.74162. Not exactly what we would expect. This probably means that our validation data set was not representative for the test data set. I also read from other people that overfitting the test data improves test accuracy slightly in the Titanic example. How could we improve our model?

Where to go from here

I decided to use a neural network because I wanted to write about it but it is not the best model. A random forest classifier seems to do better here. The highest score without cheating is around 0.83, which is achieved with this kind of model. But what could we do to improve our above code. One great improvement would be to use another validation method. Our validation method sets a lot of data aside that the model is never trained on and we don’t have massive amounts of data to begin with. A cross-validation technique could help us us that data for training. A second improvement would be to use a more formal way to determine the epoch where we want to stop training. We just looked at it and picked nice looking spots in the loss curve. Early stopping is a technique that could help us find a more objective point to stop the training. Finally, we could learn more about the Titanic data and get more features out of the data. For example, we discarded the cabin number, keeping only the letter. However, the number could be very important. A good way to learn more about the Titanic data is to browse the Kaggle discussion board.

References

Introduction to t-SNE in Python with scikit-learn

t-SNE (t-distributed stochastic neighbor embedding) is a popular dimensionality reduction technique. We often havedata where samples are characterized by n features. To reduce the dimensionality, t-SNE generates a lower number of features (typically two) that preserves the relationship between samples as good as possible. Here we will learn how to use the scikit-learn implementation of t-SNE and how it achieves dimensionality reduction step by step.

How to use t-SNE with scikit-learn

We will start by performing t-SNE on a part of the MNIST dataset. The MNIST dataset consists of images of hand drawn digits from 0 to 9. Accurately classifying each digit is a popular machine learning challenge. We can load the MNIST dataset with sklearn.

from sklearn.datasets import fetch_openml
import numpy as np
import matplotlib.pyplot as plt

# Load the MNIST data
X, y = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)

# Randomly select 1000 samples for performance reasons
np.random.seed(100)
subsample_idc = np.random.choice(X.shape[0], 1000, replace=False)
X = X[subsample_idc,:]
y = y[subsample_idc]

# Show two example images
fig, ax = plt.subplots(1,2)
ax[0].imshow(X[11,:].reshape(28,28), 'Greys')
ax[1].imshow(X[15,:].reshape(28,28), 'Greys')
ax[0].set_title("Label 3")
ax[1].set_title("Label 8")
Two example images from the MNIST dataset.

By default, the MNIST data we fetch comes with 70000 images. We randomly select 1000 of those to make this demonstration faster. Each image consists of 784 pixels and they come as a flat one dimensional array. To display them as an image we reshape them into a 28×28 matrix. The images are in X and their labels in y.

X.shape
# (1000, 784)
# 1000 Samples with 784 features
y.shape
# (1000,)
# 1000 labels
np.unique(y)
# array(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], dtype=object)
# The 10 classes of the images

Using t-SNE on this data is shockingly easy thanks to scikit-learn. We simply import the TSNE class, pass it our data and then fit.

from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sns

# We want to get TSNE embedding with 2 dimensions
n_components = 2
tsne = TSNE(n_components)
tsne_result = tsne.fit_transform(X)
tsne_result.shape
# (1000, 2)
# Two dimensions for each of our images

# Plot the result of our TSNE with the label color coded
# A lot of the stuff here is about making the plot look pretty and not TSNE
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0], 'tsne_2': tsne_result[:,1], 'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)
t-SNE example on MNIST subsample.

Considering that we did not specify any parameters except n_components, this looks pretty good. Before we dive into the parameters, we will go through t-SNE step by step and take some looks under the hood of the scikit-learn implementation.

t-SNE step by step

The Distance Matrix

The first step of t-SNE is to calculate the distance matrix. In our t-SNE embedding above, each sample is described by two features. In the actual data, each point is described by 728 features (the pixels). Plotting data with that many features is impossible and that is the whole point of dimensionality reduction. However, even at 728 features, each point is a certain distance apart from every other point. There are many different distance metrics that make sense but probably the most straightforward one is the euclidean distance.

Definition of euclidean distance for two features. Image credit licensed under CC BY 4.0 by Kmhkmh.

The above definition of euclidean distance for two features extends to n features (p1,p2,p2,…,pn). Once again we can use scikit-learn to calculate the euclidean distance matrix. Because a distance matrix of the unsorted samples doesn’t look like much, we also calculate it after sorting the samples by label.

from sklearn.metrics import pairwise_distances
y_sorted_idc = y.argsort()
X_sorted = X[y_sorted_idc]

distance_matrix = pairwise_distances(X,
                                     metric='euclidean')

distance_matrix_sorted = pairwise_distances(X_sorted,
                                            metric='euclidean')

fig, ax = plt.subplots(1,2)
ax[0].imshow(distance_matrix, 'Greys')
ax[1].imshow(distance_matrix_sorted, 'Greys')
ax[0].set_title("Unsorted")
ax[1].set_title("Sorted by Label")

When the samples are sorted by label, squareish patterns emerge in the distance matrix. White means smaller euclidean distances. This is reassuring. After all, we would expect the drawing of a one to be more similar to other drawings of a one than to a drawing of a zero. That’s what the squares near the white diagonal represent. t-SNE tries to roughly preserve the distances between samples but it does not work on the raw distances. It works on the joint probabilities, bringing us to our second step.

Joint Probabilities

The distance matrix tells us how far samples are apart. The joint probabilities tell us how likely it is that samples choose each others as “neighbors”. The two are of course related, because nearby samples should have a higher chance to be neighbors than further apart samples. The t-distribution is defined by its mean, the degrees of freedom and the scale (σ). For our t-SNE purposes we set the mean to 0 (at 0, samples are exactly at the same place). The degrees of freedom are set to the number of components minus one. That’s one in our case, since we want two components. The last free parameter is sigma and it is important because it determines how wide the tails of the distribution are.

t-distributions for sigma of 1 and 2

This is where the perplexity parameter comes in. The user chooses a perplexity value (recommended values are between 5 and 50) and based on the perplexity, t-SNE then chooses sigmas that satisfy that perplexity. To understand what this means, consider the first row of our distance matrix. It tells us the distance of our first point to each other point and as we transform that row with the t-distribution we get our own distribution P. The perplexity is defined 2H(P) where H is the Shannon entropy. Different values for sigma will results in different distributions, which differ in entropy and therefore differ in perplexity.

from scipy.stats import t, entropy

x = distance_matrix[0,1:]
t_dist_sigma01 = t(df=1.0, loc=0.0, scale=1.0)
t_dist_sigma10 = t(df=1.0, loc=0.0, scale=10.0)
P_01 = t_dist_sigma01.pdf(x)
P_10 = t_dist_sigma10.pdf(x)

perplexity_01 = 2**entropy(P_01)
perplexity_10 = 2**entropy(P_10)

dist_min = min(P_01.min(), P_10.min())
dist_max = max(P_01.max(), P_10.max())
bin_size = (dist_max - dist_min) / 100
bins = np.arange(dist_min+bin_size/2, dist_max+bin_size/2, bin_size)
fig, ax = plt.subplots(1)
ax.hist(P_01, bins=bins)
ax.hist(P_10, bins=bins)
ax.set_xlim((0, 1e-6))
ax.legend((r'$\sigma = 01; Perplexity = $' + str(perplexity_01),
           r'$\sigma = 10; Perplexity = $' + str(perplexity_10)))
Distributions of joint probabilities for different values of sigma.

Above we can see what happens to the joint probability distribution as we increase sigma. With increasing sigma the entropy increases and so does the perplexity. t-SNE performs a binary search for the sigma that produces the perplexity specified by the user. This means that the perplexity controls the chance of far away points to be chosen as neighbors. Therefor, perplexity is commonly interpreted as a measure for the number of samples neigbors. The default value for perplexity is 30 in the sklearn implementation of t-SNE. Instead of implementing our own binary search we will take a shortcut to calculating the joint probabilities. We will use sklearns internal function to do it.

from sklearn.manifold import _t_sne

perplexity = 30  # Same as the default perplexity
p = _t_sne._joint_probabilities(distances=distance_matrix,
                        desired_perplexity = perplexity,
                        verbose=False)

As a small note, our joint probabilities are no longer a matrix. You may have noticed that the distance matrix is symmetric along one of its diagonals and the diagonal is all zeros. So we only keep the upper triangular of the matrix in a flat array p. That’s all we need to move from joint probabilities to the next step.

Optimize Embedding with Gradient Descent

Now that we have the joint probabilities from our high dimensional data, we want to generate a low dimensional embedding with just two features that preserves the joint probabilities as good as possible. First we need to initialize our low dimensional embedding. By default, sklearn will use a random initialization so that’s what we will use. Once we initialized our embedding, we will optimize it using gradient descent. This optimization is at the core of t-SNE and we will be done afterwards. To achieve a good embedding, t-SNE optimizes the Kullback-Leibler divergence between the joint probabilites of the data and their embedding. It is a measure for the similarity of two distributions. The sklearn TSNE class comes with its own implementation of the Kullback-Leibler divergence and all we have to do is pass it to the _gradient_descent function with the initial embedding and the joint probabilities of the data.

# Create the initial embedding
n_samples = X.shape[0]
n_components = 2
X_embedded = 1e-4 * np.random.randn(n_samples,
                                    n_components).astype(np.float32)

embedding_init = X_embedded.ravel()  # Flatten the two dimensional array to 1D

# kl_kwargs defines the arguments that are passed down to _kl_divergence
kl_kwargs = {'P': p,
             'degrees_of_freedom': 1,
             'n_samples': 1000,
             'n_components':2}

# Perform gradient descent
embedding_done = _t_sne._gradient_descent(_t_sne._kl_divergence,
                                          embedding_init,
                                          0,
                                          1000,
                                          kwargs=kl_kwargs)

# Get first and second TSNE components into a 2D array
tsne_result = embedding_done[0].reshape(1000,2)

# Convert do DataFrame and plot
tsne_result_df = pd.DataFrame({'tsne_1': tsne_result[:,0],
                               'tsne_2': tsne_result[:,1],
                               'label': y})
fig, ax = plt.subplots(1)
sns.scatterplot(x='tsne_1', y='tsne_2', hue='label', data=tsne_result_df, ax=ax,s=120)
lim = (tsne_result.min()-5, tsne_result.max()+5)
ax.set_xlim(lim)
ax.set_ylim(lim)
ax.set_aspect('equal')
ax.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.0)

And that’s it. It doesn’t look identical to the t-SNE we did above because we did not seed the initialization of the and the gradient descent. In fact, your results will look slightly different if you follow this guide. The sign of success you are looking for are similarly well defined clusters. We could have gone a bit deeper here and there. For example we could have written our own implementations of the Kullback-Leibler divergence or gradient descent. I’ll leave that for another time. Here are some useful links if you want to dig deeper into t-SNE or its sklearn implementation.

Smoothing Data by Rolling Average with NumPy

Time series data often comes with some amount of noise. One of the easiest ways to get rid of noise is to smooth the data with a simple uniform kernel, also called a rolling average. The title image shows data and their smoothed version. The data is the second discrete derivative from the recording of a neuronal action potential. Derivatives are notoriously noisy. We can get the result shown in the title image with np.convolve

import numpy as np

data = np.load("example_data.npy")
kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved = np.convolve(data, kernel, mode='same')

Convolution is a mathematical operation that combines two arrays. One of those arrays is our data and we convolve it with the kernel array. During convolution we center the kernel at a data point. We multiple each data point in the kernel with each corresponding data point, sum up all the results and that is the new data point at the center. Let’s look at an example.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel)
# array([ 2,  7, 13, 23, 24, 18, 19,  4])

For this result to make sense you must know, that np.convolve flips the kernel around. So step by step the calculations go as follows:

[4, 3, 2, 1]  # The flipped kernel
          x
         [2, 3, 1, 4, 1]  # The data
          2= 2
         [2]

   [4, 3, 2, 1]
          x  x
         [2, 3, 1, 4, 1]
          4+ 3= 7
         [2, 7]

      [4, 3, 2, 1]
          x  x  x
         [2, 3, 1, 4, 1]
          6+ 6+ 1=13
         [2, 7, 13]

         [4, 3, 2, 1]
          x  x  x  x
         [2, 3, 1, 4, 1]
          8+ 9+ 2+ 4= 23
         [2, 7,13,23]
# This continues until the arrays stop touching. You get the idea.

One thing you’ll notice is that the edges are problematic. There is really no good way to avoid that. Data points at the edges only see part of the kernel but the mode parameter defines what should happen at the edges. I prefer the 'same' mode because it means that the new array will have the same shape as the original data, which makes plotting easier. However, if you start to use more complicated kernels, the edges might become virtually useless. In that case, mode should be 'valid'. Then, the values at the edges that did not see the entire kernel are discarded. The output array is smaller in shape than the input array.

data = [2, 3, 1, 4, 1]
kernel = [1, 2, 3, 4]
np.convolve(data, kernel, mode='valid')
array([23, 24])

The default behavior you saw above is called 'full'. It keeps all data points, so the output array is larger in shape than the input array. You might also have noticed that the size of the kernel is very important. Actually, we need to divide the array of ones by its length. Can you guess what would happen if we forgot about dividing it?

If you guessed that the signal would become larger in magnitude you guessed right. We would be summing up all data points in the kernel. By dividing it we ensure that we take the average of the data points. But the kernel size is even more important. If we make the kernel larger the outcome changes dramatically.

kernel_size = 10
kernel = np.ones(kernel_size) / kernel_size
data_convolved_10 = np.convolve(data, kernel, mode='same')

kernel_size = 20
kernel = np.ones(kernel_size) / kernel_size
data_convolved_20 = np.convolve(data, kernel, mode='same')

plt.plot(data_convolved_20)
plt.plot(data_convolved_10)
plt.legend(("Kernel Size 10", "Kernel Size 20"))

The larger we make the kernel, the smaller sharp peaks become. The peaks are also shifted in time. To be specific, a rolling mean is a low-pass filter. This means that is leaves low frequency signals alone, while making high frequency signals smaller. Sharp increases in the data have a high frequency. If we make the kernel larger, the filter attenuates high frequency signals more. This is exactly how the rolling average works. It gets rid of high frequency noise. It also means that we must be careful not to distort the signal too much with the rolling average filter.

Threshold Detection in NumPy

Many signals are easily detected by their size. We will learn how to detect the indices where signals cross a threshold with NumPy. These are our practice signals.

import numpy as np
import matplotlib.pyplot as plt
data = np.array([0, 0, 0, 5, 5, 5, 5, 0, 0, 0, 0, 4, 4, 4, 0, 0, 0])
plt.plot(data, marker='o')

We will perform two simple steps to detect the threshold crossings: 1. Make the data binary, in a way that they are true when larger than the threshold and false when lower or equal. 2. Take the difference of the binary signal. This gives us a boolean array that is true when the threshold was crossed. We can combine those steps into one line.

threshold = 2
threshold_crossings = np.diff(data > threshold, prepend=False)

Plotting shows us that threshold_crossings is true after the threshold was crossed.

plt.plot(data2, marker='o')
plt.plot(thr_crossings, marker='o')
plt.legend(("Data", "Threshold Crossings"))

To get the indices of the threshold crossings we can use np.argwhere(), which returns the true indices from a boolean array.

np.argwhere(threshold_crossings)[:,0]
# array([ 3,  7, 11, 14], dtype=int64)

Threshold crossings occur at 3, 7, 11 and 14. Sometimes we only need the upward or downward crossings. We can simply isolate those by slicing the indiced array.

np.argwhere(threshold_crossings)[::2,0]  # Upward crossings
# array([ 3, 11], dtype=int64)
np.argwhere(thr_crossings)[1::2,0]  # Downward crossings
# array([ 7, 14], dtype=int64)

Sometimes we want to find the point before the threshold is crossed, rather than after. There is one simple trick in np.diff instead of setting prepend=False, we set append=False.

threshold = 2
post_threshold_crossings = np.diff(data > threshold, prepend=False)
pre_threshold_crossings = np.diff(data > threshold, append=False)
plt.plot(data2, marker='o')
plt.plot(post_threshold_crossings, marker ='o')
plt.plot(pre_threshold_crossings, marker='o')
plt.legend(("Data", "Post Crossings", "Pre Crossings"))

Make sure to check out the documentation for np.diff and bonus points if you can figure out why exactly this works. Detecting threshold crossings is an easy but important part in most of my analysis pipelines and now you can do it too.

Comparisons and Logic Functions in NumPy

  • We can compare arrays with scalars and other arrays using Pythons standard comparison operators
  • There are two different boolean representations of an array. arr.all is True if all elements are True, arr.any is True if any element is True.
  • To perform logical functions we need the special NumPy functions np.logical_and, np.logical_or, np.logical_not, np.logical_xor

Introduction

Logic functions allow us to check if logical statements about our arrays are true or false. Luckily, logic functions are very consistent with other array functions. They are performed element-wise by default and they can be performed on specific axes.

Comparing Arrays and Scalars

The same way we can use the arithmetic operators we can use all the logical operators: >, >=, <,< <=, ==, !=,

import numpy as np
arr = np.array([-1, 0, 1])
arr > 0
# array([False, False,  True])
arr >= 0
array([False,  True,  True])
arr < 0
# array([ True, False, False])
arr <= 0
# array([ True,  True, False])
arr == 0
# array([False,  True, False])
arr != 0
# array([ True, False,  True])

Comparing Arrays with Arrays

The result of a logic operation is a boolean array containing binary values of True or False. This particular case shows us logic operations of arrays with scalars. All boolean operations are performed element-wise, so each element is compared against the scalar. We can also compare arrays to arrays if their shape allows it.

arr_one = np.array([-1, 0, 1])
arr_two = np.array([1, 0, -1])
arr_one > arr_two
# array([False, False,  True])
arr_one >= arr_two
# array([False,  True,  True])
arr_one < arr_two
# array([ True, False, False])
arr_one <= arr_two
# array([ True,  True, False])
arr_one == arr_two
# array([False,  True, False])
arr_one != arr_two
array([ True, False,  True])

Truth Value of Arrays and Elements

Sometimes we need to check if an array contains any elements that are considered True in a boolean context. While the boolean value of array elements is well defined, the truth value of an entire array is not defined.

arr = np.array([0, 1])
bool(arr[0])
# False
bool(arr[1])
# True
bool(arr)
# ValueError: The truth value of an array with more 
# than one element is ambiguous. Use a.any() or a.all()

NumPy error messages are great. This one is so great that it even tells us which method we need to use to get at the truth value of an array. We can use arr.any() to find out if any of the elements evaluate to True or arr.all() to find out if all elements are True

arr_one = np.array([0, 0])
arr_one.any()
# False
arr_one.all()
# False
arr_two = np.array([0, 1])
arr_two.any()
# True
arr_two.all()
# False
arr_three = np.array([1, 1])
arr_three.any()
# True
arr_three.any()
# True

This can be useful to find out whether an array is empty

arr = np.array([])
arr.any()
# False

Logical Operations

Finally, we need to look at four more logical operations: and, or, not & xor.
Unfortunately we can’t just use the Python keywords. The reason is in the error message above: “ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()”. It is ambiguous because NumPy does not know if we want to perform the operation element-wise or if we want to perform the operation on the truth value of the array. NumPy does not try to guess which one we mean, so it throws the error. To get these logical functions we need to call some more explicit NumPy functions.

arr_one = np.array([0,1,1])
arr_two = np.array([0,0,1])
np.logical_and(arr_one, arr_two)
# array([False, False,  True])
np.logical_or(arr_one, arr_two)
# array([False,  True,  True])
np.logical_not(arr_one)
# array([ True, False, False])
np.logical_xor(arr_one, arr_two)
# array([False,  True, False])

Summary

We are now well equipped to deal with arrays. We can compare arrays with scalar values and other arrays using the the standard comparison operators. We can also perform logical operations on arrays with the special NumPy functions (logical_and, logical_or, logical_not and logcal_xor). Finally we can get two different boolean values of an arrays using arr.all and arr.any.

NumPy Array Data Type

  • Any array has a data type (dtype)
  • The dtype determines what kind of data is stored in the array
  • Not all operations work for all dtypes

Introduction to Data Types

Having a data type (dtype) is one of the key features that distinguishes NumPy arrays from lists. In lists, the types of elements can be mixed. One index of a list can contain an integer, another can contain a string. This is not the case for arrays. In an array, each element must be of the same type. This gives the array some of its efficiency, because operations can know in advance, what kind of data they will find in each element simply by looking up the data type. At the same time it makes arrays slightly less flexible, because some operations are undefined for some data types and we cannot assign any kind of data to an array. But how does NumPy decide what data type an array should have in the first place?

Guessing or Defining the dtype

So far we were able to create arrays effortlessly without knowing what dtype even means. That is because NumPy will just take a guess, what the dtype should be, based on the input it gets for the array.

arr = np.array([4, 3, 2])
arr.dtype
# dtype('int32')
arr = np.array([4, 3.0, 2])
arr.dtype
# dtype('float64')
arr = np.array([4, '3', 2])
arr.dtype
# dtype('<U11')

In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number. Floats are a more complex data type in Python, which means that all other data types have to follow the more complex one. Therefore, all elements of the array are converted to floats and are stored with the dtype float64. Strings are an even more complex dtype. Because ‘3’ is a string in the final example, the dtype becomes ‘<U11’. U stands for unicode, a type of string encoding and the number indicates the length of the string. In all three cases NumPy guesses the dtype according to the content of the list. This works well most of the time but we can also explicitly define the dtype.

arr = np.array([4, 3, 2], dtype=np.float)
arr.dtype
# dtype('float64')
arr = np.array([4, 3, 2], dtype=np.str)
arr.dtype
# dtype('<U1')
arr = np.array([4, 3, 2], dtype=np.bool)
arr.dtype
# dtype('bool')

Converting arrays to other dtypes can be necessary because some operations will not work on arrays of mixed types. A dtype that is particularly problematic is the np.object dtype. It is the most flexible dtype but it can cause a lot of problems for both experts and beginners.

np.object and the Curse of Flexibility

Most dtypes are very specific. They let you know if the array contains a number (np.int, np.float) or a string (all unicode ‘U’ dtypes). Not so much np.object. It tells you that whatever is inside the array is a thing. Because everything is an object anyway. This can make an array as flexible as a list. Anything can be stored. That is also where the problems come in.

arr = np.array([[3,2,1],[2,5]])
arr.dtype
# dtype('O')  # 'O' means object
arr + 5
# TypeError: can only concatenate list (not "int") to list

Suddenly, the plus operation between an array and a scalar fails. What went wrong? Starting from the top, NumPy decides to assign the dtype of np.object to arr because the nested list entries have different lengths. Think of it this way: this array can neither be a (2, 3) nor a(2, 2) array of dtype integer. Therefore, NumPy makes it a (2,) array of dtype object. So the array contains two lists, the first one is of length 3 and the second one of length 2. NumPy generally turns anything that is more complex than a string into np.object. A list is one of those that gets turned into np.object. The error then occurs because the plus operation is not defined for a list with an integer. But that also means, that the operation will work, if the objects contained in the array so happen to work with the operation.

arr = np.array([3,2,1], dtype=np.object)
arr.dtype
dtype('O')
arr + 5
array([8, 7, 6], dtype=object)

This is one of the main problem of the np.object dtype. Operations work only sometimes and to know if an operation will work, each element has to be checked. With other dtypes, we know which operations will work just by just looking at it.

Summary

The dtype is one of the concepts that is closely related to the internal workings of NumPy. There is a lot that could be said about the details but effective beginners only need to remember a few points. First, the dtype determines what is stored in the array. All elements of an array have to conform to a specific type and dtype tells us which one. Second, NumPy guesses the dtype based on the literal data unless we specify which dtype we want. Guessing works most of the time but sometimes explicit types conversion is necessary. Third, operations that we know and love from numeric types (np.int, np.float) may not work on other types (np.str, np.obect). This is particularly annoying for beginners. If you have hard to debug errors, find out what dtype your arrays actually have.

Array Indexing with NumPy

  • Indexing is used to retrieve or change elements of a an array
  • Slice syntax (start:stop:step) gets a range of elements
  • Integer and boolean arrays can get an arbitrary set of elements

Introduction to Array Indexing

Indexing is an important feature that allows us to retrieve and reassign specific parts on an array. You probably already know the basics of indexing from Python lists and tuples. You can index into NumPy arrays the same way you index into those sequences but NumPy indexing comes with many extra features we will learn about here. First, lets look at single value indexing.

Single Value Indexing

We can use indexing to get single (scalar) values from an array. Indexing is always done with square brackets and we always start counting at 0.

import numpy as np
arr = np.arange(10,15)
arr
array([10, 11, 12, 13, 14])
arr[0]
10
arr[4]
14

Note that single value indexing does not return an array with a single entry but rather a numpy integer. To get a single value from a multi dimensional array we need to use multiple indices that are separated by commas.

arr = np.arange(20)
arr = arr.reshape((2,2,5))
arr
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])
arr[0,0,1]
1
arr[1,0,4]
14

I recommend this way of indexing but you can also use multiple square brackets like you would for Python sequences.

arr
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],

       [[10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19]]])
arr[0][0][0]
0
arr[1][0][4]
14

We can also use indexing to reassign elements of an array.

arr = np.arange(10,15)
arr
# array([10, 11, 12, 13, 14])
arr[1] = 20
arr
# array([10, 20, 12, 13, 14])

Slice Indexing

To retrieve a single value, our indices need to resolve all dimensions of the array and arrive at a single value. Whenever one dimension remains unspecified, we get an array (array view technically).

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [ 5,  6,  7,  8,  9]],
                [[10, 11, 12, 13, 14],
                 [15, 16, 17, 18, 19]]])
arr[0, 1]
array([5, 6, 7, 8, 9])
arr[1]
array([[10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

To take an entire dimension we can use the colon.

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [ 5,  6,  7,  8,  9]],
                [[10, 11, 12, 13, 14],
                 [15, 16, 17, 18, 19]]])
arr[0, :, 0]
array([0, 5])
arr[:, 0, 0]
array([ 0, 10])

The colon is very useful for indexing in general, because it allows us to take a slice of values instead of a single value. The syntax of the slice follows start:stop:step. If we leave out start, the slice starts at 0. If we leave out stop, it goes to the end of the dimension. If we leave out step, the step defaults to 1.

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [ 5,  6,  7,  8,  9]],
                [[10, 11, 12, 13, 14],
                 [15, 16, 17, 18, 19]]])
arr[0, 0, 1:5:2]
# array([1, 3])
arr[0, 0, 1:4]
# array([1, 2, 3])
arr[0, 0, 1:]
# array([1, 2, 3, 4])
arr[0, 0, :3]
# array([0, 1, 2])
arr[0, 0, :]
# array([0, 1, 2, 3, 4])

Index Array

So far we learned that we can use integers and slices for indexing. Now we learn that we can also use arrays to index into an array. When we use an array to index, that array has to either contain integers or boolean values. Lets take a look at integer array indexing first.

arr = np.arange(10,50,3)
idc = np.arange(5)
idc.dtype
dtype('int32')
arr[idc]
array([10, 13, 16, 19, 22])
idc = np.arange(5,8)
arr[idc]
array([25, 28, 31])
idc = np.array([1,2,4])
arr[idc]
array([13, 16, 22])

Note that in the examples where we generate index arrays with arange, we could achieve the same result with a slice as shown above and save one line of code. Integer arrays are most useful when they are generated by a process that is more complicated than the arange method. One example is the np.argwhere method we will learn more about in a later post.

Boolean Array

Boolean arrays also deserve at least one post of their own but here I will give you a teaser. We only want to retrieve those values, that satisfy a larger than condition.

arr = np.array([[[ 0,  1,  2,  3,  4],
                 [10, 11, 12, 13, 14]],
                 [[5,  6,  7,  8,  9],
                 [15, 16, 17, 18, 19]]])
boolean_idc = arr > 10
boolean_idc
array([[[False, False, False, False, False],
        [False,  True,  True,  True,  True]],

       [[False, False, False, False, False],
        [ True,  True,  True,  True,  True]]])
arr[boolean_idc]
array([11, 12, 13, 14, 15, 16, 17, 18, 19])

Summary

We learned that indexing is useful to retrieve values and reassign parts of an array. There are several ways to index. First, we can use single integers to get to an element of a certain dimension. We can also use slices with the colon syntax start:stop:step to get at a sequence of elements. Furthermore, there are two advances indexing techniques, where we can use arrays containing integers or booleans to find an arbitrary collection of elements.

Broadcasting in NumPy

  • Broadcasting is triggered when an arithmetic operation is done on two arrays of different shape
  • The goal of broadcasting is to make both arrays the same shape by performing transformations on the shape of the smaller array
  • Once arrays have the same shape, the operation is applied element-wise
  • If the arrays cannot be broadcast an error is raised

Broadcasting Introduction

When we try to add two arrays together with the plus operator, addition is performed element-wise. That means, each element is added to a corresponding element is the other array. However, this only works when both arrays have the same shape. If two arrays have different shapes, a process called broadcasting tries to resolve the difference between the arrays by performing a series of transformations on the shape of the array with lower dimensionality. To understand broadcasting we need to understand the steps broadcasting performs. Lets look at a quick example.

import numpy as np
arr_one = np.array([[4, 3, 2, 5, 6, 2],
                    [30, 34, 1, 50, 60, 56],
                    [22, 34, 32, 21, 12, 6]])
arr_two = np.array([1, 10, 20, 30, 40, 50])
arr_one.shape
(3, 6)
arr_two.shape
(6,)
arrs_plus = arr_one + arr_two
arrs_plus
array([[  5,  13,  22,  35,  46,  52],
       [ 31,  44,  21,  80, 100, 106],
       [ 23,  44,  52,  51,  52,  56]])

This one works despite both arrays having different shapes, even different number of elements. This next one does not work under seemingly similar circumstances.

arr_one = np.array([[4, 3, 2, 5, 6, 2],
                    [30, 34, 1, 50, 60, 56],
                    [22, 34, 32, 21, 12, 6]])
arr_two = np.array([4, 40, 20])
arr_one.shape
(3, 6)
arr_two.shape
(3,)
arrs_plus = arr_one + arr_two
ValueError: operands could not be broadcast together with shapes (3,6) (3,)

What happened here? In the first example we add an array of shape (6,) to an array of shape (3, 6) and it works. In the second example we add an array of shape (3,) to a (3, 6) array and get an error. In both examples, the arrays have different shapes. Therefor, broadcasting is triggered and to understand what happens we need to understand the broadcasting sequence. Lets first work through the working example.

# Broadcasting rules in order
"""
Rule #1: The array with fewer dimensions is broadcast to match
Rule #2: Array shapes are aligned to the right.
         (3, 6)
           (,6)
Rule #3: All array dimensions must be equal or one
         Otherwise broadcasting fails as: ValueError
         Here 6 is equal to six, so we don't get an error and continue
Rule #4: Array dimensions are expanded in the leftward direction
         (6, 3)
         (1, 3)
Rule #5: Array dimensions of size 1 are duplicated to match.
         (6, 3)
         (6, 3)
Done. Array operation can now be executed element wise.
"""

These are the five broadcasting rules that are followed in order. They explain why operations between a (3,) and a (3, 6) array fail. After aligning both array shapes we encounter a problem. 3 does not equal 6, so rule #3 is violated and gives us the ValueError. There are several ways to make this operation work. However, it is best to first understand array indexing before delving into those. We will learn about array indexing in the next post.

Summary

If it weren’t for broadcasting we would have to manually convert the shape of arrays so that they are equal before we can perform arithmetic operations on them. Luckily, we learned that broadcasting always happens when we want to perform an operation on two arrays of different shapes. It tries to resolve the difference in shape by performing a series of steps on the smaller array. When broadcasting finishes successfully the operation can be performed element-wise. If broadcasting fails an error is raised.

NumPy Arrays and Shape

  • The same values can be stored in arrays with different shapes
  • Array methods can perform different operations depending on the array shape
  • The methods .reshape and .flatten change the shape of an array

Introducing Array Shape

Any array has a shape and the shape of an array is important for what kind of operations we can perform. Array shape is sometimes hard to imagine, even for experienced programmers so let’s just look at some code.

import numpy as np
my_array = np.array([3, 2, 5, 6, 3, 4])
my_array.shape
(6,)
my_array_reshaped = my_array.reshape((2,3))
my_array_reshaped.shape
(2, 3)
my_array_reshaped
array([[3, 2, 5],
       [6, 3, 4]])

Here we create an array with 6 elements and my_array.shape tells us that these 6 elements are arranged in a single dimension that has a length of 6. We then reshape the array with its .reshape method into an array with two rows and three columns. This doesn’t look immediately useful but imagine we did an experiment under control and experimental condition with three replicates each. You’d clearly want a structure that represents this. Also, we went from a vector to a matrix with just one line of code. The most important part of array shape is that we can perform array methods only on specific dimensions. To do so we just need to pass the axis argument.

my_array = np.array([[3, 2, 5],
                     [6, 3, 4]])
dim0_sum = my_array.sum(axis=0)
dim0_sum
array([9, 5, 9])
dim1_sum = my_array.sum(axis=1)
dim1_sum
array([10, 13])

Remember that we start out with a (2, 3) array, 2 rows and 3 columns. When we call sum(axis=0) on that array the 0th dimension is eliminated. The array goes from a (2, 3) shape to a (3, ) shape. It does so by calculating the sum across the 0th dimension. Likewise, when we pass sum(axis=1) the 1st dimension gets eliminated in the same way and the array becomes a (2, ) array. The same concept works of course for arrays of any dimension. But lets get back to array shapes. An array cannot be converted to any shape its shape and limit the shapes it can take.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5,6))
my_array_reshaped.shape
(5, 6)
my_array_reshaped = my_array.reshape((5,7))
ValueError: cannot reshape array of size 30 into shape (5,7)

Converting from (30,) to (5, 7) didn’t work for one simple reason. 5 times 7 is 35, not 30. In other words, the new array has more elements than the original array and NumPy will not just invent new elements to make reshaping work. If the number of elements checks out, we can reshape not only to two-dimensional arrays but to any dimension.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_reshaped.shape
(5, 2, 3)
my_array_reshaped
array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23]],

       [[24, 25, 26],
        [27, 28, 29]]])

Of course we can also reshape from higher to lower dimensions.

my_array = np.array([[3, 2, 5],
                     [6, 3, 4]])
my_array_reshaped = my_array.reshape((6,))
my_array_reshaped.shape
(6,)
my_array.shape
(2, 3)

If you want combine all dimensions into one single dimension, you can use the .flatten method.

my_array = np.arange(30)  # A (30,) array
my_array_reshaped = my_array.reshape((5, 2, 3))
my_array_flattened = my_array_reshaped.flatten()
my_array_flattened.shape
(30,)

Why we need array shapes

We saw how to manipulate array shape and how array methods can use the shape of an array. Lets think a bit about the real world usage of array shape. Let’s say you are working on an image processing project. You are lucky and the images are already pre-processed in a way that each image has 64 pixels in both dimensions. So each image is an array of shape (64, 64) but your dataset consists of 1000 images. So you want your dataset to be stored as a (1000, 64, 64) array. But then your image processing project becomes a volume processing project. So each volume has 100 slices. So you need a (1000, 100, 64, 64) array. But wait. You are actually working on video files. There are 20000 frames for each volume. So you need a (1000, 20000, 100, 64, 64) array. It is rare that you will have to go beyond five dimensions, but you can. In several fields it is very easy to end up with five dimensional arrays (think fMRI).

Summary

Here we learned that the shape of an array is useful to store high dimensional data meaningfully and to have array methods operate only on specific dimensions. The .reshape method is important to change the shape of an existing array and the .flatten method can collapse an array into a single dimension. In the next blog post we will learn about broadcasting. Broadcasting is a mechanisms that is triggered whenever we perform an arithmetic operation on two arrays of different shapes (dimensionality). If two arrays have identical shape the operation is performed element-wise. If they have different shapes broadcasting performs a series of transformations on the lower dimensional array to make both arrays identical in shape and finally perform the operation element-wise.