The deep feedforward neural network is the most simple network architecture. They are a great entry point to many deep learning concepts. They can also be pretty effective for many applications but they have been replaced by more specialized networks in most areas (for example recurrent neural networks or convolutional neural networks). Here we will build one with pyTorch and we will go from feature selection to training. The dataset is the Titanic challenge, where our model has to predict who survives the sinking of the Titanic. It is the introductory data science competition on Kaggle. We will start by looking at our data.
Missing Values and String to Numerical Conversion
First we need to load the data and find out what we are dealing with. It already comes split into training and test data, both being
.csv files. We load both files and take a look at their general structure with
import pandas as pd train = pd.read_csv("train.csv", index_col='PassengerId') test = pd.read_csv("test.csv", index_col='PassengerId') train.info() """ <class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 1 to 891 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Name 891 non-null object 3 Sex 891 non-null object 4 Age 714 non-null float64 5 SibSp 891 non-null int64 6 Parch 891 non-null int64 7 Ticket 891 non-null object 8 Fare 891 non-null float64 9 Cabin 204 non-null object 10 Embarked 889 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 83.5+ KB """ test.info() """ <class 'pandas.core.frame.DataFrame'> Int64Index: 418 entries, 892 to 1309 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null int64 1 Name 418 non-null object 2 Sex 418 non-null object 3 Age 332 non-null float64 4 SibSp 418 non-null int64 5 Parch 418 non-null int64 6 Ticket 418 non-null object 7 Fare 417 non-null float64 8 Cabin 91 non-null object 9 Embarked 418 non-null object dtypes: float64(2), int64(3), object(5) memory usage: 35.9+ KB """
For each passenger we have 11 features. In the test data, the
'Survived' column is missing, because our goal is to predict it. For the training data we know who survived and we will use this to predict who survived in the test data. Before we can do that, we need to take a closer look at our data. The first issue we encounter is missing values. In the training data are 891 passengers. However, the
'Age' column contains only 714 non-null values. The others are missing. We say they are not-a-number (NaN) and we need to take care of them. If we were to feed NaNs into our model it would completely destroy the calculations. One option is to remove the rows that contain NaNs. This would exclude a large number of rows. Probably more than we were comfortable with, considering that there will be other columns with NaNs. Additionally, the test data also have NaNs. Should we just give up for those test passengers and guess?
A better idea is to replace missing values with the column mean. Given no other information, the mean is the best guess. To be extra careful, we can take the column median, which is more resistance to outliers. Importantly, we will replace NaNs in the test data with the median from the training data. This is not strictly necessary for the Titanic toy example but it is really good practice if you ever design an important model that will be deployed on completely unknown data. Generally, we should use the test data exclusively to calculate our test accuracy. Nothing else. It should never touch the training data and we should act like we don’t even have access to it yet, until we get to the model testing. There are more complicated ways to replace NaNs but they go beyond this blog post. Let us replace NaNs with the median using the
.fillna method. Because we need to do the same preprocessing to the training and test dataset we will loop through them. We will have to do the same for the
"Fare" column. There is only one value missing and it is in the test data, but we must fill it.
train_test_datasets = [train, test] median_age = train["Age"].median() median_fare = train["Fare"].median() for dataset in train_test_datasets: dataset["Age"].fillna(median_age, inplace=True) dataset["Fare"].fillna(median_fare, inplace=True)
That takes care of
"Age" and “Fare” but we cannot apply the same strategy to
"Embarked" because they are nominal columns. Note that their dtype is
object instead of a numeric types such as int64. Let us take a look at
train["Cabin"] """ PassengerId 1 NaN 2 C85 3 NaN 4 C123 5 NaN 887 NaN 888 B42 889 NaN 890 C148 891 NaN Name: Cabin, Length: 891, dtype: object """ type(train["Cabin"]) # float type(train["Cabin"]) # str train["Cabin"].isna().sum() # 687
Some rows contain strings, other contain floats. The number of NaNs is large at 687. I was tempted to simply delete this column because of the large number of NaNs. I decided to keep it because it might be important. I am no ship expert, but the location of the cabin might determine how accessible life boats are and thereby influence survival. Now there are several ways to proceed. I will go with a compromise between simplicity and retaining information. We will extract the first letter, creating a new nominal feature called “Cabin Letter”.
for dataset in train_test_datasets: dataset["Cabin Letter"] = dataset["Cabin"].str.slice(0, 1) train["Cabin Letter"].unique() # array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)
There are eight different cabin letters and the NaN problem persists. Because we are dealing with a nominal feature we can create a ninth category. These are the passengers where we don’t know their cabin number. We will do so as we convert the letters to numbers. For our network we will need to convert everything to floats anyway.
for dataset in train_test_datasets: dataset["Cabin Letter"] = pd.Categorical(dataset["Cabin Letter"]).codes train["Cabin Letter"] """ PassengerId 1 -1 2 2 3 -1 4 2 5 -1 .. 887 -1 888 1 889 -1 890 2 891 -1 Name: Cabin Letter, Length: 891, dtype: int8 """ train['Cabin Letter'].unique() # array([-1, 2, 4, 6, 3, 0, 1, 5, 7], dtype=int8)
Now the NaNs are -1, which is fine for now. Later we will convert all these nominal variables to dummy variables anyway, but we will get to that. We could extract more information from
"Cabin" but for this demonstration I will leave it here and delete
for dataset in train_test_datasets: dataset.drop("Cabin", axis=1, inplace=True)
The final variable that suffers from NaNs is
"Embarked". The data in it is less messy so we can go straight to getting the categorical codes.
for dataset in train_test_datasets: dataset["Embarked"] = pd.Categorical(dataset["Embarked"]) dataset["Embarked"] = dataset["Embarked"].cat.codes
Now we succesfully removed all NaN values but some minor issues about our columns remain.
"Ticket" do not have a numerical type. They are of type object.
"Ticket" probably don’t tell us a lot about survival, unless we would do some serious feature engineering on them. So we will simply drop them.
for dataset in train_test_datasets: dataset.drop(["Name", "Ticket"], axis=1, inplace=True)
That just leaves us with
"Sex". We can use
.cat.codes as above to convert it to nubmers.
for dataset in train_test_datasets: dataset["Sex"] = pd.Categorical(dataset["Sex"]).codes
Now let us take one more look at our datasets and make sure we took care of NaNs and everything is numeric.
train.info() """ <class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 1 to 891 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Survived 891 non-null int64 1 Pclass 891 non-null int64 2 Sex 891 non-null int8 3 Age 891 non-null float64 4 SibSp 891 non-null int64 5 Parch 891 non-null int64 6 Fare 891 non-null float64 7 Embarked 891 non-null int8 8 Cabin Letter 891 non-null int8 dtypes: float64(2), int64(4), int8(3) memory usage: 51.3 KB """ test.info() """ <class 'pandas.core.frame.DataFrame'> Int64Index: 418 entries, 892 to 1309 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pclass 418 non-null int64 1 Sex 418 non-null int8 2 Age 418 non-null float64 3 SibSp 418 non-null int64 4 Parch 418 non-null int64 5 Fare 418 non-null float64 6 Embarked 418 non-null int8 7 Cabin Letter 418 non-null int8 dtypes: float64(2), int64(3), int8(3) memory usage: 20.8 KB """
Everything looks in order. However, there is one more necessary preprocessing step before we can move to standardization. We need to convert categorical features to dummy variables. That means, each category in a column gets its own new column and a sample of that category will have a 1 there. Let’s see how that looks.
Convert Categories to Dummy Variables
We will use the
.get_dummies function on all our categorical columns but first let us try it on the
train["Pclass"] """ PassengerId 1 3 2 1 3 3 4 1 5 3 .. 887 2 888 1 889 3 890 1 891 3 Name: Pclass, Length: 891, dtype: int64 """ pd.get_dummies(train["Pclass"]) """ 1 2 3 PassengerId 1 0 0 1 2 1 0 0 3 0 0 1 4 1 0 0 5 0 0 1 .. .. .. 887 0 1 0 888 1 0 0 889 0 0 1 890 1 0 0 891 0 0 1 [891 rows x 3 columns] """
Dummy variables make sense if we just stare at them long enough. What we are looking for is the conversion of each category into an individual column.
train["Pclass"] has three categories,
[1, 2, 3]. Therefore, we get three dummy variables. A row that has a 1 in
"Pclass" will get a 1 in the first column and 0 in the other two. A row that has a 2 will get a 1 in the second column and 0 in the others. A row that has a 3 will get a 1 in the third column and 0 in the others. Now we will use the same function on all our categorical columns. Note that we will not use our usual loop trough train and test data because there is no way to generate the dummy variables inplace.
categorical_cols = ["Pclass", "Sex", "Embarked", "Cabin Letter","SibSp"] train_dummies = pd.get_dummies(train, columns=categorical_cols, prefix=categorical_cols) test_dummies = pd.get_dummies(test, columns=categorical_cols, prefix=categorical_cols) train_dummies.shape # (891, 29) test_dummies.shape # (418, 26)
The code looks solid but something went wrong. We now have more columns in our
train_dummies than in
test_dummies. What happened? Let us look at the columns and see if we can spot the issue.
train_dummies.columns """ Index(['Survived', 'Age', 'Parch', 'Fare', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_0', 'Sex_1', 'Embarked_-1', 'Embarked_0', 'Embarked_1', 'Embarked_2', 'Cabin Letter_-1', 'Cabin Letter_0', 'Cabin Letter_1', 'Cabin Letter_2', 'Cabin Letter_3', 'Cabin Letter_4', 'Cabin Letter_5', 'Cabin Letter_6', 'Cabin Letter_7', 'SibSp_0', 'SibSp_1', 'SibSp_2', 'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8'], dtype='object') """ test_dummies.columns """ Index(['Age', 'Parch', 'Fare', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_0', 'Sex_1', 'Embarked_0', 'Embarked_1', 'Embarked_2', 'Cabin Letter_-1', 'Cabin Letter_0', 'Cabin Letter_1', 'Cabin Letter_2', 'Cabin Letter_3', 'Cabin Letter_4', 'Cabin Letter_5', 'Cabin Letter_6', 'SibSp_0', 'SibSp_1', 'SibSp_2', 'SibSp_3', 'SibSp_4', 'SibSp_5', 'SibSp_8'], dtype='object') """
test_dummies is missing the following columns:
"Cabin Letter_7". Those are the three missing columns. Remember that
"Survived" is purposely not in the test dataset. This happened because in the test data there were no rows that had missing values or a 0 in
"Embarked" or a 7 in
"Cabin Letter". This raises another issue. When we converted string categories to numbers with
cat.codes, we may have converted letters differently in the test and training dataset. Luckily, we can convert strings directly to dummy variables. This is a great opportunity to take a step back and clean up our code. We will start from scratch.
import pandas as pd # Load data train = pd.read_csv("train.csv", index_col='PassengerId') test = pd.read_csv("test.csv", index_col='PassengerId') # Merge train and test for wrangling and preprocessing train_test_datasets = [train, test] """Data wrangling""" # Split cabin into letter and number median_age = train["Age"].median() median_fare = train["Fare"].median() for idx, dataset in enumerate(train_test_datasets): dataset["Age"].fillna(median_age, inplace=True) dataset["Fare"].fillna(median_fare, inplace=True) dataset["Cabin Letter"] = dataset["Cabin"].str.slice(0, 1) dataset.drop("Cabin", axis=1, inplace=True) #dataset["Embarked"] = dataset["Embarked"].cat.codes dataset.drop(["Name", "Ticket"], axis=1, inplace=True) categorical_cols = ["Pclass", "Sex", "Embarked", "Cabin Letter","SibSp"] train_dummies = pd.get_dummies(train, columns=categorical_cols, prefix=categorical_cols, dummy_na=True) test_dummies = pd.get_dummies(test, columns=categorical_cols, prefix=categorical_cols, dummy_na=True) train_dummies.shape # (891, 32) test_dummies.shape # (418, 30)
Now we have more columns because we added the
dummy_na=True parameter. This gives us a
"Sex_nan" column, although there were no NaNs in the original data. This is excessive but won’t be an issue for our network. Now we just need to add the one columns that is missing from
test_dummies. It should contain only zeros, because there were no rows with that category in the test dataset.
train_dummies.columns """ Index(['Survived', 'Age', 'Parch', 'Fare', 'Pclass_1.0', 'Pclass_2.0', 'Pclass_3.0', 'Pclass_nan', 'Sex_female', 'Sex_male', 'Sex_nan', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_nan', 'Cabin Letter_A', 'Cabin Letter_B', 'Cabin Letter_C', 'Cabin Letter_D', 'Cabin Letter_E', 'Cabin Letter_F', 'Cabin Letter_G', 'Cabin Letter_T', 'Cabin Letter_nan', 'SibSp_0.0', 'SibSp_1.0', 'SibSp_2.0', 'SibSp_3.0', 'SibSp_4.0', 'SibSp_5.0', 'SibSp_8.0', 'SibSp_nan'], dtype='object') """ test_dummies.columns """ Index(['Age', 'Parch', 'Fare', 'Pclass_1.0', 'Pclass_2.0', 'Pclass_3.0', 'Pclass_nan', 'Sex_female', 'Sex_male', 'Sex_nan', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Embarked_nan', 'Cabin Letter_A', 'Cabin Letter_B', 'Cabin Letter_C', 'Cabin Letter_D', 'Cabin Letter_E', 'Cabin Letter_F', 'Cabin Letter_G', 'Cabin Letter_nan', 'SibSp_0.0', 'SibSp_1.0', 'SibSp_2.0', 'SibSp_3.0', 'SibSp_4.0', 'SibSp_5.0', 'SibSp_8.0', 'SibSp_nan'], dtype='object') """
"Cabin Letter_T" is the missing one.
import numpy as np test_dummies["Cabin Letter_T"] = np.zeros(test_dummies.shape)
Now our dummy variables are in order and we can move on balancing the training data.
Balancing the Training Data
Balancing the training data will be important during learning. We want to train our model to predict survival. Imagine the extreme case, where we train a model only on passengers that survived. This model will be terrible, because it has no idea how a surviving passenger looks on paper. Now don’t worry, this training data has rows on both surviving and non-surviving passengers. However, a training data where one of the outcomes is more frequent can bias a model. Let us find out what’s the situation with this training data.
total_samples = train_dummies.shape # Number of rows in DataFrame number_surviving = (train_dummies['Survived'] == 1).sum() # Number of survivors perc_survivors = (number_surviving / total_samples) * 100 # 38.38383838383838
In this training data 38% of passengers survived. There are no hard rules on how balanced data should be, but I am not happy with that number. If it was 45% I think it would be fine but we should do something about 38%. Because the number of survivors is lower than the non-survivors, we will randomly select as many non-survivors as there are survivors.
bool_survivors = train_dummies['Survived'] == 1 bool_nonsurvivors = train_dummies['Survived'] == 0 all_survivors = train_dummies[bool_survivors] all_nonsurvivors = train_dummies[bool_nonsurvivors] random_nonsurvivors = all_nonsurvivors.sample(number_surviving) train_balanced = pd.concat((all_survivors, random_nonsurvivors)) train_balanced = train_balanced.sample(frac=1) (train_balanced["Survived"] == 0).sum() # 342 (train_balanced["Survived"] == 1).sum() # 342
Now we have 342 survivors and 342 non-survivors. Perfectly balanced, as all things should be. Concatenating like this we must also remember to shuffle the rows with
.sample(frac=1), otherwise we might run into problems later with unbalanced batches. Now that our data is balanced, we can move on to standardization.
Both datasets are now free from missing values, categories are converted to dummy variables and our labels are balanced. Now we can move on to standardization, the process of bringing features to the same scale. We achieve this by subtracting the mean and dividing by the standard deviation. This has some advantages for neural networks. It can increase the learning rate and result in better fits. There are also disadvantages, as some features become harder to interpret because they lose their physical units. Because we rarely try to interpret neuronal networks, we almost always standardize our data. This is how it works with scikit-learn.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(train_dummies.iloc[:,1:]) train_scaled = scaler.transform(train_balanced.iloc[:,1:]) test_scaled = scaler.transform(test_dummies) train_scaled = pd.DataFrame(train_scaled, index=train_balanced.index, columns=train_balanced.iloc[:,1:].columns) test_scaled = pd.DataFrame(test_scaled, index=test_dummies.index, columns=test_dummies.columns) y = train_balanced["Survived"] train_balanced["Age"].mean(), train_balanced["Age"].std() # (29.23489766081871, 13.239715945608669) train_scaled["Age"].mean(), train_scaled["Age"].std() # (-0.009735709395380402, 1.0174700960384269)
Originally the mean age is 29.23 and the standard deviation is 13.24. After standardization, the mean is near zero because we subtracted it and the standard deviation is near 1 because we divided by it. So our standardization worked. Now the absolute values of all features are on the scale of the standard normal distribution. We are now almost ready to create our actual network. But before, we will split our training data into a training and validation data. This will help us avoid overfitting.
Validation Data Split
train_test_split we get 547 training and 137 validation samples.
test_size=0.1 specifies that 10% of samples should be in the test data, which in our case will be used for validation, as we already have another designated test data set. We are now ready to build our network an start training.
from sklearn.model_selection import train_test_split X = train_scaled X_train, X_validate, y_train, y_validate = train_test_split(X, y, test_size=0.1) X_train.shape # (547, 31) X_validate.shape # (137, 31)
Building the Network with pyTorch
We start by converting our arrays to tensors. This is the data structure pyTorch will expect as input to the network later.
import torch train_features = torch.tensor(X_train.to_numpy()) train_labels = torch.tensor(y_train.to_numpy()) validation_features = torch.tensor(X_validate.to_numpy()) validation_labels = torch.tensor(y_validate.to_numpy())
Now we build the neural network by calling
n_features = train_features.shape # 31 model = torch.nn.Sequential(torch.nn.Linear(n_features, 50), torch.nn.ReLU(), torch.nn.Linear(50, 1), torch.nn.Sigmoid())
This network takes our 31 input features and transforms them to 50 hidden units using a fully connected linear layer. This layer also learns a bias term by default. We then apply the rectifying linear unit to introduce some non-linearity. We then convert the 50 hidden units to one output unit, to which we apply the Sigmoid function. This means our deep neural network has one hidden layer with 50 units. The
Sigmoid function makes sure that our output domain is between 0 and 1. This is important because we are making a binary classification of surviving (1) and non-surviving (0) passengers. Now we define our loos function and our learning method.
criterion = torch.nn.BCELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=0.001)
Our loss function is binary cross entropy loss, which works well for the binary classification problem we are facing. Our learning algorithm is Adam, a variety of gradient descent that features L2 regularization (
weight_decay). Before we start training we will split our train data into 41 mini batches.
n_batches = 41 train_features_batched = train_features.reshape(41, int(train_features.shape/n_batches), train_features.shape) train_labels_batched = train_labels.reshape(n_batches, int(train_labels.shape/n_batches))
Training our network on the mini batches instead of the whole data makes learning quicker and can result in models that generalize better. Now we are ready to train the network.
n_epochs = 2000 loss_list =  validate_loss_list =  for epoch in range(n_epochs): for batch_idx in range(n_batches): optimizer.zero_grad() outputs = model(train_features_batched[batch_idx].float()) loss = criterion(outputs.flatten().float(), train_labels_batched[batch_idx].float()) loss.backward() optimizer.step() outputs = model(train_features.float()) validation_outputs = model(validation_features.float()) loss = criterion(outputs.flatten().float(), train_labels.float()) validate_loss = criterion(validation_outputs.flatten().float(), validation_labels.float()) loss_list.append(loss.item()) validate_loss_list.append(validate_loss) print('Finished Training') import matplotlib.pyplot as plt plt.plot(loss_list, linewidth=3) plt.plot(validate_loss_list, linewidth=3) plt.legend(("Training Loss", "Validation Loss")) plt.xlabel("Epoch") plt.ylabel("BCE Loss")
This code starts by creating empty lists to record the loss of both test and validation data.
n_epochs tells us how many times we train the network on the entire data. The actual learning happens in the inner loop that exposes
batch_idx. We train the network after each mini batch. The
output is whatever the network current spits out for a given batch. Then we compare that output to the known labels, which is our loss.
loss.backward() calculates the gradient of the loss with respect to the parameters. Finally, the actual learning happens when we call
optimizer.step(), which adjust the parameters of our model according to the gradients. This is why pyTorch is handy, it takes very good care of the gradients. The loss we are plotting is calculated in the out loop, where the loss across the entire data instead of the mini batches is calculated. The rest is just plotting. So how did it go? Our network learns very quickly. After 250 epochs the validation loss already stops improving. Everything that comes afterwards is probably overfitting the training data. That means the training loss improves while the validation loss is stagnant or becomes worse. The last piece of code we have to write calculates our prediction for the test data and saves it in a way that we can upload it to Kaggle to find out how well we did.
test_features = torch.tensor(test_scaled.to_numpy()) test_prediction = model(test_features.float()).detach().numpy().flatten() test_prediction_binary = (test_prediction > 0.5).astype(np.int) test_prediction_df = pd.DataFrame(test_prediction_binary, index=test.index, columns=["Survived"]) test_prediction_df.to_csv("prediction_submission.csv")
I will upload three different predictions to Kaggle. One from a network that is untrained. To do so I will set n_epochs to 0. This amounts to guessing, because the network is randomly initialized. Then I will upload predictions from a model that stops after 200 epochs. This should be a well trained model. Then I will use a model that stopped after 2000 epochs. This should be slightly overfit. Let’s see how we did.
We got a 0.39952 accuracy score for the untrained model, meaning that 39.952 % of our predictions were correct. Pretty bad but expected from a guessing model. Our model trained on 200 epochs has a 0.70095 score. Much better and a good sign that our model actually learned something. Our model trained on 2000 epochs got a score of 0.74162. Not exactly what we would expect. This probably means that our validation data set was not representative for the test data set. I also read from other people that overfitting the test data improves test accuracy slightly in the Titanic example. How could we improve our model?
Where to go from here
I decided to use a neural network because I wanted to write about it but it is not the best model. A random forest classifier seems to do better here. The highest score without cheating is around 0.83, which is achieved with this kind of model. But what could we do to improve our above code. One great improvement would be to use another validation method. Our validation method sets a lot of data aside that the model is never trained on and we don’t have massive amounts of data to begin with. A cross-validation technique could help us us that data for training. A second improvement would be to use a more formal way to determine the epoch where we want to stop training. We just looked at it and picked nice looking spots in the loss curve. Early stopping is a technique that could help us find a more objective point to stop the training. Finally, we could learn more about the Titanic data and get more features out of the data. For example, we discarded the cabin number, keeping only the letter. However, the number could be very important. A good way to learn more about the Titanic data is to browse the Kaggle discussion board.