Yet Another Kaggle Titanic Competition Tutorial

Posted Nov 23, 2020 Updated Jul 22, 2024

By Libo Su

15 min read

This post is a tutorial on solving the Kaggle Titanic Competition using Deep Neural Network with the TensorFlow API Keras. Kaggle is a competition site which provides problems to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The Titanic competition is probably the first competition you will come across on Kaggle. The goal of the competition is to predict the possibility of survival of the passengers onboard, a typical classification problem.

This tutorial follows a typical workflow of machine learning project.

Define the problem
Acquire the data
Prepare the data
Define the model
Train and fine-tune the model
Test and deploy the model

The accompanying notebook of this tutorial is available here.

Define the problem

The question or problem definition for Titanic Survival competition is described here at Kaggle.

Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.

Kaggle already provides some early understanding about the domain of this problem, see the Kaggle competition description page here. Here are the highlights to note.

On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Before going forward, we need to import some necessary modules.

  
# data analysis
import numpy as np 
import pandas as pd 

# visualization 
import matplotlib.pyplot as plt 
import seaborn as sns 

# machine learning 
import sklearn
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.python.framework import ops

# utils
import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

Acquire the data

The Python Pandas packages are used to acquire the data. The data consists of two groups: a training set (train.csv) and test set (test.csv).

The training set should be used to build the machine learning models. For the training set, the outcome (also known as the “ground truth”) for each passenger is provided.

The test set should be used to see how well your model performs on unseen data. For the test set, the ground truth for each passenger is not provided. The machine learning models will be used to predict the surival of each passenger.

  
train_data = pd.read_csv('./Dataset/train.csv')
test_data = pd.read_csv('./Dataset/test.csv')

  
train_data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

  
test_data.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

Prepare the data

This step aims to prepare the data for use in machine learning training. It involves cleaning up the data, understanding its structure and attributes, studying possible correlations between attributes, and visualizing the data to extract human-level insight on its properties, distribution, range, and overall usefulness [3].

Data Overview

  
train_data.info()
print("-" * 40)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

Observations

The data contains 11 features. The details of these features are described on the Kaggle data pape here.
- Categorical: Sex and Embarked,
- Orinial: Pclass,
- Numerical: Age and Fare (continous), SibSp and Parch (discret),
- Mixed: Cabin (alphanumeric).
There are missing values in Age, Cabin, Embarked and Fare.

Primary data analysis and visualization

  
# Overall Surival Rate
train_data['Survived'].value_counts().plot.pie(autopct = '%1.2f%%');

Sex

  
# Sex  
train_data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar();

Observation: Female passengers had higher survival chance.

Class

  
# Class  
train_data[['Pclass', 'Survived']].groupby('Pclass').mean().plot.bar();

Observation: Surival rate Class 1 > Class 2 > Class 3

Family size

  
# Family Size 
train_data['FamilySize'] = train_data['Parch'] + train_data['SibSp'] + 1 
test_data['FamilySize'] = test_data['Parch'] + test_data['SibSp'] + 1 

sns.barplot(data=train_data, x='FamilySize', y='Survived');

Observation: Passengers with small family size (2-4) were more likely to survive.

Therefore, the passengers are grouped into 3 groups according the size of family: singleton, SmallFamily (family size 2-4) and LargeFamily (family size larger than 4).

  
# Feature engineering
train_data['Singleton'] = train_data['FamilySize'].map(lambda s: 1 if s == 1 else 0)
train_data['SmallFamily'] = train_data['FamilySize'].map(lambda s: 1 if 2<= s <= 4 else 0)
train_data['LargeFamily'] = train_data['FamilySize'].map(lambda s: 1 if s>=5 else 0)

test_data['Singleton'] = test_data['FamilySize'].map(lambda s: 1 if s == 1 else 0)
test_data['SmallFamily'] = test_data['FamilySize'].map(lambda s: 1 if 2<= s <= 4 else 0)
test_data['LargeFamily'] = test_data['FamilySize'].map(lambda s: 1 if s>=5 else 0)

Age

  
# Age & Survival Rate 
age_data = train_data[train_data['Age'].notna()][['Age', 'Survived']]
sns.histplot(age_data['Age']);

  
ageFacet=sns.FacetGrid(train_data,hue='Survived',aspect=3);
ageFacet.map(sns.kdeplot,'Age',shade=True);
ageFacet.set(xlim=(0,train_data['Age'].max()));
ageFacet.add_legend();

Observation: Passengers aged under 10 were more likely to survive.

Fare

  
# Fare & Survival Rate
fareFacet=sns.FacetGrid(train_data,hue='Survived',aspect=3);
fareFacet.map(sns.kdeplot,'Fare',shade=True);
fareFacet.set(xlim=(0,150));
fareFacet.add_legend();

Observation: Low survival rate for fare < 20.

Embarked port

  
# Embarked & Survival Rate
sns.barplot(data=train_data,x='Embarked',y='Survived');

Observations: Passengers embarked from port C were more likely to survive.

Name (Title)

The titles of the passengers can be extracted from their names. Passengers from different social classes may have different survival chance.

  
# Extract titles
train_data['Title'] = train_data.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
test_data['Title'] = test_data.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
train_data['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Ms            1
Don           1
Lady          1
Mme           1
Sir           1
Countess      1
Capt          1
Jonkheer      1
Name: Title, dtype: int64

  
# Group the passengers into 6 groups by their titles.
Title_Dictionary = {
                        "Capt":       "Officer",
                        "Col":        "Officer",
                        "Major":      "Officer",
                        "Jonkheer":   "Royalty",
                        "Don":        "Royalty",
                        "Sir" :       "Royalty",
                        "Dr":         "Officer",
                        "Rev":        "Officer",
                        "Countess":   "Royalty",
                        "Dona":       "Royalty",
                        "Mme":        "Mrs",
                        "Mlle":       "Miss",
                        "Ms":         "Mrs",
                        "Mr" :        "Mr",
                        "Mrs" :       "Mrs",
                        "Miss" :      "Miss",
                        "Master" :    "Master",
                        "Lady" :      "Royalty"
                        }
train_data['Title']=train_data['Title'].map(Title_Dictionary)
test_data['Title']=test_data['Title'].map(Title_Dictionary)
train_data['Title'].value_counts()

Mr         517
Miss       184
Mrs        127
Master      40
Officer     18
Royalty      5
Name: Title, dtype: int64

  
sns.barplot(data=train_data, x='Title', y='Survived');

Observation: Passengers with “Mr” and “Officer” titles had low survival chance.

Cabin

The Cabin information can tell which decks the passengers were resided, which may influence the survival rate. The Cabin variables have many missing values. THe missing values will be replaced by “Unknown”.

  
train_data['Cabin'] = train_data['Cabin'].fillna('Unknown')
test_data['Cabin'] = test_data['Cabin'].fillna('Unknown')

  
# Cabin 
train_data['Deck']=train_data['Cabin'].map(lambda x:x[0])
test_data['Deck']=test_data['Cabin'].map(lambda x:x[0])
sns.barplot(data=train_data, x='Deck', y='Survived');

Observation: Passengers in Deck B, D, E were more likely to survive.

Data pre-processing

Combine all data

  
full_data = pd.concat([train_data, test_data], axis=0)

Fill the missing values

Embarked

  
# Embarked
full_data[full_data['Embarked'].isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	FamilySize	Singleton	SmallFamily	LargeFamily	Title	Deck
61	62	1.0	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN	1	1	0	0	Miss	B
829	830	1.0	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN	1	1	0	0	Mrs	B

  
full_data['Embarked'].value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

Fill the missing value of the Embarked feature with the most frequent value.

  
full_data['Embarked'] = full_data['Embarked'].fillna('S')

Fare

  
# Fare 
full_data[full_data['Fare'].isnull()]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	FamilySize	Singleton	SmallFamily	LargeFamily	Title	Deck
152	1044	NaN	3	Storey, Mr. Thomas	male	60.5	0	0	3701	NaN	Unknown	S	1	1	0	0	Mr	U

Fill the missing value of the Fare feature with the mean value.

  
full_data['Fare']=full_data['Fare'].fillna(full_data[(full_data['Pclass']==3)&(full_data['Embarked']=='S')&(full_data['Cabin']=='Unknown')]['Fare'].mean())

Age

The Random Forest algorithm is used to fill the missing of the Age feature.

  
AgePre = full_data[['Age','Pclass','Title','FamilySize','Sex']]
AgePre = pd.get_dummies(AgePre)
AgePre.head()

	Age	Pclass	FamilySize	Title_Miss	Title_Mr	Title_Mrs	Sex_female	Sex_male
0	22.0	3	2	0	1	0	0	1
1	38.0	1	2	0	0	1	1	0
2	26.0	3	1	1	0	0	1	0
3	35.0	1	2	0	0	1	1	0
4	35.0	3	1	0	1	0	0	1

  
# Split the dateset
AgeKnown=AgePre[AgePre['Age'].notnull()]
AgeUnKnown=AgePre[AgePre['Age'].isnull()]

AgeKnown_X=AgeKnown.drop(['Age'],axis=1)
AgeKnown_y=AgeKnown['Age']

AgeUnKnown_X=AgeUnKnown.drop(['Age'],axis=1)

# Setup the random forest model
from sklearn.ensemble import RandomForestRegressor
rfr=RandomForestRegressor(random_state=None,n_estimators=500,n_jobs=-1)
rfr.fit(AgeKnown_X,AgeKnown_y)

# Model evaluation
rfr.score(AgeKnown_X,AgeKnown_y)

0.4911278076281459

  
# Predict age
AgeUnKnown_y=rfr.predict(AgeUnKnown_X)
# Fill the mssing value
full_data.loc[full_data['Age'].isnull(),['Age']]=AgeUnKnown_y
# Create featuer IsChild (less than 10 years old)
full_data['IsChild'] = full_data['Age'].map(lambda s: 1 if s <= 10 else 0)
full_data.info() # no missing values now (except Survived)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 19 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 PassengerId  1309 non-null   int64  
 Survived     891 non-null    float64
 Pclass       1309 non-null   int64  
 Name         1309 non-null   object 
 Sex          1309 non-null   object 
 Age          1309 non-null   float64
 SibSp        1309 non-null   int64  
 Parch        1309 non-null   int64  
 Ticket       1309 non-null   object 
 Fare         1309 non-null   float64
Cabin        1309 non-null   object 
Embarked     1309 non-null   object 
FamilySize   1309 non-null   int64  
Singleton    1309 non-null   int64  
SmallFamily  1309 non-null   int64  
LargeFamily  1309 non-null   int64  
Title        1309 non-null   object 
Deck         1309 non-null   object 
IsChild      1309 non-null   int64  
dtypes: float64(3), int64(9), object(7)
memory usage: 204.5+ KB

  
full_data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	FamilySize	Singleton	SmallFamily	Title	Deck
0	1	0.0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	Unknown	S	2	0	1	Mr	U
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	2	0	1	Mrs	C
2	3	1.0	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	Unknown	S	1	1	0	Miss	U
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	2	0	1	Mrs	C
4	5	0.0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	Unknown	S	1	1	0	Mr	U

Select the features

  
# Remove the less revelant features
fullSel=full_data.drop(['Cabin','Name','Ticket','PassengerId'],axis=1)
# Check the correlations of the features
corrDf=pd.DataFrame()
corrDf=fullSel.corr()
corrDf['Survived'].sort_values(ascending=True)

Pclass        -0.338481
Singleton     -0.203367
LargeFamily   -0.125147
Age           -0.065783
SibSp         -0.035322
FamilySize     0.016639
Parch          0.081629
IsChild        0.125678
Fare           0.257307
SmallFamily    0.279855
Survived       1.000000
Name: Survived, dtype: float64

  
# Heatmap
plt.figure(figsize=(8,8));
sns.heatmap(fullSel[['Survived','Age','Embarked','Fare','Parch','Pclass',
                    'Sex','SibSp','Title','Singleton','SmallFamily','LargeFamily','FamilySize','IsChild','Deck']].corr(),cmap='BrBG',annot=True,
           linewidths=.5);
plt.xticks(rotation=45);

  
fullSel=fullSel.drop(['SibSp','Parch','FamilySize','Age'],axis=1)

One-hot encoding

  
# One-hot encoding
fullSel=pd.get_dummies(fullSel)
PclassDf=pd.get_dummies(full_data['Pclass'],prefix='Pclass')

  
PclassDf.head()

	Pclass_1	Pclass_3
0	0	1
1	1	0
2	0	1
3	1	0
4	0	1

  
fullSel = pd.concat([fullSel, PclassDf], axis=1)

  
fullSel.drop('Pclass', axis=1, inplace=True)

  
fullSel.head()

	Survived	Fare	Singleton	SmallFamily	Sex_female	Sex_male	Embarked_C	...	Deck_C	Deck_U	Pclass_1	Pclass_3
0	0.0	7.2500	0	1	0	1	0	...	0	1	0	1
1	1.0	71.2833	0	1	1	0	1	...	1	0	1	0
2	1.0	7.9250	1	0	1	0	0	...	0	1	0	1
3	1.0	53.1000	0	1	1	0	0	...	1	0	1	0
4	0.0	8.0500	1	0	0	1	0	...	0	1	0	1

5 rows × 29 columns

Normalization

  
fullSel[fullSel.columns] = fullSel[fullSel.columns].apply(lambda x: x/x.max(), axis=0)

  
fullSel.head()

	Survived	Fare	Singleton	SmallFamily	Sex_female	Sex_male	Embarked_C	...	Deck_C	Deck_U	Pclass_1	Pclass_3
0	0.0	0.014151	0.0	1.0	0.0	1.0	0.0	...	0.0	1.0	0.0	1.0
1	1.0	0.139136	0.0	1.0	1.0	0.0	1.0	...	1.0	0.0	1.0	0.0
2	1.0	0.015469	1.0	0.0	1.0	0.0	0.0	...	0.0	1.0	0.0	1.0
3	1.0	0.103644	0.0	1.0	1.0	0.0	0.0	...	1.0	0.0	1.0	0.0
4	0.0	0.015713	1.0	0.0	0.0	1.0	0.0	...	0.0	1.0	0.0	1.0

5 rows × 29 columns

  
fullSel.to_csv('combined_data.csv', index=False)

Split the dataset

  
train = fullSel[fullSel['Survived'].notnull()]
test = fullSel[fullSel['Survived'].isnull()]
targets = train['Survived']
train.drop('Survived', axis=1, inplace=True)
test.drop('Survived', axis=1, inplace=True)

X_train,X_val,Y_train,Y_val = train_test_split(train,targets,test_size = 0.2,random_state = 42)

Define the model

A neural network model with 3 layers are chosen. The model is implemented using Keras.

  
x = X_train
y = Y_train
L1=20
L2=20
L3=5
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(L1,input_shape=(X_train.shape[1],),kernel_regularizer='l2', activation='relu'))
model.add(tf.keras.layers.Dense(L2,kernel_regularizer='l2', activation='relu'))
model.add(tf.keras.layers.Dense(L3,kernel_regularizer='l2', activation='relu'))
model.add(tf.keras.layers.Dense(1,kernel_regularizer='l2', activation='sigmoid'))
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

Train and fine-tune model

  
epoch = 50
history = model.fit(x,y,epochs=epoch,batch_size=32,verbose=0)
preds = model.evaluate(x=X_val, y=Y_val)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))
print ("Training Accuracy = " + str(history.history['accuracy'][-1]))

Output:

Loss = 0.5216637253761292
Test Accuracy = 0.826815664768219
Training Accuracy = 0.8300561904907227

Observation: Test accuracy and training accuracy are close, indicating that the model is not over-fitting.

Retrain the model with all training data.

  
history = model.fit(train,targets,epochs=epoch,batch_size=32,verbose=0)
preds = model.evaluate(x=X_val, y=Y_val)
print ("Loss = " + str(preds[0]))
print ("Accuracy = " + str(history.history['accuracy'][-1]))

Output:

Loss = 0.49221670627593994
Accuracy = 0.8316498398780823

Test and deploy model

  
l = model.predict(test)

  
survived = [1 if x > 0.5 else 0 for x in l]
#original_test_data = pd.read_csv('/kaggle/input/titanic/test.csv')
original_test_data = pd.read_csv('./Dataset/test.csv')
submission = pd.DataFrame({
    'PassengerId': original_test_data['PassengerId'],
    'Survived':  survived
})

submission.to_csv('titanic-nn.csv', index=False)

Submission of this result on Kaggle results in a public score of 0.77751, ranking 4428th out of 16823 at the time of writing.

References

This tutorial has been created based on great work done solving the Titanic competition and machine learning.

Tech

This post is licensed under CC BY 4.0 by the author.

Define the problem

Acquire the data

Prepare the data

Data Overview

Primary data analysis and visualization

Sex

Class

Family size

Age

Fare

Embarked port

Name (Title)

Cabin

Data pre-processing

Combine all data

Fill the missing values

Embarked

Fare

Age

Select the features

One-hot encoding

Normalization

Split the dataset

Define the model

Train and fine-tune model

Test and deploy model

References

Trending Tags