Titanic Kaggle Competition

February 15, 2018

The main idea for this is to go thru the whole proccess of data analysis and machine learning model building, from data cleaning to model evaluation. the final result is on this link My Titanic Kaggle but anyway gonna go step by step with the code

Main idea

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. I will get the dataset, create a couple of DataFrames to analyze the data, clean it, create a model and evaluate it, and finally predict the results.

Loading the data

First of all, I will load the data from the urls provided by Kaggle. and use pandas to read the csvs, then make a train and test batches to work with

import pandas as pd
import sys
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)


test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

Then I labeled the dataset assiggning integers to the different characteristics and classes and also create new features to improve the model.

#Setting the dataset

train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2


train["Age"] = train["Age"].fillna(train["Age"].median())
train["Fare"] = train["Fare"].fillna(train["Fare"].median())

#Creating New Features

train["Child"] = 0
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] > 18] = 0
train["Child"][train["Age"] == 18] = 0


train["Family_Size"] = train["SibSp"].values + train["Parch"].values + 1

Train and test with my custom data

training:

X=train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked", "Child","Family_Size"]].values

y=train["Survived"].values


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=35)

test:

test["Age"] = test["Age"].fillna(test["Age"].median())
test["Fare"] = test["Fare"].fillna(test["Fare"].median())


# Convert the male and female groups to integer form
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
# Impute the Embarked variable
test["Embarked"] = test["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2

test["Child"] = 0
test["Child"][test["Age"] < 18] = 1
test["Child"][test["Age"] > 18] = 0
test["Child"][test["Age"] == 18] = 0


test["Family_Size"] = test["SibSp"].values + test["Parch"].values + 1

Making the classifier

Now the fun part, I use MLPClassifier from sklearn.neural_network to make a classifier.

--> MLPClassifier offcial docs

from sklearn.neural_network import MLPClassifier

X = X_train
y = y_train

clf = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(9,27,22,20,9), max_iter=2000, random_state=1)
clf.fit(X, y)

and the predictions

test_features_clf=X_test
test_target_clf=y_test


my_clf = clf.predict(test_features_clf)
print(clf.score(test_features_clf, test_target_clf))

test_clf_to_submit = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked","Child","Family_Size"]].values


pred_clf_to_submit = clf.predict(test_clf_to_submit)

PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution_clf = pd.DataFrame(pred_clf_to_submit, PassengerId, columns = ["Survived"])
my_solution_clf.to_csv("ClfSolution2.csv", index_label = ["PassengerId"])

Luciano Lupo Notes.

Titanic Kaggle Competition

Main idea

Loading the data

Train and test with my custom data

Making the classifier