Learnings:

Setting a learning goal
Writing a preprocessor (binary and linear)
Checking what an upper limit of accuracy should be
Checking what a simple model is in terms of accuracy

Intro

The problem at hand is to predict which passengers survived on the maiden journey of the Titanic. We are provided with a dataset with various known facts about the passengers, and whether they survived or not.

My learning goal is to use a Neural Network and build it with a sequential model using TensorFlow. My second goal is to start on Kaggle and see how the site works.

Baseline

In the Titanic project, you get a gender submission file which considers whether the sex of the individual was solely responsible for their outcome of survival. The score I got when submitting this in the test was 0.76.

When I looked at the leader board it seems that many people have been able to create files that reach a 1.00 score.

First stab

So my first attempt was to make a simple sequential neural network, add some binary one hot lists to the input, and see what the output was. I added Pclass, Embarked, and Sex.

I had to write a bit of boilerplate code to do the conversion to one hot list.

Initially, I was quite happy to see my training accuracy to be 0.81 however, after submitting my solution was only 0.77. Not much better than the baseline of just using sex.

Second try adding linear

I’ve added new features such as Fare and Age, and I saw that the training was 0.8 but few improvements on the test with the score remaining at 0.77. To make this happen I added a mask to deal with unknown values and made a custom linear encoder.

Moving on with tuning and dev set

What I wanted to do was do hyperparameter tuning. Because we get 10 submissions, I wanted to make a dev set from the test set to make things go faster. After doing tuning and playing around I got stuck on 0.77 again.

Moving on to the next competition

The total time I spent on this project was approximately 4 hours. I think this was a great intro to getting started with Kaggle. Although, I’m not happy with my position on the leaderboard – I would have loved to be higher. I think I learned all I could do, so time to move on to the next thing.

Exploring other work

Exploring how people got higher; some are cheating to get a score of 1. And some are using large language models to get a score higher than 0.77 in the 0.8 range. I’m tempted to ask ChatGTP which people survived, but I’m moving on for now.

Theo Christiaanse, PhD

Predicting Titanic Survivors