Reader small image

You're reading from  Developing Kaggle Notebooks

Product typeBook
Published inDec 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781805128519
Edition1st Edition
Languages
Right arrow
Author (1)
Gabriel Preda
Gabriel Preda
author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Right arrow

More images are even better

We saw how using graphs for the distribution of each feature we can get very interesting insights into the data. To make easier our observations, we grouped each feature presented on train and test data as well as, for train data only, on Survived / Not Survived. We then experimented with feature engineering to get useful, more relevant features. While observing variables separately can help us to get an initial image of data distribution, by grouping values and looking to more than one feature at a time can reveal correlations and more insights in how different features are interacting. In the following will use various graphics to explore more such correlations of features while we also explore the visualizations options. We keep for now our initial option for using a combination of matplotlib and seaborn graphical libraries.Figure 3.15. shows the number of passengers / Age interval, grouped by Passenger Class. We can see from this image that in 3rd class...

What is in a name?

We follow now with the analysis including the Name into the data we are processing to extract meaningful information. From our initial visual inspection, we understood that all names have a similar structure. It starts with a Family Name, followed by comma, then it is a Title (short version, followed by a point), a Given Name and, for those that by marriage acquired a new name, the old or maiden name. Let’s process the data to extract these information. The code is given in the lines below.

def parse_names(row):
    try:
        text = row["Name"]
        split_text = text.split(",")
        family_name = split_text[0]
        next_text = split_text[1]
        split_text = next_text.split(".")
        title = split_text[0] + "."
        next_text = split_text[1]
        if "(" in next_text:
            split_text = next_text.split("(")
            given_name = split_text[0]
            maiden_name ...

Aggregated view of various features

We explored the categorical, numerical data as well as the text data. We learned how to extract various features from text data, and we build aggregated features from some of the numerical ones. Let’s now build two more features by grouping the Title and the Family Size. We will create two new features:

  • Titles – by clustering together titles that are similar (like Miss. with Mlle. or Mrs. and Mme.) or rare (like Dona., Don., Capt., Jonkheer., Rev., the Countess.) and keeping the most frequent ones: Mr., Mrs., Master. And Miss.
  • Family Type – create three clusters from the Family Size values, Single for Family Size of 1, Small (for families to up to 4 members) and Large (for families with more than 4 members)

Then, we represent on a single graph several simple or derived features that we learned have an important predictive value (see Figure 3.26).

Figure 3.25. Passengers survival rates for different features (original or derived): Sex, Passenger Class (Pclass), Age Interval, Fare Interval, Family Type, Titles (clustered). The graphs show also the percent that the subset (given by both category and survived status) represent from all passengers.

A short intro into model building

As a result of our data analysis, we were able to identify some of the features with predictive value. We can now build a model that is using this knowledge. We start with a model that will use just two out of many features we investigated. This is called a baseline model and it is used as a starting point for the incremental refinement of the solution.

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
# convert categorical data in numerical
for dataset in [train_df, test_df]:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
# train-validation split (20% validation)
VALID_SIZE = 0.2
train, valid = train_test_split(train_df, test_size=VALID_SIZE, random_state=42, shuffle=True)
# define predictors and target feature (labels)
predictors = ["Sex", "Pclass"]
target = 'Survived&apos...

Summary

In this chapter we started our travel around the data world on board of Titanic. We started with a preliminary statistical analysis of each feature, we continued with univariate analysis, feature engineering to create derived or aggregated features. We extracted multiple features from text, and we also created complex graphs, to visualize multiple features at the same time and reveal their predictive value. We also learned how to assign a uniform visual identity for our analysis by using a custom colormap across the Notebook. For some of the features, most notably those derived from Names, we perform a deep-dive exploration, to learn about the fate of large families on Titanic or name distribution according to the embarking port. Some of the analysis and visualization tools are easy reusable and in one of the next chapters we will see how to extract them to be as utility scripts in other Notebooks as well.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Developing Kaggle Notebooks
Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda