You're reading from Developing Kaggle Notebooks

Product typeBook

Published inDec 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781805128519

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Gabriel Preda

More images are even better

We saw how using graphs for the distribution of each feature we can get very interesting insights into the data. To make easier our observations, we grouped each feature presented on train and test data as well as, for train data only, on Survived / Not Survived. We then experimented with feature engineering to get useful, more relevant features. While observing variables separately can help us to get an initial image of data distribution, by grouping values and looking to more than one feature at a time can reveal correlations and more insights in how different features are interacting. In the following will use various graphics to explore more such correlations of features while we also explore the visualizations options. We keep for now our initial option for using a combination of matplotlib and seaborn graphical libraries.Figure 3.15. shows the number of passengers / Age interval, grouped by Passenger Class. We can see from this image that in 3^rd class...

What is in a name?

We follow now with the analysis including the Name into the data we are processing to extract meaningful information. From our initial visual inspection, we understood that all names have a similar structure. It starts with a Family Name, followed by comma, then it is a Title (short version, followed by a point), a Given Name and, for those that by marriage acquired a new name, the old or maiden name. Let’s process the data to extract these information. The code is given in the lines below.

def parse_names(row):
    try:
        text = row["Name"]
        split_text = text.split(",")
        family_name = split_text[0]
        next_text = split_text[1]
        split_text = next_text.split(".")
        title = split_text[0] + "."
        next_text = split_text[1]
        if "(" in next_text:
            split_text = next_text.split("(")
            given_name = split_text[0]
            maiden_name ...

Aggregated view of various features

We explored the categorical, numerical data as well as the text data. We learned how to extract various features from text data, and we build aggregated features from some of the numerical ones. Let’s now build two more features by grouping the Title and the Family Size. We will create two new features:

Titles – by clustering together titles that are similar (like Miss. with Mlle. or Mrs. and Mme.) or rare (like Dona., Don., Capt., Jonkheer., Rev., the Countess.) and keeping the most frequent ones: Mr., Mrs., Master. And Miss.
Family Type – create three clusters from the Family Size values, Single for Family Size of 1, Small (for families to up to 4 members) and Large (for families with more than 4 members)

Then, we represent on a single graph several simple or derived features that we learned have an important predictive value (see Figure 3.26).

Figure 3.25. Passengers survival rates for different features (original or derived): Sex, Passenger Class (Pclass), Age Interval, Fare Interval, Family Type, Titles (clustered). The graphs show also the percent that the subset (given by both category and survived status) represent from all passengers. — Figure 3.25. Passengers survival rates for different features (original or derived...

A short intro into model building

As a result of our data analysis, we were able to identify some of the features with predictive value. We can now build a model that is using this knowledge. We start with a model that will use just two out of many features we investigated. This is called a baseline model and it is used as a starting point for the incremental refinement of the solution.

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
# convert categorical data in numerical
for dataset in [train_df, test_df]:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
# train-validation split (20% validation)
VALID_SIZE = 0.2
train, valid = train_test_split(train_df, test_size=VALID_SIZE, random_state=42, shuffle=True)
# define predictors and target feature (labels)
predictors = ["Sex", "Pclass"]
target = 'Survived&apos...

Summary

In this chapter we started our travel around the data world on board of Titanic. We started with a preliminary statistical analysis of each feature, we continued with univariate analysis, feature engineering to create derived or aggregated features. We extracted multiple features from text, and we also created complex graphs, to visualize multiple features at the same time and reveal their predictive value. We also learned how to assign a uniform visual identity for our analysis by using a custom colormap across the Notebook. For some of the features, most notably those derived from Names, we perform a deep-dive exploration, to learn about the fate of large families on Titanic or name distribution according to the embarking port. Some of the analysis and visualization tools are easy reusable and in one of the next chapters we will see how to extract them to be as utility scripts in other Notebooks as well.

References

Every Pub in England, Kaggle Datasets: https://www.kaggle.com/datasets/rtatman/every-pub-in-england
Every Pub in England – Data Exploration, Kaggle Notebook: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/blob/develop/Chapter-04/every-pub-in-england-data-exploration.ipynb
Starbucks Locations Worldwide, Kaggle Datasets: https://www.kaggle.com/datasets/starbucks/store-locations
Open Postcode Geo, Kaggle Datasets: https://www.kaggle.com/datasets/danwinchester/open-postcode-geo
GADM Data for UK, Kaggle Datasets: https://www.kaggle.com/datasets/gpreda/gadm-data-for-uk
Starbucks Location Worldwide – Data Exploration, Kaggle Notebook: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/blob/develop/Chapter-04/starbucks-location-worldwide-data-exploration.ipynb
Polygon overlay in Leaflet Map: https://stackoverflow.com/questions/59303421/polygon-overlay-in-leaflet-map
Geopandas area: https...

The rest of the chapter is locked

You have been reading a chapter from

Developing Kaggle Notebooks

Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages