You're reading from Developing Kaggle Notebooks

Product typeBook

Published inDec 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781805128519

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Gabriel Preda

How to create a Notebooks?

You have several modalities to start a Notebook. You can start it from the main menu “Code” (see Figure 2.1), from the context of a Dataset (see Figure 2.2), a Competition (see Figure 2.3), or by forking one existing Notebook, from your own or from another Notebook.

Figure 2.1. Create a new Notebook from Code menu

When you create a new Notebook from Code menu, this will appear in your list of Notebooks but will not be added to any Dataset or Competition context.If you choose to start it from a Dataset, the dataset will be already added in the list of data added to the notebook, and you will see it in the right-side panel when you edit it.

Figure 2.2. Create a new Notebook in the context of a Dataset

The same in the case of a competition, the dataset associated with it will be already present in the list of datasets when you initialize the notebook.

Figure 2.3. Create a new Notebook in the context of a Competition

To fork a notebook, press...

Frequently used features

On the right-side panel we have quick menu actions for access to frequently used features of notebooks. The first actions are grouped under Data section. Here you have buttons for adding or removing dataset to the notebook. By pressing Add Data button you can add one existing dataset. You have search text box, and quick buttons to select from your datasets, competition datasets and from your notebooks. When you select your notebooks, you can include output of notebooks as data sources for the current notebook. You have also an upload button next to the Add Data button, and you can use it to upload a new dataset, before adding it to the notebook. In the same Data section on the panel, you have input and output folder browser, and buttons next each item so that you can copy the path to either folders or files.Right under the Data section we have the Models section. Here we can add models to the Notebook. Models are a new feature on the platform, and this allows...

Special features

Set as Utility Script or add Utility Scripts

In most of the cases, you will write all the code for your Notebook in successive cells, in the same file. For more complex code, and especially when you would like to reuse some of the code, without copy & paste between notebooks, you can choose to develop utility modules. Kaggle Notebooks offers a useful feature for this purpose, namely Utility scripts. Utility scripts are defined in the same way Notebooks are.To create a utility script, you will have to start a Notebook and then choose from File menu “Save as Utility Script” menu item.If you want to use a Utility Script in your current notebook, you need to select from File menu “Add utility scripts” menu item. This will open a selector window for utility scripts on the right-side panel and here you can choose from your utility scripts and add one or more to the notebook. The utility scripts added to the notebook will be listed under a separate...

Use Google Cloud Services in Kaggle Notebook

From the Add-ons, you select Google Cloud Services. In the dialog window that opens, select to Attach to Notebook the Google Account available in the list of account. You can select which Google Cloud services you want to integrate with your Kaggle environment. Currently, Kaggle offers integration with Google Cloud Storage, BigQuery and AutoML. When using these services through Kaggle Notebooks you need to know that this will incur charges, according to the plan you have. If you choose to use only public data with BigQuery, will not incur any charges.

Upgrade your Kaggle Notebook to Google Cloud AI Notebook

If you reach the limit of resources available for Kaggle Notebooks (RAM, number of cores, execution time), you can choose to promote to Google Cloud AI Notebooks. You can export your Notebook to Google Cloud. For this action, select File | Upgrade to Goole AI Notebook.

Figure 2.7. Upgrade to Goole Cloud AI Platform Notebooks

You will...

Summary

In this chapter we learned what are Kaggle Notebooks, what types we can use, and with what programming languages; we learned how to create, run, and update Notebooks. We visited then the most common features for using notebooks: how to connect them to data sources, models, setting accelerators, environment, internet access. Next, we reviewed less frequent used features: use of utility scripts, secrets, connection to Google Cloud to use Google Cloud Storage, BigQuery or AutoML services and also how to use features from Kaggle Notebooks interface to automate datasets update. Finally, we introduced use of Kaggle API to further extend your usage of Notebooks, allowing you to build external data and ML pipelines that integrates with your Kaggle environment.

Extracting meaningful information from passenger names

We continue now with our analysis, including analyzing the passengers’ names to extract meaningful information. As you will remember from the beginning of this chapter, the Name column also contains some additional information. After our preliminary visual analysis, it became apparent that all names follow a similar structure. They begin with a Family Name, followed by a comma, then a Title (short version, followed by a period), then a Given Name, and, in cases where a new name was acquired through marriage, the previous or Maiden Name. Let’s process the data to extract this information. The code to extract this information will be:

def parse_names(row):
    try:
        text = row["Name"]
        split_text = text.split(",")
        family_name = split_text[0]
        next_text = split_text[1]
        split_text = next_text.split(".")
        title =  (split_text[0] + "."...

Creating a dashboard showing multiple plots

We have explored categorical and numerical data, as well as text data. We have learned how to extract various features from text data, and we built aggregated features from some of the numerical ones. Let’s now build two more features by grouping Title and Family Size. We will create two new features:

Titles: By clustering together similar titles (like Miss with Mlle., or Mrs. and Mme.) or rare (like Dona., Don., Capt., Jonkheer, Rev., and Countess) and keeping the most frequent ones – Mr., Mrs., Master, and Miss
Family Type: By creating three clusters from the Family Size values – Single for a family size of 1, Small for families made of up to 4 members, and Large for families with more than 4 members

Then, we will represent, on a single graph, several simple or derived features that we learned have an important predictive value. We show the passengers’ survival rates for Sex, Passenger...

Building a baseline model

As a result of our data analysis, we were able to identify some of the features with predictive value. We can now build a model by using this knowledge to select relevant features. We will start with a model that will use just two out of the many features we investigated. This is called a baseline model and it is used as a starting point for the incremental refinement of the solution.

For the baseline model, we chose a RandomForestClassifier model. The model is simple to use, gives good results with the default parameters, and can be interpreted easily, using feature importance.

Let’s begin with the following code block to implement the model. First, we import a few libraries that are needed to prepare the model. Then, we convert the categorical data to numerical. We need to do this since the model we chose deals with numbers only. The operation of converting the categorical feature values to numbers is called label encoding. Then, we split...

Summary

In this chapter, we started our journey around the data world on board the Titanic. We started with a preliminary statistical analysis of each feature and then continued with univariate analysis and feature engineering to create derived or aggregated features. We extracted multiple features from text, and we also created complex graphs to visualize multiple features at the same time and reveal their predictive value. We then learned how to assign a uniform visual identity for our analysis by using a custom color map across the notebook.

For some of the features – most notably, those derived from names – we performed a deep-dive exploration to learn about the fate of large families on the Titanic and about name distribution according to the embarking port. Some of the analysis and visualization tools are easily reusable and, in the next chapter, we will see how to extract them to be used as utility scripts in other notebooks as well.

In the next chapter...

References

Titanic - Machine Learning from Disaster, Kaggle competition: https://www.kaggle.com/competitions/titanic
Gabriel Preda, Titanic – start of a journey around data world, Kaggle notebook: https://www.kaggle.com/code/gpreda/titanic-start-of-a-journey-around-data-world
Developing-Kaggle-Notebooks, Packt Publishing GitHub repository: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/
Developing-Kaggle-Notebooks, Packt Publishing GitHub repository, Chapter 3: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/tree/main/Chapter-03

Join our book’s Discord space

Join our Discord community to meet like-minded people and learn alongside more than 5000 members at:

https://packt.link/kaggle

The rest of the chapter is locked

You have been reading a chapter from

Developing Kaggle Notebooks

Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages