Reader small image

You're reading from  Developing Kaggle Notebooks

Product typeBook
Published inDec 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781805128519
Edition1st Edition
Languages
Right arrow
Author (1)
Gabriel Preda
Gabriel Preda
author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Right arrow

How to create a Notebooks?

You have several modalities to start a Notebook. You can start it from the main menuCode” (see Figure 2.1), from the context of a Dataset (see Figure 2.2), a Competition (see Figure 2.3), or by forking one existing Notebook, from your own or from another Notebook.

Figure 2.1. Create a new Notebook from Code menu

When you create a new Notebook from Code menu, this will appear in your list of Notebooks but will not be added to any Dataset or Competition context.If you choose to start it from a Dataset, the dataset will be already added in the list of data added to the notebook, and you will see it in the right-side panel when you edit it.

Figure 2.2. Create a new Notebook in the context of a Dataset

The same in the case of a competition, the dataset associated with it will be already present in the list of datasets when you initialize the notebook.

Figure 2.3. Create a new Notebook in the context of a Competition

To fork a notebook, press...

Frequently used features

On the right-side panel we have quick menu actions for access to frequently used features of notebooks. The first actions are grouped under Data section. Here you have buttons for adding or removing dataset to the notebook. By pressing Add Data button you can add one existing dataset. You have search text box, and quick buttons to select from your datasets, competition datasets and from your notebooks. When you select your notebooks, you can include output of notebooks as data sources for the current notebook. You have also an upload button next to the Add Data button, and you can use it to upload a new dataset, before adding it to the notebook. In the same Data section on the panel, you have input and output folder browser, and buttons next each item so that you can copy the path to either folders or files.Right under the Data section we have the Models section. Here we can add models to the Notebook. Models are a new feature on the platform, and this allows...

Special features

Set as Utility Script or add Utility Scripts

In most of the cases, you will write all the code for your Notebook in successive cells, in the same file. For more complex code, and especially when you would like to reuse some of the code, without copy & paste between notebooks, you can choose to develop utility modules. Kaggle Notebooks offers a useful feature for this purpose, namely Utility scripts. Utility scripts are defined in the same way Notebooks are.To create a utility script, you will have to start a Notebook and then choose from File menuSave as Utility Script” menu item.If you want to use a Utility Script in your current notebook, you need to select from File menuAdd utility scripts” menu item. This will open a selector window for utility scripts on the right-side panel and here you can choose from your utility scripts and add one or more to the notebook. The utility scripts added to the notebook will be listed under a separate...

Use Google Cloud Services in Kaggle Notebook

From the Add-ons, you select Google Cloud Services. In the dialog window that opens, select to Attach to Notebook the Google Account available in the list of account. You can select which Google Cloud services you want to integrate with your Kaggle environment. Currently, Kaggle offers integration with Google Cloud Storage, BigQuery and AutoML. When using these services through Kaggle Notebooks you need to know that this will incur charges, according to the plan you have. If you choose to use only public data with BigQuery, will not incur any charges.

Upgrade your Kaggle Notebook to Google Cloud AI Notebook

If you reach the limit of resources available for Kaggle Notebooks (RAM, number of cores, execution time), you can choose to promote to Google Cloud AI Notebooks. You can export your Notebook to Google Cloud. For this action, select File | Upgrade to Goole AI Notebook.

Figure 2.7. Upgrade to Goole Cloud AI Platform Notebooks

You will...

Summary

In this chapter we learned what are Kaggle Notebooks, what types we can use, and with what programming languages; we learned how to create, run, and update Notebooks. We visited then the most common features for using notebooks: how to connect them to data sources, models, setting accelerators, environment, internet access. Next, we reviewed less frequent used features: use of utility scripts, secrets, connection to Google Cloud to use Google Cloud Storage, BigQuery or AutoML services and also how to use features from Kaggle Notebooks interface to automate datasets update. Finally, we introduced use of Kaggle API to further extend your usage of Notebooks, allowing you to build external data and ML pipelines that integrates with your Kaggle environment.

Extracting meaningful information from passenger names

We continue now with our analysis, including analyzing the passengers’ names to extract meaningful information. As you will remember from the beginning of this chapter, the Name column also contains some additional information. After our preliminary visual analysis, it became apparent that all names follow a similar structure. They begin with a Family Name, followed by a comma, then a Title (short version, followed by a period), then a Given Name, and, in cases where a new name was acquired through marriage, the previous or Maiden Name. Let’s process the data to extract this information. The code to extract this information will be:

def parse_names(row):
    try:
        text = row["Name"]
        split_text = text.split(",")
        family_name = split_text[0]
        next_text = split_text[1]
        split_text = next_text.split(".")
        title =  (split_text[0] + "."...

Creating a dashboard showing multiple plots

We have explored categorical and numerical data, as well as text data. We have learned how to extract various features from text data, and we built aggregated features from some of the numerical ones. Let’s now build two more features by grouping Title and Family Size. We will create two new features:

  • Titles: By clustering together similar titles (like Miss with Mlle., or Mrs. and Mme.) or rare (like Dona., Don., Capt., Jonkheer, Rev., and Countess) and keeping the most frequent ones – Mr., Mrs., Master, and Miss
  • Family Type: By creating three clusters from the Family Size values – Single for a family size of 1, Small for families made of up to 4 members, and Large for families with more than 4 members

Then, we will represent, on a single graph, several simple or derived features that we learned have an important predictive value. We show the passengers’ survival rates for Sex, Passenger...

Building a baseline model

As a result of our data analysis, we were able to identify some of the features with predictive value. We can now build a model by using this knowledge to select relevant features. We will start with a model that will use just two out of the many features we investigated. This is called a baseline model and it is used as a starting point for the incremental refinement of the solution.

For the baseline model, we chose a RandomForestClassifier model. The model is simple to use, gives good results with the default parameters, and can be interpreted easily, using feature importance.

Let’s begin with the following code block to implement the model. First, we import a few libraries that are needed to prepare the model. Then, we convert the categorical data to numerical. We need to do this since the model we chose deals with numbers only. The operation of converting the categorical feature values to numbers is called label encoding. Then, we split...

Summary

In this chapter, we started our journey around the data world on board the Titanic. We started with a preliminary statistical analysis of each feature and then continued with univariate analysis and feature engineering to create derived or aggregated features. We extracted multiple features from text, and we also created complex graphs to visualize multiple features at the same time and reveal their predictive value. We then learned how to assign a uniform visual identity for our analysis by using a custom color map across the notebook.

For some of the features – most notably, those derived from names – we performed a deep-dive exploration to learn about the fate of large families on the Titanic and about name distribution according to the embarking port. Some of the analysis and visualization tools are easily reusable and, in the next chapter, we will see how to extract them to be used as utility scripts in other notebooks as well.

In the next chapter...

References

  1. Titanic - Machine Learning from Disaster, Kaggle competition: https://www.kaggle.com/competitions/titanic
  2. Gabriel Preda, Titanic – start of a journey around data world, Kaggle notebook: https://www.kaggle.com/code/gpreda/titanic-start-of-a-journey-around-data-world
  3. Developing-Kaggle-Notebooks, Packt Publishing GitHub repository: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/
  4. Developing-Kaggle-Notebooks, Packt Publishing GitHub repository, Chapter 3: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/tree/main/Chapter-03

Join our book’s Discord space

Join our Discord community to meet like-minded people and learn alongside more than 5000 members at:

https://packt.link/kaggle

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Developing Kaggle Notebooks
Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda