Developing Kaggle Notebooks

Introducing Kaggle and Its Basic Functions

Kaggle is currently the main platform for competitive predictive modeling. Here, those who are passionate about machine learning, both experts and beginners, have a collaborative and competitive environment to learn, win recognition, share knowledge, and give back to the community. The company was launched in 2010, offering only machine learning competitions. Currently, it is a data platform that includes sections titled Competitions, Datasets, Code, Discussions, Learn, and, most recently, Models.

In 2011, Kaggle went through an investment round, valuing the company above $25 million. In 2017, it was acquired by Google (now Alphabet Inc.), becoming associated with Google Cloud. The most notable key persons from Kaggle are co-founders Anthony Goldbloom (long-time CEO until 2022) and Ben Hammer (CTO). Recently, D. Sculley, the legendary Google engineer, became Kaggle’s new CEO, after Anthony Goldbloom stepped down to become involved in the development of a new start-up.

In this first chapter, we’ll explore the main sections that the Kaggle platform offers its members. We will also learn how to create an account, how the platform is organized, and what its main sections are. In short, this chapter will cover the following topics:

The Kaggle platform
Kaggle Competitions
Kaggle Datasets
Kaggle Code
Kaggle Discussions
Kaggle Learn
Kaggle Models

If you are familiar with the Kaggle platform, you probably know about these features already. You can choose to continue reading the following sections to refresh your knowledge about the platform or you can skip them and go directly to the next chapter.

The Kaggle platform

To start using Kaggle, you will have to create an account. You can register with your email and password or authenticate using your Google account directly. Once registered, you can start by creating a profile with your name, picture, role, and current organization. You then can add your location, which is optional, and a short personal presentation as well. After you perform an SMS verification and add some minimal content on the platform (run one notebook or script, make one competition submission, make one comment, or give one upvote), you will also be promoted from Novice to Contributor. The following figure shows a checklist for how to become a contributor. As you can see, all items are checked, which means that the user has already been promoted to the Contributor tier.

A close-up of a white background Description automatically generated

Figure 1.1: Checklist to become a contributor

With the entire Contributor checklist completed, you are ready to start your Kaggle journey.

The current platform contains multiple features. The most important are:

Competitions: This is where Kagglers can take part in competitions and submit their solutions to be scored.
Datasets: In this section, users can upload datasets.
Code: This is one of the most complex features of Kaggle. Also known as Kernels or Notebooks, it allows users to add code (independently or connected to datasets and competitions), modify it, run it to perform analysis, prepare models, and generate submission files for competitions.
Discussions: In this section, contributors on the platform can add topics and comments to competitions, Notebooks, or datasets. Topics can also be added independently and linked to themes such as Getting Started.

Each of these sections allows you to gain medals, according to Kaggle’s progression system. Once you start to contribute to one of these sections, you can also be ranked in the overall Kaggle ranking system for the respective section. There are two main methods to gain medals: by winning top positions in competitions and by getting upvotes for your work in the Datasets, Code, and Discussions sections.

Besides Competitions, Datasets, Code, and Discussions, there are two more sections with content on Kaggle:

Learn: This is one of the coolest features of Kaggle. It contains a series of lectures and tutorials on various topics, from a basic introduction to programming languages to advanced topics like computer vision, model interpretability, and AI ethics. You can use all the other Kaggle resources as support materials for the lectures (Datasets, Competitions, Code, and Discussions).
Models: This is the newest feature introduced on Kaggle. It allows you to load a model into your code, in the same way that you currently add datasets.

Now that we’ve had a quick overview of the various features of the Kaggle platform, the following sections will give you an in-depth view of Competitions, Datasets, Code, Discussions, Learn, and Models. Let’s get started!

Kaggle Competitions

It all started with Competitions more than 12 years ago. The first competition had just a few participants. With the growing interest in machine learning and the increased community around Kaggle, the complexity of the competitions, the number of participants, and the interest around competitions increased significantly.

To start a competition, the competition host prepares a dataset, typically split between train and test. In the most common form, the train set has labeled data available, while the test set only contains the feature data. The host also adds information about the data and a presentation of the competition objective. This includes a description of the problem to set the background for the competitors. The host also adds information about the metrics used to evaluate the solutions to the competition. The terms and conditions of the competitions are also specified.

Competitors are allowed to submit a limited number of solutions per day and, at the end, the best two solutions (evaluated based on a portion of the test set used to calculate the public score) will be selected. Competitors also have the option to select two solutions themselves based on their own judgment. Then, these two selected solutions will be evaluated on the reserved subset of test data to generate the private score. This will be the final score used to rank the competitors.

There are several types of competitions:

Featured competitions: The most important are the featured competitions. Currently, featured competitions might reunite several thousand teams, with tens or even hundreds of thousands of solutions submitted. Featured competitions are typically hosted by companies but also sometimes by research organizations or universities, and are usually aimed at solving a difficult problem related to a company or a research topic. The organizer turns to the large Kaggle community to bring their knowledge and skills, and the competitive aspect of the setup accelerates the development of a solution. Usually, a featured competition will also have a significant prize, which will be distributed according to the competition rules to the top competitors. Sometimes, the host will not include a prize but will offer a different incentive, such as recruiting the top competitors to work for them (with high-profile companies, this might be more interesting than a prize), vouchers for using cloud resources, or acceptance of the top solutions to be presented at high-profile conferences. Besides the Featured competitions, there are also Getting Started, Research, Community, Playground, Simulations, and Analytics competitions.
Getting Started competitions: These are aimed at mostly beginners and tackle easily approachable machine learning problems to help build basic skills. These competitions are restarted periodically and the leaderboard is reset. The most notable ones are Titanic – Machine Learning for Disaster, Digit Recognizer, House Prices – Advanced Regression Techniques, and Natural Language Processing with Disaster Tweets.
Research competitions: In Research competitions, the themes are related to finding the solution to a difficult scientific problem in various domains such as medicine, genetics, cell biology, and astronomy by applying a machine learning approach. Some of the most popular competitions in recent years were from this category and with the rising use of machine learning in many fields of fundamental and applied research, we can expect that this type of competition will be more and more frequent and popular.
Community competitions: These are created by Kagglers and are either open to the public or private competitions, where only those invited can take part. For example, you can host a Community competition as a school or university project, where students are invited to join and compete to get the best grades.
Kaggle offers the infrastructure, which makes it very simple for you to define and start a new competition. You have to provide the training and test data, but this can be as simple as two files in CSV format. Additionally, you need to add a submission sample file, which gives the expected format for submissions. Participants in the competition have to replace the prediction in this file with their own prediction, save the file, and then submit it. Then, you have to choose a metric to assess the performance of a machine learning model (no need to define one, as you have a large collection of predefined metrics). At the same time, as the host, you will be required to upload a file with the correct, expected solution to the competition challenge, which will serve as reference against which all competitors’ submissions will be checked. Once this is done, you just need to edit the terms and conditions, choose a start and end date for the competition, write the data description and objectives, and you are good to go. Other options that you can choose from are whether participants can team up or not, and whether joining the competition is open to everybody or just to people who receive the competition link.

Playground competitions: Around three years ago, a new section of competitions was launched: Playground competitions. These are generally simple competitions, like the Getting Started ones, but will have a shorter lifespan (it was initially one month, but currently it is from one to four weeks). These competitions will be of low or medium difficulty and will help participants gain new skills. Such competitions are highly recommended to beginners but also to competitors with more experience who want to refine their skills in a certain domain.
Simulation competitions: If the previous types are all supervised machine learning competitions, Simulations competitions are, in general, optimization competitions. The most well known are those around Christmas and New Year (Santa competitions) and also the Lux AI Challenge, which is currently in the third season. Some of the Simulation competitions are also recurrent and will qualify for an additional category: Annual competitions. Examples of such competitions that are of both the Simulations type and Annual are the Santa competitions.
Analytics competitions: These are different in both the objective and the modality of scoring the solutions. The objective is to perform a detailed analysis of the competition dataset to get insights from the data. The score is based, in general, on the judgment of the organizers and, in some cases, on the popularity of the solutions that compete; in this case, the organizers will grant parts of the prizes to the most popular notebooks, based on the upvotes of Kagglers. In Chapter 5, we will analyze the data from one of the first Analytics competitions and also provide some insights into how to approach this type of competition.

For a long time, competitions required participants to prepare a submission file with the predictions for the test set. No other constraints were imposed on the method to prepare the submissions; the competitors were supposed to use their own computing resources to train models, validate them, and prepare the submission. Initially, there were no available resources on the platform to prepare a submission. After Kaggle started to provide computational resources, where you could prepare your model using Kaggle Kernels (later named Notebooks and now Code), you could submit directly from the platform, but there was no limitation imposed on this. Typically, the submission file will be evaluated on the fly and the result will be displayed almost instantly. The result (i.e., the score according to the competition metric) will be calculated only for a percentage of the test set. This percentage is announced at the start of the competition and is fixed. Also, the subset of test data used during the competition to calculate the displayed score (the public score) is fixed. After the end of the competition, the final score is calculated with the rest of the test data, and this final score (also known as the private score) is the final score for each competitor. The percentage of the test data used during the competition to evaluate the solution and provide the public score could be anything from a few percent to more than 50%. In most competitions, it tends to be less than 50%.

The reason Kaggle uses this approach is to prevent one unwanted phenomenon. Rather than improving their models for enhanced generalization, competitors might be inclined to optimize their solution to predict the test set as perfectly as possible, without considering the cross-validation score on their train data. In other words, the competitors might be inclined to overfit their solution on the test set. By splitting this data and only providing the score for a part of the test set – the public score – the organizers intend to prevent this.

With more and more complex competitions (sometimes with very large train and test sets), some participants with greater computational resources might gain an advantage, while others with limited resources may struggle to develop advanced models. Especially in featured competitions, the goal is often to create robust, production-compatible solutions. However, without setting restrictions on how solutions are obtained, achieving this goal may be difficult, especially if solutions with unrealistic resource use become prevalent. To limit the negative unwanted consequences of the “arms race” for better and better solutions, a few years ago, Kaggle introduced Code competitions. This kind of competition requires that all solutions be submitted from a running notebook on the Kaggle platform. In this way, the infrastructure to run the solution became fully controllable by Kaggle.

Also, not only are the computing resources limited in such competitions but there are also additional constraints: the duration of the run and internet access (to prevent the use of additional computing power through the use of external APIs or other remote computing resources).

Kagglers discovered quite fast that this was a limitation just for the inference part of the solution and an adaptation appeared: competitors started to train offline, large models that would not fit within the limits of computing power and time of run imposed by the Code competitions. Then, they uploaded the offline trained models (sometimes using very large computational resources) as datasets and loaded these models in the inference code that observed the limits for memory and computation time for the Code competitions.

In some cases, multiple models trained offline were loaded as datasets and inference combined these multiple models to create more precise solutions. Over time, Code competitions have become more refined. Some of them will only expose a few rows from the test set and not reveal the size of the real test set used for the public or future private test set. Therefore, Kagglers have to resort to clever probing techniques to estimate the limitations that might be incurred while running the final, private test set, to avoid a case where their code will fail due to surpassing memory or runtime limits.

Currently, there are also Code competitions that, after the active part of the competition (i.e., when competitors are allowed to continue to refine their solutions) ends, will not publish the private score, but will rerun the code with several new sets of test data, and reevaluate the setwo selected solutions against these new datasets, which have never been seen before. Some of these competitions are about the stock market, cryptocurrency valuation, or credit performance predictions and they use real data. The evolution of Code competitions ran in parallel with the evolution of available computational resources on the platform, to provide users with the required computational power.

Some of the competitions (most notably the Featured competitions and the Research competitions) grant ranking points and medals to the participants. Ranking points are used to calculate the relative position of Kagglers in the general leaderboard of the platform. The formula to calculate the ranking points awarded for a competition hasn’t changed since May 2015:

Figure 1.2: Formula for calculating ranking points

The number of points decreases with the square root of the number of teammates in the current competition team. More points are awarded for competitions with a larger number of teams. The number of points will also decrease over time, to keep the ranking up to date and competitive.

Medals are counted to get a promotion in the Kaggle progression system for competitions. Medals for competitions are obtained based on the position at the top of the competition leaderboard. The actual system is a bit more complicated but, generally, the top 10% will get a bronze medal, the top 5% will get a silver medal, and the top 1% will get a gold medal. The actual number of medals granted will be larger with an increased number of participants, but this is the basic principle.

With two bronze medals, you reach the Competition Expert tier. With two silver medals and one gold medal, you reach the Competition Master tier. And with one Solo gold medal (i.e., you obtained this medal without teaming up with others) and a total of five gold medals, you reach the most valuable Kaggle tier: the Competition Grandmaster. Currently, at the time of preparing this book, among the over 12 million users on Kaggle, there are 280 Kaggle Competition Grandmasters and 1,936 Masters.

The ranking system adds points depending on the position of users in the leaderboard, which grants ranking points. The points are not permanent, and, as we can see from Figure 1.2, there is a quite complex formula for points decreasing. If you do not continue to compete and get new points, your points will decrease quite fast and the only thing that will remind you of your past glory is the maximum rank you reached in the past. However, once you achieve a medal, you will always have that medal in your profile, even if your ranking position changes or your points decrease over time.

Kaggle Datasets

Kaggle Datasets were added only a few years back. Currently, there are more than 200,000 datasets available on the platform, contributed by the users. There were, of course, datasets in the past, associated with the competitions. With the new Datasets section, Kagglers can get medals and ranking based on the recognition of other users on the platform, in the form of upvotes for datasets contributed.

Everybody can contribute datasets and the process to add a dataset is quite simple. You first need to identify an interesting subject and a data source. This can be an external dataset that you are mirroring on Kaggle, provided that the right license is in place, or the data is collected by yourself. Datasets can also be authored collectively. There will be a main author, the one that initiates the dataset, but they can add other contributors with view or edit roles. There are a few compulsory steps to define a dataset on Kaggle.

First, you will have to upload one or multiple files and give a name to the dataset. Alternatively, you can set the dataset to be provided from a public link, which should point to a file or a public repository on GitHub. Another way to provision a dataset is from a Kaggle Notebook; in this case, the output of the notebook will be the content of the dataset. The dataset can also be created from a Google Cloud Storage resource. Before creating a dataset, you have the option to set it as public, and you can also check your current private quota. Each Kaggler has a limited private quota (which has been increasing slightly over time; currently, it is over 100 GB). If you decide to keep the dataset private, you will have to fit all your private datasets in this quota. If a dataset is kept private, you can decide at any time to delete it if you do not need it anymore. After the dataset is initialized, you can start improving it by adding additional information.

When creating a dataset, you have the option to add a subtitle, a description (with a minimum number of characters required), and information about each file in the dataset. For tabular datasets, you can also add titles and explanations for each column. Then, you can add tags to make the dataset easier to find through searching and clearly specify the topic, data type, and possible business or research domains, for those interested. You can also change the image associated with the dataset. It is advisable to use a public domain or personal picture. Adding metadata about authors, generating DOI (Digital Object Identifier) citations, and specifying provenance and expected update frequency are all helpful in boosting the visibility of your dataset. It will also improve the likelihood that your contribution will be correctly cited and used in other works. License information is also important, and you can select from a large list of frequently used licenses. With each element added in the description and metadata about the contributed dataset, you also increase the usability score, calculated automatically by Kaggle. It is not always possible to reach a 10/10 usability score (especially when you have a dataset with tens of thousands of files) but it is always preferable to try to improve the information associated with the dataset.

Once you publish your dataset, this will become visible in the Datasets section of the platform, and, depending on the usability and the quality perceived by the content moderators from Kaggle, you might get a special status of Featured dataset. Featured datasets get more visibility in searches and are included in the top section of recommended datasets when you select the Datasets section. Besides the Featured datasets, presented under a Trending datasets lane, you will see lanes with themes like Sport, Health, Software, Food, and Travel, as well as Recently Viewed Datasets.

The datasets can include all kinds of file formats. The most frequently used format is CSV. It is a very popular format outside Kaggle too and it is the best format choice for tabular data. When a file is in CSV format, Kaggle will display it, and you can choose to see the content in detail, by columns, or in a compact form. Other possible data formats used are JSON, SQLite, and archives. Although a ZIP archive is not a data format per se, it has full support on Kaggle and you can directly read the content of the archive, without unpacking it. Datasets also include modality-specific formats, various image formats (JPEG, PNG, and so on), audio signals formats (WAV, OGG, and MP3), and video formats. Domain-specific formats, like DICOM for medical imaging, are widely used. BigQuery, a dataset format specific to Google Cloud, is also used for datasets on Kaggle, and there is full support for accessing the content.

If you contribute to datasets, you can get ranking points and medals as well. The system is based on upvotes by other users, upvotes from yourself or from Novice Kagglers, or old upvotes not being included in the calculation for granting ranking points or medals. You can get to the Datasets Expert tier if you acquire three bronze medals, to Master if you get one gold medal and four silver medals, and to Datasets Grandmaster with five gold medals and five silver medals. Acquiring medals in Datasets is not easy, since upvotes in Datasets are not easily granted by users, and you will need 5 upvotes to get a bronze medal, 20 upvotes for a silver medal, and 50 upvotes for a gold medal. Once you get the medals, as these are based on votes, you can lose your medals over time, and even your status as Expert, Master, or Grandmaster can be lost if the users that upvoted you remove their upvote or if they are banned from the platform. This happens sometimes, and not so infrequently as you might think. So, if you want to secure your position, the best approach is to always create high-quality content; this will bring you more upvotes and medals than the minimum required.

Kaggle Code

Kaggle Code is one of the most active sections on the platform. Older names for Code are Kernels and Notebooks and you will frequently hear them used interchangeably. The number of current contributors, at the time of writing this book, exceeds 260,000 and is surpassed by only the Discussions section.

Code is used for the analysis of datasets or competition datasets, for preparing models for competition submissions, and for generating models and datasets. In the past, Code could use either R, Python, or Julia as programming languages; currently, you can only choose between Python (the default option) and R. You can set your editor as Script or Notebook. You can choose the computing resource to run your code, with CPU being the default.

Alternatively, you can choose between four options of accelerators if using Python as a programming language or two if using R. Accelerators are provided free of charge, but there is a quota, reset weekly. For high-demand accelerator resources, there might also be a waiting list.

Code is under source control and, when editing, you can choose to just save (and create a version) or save and run (and you create a code version and a run version). You can attach to Code datasets, Competitions datasets, and external utility scripts and models. As long as you are not rerunning the notebook, changes made in the resources used will not affect its visibility. If you try to rerun the code and refresh the datasets or utility script versions, you might need to account for changes in those data and code versions. The output of code can be used as input to other code, in the same way as you include datasets and models. By default, your code is private, and you do not need to make it public to submit the output to a competition.

If you make your code public, you can get upvotes, and these count for both the ranking in the Notebooks category as well as for getting medals. You need 5 bronze medals for the Expert tier in Notebooks, 10 silver medals for the Master tier, and 15 gold medals for the Grandmaster tier. One bronze medal needs 5 upvotes, a silver medal needs 20 upvotes, and a gold medal requires 50 upvotes. Upvotes in Notebooks can be revoked, and you can also make your public notebooks private again (or delete them). In such a case, all upvotes and medals associated with that Notebook are no longer counted for your ranking or performance tier. There are Code sections associated with Competitions, Datasets, and Models. At the time of writing this book, there were 125 Notebook Grandmasters and 472 Masters.

Kaggle grows and changes continuously, both as a data and competitive machine learning platform and as a community. At the time of writing this book, starting with the new 2023 Kaggle AI Report, Kaggle introduced a review system for Notebook competitions where all participants submitting an essay are also asked to provide a review for another three participants’ essays. The final decision about which submission will win the competition is taken by a panel of experts from veteran Kaggle Grandmasters.

Kaggle Code’s many features and options will be described in the next chapter in a more detailed manner.

Kaggle Learn

Kaggle Learn is one of the lesser-known gems on Kaggle. It contains compact learning modules, each centered on a certain subject related to data science or machine learning. Each learning module has several lessons, each one with a Tutorial section followed by an Exercise section. The Tutorial and Exercise sections are available in the form of interactive Kaggle Notebooks. To complete a learning module, you need to go through all the lessons. In each lesson, you will need to review the training material and successfully run the Exercise Notebook. Some of the cells in the Exercise Notebook have a verification associated with them. If you need help, there are also special cells in the notebook that reveal hints about how to solve the current exercise. Upon completing the entire learning module, you receive a certificate of completion from Kaggle.

Currently, Kaggle Learn is organized into three main sections:

Your Courses, where you have the courses that you have completed and those that are now in progress (active).
Open courses that you can explore further. The courses in this main section are from absolute beginner courses (such as Intro to Programming, Python, Pandas, Intro to SQL, and Intro to Machine Learning) to intermediate courses (such as Data Cleaning, Intermediate Machine Learning, Feature Engineering, and Advanced SQL). Also, it contains topic-specific courses like Visualization, Geospatial Analysis, Computer Vision, Time Series, and Intro to Game AI and Reinforcement Learning. Some courses touch on extremely interesting topics such as AI ethics and machine learning interpretability.
Guides, which is dedicated to various learning guides for programs, frameworks, or domains of interest. This includes the JAX Guide, TensorFlow Guide, Transfer Learning for Computer Vision Guide, Kaggle Competitions Guide, Natural Language Processing Guide, and R Guide.

Kaggle is also committed to supporting continuous learning and helping anyone benefit from the knowledge accumulated on the Kaggle platform and the Kaggle community. In the last two years, Kaggle has started to reach out and help professionals from underrepresented communities acquire skills and experience in data science and machine learning in the form of the KaggleX BIPOC (Black, Indigenous, and People of Color) Grant program, by pairing Kagglers, as mentors, with professionals from BIPOC communities, as mentees.

In the next section, we will familiarize ourselves with a rapidly evolving capability of the Kaggle platform: Models.

Kaggle Models

Models is the newest section introduced on the platform; at the time of writing this book, it is less than one month old. Models started to be contributed quite often by users in several ways and for a few purposes. Most frequently, models were saved as output of Notebooks (Code) after being trained using custom code, often in the context of a competition. Subsequently, these models can be optionally included in a dataset or used directly in code. Also, sometimes, models built outside the platform were uploaded as datasets and then included in the pipeline of users to prepare a solution for a competition. Meantime, model repositories were available either through a public cloud, like Google Cloud, AWS, or Azure, or from a company specialized in such a service, like Hugging Face.

With the concept of downloadable models ready to be used or easy to fine-tune for a custom task, Kaggle chose to include Models in this platform. Currently, you can search in several categories: Text Classification, Image Feature Vector, Object Detection, and Image Segmentation. Alternatively, you can use the Model Finder feature to explore models specialized in a certain modality: Image, Text, Audio, Multimodal, or Video. When searching the Models library, you can apply filters on Task, Data Type, Framework, Language, License, and Size, as well as functional criteria, like Fine Tuneable.

There are no ranking points or performance tiers related to models yet. Models can be upvoted and there is a Code and Discussions section associated with each model. In the future, it is possible that we will see evolution here as well and have models with ranking points as well as performance tiers if they make it possible to contribute models and get recognition for this. Currently, models are contributed by Google only.

We might see the Models feature evolving immensely in the near future, providing the community with a flexible and powerful tool for the creation of modular and scalable solutions to train and add inference to machine learning pipelines on the Kaggle platform.