Reader small image

You're reading from  Developing Kaggle Notebooks

Product typeBook
Published inDec 2023
Reading LevelIntermediate
PublisherPackt
ISBN-139781805128519
Edition1st Edition
Languages
Right arrow
Author (1)
Gabriel Preda
Gabriel Preda
author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Right arrow

Text Analysis Is All You Need

In this chapter, we will learn how to analyze text data and create machine learning models to help us. We will use the Jigsaw Unintended Bias in Toxicity Classification dataset (see Reference 1). This competition had the objective of building models that detect toxicity and reduce unwanted bias toward minorities that might be wrongly associated with toxic comments. With this competition, we introduce the field of Natural Language Processing (NLP).

The data used in the competition originates from the Civil Comments platform, which was founded by Aja Bogdanoff and Christa Mrgan in 2015 (see Reference 2) with the aim of solving the problem of civility in online discussions. When the platform was closed in 2017, they chose to keep around 2 million comments for researchers who want to understand and improve civility in online conversations. Jigsaw was the organization that sponsored this effort and then started a competition for language toxicity classification...

What is in the data?

The data from the Jigsaw Unintended Bias in Toxicity Classification competition dataset contains 1.8 million rows in the training set and 97,300 rows in the test set. The test data contains only a comment column and does not contain a target (the value to predict) column. Training data contains, besides the comment column, another 43 columns, including the target feature. The target is a number between 0 and 1, which represents the annotation that is the objective of the prediction for this competition. This target value represents the degree of toxicity of a comment (0 means zero/no toxicity and 1 means maximum toxicity), and the other 42 columns are flags related to the presence of certain sensitive topics in the comments. The topic is related to five categories: race and ethnicity, gender, sexual orientation, religion, and disability. In more detail, these are the flags per each of the five categories:

  • Race and ethnicity: asian, black, jewish, latino...

Analyzing the comments text

NLP is a field of AI that involves the use of computational techniques to enable computers to understand, interpret, transform, and even generate human language. NLP uses several techniques, algorithms, and models to process and analyze large datasets of text. Among these techniques, we can mention:

  • Tokenization: Breaks down text into smaller units, like words, parts of words, or characters
  • Lemmatization or stemming: Reduces the words to dictionary form or removes the last few characters to get to a common form (stem)
  • Part-of-Speech (POS) tagging: Assigns a grammatical category (for example, nouns, verbs, proper nouns, and adjectives) to each word in a sequence
  • Named Entity Recognition (NER): Identifies and classifies entities (for example, names of people, organizations, and places)
  • Word embeddings: Use a high-dimensional space to represent the words, a space in which the position of each word is determined by its relationship...

Preparing the model

The model preparation, depending on the method we will implement, might be more or less complex. In our case, we opt to start the first baseline model with a simple deep learning architecture (which was the standard approach at the time of the competition), including a word embeddings layer (using pretrained word embeddings) and one or more bidirectional LSTM layers. This architecture was a common choice at the time when this competition took place, and it is still a good option for a baseline for a text classification problem. LSTM stands for Long Short-Term Memory. It is a type of recurrent neural network architecture designed to capture and remember long-term dependencies in sequential data. It is particularly effective for text classification problems due to its ability to handle and model intricate relationships and dependencies in sequences of text.

For this, we will need to perform some comment data preprocessing (we also performed preprocessing when...

Building a baseline model

These days, everybody will build a baseline model by at least fine-tuning a Transformer architecture. Since the 2017 paper Attention Is All You Need (Reference 14), the performance of these solutions has continuously improved, and for competitions like Jigsaw Unintended Bias in Toxicity Classification, a recent Transformer-based solution will probably take you easily into the gold zone.

In this exercise, we will start with a more classical baseline. The core of this solution is based on contributions from Christof Henkel (Kaggle nickname: Dieter), Ane Berasategi (Kaggle nickname: Ane), Andrew Lukyanenko (Kaggle nickname: Artgor), Thousandvoices (Kaggle nickname), and Tanrei (Kaggle nickname); see References 12, 13, 15, 16, 17, and 18.

The solution includes four steps. In the first step, we load the train and test data as pandas datasets and then we perform preprocessing on the two datasets. The preprocessing is largely based on the preprocessing steps...

Transformer-based solution

At the time of the competition, BERT and some other Transformer models were already available and a few solutions with high scores were provided. Here, we will not attempt to replicate them but we will just point out the most accessible implementations.

In Reference 20, Qishen Ha combines a few solutions, including BERT-Small V2, BERT-Large V2, XLNet, and GPT-2 (fine-tuned models using competition data included as datasets) to obtain a 0.94656 private leaderboard score (late submission), which would put you in the top 10 (both gold medal and prize area for this competition).

A solution with only the BERT-Small model (see Reference 21) will yield a private leaderboard score of 0.94295. Using the BERT-Large model (see Reference 22) will result in a private leaderboard score of 0.94388. Both these solutions will be in the silver medal zone (around places 130 and 80, respectively, in the private leaderboard, as late submissions).

Summary

In this chapter, we learned how to work with text data, using various approaches to explore this type of data. We started by analyzing our target and text data, preprocessing text data to include it in a machine learning model. We also explored various NLP tools and techniques, including topic modeling, NER, and POS tagging, and then prepared the text to build a baseline model, passing through an iterative process to gradually improve the data quality for the objective set (in this case, the objective being to improve the coverage of word embeddings for the vocabulary in the corpus of text from the competition dataset).

We introduced and discussed a baseline model (based on the work of several Kaggle contributors). This baseline model architecture includes a word embedding layer and bidirectional LSTM layers. Finally, we looked at some of the most advanced solutions available, based on Transformer architectures, either as single models or combined, to get a late submission...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Developing Kaggle Notebooks
Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda