You're reading from Developing Kaggle Notebooks

Product typeBook

Published inDec 2023

Reading LevelIntermediate

PublisherPackt

ISBN-139781805128519

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Gabriel Preda

Text Analysis Is All You Need

In this chapter, we will learn how to analyze text data and create machine learning models to help us. We will use the Jigsaw Unintended Bias in Toxicity Classification dataset (see Reference 1). This competition had the objective of building models that detect toxicity and reduce unwanted bias toward minorities that might be wrongly associated with toxic comments. With this competition, we introduce the field of Natural Language Processing (NLP).

The data used in the competition originates from the Civil Comments platform, which was founded by Aja Bogdanoff and Christa Mrgan in 2015 (see Reference 2) with the aim of solving the problem of civility in online discussions. When the platform was closed in 2017, they chose to keep around 2 million comments for researchers who want to understand and improve civility in online conversations. Jigsaw was the organization that sponsored this effort and then started a competition for language toxicity classification...

What is in the data?

The data from the Jigsaw Unintended Bias in Toxicity Classification competition dataset contains 1.8 million rows in the training set and 97,300 rows in the test set. The test data contains only a comment column and does not contain a target (the value to predict) column. Training data contains, besides the comment column, another 43 columns, including the target feature. The target is a number between 0 and 1, which represents the annotation that is the objective of the prediction for this competition. This target value represents the degree of toxicity of a comment (0 means zero/no toxicity and 1 means maximum toxicity), and the other 42 columns are flags related to the presence of certain sensitive topics in the comments. The topic is related to five categories: race and ethnicity, gender, sexual orientation, religion, and disability. In more detail, these are the flags per each of the five categories:

Race and ethnicity: asian, black, jewish, latino...

Analyzing the comments text

NLP is a field of AI that involves the use of computational techniques to enable computers to understand, interpret, transform, and even generate human language. NLP uses several techniques, algorithms, and models to process and analyze large datasets of text. Among these techniques, we can mention:

Tokenization: Breaks down text into smaller units, like words, parts of words, or characters
Lemmatization or stemming: Reduces the words to dictionary form or removes the last few characters to get to a common form (stem)
Part-of-Speech (POS) tagging: Assigns a grammatical category (for example, nouns, verbs, proper nouns, and adjectives) to each word in a sequence
Named Entity Recognition (NER): Identifies and classifies entities (for example, names of people, organizations, and places)
Word embeddings: Use a high-dimensional space to represent the words, a space in which the position of each word is determined by its relationship...

Preparing the model

The model preparation, depending on the method we will implement, might be more or less complex. In our case, we opt to start the first baseline model with a simple deep learning architecture (which was the standard approach at the time of the competition), including a word embeddings layer (using pretrained word embeddings) and one or more bidirectional LSTM layers. This architecture was a common choice at the time when this competition took place, and it is still a good option for a baseline for a text classification problem. LSTM stands for Long Short-Term Memory. It is a type of recurrent neural network architecture designed to capture and remember long-term dependencies in sequential data. It is particularly effective for text classification problems due to its ability to handle and model intricate relationships and dependencies in sequences of text.

For this, we will need to perform some comment data preprocessing (we also performed preprocessing when...

Building a baseline model

These days, everybody will build a baseline model by at least fine-tuning a Transformer architecture. Since the 2017 paper Attention Is All You Need (Reference 14), the performance of these solutions has continuously improved, and for competitions like Jigsaw Unintended Bias in Toxicity Classification, a recent Transformer-based solution will probably take you easily into the gold zone.

In this exercise, we will start with a more classical baseline. The core of this solution is based on contributions from Christof Henkel (Kaggle nickname: Dieter), Ane Berasategi (Kaggle nickname: Ane), Andrew Lukyanenko (Kaggle nickname: Artgor), Thousandvoices (Kaggle nickname), and Tanrei (Kaggle nickname); see References 12, 13, 15, 16, 17, and 18.

The solution includes four steps. In the first step, we load the train and test data as pandas datasets and then we perform preprocessing on the two datasets. The preprocessing is largely based on the preprocessing steps...

Transformer-based solution

At the time of the competition, BERT and some other Transformer models were already available and a few solutions with high scores were provided. Here, we will not attempt to replicate them but we will just point out the most accessible implementations.

In Reference 20, Qishen Ha combines a few solutions, including BERT-Small V2, BERT-Large V2, XLNet, and GPT-2 (fine-tuned models using competition data included as datasets) to obtain a 0.94656 private leaderboard score (late submission), which would put you in the top 10 (both gold medal and prize area for this competition).

A solution with only the BERT-Small model (see Reference 21) will yield a private leaderboard score of 0.94295. Using the BERT-Large model (see Reference 22) will result in a private leaderboard score of 0.94388. Both these solutions will be in the silver medal zone (around places 130 and 80, respectively, in the private leaderboard, as late submissions).

Summary

In this chapter, we learned how to work with text data, using various approaches to explore this type of data. We started by analyzing our target and text data, preprocessing text data to include it in a machine learning model. We also explored various NLP tools and techniques, including topic modeling, NER, and POS tagging, and then prepared the text to build a baseline model, passing through an iterative process to gradually improve the data quality for the objective set (in this case, the objective being to improve the coverage of word embeddings for the vocabulary in the corpus of text from the competition dataset).

We introduced and discussed a baseline model (based on the work of several Kaggle contributors). This baseline model architecture includes a word embedding layer and bidirectional LSTM layers. Finally, we looked at some of the most advanced solutions available, based on Transformer architectures, either as single models or combined, to get a late submission...

References

Jigsaw Unintended Bias in Toxicity Classification, Kaggle competition dataset: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/
Aja Bogdanoff, Saying goodbye to Civil Comments, Medium: https://medium.com/@aja_15265/saying-goodbye-to-civil-comments-41859d3a2b1d
Gabriel Preda, Jigsaw Comments Text Exploration: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/blob/develop/Chapter-07/jigsaw-comments-text-exploration.ipynb
Gabriel Preda, Jigsaw Simple Baseline: https://github.com/PacktPublishing/Developing-Kaggle-Notebooks/blob/develop/Chapter-07/jigsaw-simple-baseline.ipynb
Susan Li, Topic Modeling and Latent Dirichlet Allocation (LDA) in Python: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24
Aneesha Bakharia, Improving the Interpretation of Topic Models: https://towardsdatascience.com/improving-the-interpretation-of-topic-models-87fd2ee3847d

The rest of the chapter is locked

You have been reading a chapter from

Developing Kaggle Notebooks

Published in: Dec 2023Publisher: PacktISBN-13: 9781805128519

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Gabriel Preda

Dr. Gabriel Preda is a Principal Data Scientist for Endava, a major software services company. He has worked on projects in various industries, including financial services, banking, portfolio management, telecom, and healthcare, developing machine learning solutions for various business problems, including risk prediction, churn analysis, anomaly detection, task recommendations, and document information extraction. In addition, he is very active in competitive machine learning, currently holding the title of a three-time Kaggle Grandmaster and is well-known for his Kaggle Notebooks.
Read more about Gabriel Preda

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages