Home Data Data Augmentation with Python

Data Augmentation with Python

By Duc Haba
books-svg-icon Book
eBook $35.99 $24.99
Print $44.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $35.99 $24.99
Print $44.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Data Augmentation Made Easy
About this book
Data is paramount in AI projects, especially for deep learning and generative AI, as forecasting accuracy relies on input datasets being robust. Acquiring additional data through traditional methods can be challenging, expensive, and impractical, and data augmentation offers an economical option to extend the dataset. The book teaches you over 20 geometric, photometric, and random erasing augmentation methods using seven real-world datasets for image classification and segmentation. You’ll also review eight image augmentation open source libraries, write object-oriented programming (OOP) wrapper functions in Python Notebooks, view color image augmentation effects, analyze safe levels and biases, as well as explore fun facts and take on fun challenges. As you advance, you’ll discover over 20 character and word techniques for text augmentation using two real-world datasets and excerpts from four classic books. The chapter on advanced text augmentation uses machine learning to extend the text dataset, such as Transformer, Word2vec, BERT, GPT-2, and others. While chapters on audio and tabular data have real-world data, open source libraries, amazing custom plots, and Python Notebook, along with fun facts and challenges. By the end of this book, you will be proficient in image, text, audio, and tabular data augmentation techniques.
Publication date:
April 2023
Publisher
Packt
Pages
394
ISBN
9781803246451

 

Data Augmentation Made Easy

Data augmentation is essential for developing a successful deep learning (DL) project. However, data scientists and developers often overlook this crucial step. It is no secret that you will spend the majority of your project time gathering, cleaning, and augmenting the dataset in a real-world DL project. Thus, learning how to expand the dataset without purchasing new data is essential. This book covers standard and advanced techniques for extending image, text, audio, and tabular datasets. Furthermore, you will learn about data biases and learn how to code on Jupyter Python Notebooks.

Chapter 1 will introduce various data augmentation concepts, set up the coding environment, and create the foundation class. Later chapters will explain various techniques in detail, including Python coding. The effective use of data augmentation has proven to be the deciding factor between success and failure in machine learning (ML). Many real-world ML projects stay in the conceptual phase because of insufficient data for training the ML model. Data augmentation is a cost-effective technique that can increase the size of the dataset, lower the training error rate, and produce a more accurate prediction and forecast.

Fun fact

The car gasoline analogy is helpful for students who first learn about data augmentation and artificial intelligence (AI). You can think of data for the AI engine as the gasoline and data augmentation as the additive, such as the Chevron Techron fuel cleaner, that makes your car engine run faster, smoother, and further without extra petrol.

In this chapter, we’ll define the data augmentation role and the limitations of extending data without changing its integrity. We’ll briefly discuss the different types of input data, such as image, text, audio, and tabular data, and the challenges in supplementing it. Finally, we’ll set up the system requirements and the programming style in the accompanying Python notebook.

I designed this book to be a hands-on journey. It will be most effective to read a chapter, run the code, re-read the part of the chapter that confused you, and jump back to hacking the code until you firmly understand the concept or technique that was presented.

You are encouraged to change or add new code to the Python notebook. The primary purpose of this book is interactive learning. So, if something goes wrong, download a fresh copy from the book's GitHub. The surest method to learn is to make mistakes and create something new.

Data augmentation is an iterative process. There is no fixed recipe. In other words, depending on the dataset, you select augmented functions and jiggle the parameters. A subject domain expert may provide insight into how much distortion is acceptable. By the end of this chapter, you will know the general rules for data augmentation, what type of input data can be augmented, the programming style, and how to set up a Python Notebook online or offline.

In particular, this chapter covers the following primary topics:

  • Data augmentation role
  • Data input types
  • Python Notebook
  • Programming styles

Let’s start with the data augmentation role.

 

Data augmentation role

Data is paramount in any AI project. This is especially true when using the artificial neural network (ANN) algorithm, also known as DL. The success or failure of a DL project is primarily due to the input data quality.

One primary reason for the significance of data augmentation is that it is relatively too easy to develop an AI for prediction and forecasting, and those models require robust data input. With the remarkable advancement in developing, training, and deploying a DL project, such as using the FastAI framework, you can create a world-class DL model in a handful of Python code lines. Thus, expanding the dataset is an effective option to improve the DL model’s accuracy over your competitor.

The traditional method of acquiring additional data is difficult, expensive, and impractical. Sometimes, the only available option is to use data augmentation techniques to extend the dataset.

Fun fact

Data augmentation methods can increase the data’s size tenfold. For example, it is relatively challenging to acquire additional skin cancer images. Thus, using a random combination of image transformations, such as vertical flip, horizontal flip, rotating, and skewing, is a practical technique that can expand the skin cancer photo data.

Without data augmentation, sourcing new skin cancer photos and labeling them is expensive and time-consuming. The International Skin Imaging Collaboration (ISIC) is the authoritative data source for skin diseases, where a team of dermatologists verified and classified the images. ISIC made the datasets available to the public to download for free. If you can’t find a particular dataset from ISIC, it is difficult to find other means, as accessing hospital or university labs to acquire skin disease images is laced with legal and logistic blockers. After obtaining the photos, hiring a team of dermatologists to classify the pictures to correct diseases would be costly.

Another example of the impracticality of attaining additional images instead of augmentation is when you download photos from social media or online search engines. Social media is a rich source of image, text, audio, and video data. Search engines, such as Google or Bing, make it relatively easy to download additional data for a project, but copyrights and legal usage are a quagmire. Most images, texts, audio, and videos on social media, such as YouTube, Facebook, TikTok, and Twitter, are not clearly labeled as copyrights or public domain material.

Furthermore, social media promotes popular content, not unfavorable or obscure material. For example, let’s say you want to add more images of parrots to your parrot classification AI system. Online searches will return a lot of blue-and-yellow macaws, red-and-green macaws, or sulfur-crested cockatoos, but not as many Galah, Kea, or the mythical Norwegian-blue parrot – a fake parrot from the Monty Python comedy skit.

Insufficient data for AI training is exacerbated for text, audio, and tabular data types. Generally, obtaining additional text, audio, and tabular data is expensive and time-consuming. There are strong copyright laws protecting text data. Audio files are less common online, and tabular data is primarily from private company databases.

The following section will define the four commonly used data types.

 

Data input types

The four data input types are self-explanatory, but it is worth clearly defining the data input types and what is out of scope:

  • Image definition
  • Text definition
  • Audio definition
  • Tabular data definition
Figure 1.1 – Image, text, tabular, and audio augmentation

Figure 1.1 – Image, text, tabular, and audio augmentation

Figure 1.1 provides a sneak peek at image, text, tabular and audio augmentation. Later in this book, you will learn how to implement augmentation methods.

Let’s get started with images.

Image definition

Image is a large category because you can represent almost anything as an image, such as people, landscapes, animals, plants, and various objects around us. Pictures can also represent action, such as sports, sign language, yoga poses, and many more. One particularly creative use of images is capturing a computer mouse’s movement over time to predict whether a user is a computer hacker or not.

The techniques for increasing the number of pictures are horizontal flip, vertical flip, enlarge, zoom in, zoom out, skew, warp, and lighting. Humans are experts at processing images. Thus, if a picture is slightly distorted or darkened, you can still tell that it is the same image. However, this is not the same for a computer. AI represents a color picture as a three-dimensional array of float numbers – the width, height, and RGB as depth. Any image distortion will yield an array with different values.

Graphs, such as time series data charts, and mathematical equation plots, such as 3D topology plots, are outside the scope of image augmentation.

Fun fact

You can eliminate the overfitting problem in DL image classification training by creatively using data augmentation methods.

Text augmentation has different concerns than image augmentation. Let’s take a look.

Text definition

The primary text input data is in English, but the same techniques for text augmentation can be applied to other West Germanic languages. Python lessons use English as the text input data.

The techniques for supplementing the text input are back translation, easy data augmentation, and albumentation. A few methods might be counterintuitive at first glance, such as deleting or swamping words in a sentence. However, it is an acceptable practice because, in the real world, not everyone writes perfect English.

For example, movie reviewers on the American Multi-Cinema (AMC) website write incomplete or grammatically incorrect sentences. They omit verbs or use inappropriate words. As a rule of thumb, you should not expect perfect English for text input data in many NLP projects.

If an NLP model is trained in perfect English as text input data, it could cause bias against typical online reviewers. In other words, the NLP model will predict inaccurately when deployed to a real-world audience. For example, in sentiment analysis, the AI system will predict whether a movie review has a positive or negative sentiment. Suppose you trained the system using a perfect English dataset. In that case, the AI system might forecast a false positive or false negative when people write a short line with misspelled words and grammatical errors.

Language translation, ideograms, and hieroglyphs are outside the scope of this book. Now, let’s look at audio augmentation.

Audio definition

Audio input data can be any sound wave recording such as music, speech, and natural sounds. Sound wave attributes such as amplitude and frequency are represented as graphs, which are technically images, but you can’t use any image augmentation methods for audio input data.

The techniques for expanding audio input are split into two types: waveform and spectrograph. For raw audio, the transformation methods range from time-shifting and pitch scaling to random gain, while for spectrographs, the functions are time masking, time stretching, pitch scaling, and many others.

Speech in a language other than English is outside the scope of this book. This is not due to technical difficulties but rather because this book is written in English. Writing about the aftermath effects of switching to a different language would be problematic.Audio augmentation is demanding, but tabular data is even more challenging to expand.

Tabular data definition

Tabular data is information in a relational database, spreadsheet, or text file in comma-separated values (CSV) format. Tabular data augmentation is a fast-growing field in ML and DL. The tabular data augmentation techniques are transforming, interacting, mapping, and extraction.

Fun challenge

Here is a thought experiment. Can you think of data types other than image, text, audio, and tabular? A hint is Casablanca and Blade Runner.

There are two parts to this chapter. The first half discussed the various concepts and techniques; what follows is hands-on Python coding on a Python Notebook. The book will use this learn-then-code pattern in all the chapters. It is time to get your hands dirty and write Python code.

 

Python Notebook

Jupyter Notebook is an open source web application that is the de facto choice for AI, ML, and data scientists. Jupyter Notebook supports multiple computer languages, and the most popular is Python.

Throughout this book, the term Python Notebook will be used synonymously for Jupyter Notebook, JupyterLab, and Google Colab Jupyter Notebook.

For Python developers, there are many choices of integrated development environment (IDE) platforms, such as Integrated Development and Learning Environment (IDLE), PyCharm, Microsoft Visual Studio, Atom, Sublime, and many more. Still, a Python Notebook is the preferred choice for AI, ML, and data scientists. It is an interactive IDE fit for exploring, coding, and deploying AI projects.

Fun fact

The easiest learning method is reading this book, running the code, and hacking it. This book cannot cover all scenarios; therefore, you must be comfortable with hacking the code so that it matches your real-world dataset. The Python Notebook is designed for interactivity. It gives us the freedom to play, explore, and make mistakes.

Python Notebook is the development tool of choice, and in particular, we will review the following:

  • Google Colab
  • Python Notebook options
  • Installing Python Notebook

Let’s begin with Google Colab.

 

Google Colab

Google Colab Jupyter Notebook with Python is one of the popular options for developing AI and ML projects. All you need is a Gmail account.

Colab can be found at https://colab.research.google.com/. The free Colab version is sufficient for the code in this book; the Pro+ version enables more CPU and GPU RAM.

After logging in to Colab, you can retrieve this book’s Python Notebooks from the following GitHub URL: https://github.com/PacktPublishing/data-augmentation-with-python.

You can start using Colab by using one of the following options:

  • The first method of opening a Python Notebook is copying it from GitHub. From Colab, go to the File menu, choose Open Notebook, and then click on the GitHub tab. In the Repository field, enter the GitHub URL specified previously; refer to Figure 1.2. Lastly, select the chapter and Python Notebook (.ipynb) file:
Figure 1.2 – Loading a Python Notebook from GitHub

Figure 1.2 – Loading a Python Notebook from GitHub

  • The second method of opening a Python Notebook is auto-loading it from GitHub. Go to the GitHub link mentioned previously and click on the Python Notebook (ipynb) file. Click the blue-colored Open in Colab button, as shown in Figure 1.3; it should be on the first line of the Python Notebook. It will launch Colab and load in the Python Notebook automatically:
Figure 1.3 – Loading a Python Notebook from Colab

Figure 1.3 – Loading a Python Notebook from Colab

  • Ensure you save a copy of the Python Notebook to your local Google Drive by clicking on the File menu and selecting the Save a copy in Drive option. Afterward, close the original and use the copy version.
  • The third method of opening a Python Notebook is by downloading a copy from GitHub. Upload the Python Notebook to Colab by clicking on the File menu, choosing Open Notebook, then clicking on the Upload tab, as shown in Figure 1.4:
Figure 1.4 – Loading a Python Notebook by uploading it to Colab

Figure 1.4 – Loading a Python Notebook by uploading it to Colab

Fun fact

For a quick overview of Colab’s features, go to https://colab.research.google.com/notebooks/basic_features_overview.ipynb. For a tutorial on how to use a Python Notebook, go to https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/jupyter-notebook-tutorial.ipynb.

Choosing Colab follows the same rationale as selecting an IDE: it is based mainly on your preferences. The following section describes additional Python Notebook options.

Additional Python Notebook options

Python notebooks are available in free and paid versions from many online companies, such as Microsoft, Amazon, Kaggle, Paperspace, and others. Using more than one vendor is typical because a Python Notebook behaves the same way across multiple vendors. However, it is similar to choosing an IDE – once selected, we tend to stay in the same environment.

You can use the following feature criteria to select a Python Notebook:

  • Easy to set up. Can you load and run a Python Notebook in 15 minutes?
  • A free version where you can run the Python Notebooks in this book.
  • Free CPU and GPU.
  • Free permanent storage for the Python Notebooks and versioning.
  • Easy access to GitHub.
  • Easy to upload and download the Python Notebooks to and from the local disk drive.
  • Option to upgrade to a paid version for faster and additional RAM in terms of CPU and GPU.

The choice of Python Notebook is based on your needs, preferences, or familiarity. You don’t have to use Google Colab for the lessons in this book. This book’s Python Notebooks will run on, but are not limited to, the following vendors:

  • Google Colab
  • Kaggle Notebooks
  • Deepnote
  • Amazon SageMaker Studio Lab
  • Paperspace Gradient
  • DataCrunch
  • Microsoft Notebooks in Visual Studio Code

The cloud-based options depend on having fast internet access at all times, so if internet access is a problem, you might want to install the Python Notebook locally on your laptop/computer. The installation process is straightforward.

Installing Python Notebook

Python Notebook can be installed on a local desktop or laptop for Windows, Mac, and Linux. The advantages of the online version are as follows:

  • Fully customizable
  • No limit on runtime – that is, no timeout on the Python Notebook during long training sessions
  • No rules or arbitrary limitations

The disadvantage is that you have to set up and maintain the environment. For example, you must do the following:

  • Install Python and Jupyter Notebook
  • Install and configure the NVIDIA graphic card (optional for data augmentation)
  • Maintain and update dozens of dependency Python libraries
  • Upgrade the disk drive, CPU, and GPU RAM

Installing Python Notebook is easy, requiring just one console or terminal command, but first, check the Python version. Type the following command in the terminal or console application:

>python3 --version

You should have version 3.7.0 or later. If you don’t have Python 3 or have an older version, install Python from https://www.python.org/downloads/.

Install JupyterLab using pip, which contains Python Notebook. On a Windows, Mac, or Linux laptop, use the following command for all three OSs:

>pip install jupyterlab

If you don’t like pip, use conda:

>conda install -c conda-forge jupyterlab

Other than pip and conda, you can use mamba:

>mamba install -c conda-forge jupyterlab

Start JupyterLab or Python Notebook with the following command:

>jupyter lab

The result of installing Python Notebook on a Mac is as follows:

Figure 1.5 – Jupyter Notebook on a local MacBook

Figure 1.5 – Jupyter Notebook on a local MacBook

The next step is cloning this book’s Python Notebook from the respective GitHub link. You can use the GitHub desktop app, the GitHub command on the terminal command line, or the Python Notebook using the magic character exclamation point (!) and standard GitHub command, as follows:

url = 'https://github.com/PacktPublishing/Data-Augmentation-with-Python'
!git clone {url}

Regardless of whether you choose the cloud-based options, such as Google Colab or Kaggle, or work offline, the Python Notebook code will work the same. The following section will dive into the Python Notebook programming style and introduce you to Pluto.

 

Programming styles

The coding style is the standard, tried-and-true method of object-oriented programing and is the variable naming convention for functions and variables.

Fun fact

The majority of Python code you find on blogs and websites is snippets. Therefore, they are not very helpful in studying fundamental topics such as data augmentation. In addition, Python on a Notebook induces lazy practices because programmers think each Notebook’s code cell is a separate snippet from the whole. In reality, the entire Python Notebook is one program. Chief among the benefits of using best programming practices is that it’s easier to learn and retain knowledge. A programming style may include many standard best practices, but it is also unique to your programming style. Use it to your advantage by learning new concepts and techniques faster, such as how to write data augmentation code.

There are quite a few topics in this section. In particular, we will cover the following concepts:

  • Source control
  • The PackTDataAug class
  • Naming convention
  • Extend base class
  • Referencing library
  • Exporting Python code
  • Pluto

Let’s begin with source control.

Source control

The first rule of programming is to manage the source code version. It will help you answer questions such as, What did you code last week?, What was fixed yesterday?, What new feature was added today?, and How do I share my code with my team?

The Git process manages the source code for one person or a team. Among many of Git's virtues is the freedom to make mistakes. In other words, Git allows you to try something new or break the code because you can always roll back to a previous version.

For source control, GitHub is a popular website, and Bitbucket comes in second place. You can use the Git process from a command-line terminal or Git applications, such as GitHub Desktop.

Google Colab has a built-in Git feature. You have seen how easy it is to load a Python Notebook on Google Colab, and saving it is just as easy. In Git, you must commit and push. The steps are as follows:

  1. From the Colab menu, click on File.
  2. Select Save a copy in GitHub.
  3. Enter your GitHub URL in the Repository field and select the code branch.
  4. Enter the commit message.
  5. Click OK:
Figure 1.6 – Google Colab – saving to GitHub

Figure 1.6 – Google Colab – saving to GitHub

Figure 1.6 shows the interface between Google Colab Python Notebook and GitHub. Next, we’ll look at the base class, PacktDataAug.

The PacktDataAug class

The code for the base class is neither original nor unique to this book. It is standard Python code for constructing an object-oriented class. The name of the object is different for every project. For this book, the name of the class is PacktDataAug.

Every chapter begins with this base class, and we will add new methods to the object using a Python decorator as we learn new concepts and techniques for augmenting data.

This exercise's Python code is in the Python Notebooks and on this book’s GitHub repository. Thus, I will not copy or display the complete code in this book. I will show relevant code lines, explain their significance, and rely on you to study the entire code in the Python Notebooks.

The definition of the base class is as follows:

# class definition
class PacktDataAug(object):
  def __init__(self,
    name="Pluto",
    is_verbose=True,
    args, **kwargs):

PacktDataAug is inherent from the based Object class, and the definition has two optional parameters:

  • The name parameter is a string, and it is the name of your object. It has no essential function other than labeling your object.
  • is_verbose is a Boolean that tells the object to print the object information during instantiation.

The next topic we will cover is the code naming convention.

Naming convention

The code naming convention is as follows:

  • The function’s name will begin with an action verb, such as print_, fetch_, or say_.
  • A function that returns a Boolean value begins with is_ or has_.
  • Variable names begin with a noun, not an action verb.
  • There is a heated discussion in the Python community on whether to use camelCase – for example, fetchKaggleData() – or use lowercase with underscores – for example, fetch_kaggle_data(). This book uses lowercase with underscores.
  • Functions or variables that begin with underscores are temporary variables or helper functions – for example, _image_auto_id, _drop_images(), and _append_full_path().
  • Variable or function abbreviations are sparingly used because the descriptive name is easier to understand. In addition, Colab has auto-complete functionality. Thus, it makes using long, descriptive names easier to type with fewer typos.

The code for instantiating a base class is standard Python code. I used pluto as the object name, but you can choose any name:

# Instantiate Pluto
pluto = PackTDataAug("Pluto")

The output is as follows:

--------------------------- : ---------------------------
            Hello from class : <class '__main__.PacktDataAug '> Class: PacktDataAug
                   Code name : Pluto
                   Author is : Duc Haba
---------------------------- : ---------------------------

The base class comes with two simple helper methods. They are both for printing pretty – that is, making the printing of status or output messages neatly centered.

The self._ph() method prints the header line with an equal number of dashes on both sides of the colon character, while the self._pp() function takes two parameters, one for the left-hand side and the other for the right-hand side.

You have already seen the result of instantiating pluto with the default parameter of is_verbose=True. As standard practice, I will not print the complete code in this book. I am relying on you to view and run the code in the Python Notebook, but I will make an exception for this chapter and show you the snippet of code for the is_verbose option. This demonstrates how easy it is to read Python code in the Python Notebook. The snippet is as follows:

# code snippet for verbose option
if (is_verbose):
  self._ph()
  self._pp(f"Hello from class {self.__class__} Class: {self.__class__.__name__}")
  self._pp("Code name", self.name)
  self._pp("Author is", self.author)
  self._ph()

Fun fact

This book’s primary goal is to help you write clean and easy-to-understand code and not write compact code that may lead to obfuscation.

Another powerful programming technique is using a Python decorator to extend the base class.

Extend base class

This book has been designed as an interactive journey where you learn and discover new data augmentation concepts and techniques sequentially, from image, text, and audio data to tabular data. The object, pluto, will acquire new methods as the journey progresses. Thus, having a technique to extend the class with new functions is essential. In contrast, providing the fully built class at the beginning of this book would not allow you to embark on the learning journey. Learning by exploration helps you retain knowledge longer compared to learning by memorization.

The @add_method() decorator function extends any class with a new function.

Here is an excellent example of extending the base class. The root cause of Python’s most common and frustrating error is having a different library version from the class homework or code snippet copy from the Python community. Python data scientists seldom write code from scratch and rely heavily on existing libraries. Thus, printing the Python library versions on a local or cloud-based server would save hours of aggravating debugging sessions.

To resolve this issue, we can extend the PacktDataAug class or use the journey metaphor of teaching Pluto a new trick. The new method, say_sys_info(), prints this book’s expected system library version on the left-hand side and the actual library version on your local or remote servers on the right-hand side. The decorator’s definition for extending the Pluto class is as follows:

# using decorator to add new method
@add_method(PackTDataAug)
def say_sys_info(self):

After running the aforementioned code cell, you can ask Pluto to print the library version using the following command:

# check Python and libraries version
pluto.say_sys_info()

The results are as follows:

---------------------------- : ---------------------------
                 System time : 2022/07/23 06:36
                    Platform : linux
     Pluto Version (Chapter) : 1.0
            Python (3.7.10)  : actual: 3.7.12 (default, Apr 24 2022, 17:11:25) [GCC 7.5.0]
            PyTorch (1.11.0) : actual: 1.12.1+cu113
              Pandas (1.3.5) : actual: 1.3.5
                 PIL (9.0.0) : actual: 7.1.2
          Matplotlib (3.2.2) : actual: 3.2.2
                   CPU count : 2
                  CPU speed : NOT available
---------------------------- : ---------------------------

If your result contains libraries that are older versions than this book’s expected value, you might run into bugs while working through the lessons. For example, the Pillow (PIL) library version is 7.1.2, which is lower than the book’s expected version of 9.0.0.

To correct this issue, run the following code line in the Notebook to install the 9.0.0 version:

# upgrade to Pillow library version 9.0.0
!pip install Pillow==9.0.0

Rerunning pluto.say_sys_info() should now show the PIL version as 9.0.0.

Fun challenge

Extend Pluto with a new function to display the system’s GPU total RAM and available free RAM. The function name can be fetch_system_gpu_ram(). A hint is to use the torch library and the torch.to cuda.memory_allocated() and torch.cuda.memory_reserved() functions. You can use this technique to extend any Python library class. For example, to add a new function to the numpy library, you can use the @add_method(numpy) decorator.

There are a few more programming-style topics. Next, you’ll discover how best to reference a library.

Referencing a library

Python is a flexible language when it comes to importing libraries. There are aliases and direct imports. Here are a few examples of importing the same function – that is, plot():

# display many options to import a function
from matplotlib.pyplot import plot
import matplotlib.pyplot
import matplotlib.pyplot as plt # most popular
# more exotics importing examples
from matplotlib.pyplot import plot as paint
import matplotlib.pyplot as canvas
from matplotlib import pyplot as plotter

The salient point is that all these examples are valid, and that is both good and bad. It enables flexibility, but at the same time, sharing code snippets online or maintaining code can lead to frustration when they break. Python often gives an unintelligible error message when the system cannot locate the function. To fix this bug, you need to know which library to upgrade. The problem is compounded when many libraries use the same function name, such as the imread() method, which appears in at least four libraries.

By adhering to this book’s programming style, when the imread() method fails, you know which library needs to be upgraded or, in rare conditions, downgraded. The code is as follows:

# example of use full method name
import matplotlib
matplotlib.pyplot.imread()

matplotlib might need to be upgraded, or equally, you might be using the wrong imread() method. It could be from OpenCV version 4.7.0.72. Thus, the call should be cv2.imread().

The next concept is exporting. It may not strictly belong to the programming style, but it is necessary if you wish to reuse and add extra functions to this chapter’s code.

Exporting Python code

This book ensures that every chapter has its own Python Notebook. The advanced image, text, and audio chapters need the previous chapter's code. Thus, it is necessary to export the selected Python code cells from the Python Notebook.

The Python Notebook has both markup and code cells, and not all code cells must be exported. You only need to export code cells that define new functions. For the code cells that you want to export to a Python file, use the Python Notebook %%writefile file_name.py magic command at the beginning of the code cells and %%writefile -a file_name.py to append additional code to the file. file_name is the name of the Python file – for example, pluto_chapter_1.py.

The last and best part of the programming style is introducing Pluto as your coding companion.

Pluto

Pluto uses a whimsical idea of teaching by including dialogs with an imaginary digital character. We can give Pluto tasks to complete. It has a friendly tone, and sometimes the author addresses you directly. It moves away from the direct lecturing format. There are scholarly papers that explain how lecturing in monologue is not the optimal method for learning new concepts, such as the article Why Students Learn More From Dialogue- Than Monologue-Videos: Analyses of Peer Interactions by Michelene T. H. Chi, Seokmin Kang, and David L. Yaghmourian that was published by the Journal of the Learning Sciences in 2016.

You are most likely reading this book alone rather than engaging in a group, learning how to write augmentation code together. Thus, creating an imaginary companion as the instantiated object might infuse imagination. It makes writing and reading more accessible – for example, the pluto.fetch_kaggle_data() function is self-explanatory, and little additional documentation is needed. It simplifies Python code to a common subject and action-verb-sentence format.

Fun challenge

Change the object name from Pluto to your favorite canine name, such as Biggy, Sunny, or Hanna. It will make the learning process more personal. For example, change pluto = PackTDataAug("Pluto") to hanna = PackTDataAug("Hanna").

Fair warning: Do not choose your beloved cat as the object’s name because felines will not listen to any commands. Imagine asking your cat to play fetch.

 

Summary

In this chapter, you learned that data augmentation is essential for achieving higher accuracy prediction in DL and generative AI. Data augmentation is an economical option for extending a dataset without the difficulty of purchasing and labeling new data.

The four input data types are image, text, audio, and tabular. Each data type faces different challenges, techniques, and limitations. Furthermore, the dataset dictates which functions and parameters are suitable. For example, people’s faces and aerial photographs are image datasets, but you can’t expand the data by vertically flipping people’s images; however, you can vertically flip aerial photos.

In the second part of this chapter, you used Python notebooks to reinforce your learning of these augmentation concepts. This involved selecting a Python Notebook as the default IDE to access a cloud-based platform, such as Google Colab or Kaggle, or installing the Python Notebook locally on your laptop.

The Programming styles section lay the foundation for the Python Notebook’s structure. It touched on GitHub as a form of source control, using base classes, extending base classes, long library function names, exporting to Python, and introducing Pluto.

This chapter laid the foundation with Pluto as the main object. Pluto does not start with complete data augmentation functions – he begins with a minimum structure, and as he learns new data augmentation concepts and techniques from chapter to chapter, he will add new methods to his arsenal.

By the end of this book, Pluto and you will learn techniques regarding how to augment image, text, audio, and tabular data. In other words, you will learn how to write a powerful image, text, audio, and tabular augmentation class from scratch using real-world data, which you can reuse in future data augmentation projects.

Throughout this chapter, there were fun facts and fun challenges. Pluto hopes you will take advantage of what’s been provided and expand your experience beyond the scope of this chapter.

In Chapter 2, Biases in Data Augmentation, Pluto and you will explore how data augmentation can increase biases. Using data biases as a guiding principle to data augmentation is an often-overlooked technique.

About the Author
  • Duc Haba

    Mr. Duc Haba is a lifelong technologist and researcher specializing in Deep Learning and Generative AI. He has been a programmer, Enterprise Mobility Solution Architect, AI Solution Architect, Principal, VP, CTO, and CEO. The companies range from startups and IPOs to enterprise companies. Duc's career started with Xerox Palo Alto Research Center (PARC), researching expert systems (ruled-based) for Xerox copier diagnostics. After PARC, he joined Oracle, following Viant Consulting as a founding member. He jumped headfirst into the entrepreneurial culture in Silicon Valley. There were slightly more failures than successes, but the highlights are working with Oracle, Viant, and RRKidz. Currently, he is happy working at YML as the AI Solution Architect.

    Browse publications by this author
Data Augmentation with Python
Unlock this book and the full library FREE for 7 days
Start now