Deep Learning with fastai Cookbook

By Mark Ryan
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 2: Exploring and Cleaning Up Data with fastai

About this book

fastai is an easy-to-use deep learning framework built on top of PyTorch that lets you rapidly create complete deep learning solutions with as few as 10 lines of code. Both predominant low-level deep learning frameworks, TensorFlow and PyTorch, require a lot of code, even for straightforward applications. In contrast, fastai handles the messy details for you and lets you focus on applying deep learning to actually solve problems.

The book begins by summarizing the value of fastai and showing you how to create a simple 'hello world' deep learning application with fastai. You'll then learn how to use fastai for all four application areas that the framework explicitly supports: tabular data, text data (NLP), recommender systems, and vision data. As you advance, you'll work through a series of practical examples that illustrate how to create real-world applications of each type. Next, you'll learn how to deploy fastai models, including creating a simple web application that predicts what object is depicted in an image. The book wraps up with an overview of the advanced features of fastai.

By the end of this fastai book, you'll be able to create your own deep learning applications using fastai. You'll also have learned how to use fastai to prepare raw datasets, explore datasets, train deep learning models, and deploy trained models.

Publication date:
September 2021
Publisher
Packt
Pages
340
ISBN
9781800208100

 

Chapter 2: Exploring and Cleaning Up Data with fastai

In the previous chapter, we got started with the fastai framework by setting up its coding environment, working through a concrete application example (MNIST), and investigating two frameworks with different relationships to fastai: PyTorch and Keras. In this chapter, we are going to dive deeper into an important aspect of fastai: ingesting, exploring, and cleaning up data. In particular, we are going to explore a selection of the datasets that are curated by fastai.

By the end of this chapter, you will be able to describe the complete set of curated datasets that fastai supports, use the facilities of fastai to examine these datasets, and clean up a dataset to eliminate missing and non-numeric values.

Here are the recipes that will be covered in this chapter:

  • Getting the complete set of oven-ready fastai datasets
  • Examining tabular datasets with fastai
  • Examining text datasets with fastai
  • Examining image datasets with fastai
  • Cleaning up raw datasets with fastai
 

Technical requirements

Ensure that you have completed the setup sections in Chapter 1, Getting Started with fastai, and that you have a working Gradient instance or Colab setup. Ensure that you have cloned the repository for this book (https://github.com/PacktPublishing/Deep-Learning-with-fastai-Cookbook) and have access to the ch2 folder. This folder contains the code samples that will be described in this chapter.

 

Getting the complete set of oven-ready fastai datasets

In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.

In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb notebook in the ch2 directory of your cloned repository.

How to do it…

In this section, you will be running through the fastai_dataset_walkthrough.ipynb notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first three cells of the notebook to load the required libraries, set up the notebook for fastai, and define the MNIST dataset:
    Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset

    Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset

  2. Consider the argument to untar_data: URLs.MINST. What is this? Let's try the ??  shortcut to examine the source code for a URLs object:
    Figure 2.2 – Source for URLs

    Figure 2.2 – Source for URLs

  3. By looking at the image classification datasets section of the source code for URLs, we can find the definition of URLs.MNIST:
    MNIST           = f'{S3_IMAGE}mnist_png.tgz'
  4. Working backward through the source code for the URLs class, we can get the whole URL for MNIST:
    S3_IMAGE     = f'{S3}imageclas/'
    S3  = 'https://s3.amazonaws.com/fast-ai-'
  5. Putting it all together, we get the URL for URLs.MNIST:
    https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz
  6. You can download this file for yourself and untar it. You will see that the directory structure of the untarred package looks like this:
    mnist_png
    ├── testing
    │   ├── 0
    │   ├── 1
    │   ├── 2
    │   ├── 3
    │   ├── 4
    │   ├── 5
    │   ├── 6
    │   ├── 7
    │   ├── 8
    │   └── 9
    └── training
         ├── 0
         ├── 1
         ├── 2
         ├── 3
         ├── 4
         ├── 5
         ├── 6
         ├── 7
         ├── 8
         └── 9
  7. In the untarred directory structure, each of the testing and training directories contain subdirectories for each digit. These digit directories contain image files for that digit. This means that the label of the dataset – the value that we want the model to predict – is encoded in the directory that the image file resides in.
  8. Is there a way to get the directory structure of one of the curated datasets without having to determine its URL from the definition of URLs, download the dataset, and unpack it? There is – using path.ls():
    Figure 2.3 – Using path.ls() to get the dataset's directory structure

    Figure 2.3 – Using path.ls() to get the dataset's directory structure

  9. This tells us that there are two subdirectories in the dataset: training and testing. You can call ls() to get the structure of the training subdirectory:
    Figure 2.4 – The structure of the training subdirectory

    Figure 2.4 – The structure of the training subdirectory

  10. Now that we have learned how to get the directory structure of the MNIST dataset using the ls() function, what else can we learn from the output of ??URLs?
  11. First, let's look at the other datasets listed in the output of ??URLs by group. First, let's look at the datasets listed under main datasets. This list includes tabular datasets (ADULT_SAMPLE), text datasets (IMDB_SAMPLE), recommender system datasets (ML_SAMPLE), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE):
         ADULT_SAMPLE           = f'{URL}adult_sample.tgz'
         BIWI_SAMPLE            = f'{URL}biwi_sample.tgz'
         CIFAR                     = f'{URL}cifar10.tgz'
         COCO_SAMPLE            = f'{S3_COCO}coco_sample.tgz'
         COCO_TINY               = f'{S3_COCO}coco_tiny.tgz'
         HUMAN_NUMBERS         = f'{URL}human_numbers.tgz'
         IMDB                       = f'{S3_NLP}imdb.tgz'
         IMDB_SAMPLE            = f'{URL}imdb_sample.tgz'
         ML_SAMPLE               = f'{URL}movie_lens_sample.tgz'
         ML_100k                  = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
         MNIST_SAMPLE           = f'{URL}mnist_sample.tgz'
         MNIST_TINY              = f'{URL}mnist_tiny.tgz'
         MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz'
         PLANET_SAMPLE         = f'{URL}planet_sample.tgz'
         PLANET_TINY            = f'{URL}planet_tiny.tgz'
         IMAGENETTE              = f'{S3_IMAGE}imagenette2.tgz'
         IMAGENETTE_160        = f'{S3_IMAGE}imagenette2-160.tgz'
         IMAGENETTE_320        = f'{S3_IMAGE}imagenette2-320.tgz'
         IMAGEWOOF               = f'{S3_IMAGE}imagewoof2.tgz'
         IMAGEWOOF_160         = f'{S3_IMAGE}imagewoof2-160.tgz'
         IMAGEWOOF_320         = f'{S3_IMAGE}imagewoof2-320.tgz'
         IMAGEWANG               = f'{S3_IMAGE}imagewang.tgz'
         IMAGEWANG_160         = f'{S3_IMAGE}imagewang-160.tgz'
         IMAGEWANG_320         = f'{S3_IMAGE}imagewang-320.tgz'
  12. Next, let's look at the datasets in the other categories: image classification datasets, NLP datasets, image localization datasets, audio classification datasets, and medical image classification datasets. Note that the list of curated datasets includes datasets that aren't directly associated with any of the four main application areas supported by fastai. The audio datasets, for example, apply to a use case outside the four main application areas:
         # image classification datasets
         CALTECH_101  = f'{S3_IMAGE}caltech_101.tgz'
         CARS            = f'{S3_IMAGE}stanford-cars.tgz'
         CIFAR_100     = f'{S3_IMAGE}cifar100.tgz'
         CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz'
         FLOWERS        = f'{S3_IMAGE}oxford-102-flowers.tgz'
         FOOD            = f'{S3_IMAGE}food-101.tgz'
         MNIST           = f'{S3_IMAGE}mnist_png.tgz'
         PETS            = f'{S3_IMAGE}oxford-iiit-pet.tgz'
         # NLP datasets
         AG_NEWS                        = f'{S3_NLP}ag_news_csv.tgz'
         AMAZON_REVIEWS              = f'{S3_NLP}amazon_review_full_csv.tgz'
         AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz'
         DBPEDIA                        = f'{S3_NLP}dbpedia_csv.tgz'
         MT_ENG_FRA                    = f'{S3_NLP}giga-fren.tgz'
         SOGOU_NEWS                    = f'{S3_NLP}sogou_news_csv.tgz'
         WIKITEXT                       = f'{S3_NLP}wikitext-103.tgz'
         WIKITEXT_TINY               = f'{S3_NLP}wikitext-2.tgz'
         YAHOO_ANSWERS               = f'{S3_NLP}yahoo_answers_csv.tgz'
         YELP_REVIEWS                 = f'{S3_NLP}yelp_review_full_csv.tgz'
         YELP_REVIEWS_POLARITY   = f'{S3_NLP}yelp_review_polarity_csv.tgz'
         # Image localization datasets
         BIWI_HEAD_POSE      = f"{S3_IMAGELOC}biwi_head_pose.tgz"
         CAMVID                  = f'{S3_IMAGELOC}camvid.tgz'
         CAMVID_TINY           = f'{URL}camvid_tiny.tgz'
         LSUN_BEDROOMS        = f'{S3_IMAGE}bedroom.tgz'
         PASCAL_2007           = f'{S3_IMAGELOC}pascal_2007.tgz'
         PASCAL_2012           = f'{S3_IMAGELOC}pascal_2012.tgz'
         # Audio classification datasets
         MACAQUES               = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip'
         ZEBRA_FINCH           = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip'
         # Medical Imaging datasets
         SIIM_SMALL            = f'{S3_IMAGELOC}siim_small.tgz'
  13. Now that we have listed all the datasets defined in URLs, how can we find out more information about them?

    a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in URLs. Note that this documentation is not consistent with what's listed in the source of URLs. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source of URLs as your single source of truth about fastai curated datasets.

    b) Use the path.ls() function to examine the directory structure, as shown in the following example, which lists the directories under the training subdirectory of the MNIST dataset:

    Figure 2.5 – Structure of the training subdirectory

    Figure 2.5 – Structure of the training subdirectory

    c) Check out the file structure that gets installed when you run untar_data. For example, in Gradient, the datasets get installed in storage/data, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.

    d) For example, let's say untar_data is run with URLs.PETS as the argument:

    path = untar_data(URLs.PETS)

    e) Here, you can find the dataset in storage/data/oxford-iiit-pet, and you can see the directory's structure:

    oxford-iiit-pet
    ├── annotations
    │   ├── trimaps
    │   └── xmls
    └── images
  14. If you want to see the definition of a function in a notebook, you can run a cell with ??, followed by the name of the function. For example, to see the definition of the ls() function, you can use ??Path.ls:
    Figure 2.6 – Source for Path.ls()

    Figure 2.6 – Source for Path.ls()

  15. To see the documentation for any function, you can use the doc() function. For example, the output of doc(Path.ls) shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function:
Figure 2.7 – Output of doc(Path.ls)

Figure 2.7 – Output of doc(Path.ls)

You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.

How it works…

As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs class. When you call untar_data with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data in a Gradient instance). The object you get back from untar_data allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.

There's more…

You might be asking yourself why we went to the trouble of examining the source code for the URLs class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.

 

Examining tabular datasets with fastai

In the previous section, we looked at the whole set of datasets curated by fastai. In this section, we are going to dig into a tabular dataset from the curated list. We will ingest the dataset, look at some example records, and then explore characteristics of the dataset, including the number of records and the number of unique values in each column.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_tabular_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to include the ADULT_SAMPLE dataset featured in this section.

Dataset citation

Ron Kohavi. (1996) Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid (http://robotics.stanford.edu/~ronnyk/nbtree.pdf).

How to do it…

In this section, you will be running through the examining_tabular_datasets.ipynb notebook to examine the ADULT_SAMPLE dataset.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.ADULT_SAMPLE)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.8 – Output of path.ls()

    Figure 2.8 – Output of path.ls()

  4. The dataset is in the adult.csv file. Run the following cell to ingest this CSV file into a pandas DataFrame:
    df = pd.read_csv(path/'adult.csv')
  5. Run the head() command to get a sample of records from the beginning of the dataset:
    Figure 2.9 – Sample of records from the beginning of the dataset

    Figure 2.9 – Sample of records from the beginning of the dataset

  6. Run the following command to get the number of records (rows) and fields (columns) in the dataset:
    df.shape
  7. Run the following command to get the number of unique values in each column of the dataset. Can you tell from the output which columns are categorical?
    df.nunique()
  8. Run the following command to get the count of missing values in each column of the dataset. Which columns have missing values?
    df.isnull().sum()
  9. Run the following command to display some sample records from the subset of the dataset for people whose age is less than or equal to 40:
    df_young = df[df.age <= 40]
    df_young.head()

Congratulations! You have ingested a tabular dataset curated by fastai and done a basic examination of the dataset.

How it works…

The dataset that you explored in this section, ADULT_SAMPLE, is one of the datasets you would have seen in the source for URLs in the previous section. Note that while the source for URLs identifies which datasets are related to image or NLP (text) applications, it does not explicitly identify the tabular or recommender system datasets. ADULT_SAMPLE is one of the datasets listed under main datasets:

Figure 2.10 – Main datasets from the source for URLs

Figure 2.10 – Main datasets from the source for URLs

How did I determine that ADULT_SAMPLE was a tabular dataset? First, the paper by Howard and Gugger (https://arxiv.org/pdf/2002.04688.pdf) identifies ADULT_SAMPLE as a tabular dataset. Second, I just had to ingest it and try it out to confirm it could be ingested into a pandas DataFrame.

There's more…

What about the other curated datasets that aren't explicitly categorized in the source for URLs? Here's a summary of the datasets listed in the source for URLs under main datasets:

  • Tabular:

    a) ADULT_SAMPLE

  • NLP (text):

    a) HUMAN_NUMBERS  

    b) IMDB

    c) IMDB_SAMPLE      

  • Collaborative filtering:

    a) ML_SAMPLE               

    b) ML_100k                              

  • Image data:  

    a) All of the other datasets listed in URLs under main datasets.

 

Examining text datasets with fastai

In the previous section, we looked at how a curated tabular dataset could be ingested. In this section, we are going to dig into a text dataset from the curated list.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_text_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the WIKITEXT_TINY dataset (https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) featured in this section.

Dataset citation

Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2016). Pointer Sentinel Mixture Models (https://arxiv.org/pdf/1609.07843.pdf).

How to do it…

In this section, you will be running through the examining_text_datasets.ipynb notebook to examine the WIKITEXT_TINY dataset. As its name suggests, this is a small set of text that's been gleaned from good and featured Wikipedia articles.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.WIKITEXT_TINY)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.11 – Output of path.ls()

    Figure 2.11 – Output of path.ls()

  4. There are two CSV files that make up this dataset. Let's ingest each of them into a pandas DataFrame, starting with train.csv:
    df_train = pd.read_csv(path/'train.csv')
  5. When you use head() to check the DataFrame, you'll notice that something's wrong – the CSV file has no header with column names, but by default, read_csv assumes the first row is the header, so the first row gets misinterpreted as a header. As shown in the following screenshot, the first row of output is in bold, which indicates that the first row is being interpreted as a header, even though it contains a regular data row:
    Figure 2.12 – First record in df_train

    Figure 2.12 – First record in df_train

  6. To fix this problem, rerun the read_csv function, but this time with the header=None parameter, to specify that the CSV file doesn't have a header:
    df_train = pd.read_csv(path/'train.csv',header=None)
  7. Check head() again to confirm that the problem has been resolved:
    Figure 2.13 – Revising the first record in df_train

    Figure 2.13 – Revising the first record in df_train

  8. Ingest test.csv into a DataFrame using the header=None parameter:
    df_test = pd.read_csv(path/'test.csv',header=None)
  9. We want to tokenize the dataset and transform it into a list of words. Since we want a common set of tokens for the entire dataset, we will begin by combining the test and train DataFrames:
    df_combined = pd.concat([df_train,df_test])
  10. Confirm the shape of the train, test, and combined dataframes – the number of rows in the combined DataFrame should be the sum of the number of rows in the train and test DataFrames:
    print("df_train: ",df_train.shape)
    print("df_test: ",df_test.shape)
    print("df_combined: ",df_combined.shape)
  11. Now, we're ready to tokenize the DataFrame. The tokenize_df() function takes the list of columns containing the text we want to tokenize as a parameter. Since the columns of the DataFrame are not labeled, we need to refer to the column we want to tokenize using its position rather than its name:
    df_tok, count = tokenize_df(df_combined,[df_combined.columns[0]])
  12. Check the contents of the first few records of df_tok, which is the new DataFrame containing the tokenized contents of the combined DataFrame:
    Figure 2.14 – The first few records of df_tok

    Figure 2.14 – The first few records of df_tok

  13. Check the count for a few sample words to ensure they are roughly what you expected. Pick a very common word, a moderately common word, and a rare word:
    print("very common word (count['the']):", count['the'])
    print("moderately common word (count['prepared']):", count['prepared'])
    print("rare word (count['gaga']):", count['gaga'])

Congratulations! You have successfully ingested, explored, and tokenized a curated text dataset.

How it works…

The dataset that you explored in this section, WIKITEXT_TINY, is one of the datasets you would have seen in the source for URLs in the Getting the complete set of oven-ready fastai datasets section. Here, you can see that WIKITEXT_TINY is in the NLP datasets section of the source for URLs:

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

 

Examining image datasets with fastai

In the past two sections, we examined tabular and text datasets and got a taste of the facilities that fastai provides for accessing and exploring these datasets. In this section, we are going to look at image data. We are going to look at two datasets: the FLOWERS image classification dataset and the BIWI_HEAD_POSE image localization dataset.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_image_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the FLOWERS dataset featured in this section.

Dataset citation

Maria-Elena Nilsback, Andrew Zisserman. (2008). Automated flower classification over a large number of classes (https://www.robots.ox.ac.uk/~vgg/publications/papers/nilsback08.pdf).

I am grateful for the opportunity to use the BIWI_HEAD_POSE dataset featured in this section.

Dataset citation

Gabriele Fanelli, Thibaut Weise, Juergen Gall, Luc Van Gool. (2011). Real Time Head Pose Estimation from Consumer Depth Cameras (https://link.springer.com/chapter/10.1007/978-3-642-23123-0_11). Lecture Notes in Computer Science, vol 6835. Springer, Berlin, Heidelberg https://doi.org/10.1007/978-3-642-23123-0_11.

How to do it…

In this section, you will be running through the examining_image_datasets.ipynb notebook to examine the FLOWERS and BIWI_HEAD_POSE datasets.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the FLOWERS dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.FLOWERS)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.16 – Output of path.ls()

    Figure 2.16 – Output of path.ls()

  4. Look at the contents of the valid.txt file. This indicates that train.txt, valid.txt, and test.txt contain lists of the image files that belong to each of these datasets:
    Figure 2.17 – The first few records of valid.txt

    Figure 2.17 – The first few records of valid.txt

  5. Examine the jgp subdirectory:
    (path/'jpg').ls()
  6. Take a look at one of the image files. Note that the get_image_files() function doesn't need to be pointed to a particular subdirectory – it recursively collects all the image files in a directory and its subdirectories:
    img_files = get_image_files(path)
    img = PILImage.create(img_files[100])
    img
  7. You should have noticed that the image displayed in the previous step was the native size of the image, which makes it rather big for the notebook. To get the image at a more appropriate size, apply the to_thumb function with the image dimension specified as an argument. Note that you might see a different image when you run this cell:
    Figure 2.18 – Applying to_thumb to an image

    Figure 2.18 – Applying to_thumb to an image

  8. Now, ingest the BIWI_HEAD_POSE dataset:
    path = untar_data(URLs.BIWI_HEAD_POSE)
  9. Examine the path for this dataset:
    path.ls()
  10. Examine the 05 subdirectory:
    (path/"05").ls()
  11. Examine one of the images. Note that you may see a different image:
    Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset

    Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset

  12. In addition to the image files, this dataset also includes text files that encode the pose depicted in the image. Ingest one of these text files into a pandas DataFrame and display it:
Figure 2.20 – The first few records of one of the position text files

Figure 2.20 – The first few records of one of the position text files

In this section, you learned how to ingest two different kinds of image datasets, explore their directory structure, and examine images from the datasets.

How it works…

You used the same untar_data() function to ingest the curated tabular, text, and image datasets, and the same ls() function to examine the directory structures for all the different kinds of datasets. On top of these common facilities, fastai provides additional convenience functions for examining image data: get_image_files() to collect all the image files in a directory tree starting at a given directory, and to_thumb() to render the image at a size that is suitable for a notebook.

There's more…

In addition to image classification datasets (where the goal of the trained model is to predict the category of what's displayed in the image) and image localization datasets (where the goal is to predict the location in the image of a given feature), the fastai curated datasets also include image segmentation datasets where the goal is to identify the subsets of an image that contain a particular object, including the CAMVID and CAMVID_TINY datasets.

 

Cleaning up raw datasets with fastai

Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb notebook in the ch2 directory of your repository.

How to do it…

In this section, you will be running through the cleaning_up_datasets.ipynb notebook to address missing values in the ADULT_SAMPLE dataset and replace categorical values with numeric identifiers.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Recall the Examining tabular datasets with fastai section of this chapter. When you checked to see which columns in the ADULT_SAMPLE dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns in ADULT_SAMPLE that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers.
  3. First, let's ingest the ADULT_SAMPLE curated dataset again:
    path = untar_data(URLs.ADULT_SAMPLE)
  4. Now, create a pandas DataFrame for the dataset and check for the number of missing values in each column. Note which columns have missing values:
    df = pd.read_csv(path/'adult.csv')
    df.isnull().sum()
  5. To deal with these missing values (and prepare categorical columns), we will use the fastai TabularPandas class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters:

    a) procs is the list of transformations that will be applied to TabularPandas. Here, we will specify that we want missing values to be filled (FillMissing) and that we will replace values in categorical columns with numeric identifiers (Categorify).

    b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of ADULT_SAMPLE, the dependent variable is salary.

    c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the cont_cat_split() (https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:

    procs = [FillMissing,Categorify]
    dep_var = 'salary'
    cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
  6. Now, create a TabularPandas object called df_no_missing using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:
    df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
  7. Apply the show API to df_no_missing to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed using show(). What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step:
    Figure 2.21 – The first few records of df_no_missing

    Figure 2.21 – The first few records of df_no_missing

  8. Now, display some sample contents of df_no_missing using the items.head() API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use the show() API, which transforms the numeric values in categorical columns back into their original values, while the items.head() API shows the actual numeric identifiers in the categorical columns:
    Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns

    Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns

  9. Finally, let's confirm that the missing values were handled correctly. As you can see, the two columns that originally had missing values no longer have missing values in df_no_missing:
Figure 2.23 – Missing values in df_no_missing

Figure 2.23 – Missing values in df_no_missing

By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.

How it works…

In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split() function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.

About the Author

  • Mark Ryan

    Mark Ryan is a machine learning practitioner and technology manager who is passionate about delivering end-to-end deep learning applications that solve real-world problems. Mark has worked on deep learning projects that incorporate a variety of related technologies, including Rasa chatbots, web applications, and messenger platforms. As a strong believer in democratizing technology, Mark advocates for Keras and fastai as accessible frameworks that open up deep learning to non-specialists. Mark has a degree in computer science from the University of Waterloo and a Master of Science degree in computer science from the University of Toronto.

    Browse publications by this author
Deep Learning with fastai Cookbook
Unlock this book and the full library for FREE
Start free trial