Deep Learning with fastai Cookbook

Chapter 2: Exploring and Cleaning Up Data with fastai

In the previous chapter, we got started with the fastai framework by setting up its coding environment, working through a concrete application example (MNIST), and investigating two frameworks with different relationships to fastai: PyTorch and Keras. In this chapter, we are going to dive deeper into an important aspect of fastai: ingesting, exploring, and cleaning up data. In particular, we are going to explore a selection of the datasets that are curated by fastai.

By the end of this chapter, you will be able to describe the complete set of curated datasets that fastai supports, use the facilities of fastai to examine these datasets, and clean up a dataset to eliminate missing and non-numeric values.

Here are the recipes that will be covered in this chapter:

Getting the complete set of oven-ready fastai datasets
Examining tabular datasets with fastai
Examining text datasets with fastai
Examining image datasets with fastai
Cleaning up raw datasets with fastai

Getting the complete set of oven-ready fastai datasets

In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.

In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb notebook in the ch2 directory of your cloned repository.

How to do it…

In this section, you will be running through the fastai_dataset_walkthrough.ipynb notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:

Run the first three cells of the notebook to load the required libraries, set up the notebook for fastai, and define the MNIST dataset:
Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset
Consider the argument to untar_data: URLs.MINST. What is this? Let's try the ?? shortcut to examine the source code for a URLs object:
Figure 2.2 – Source for URLs
By looking at the image classification datasets section of the source code for URLs, we can find the definition of URLs.MNIST:
```
MNIST           = f'{S3_IMAGE}mnist_png.tgz'
```
Working backward through the source code for the URLs class, we can get the whole URL for MNIST:
```
S3_IMAGE     = f'{S3}imageclas/'
S3  = 'https://s3.amazonaws.com/fast-ai-'
```

Putting it all together, we get the URL for URLs.MNIST:

https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz

You can download this file for yourself and untar it. You will see that the directory structure of the untarred package looks like this:

mnist_png
├── testing
│   ├── 0
│   ├── 1
│   ├── 2
│   ├── 3
│   ├── 4
│   ├── 5
│   ├── 6
│   ├── 7
│   ├── 8
│   └── 9
└── training
     ├── 0
     ├── 1
     ├── 2
     ├── 3
     ├── 4
     ├── 5
     ├── 6
     ├── 7
     ├── 8
     └── 9

In the untarred directory structure, each of the testing and training directories contain subdirectories for each digit. These digit directories contain image files for that digit. This means that the label of the dataset – the value that we want the model to predict – is encoded in the directory that the image file resides in.
Is there a way to get the directory structure of one of the curated datasets without having to determine its URL from the definition of URLs, download the dataset, and unpack it? There is – using path.ls():
Figure 2.3 – Using path.ls() to get the dataset's directory structure
This tells us that there are two subdirectories in the dataset: training and testing. You can call ls() to get the structure of the training subdirectory:
Figure 2.4 – The structure of the training subdirectory
Now that we have learned how to get the directory structure of the MNIST dataset using the ls() function, what else can we learn from the output of ??URLs?

First, let's look at the other datasets listed in the output of ??URLs by group. First, let's look at the datasets listed under main datasets. This list includes tabular datasets (ADULT_SAMPLE), text datasets (IMDB_SAMPLE), recommender system datasets (ML_SAMPLE), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE):

     ADULT_SAMPLE           = f'{URL}adult_sample.tgz'
     BIWI_SAMPLE            = f'{URL}biwi_sample.tgz'
     CIFAR                     = f'{URL}cifar10.tgz'
     COCO_SAMPLE            = f'{S3_COCO}coco_sample.tgz'
     COCO_TINY               = f'{S3_COCO}coco_tiny.tgz'
     HUMAN_NUMBERS         = f'{URL}human_numbers.tgz'
     IMDB                       = f'{S3_NLP}imdb.tgz'
     IMDB_SAMPLE            = f'{URL}imdb_sample.tgz'
     ML_SAMPLE               = f'{URL}movie_lens_sample.tgz'
     ML_100k                  = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
     MNIST_SAMPLE           = f'{URL}mnist_sample.tgz'
     MNIST_TINY              = f'{URL}mnist_tiny.tgz'
     MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz'
     PLANET_SAMPLE         = f'{URL}planet_sample.tgz'
     PLANET_TINY            = f'{URL}planet_tiny.tgz'
     IMAGENETTE              = f'{S3_IMAGE}imagenette2.tgz'
     IMAGENETTE_160        = f'{S3_IMAGE}imagenette2-160.tgz'
     IMAGENETTE_320        = f'{S3_IMAGE}imagenette2-320.tgz'
     IMAGEWOOF               = f'{S3_IMAGE}imagewoof2.tgz'
     IMAGEWOOF_160         = f'{S3_IMAGE}imagewoof2-160.tgz'
     IMAGEWOOF_320         = f'{S3_IMAGE}imagewoof2-320.tgz'
     IMAGEWANG               = f'{S3_IMAGE}imagewang.tgz'
     IMAGEWANG_160         = f'{S3_IMAGE}imagewang-160.tgz'
     IMAGEWANG_320         = f'{S3_IMAGE}imagewang-320.tgz'

Next, let's look at the datasets in the other categories: image classification datasets, NLP datasets, image localization datasets, audio classification datasets, and medical image classification datasets. Note that the list of curated datasets includes datasets that aren't directly associated with any of the four main application areas supported by fastai. The audio datasets, for example, apply to a use case outside the four main application areas:

     # image classification datasets
     CALTECH_101  = f'{S3_IMAGE}caltech_101.tgz'
     CARS            = f'{S3_IMAGE}stanford-cars.tgz'
     CIFAR_100     = f'{S3_IMAGE}cifar100.tgz'
     CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz'
     FLOWERS        = f'{S3_IMAGE}oxford-102-flowers.tgz'
     FOOD            = f'{S3_IMAGE}food-101.tgz'
     MNIST           = f'{S3_IMAGE}mnist_png.tgz'
     PETS            = f'{S3_IMAGE}oxford-iiit-pet.tgz'
     # NLP datasets
     AG_NEWS                        = f'{S3_NLP}ag_news_csv.tgz'
     AMAZON_REVIEWS              = f'{S3_NLP}amazon_review_full_csv.tgz'
     AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz'
     DBPEDIA                        = f'{S3_NLP}dbpedia_csv.tgz'
     MT_ENG_FRA                    = f'{S3_NLP}giga-fren.tgz'
     SOGOU_NEWS                    = f'{S3_NLP}sogou_news_csv.tgz'
     WIKITEXT                       = f'{S3_NLP}wikitext-103.tgz'
     WIKITEXT_TINY               = f'{S3_NLP}wikitext-2.tgz'
     YAHOO_ANSWERS               = f'{S3_NLP}yahoo_answers_csv.tgz'
     YELP_REVIEWS                 = f'{S3_NLP}yelp_review_full_csv.tgz'
     YELP_REVIEWS_POLARITY   = f'{S3_NLP}yelp_review_polarity_csv.tgz'
     # Image localization datasets
     BIWI_HEAD_POSE      = f"{S3_IMAGELOC}biwi_head_pose.tgz"
     CAMVID                  = f'{S3_IMAGELOC}camvid.tgz'
     CAMVID_TINY           = f'{URL}camvid_tiny.tgz'
     LSUN_BEDROOMS        = f'{S3_IMAGE}bedroom.tgz'
     PASCAL_2007           = f'{S3_IMAGELOC}pascal_2007.tgz'
     PASCAL_2012           = f'{S3_IMAGELOC}pascal_2012.tgz'
     # Audio classification datasets
     MACAQUES               = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip'
     ZEBRA_FINCH           = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip'
     # Medical Imaging datasets
     SIIM_SMALL            = f'{S3_IMAGELOC}siim_small.tgz'

Now that we have listed all the datasets defined in URLs, how can we find out more information about them?
a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in URLs. Note that this documentation is not consistent with what's listed in the source of URLs. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source of URLs as your single source of truth about fastai curated datasets.
b) Use the path.ls() function to examine the directory structure, as shown in the following example, which lists the directories under the training subdirectory of the MNIST dataset:
Figure 2.5 – Structure of the training subdirectory
c) Check out the file structure that gets installed when you run untar_data. For example, in Gradient, the datasets get installed in storage/data, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.
d) For example, let's say untar_data is run with URLs.PETS as the argument:
```
path = untar_data(URLs.PETS)
```
e) Here, you can find the dataset in storage/data/oxford-iiit-pet, and you can see the directory's structure:
```
oxford-iiit-pet
├── annotations
│   ├── trimaps
│   └── xmls
└── images
```
If you want to see the definition of a function in a notebook, you can run a cell with ??, followed by the name of the function. For example, to see the definition of the ls() function, you can use ??Path.ls:
Figure 2.6 – Source for Path.ls()
To see the documentation for any function, you can use the doc() function. For example, the output of doc(Path.ls) shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function:

Figure 2.7 – Output of doc(Path.ls)

You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.

How it works…

As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs class. When you call untar_data with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data in a Gradient instance). The object you get back from untar_data allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.

There's more…

You might be asking yourself why we went to the trouble of examining the source code for the URLs class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.

Examining tabular datasets with fastai

In the previous section, we looked at the whole set of datasets curated by fastai. In this section, we are going to dig into a tabular dataset from the curated list. We will ingest the dataset, look at some example records, and then explore characteristics of the dataset, including the number of records and the number of unique values in each column.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_tabular_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to include the ADULT_SAMPLE dataset featured in this section.

Dataset citation

Ron Kohavi. (1996) Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid (http://robotics.stanford.edu/~ronnyk/nbtree.pdf).

How to do it…

In this section, you will be running through the examining_tabular_datasets.ipynb notebook to examine the ADULT_SAMPLE dataset.

Once you have the notebook open in your fastai environment, complete the following steps:

Run the first two cells to import the necessary libraries and set up the notebook for fastai.
Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
```
path = untar_data(URLs.ADULT_SAMPLE)
```
Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
Figure 2.8 – Output of path.ls()
The dataset is in the adult.csv file. Run the following cell to ingest this CSV file into a pandas DataFrame:
```
df = pd.read_csv(path/'adult.csv')
```
Run the head() command to get a sample of records from the beginning of the dataset:
Figure 2.9 – Sample of records from the beginning of the dataset
Run the following command to get the number of records (rows) and fields (columns) in the dataset:
```
df.shape
```
Run the following command to get the number of unique values in each column of the dataset. Can you tell from the output which columns are categorical?
```
df.nunique()
```
Run the following command to get the count of missing values in each column of the dataset. Which columns have missing values?
```
df.isnull().sum()
```
Run the following command to display some sample records from the subset of the dataset for people whose age is less than or equal to 40:
```
df_young = df[df.age <= 40]
df_young.head()
```

Congratulations! You have ingested a tabular dataset curated by fastai and done a basic examination of the dataset.

How it works…

The dataset that you explored in this section, ADULT_SAMPLE, is one of the datasets you would have seen in the source for URLs in the previous section. Note that while the source for URLs identifies which datasets are related to image or NLP (text) applications, it does not explicitly identify the tabular or recommender system datasets. ADULT_SAMPLE is one of the datasets listed under main datasets:

Figure 2.10 – Main datasets from the source for URLs

How did I determine that ADULT_SAMPLE was a tabular dataset? First, the paper by Howard and Gugger (https://arxiv.org/pdf/2002.04688.pdf) identifies ADULT_SAMPLE as a tabular dataset. Second, I just had to ingest it and try it out to confirm it could be ingested into a pandas DataFrame.

There's more…

What about the other curated datasets that aren't explicitly categorized in the source for URLs? Here's a summary of the datasets listed in the source for URLs under main datasets:

Tabular:
a) ADULT_SAMPLE
NLP (text):
a) HUMAN_NUMBERS
b) IMDB
c) IMDB_SAMPLE
Collaborative filtering:
a) ML_SAMPLE
b) ML_100k
Image data:
a) All of the other datasets listed in URLs under main datasets.

Examining text datasets with fastai

In the previous section, we looked at how a curated tabular dataset could be ingested. In this section, we are going to dig into a text dataset from the curated list.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_text_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the WIKITEXT_TINY dataset (https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) featured in this section.

Dataset citation

Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2016). Pointer Sentinel Mixture Models (https://arxiv.org/pdf/1609.07843.pdf).

How to do it…

In this section, you will be running through the examining_text_datasets.ipynb notebook to examine the WIKITEXT_TINY dataset. As its name suggests, this is a small set of text that's been gleaned from good and featured Wikipedia articles.

Once you have the notebook open in your fastai environment, complete the following steps:

Run the first two cells to import the necessary libraries and set up the notebook for fastai.
Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
```
path = untar_data(URLs.WIKITEXT_TINY)
```
Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
Figure 2.11 – Output of path.ls()
There are two CSV files that make up this dataset. Let's ingest each of them into a pandas DataFrame, starting with train.csv:
```
df_train = pd.read_csv(path/'train.csv')
```
When you use head() to check the DataFrame, you'll notice that something's wrong – the CSV file has no header with column names, but by default, read_csv assumes the first row is the header, so the first row gets misinterpreted as a header. As shown in the following screenshot, the first row of output is in bold, which indicates that the first row is being interpreted as a header, even though it contains a regular data row:
Figure 2.12 – First record in df_train
To fix this problem, rerun the read_csv function, but this time with the header=None parameter, to specify that the CSV file doesn't have a header:
```
df_train = pd.read_csv(path/'train.csv',header=None)
```
Check head() again to confirm that the problem has been resolved:
Figure 2.13 – Revising the first record in df_train
Ingest test.csv into a DataFrame using the header=None parameter:
```
df_test = pd.read_csv(path/'test.csv',header=None)
```
We want to tokenize the dataset and transform it into a list of words. Since we want a common set of tokens for the entire dataset, we will begin by combining the test and train DataFrames:
```
df_combined = pd.concat([df_train,df_test])
```
Confirm the shape of the train, test, and combined dataframes – the number of rows in the combined DataFrame should be the sum of the number of rows in the train and test DataFrames:
```
print("df_train: ",df_train.shape)
print("df_test: ",df_test.shape)
print("df_combined: ",df_combined.shape)
```
Now, we're ready to tokenize the DataFrame. The tokenize_df() function takes the list of columns containing the text we want to tokenize as a parameter. Since the columns of the DataFrame are not labeled, we need to refer to the column we want to tokenize using its position rather than its name:
```
df_tok, count = tokenize_df(df_combined,[df_combined.columns[0]])
```
Check the contents of the first few records of df_tok, which is the new DataFrame containing the tokenized contents of the combined DataFrame:
Figure 2.14 – The first few records of df_tok

Check the count for a few sample words to ensure they are roughly what you expected. Pick a very common word, a moderately common word, and a rare word:

print("very common word (count['the']):", count['the'])
print("moderately common word (count['prepared']):", count['prepared'])
print("rare word (count['gaga']):", count['gaga'])

Congratulations! You have successfully ingested, explored, and tokenized a curated text dataset.

How it works…

The dataset that you explored in this section, WIKITEXT_TINY, is one of the datasets you would have seen in the source for URLs in the Getting the complete set of oven-ready fastai datasets section. Here, you can see that WIKITEXT_TINY is in the NLP datasets section of the source for URLs:

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Examining image datasets with fastai

In the past two sections, we examined tabular and text datasets and got a taste of the facilities that fastai provides for accessing and exploring these datasets. In this section, we are going to look at image data. We are going to look at two datasets: the FLOWERS image classification dataset and the BIWI_HEAD_POSE image localization dataset.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_image_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the FLOWERS dataset featured in this section.

Dataset citation

Maria-Elena Nilsback, Andrew Zisserman. (2008). Automated flower classification over a large number of classes (https://www.robots.ox.ac.uk/~vgg/publications/papers/nilsback08.pdf).

I am grateful for the opportunity to use the BIWI_HEAD_POSE dataset featured in this section.

Dataset citation

Gabriele Fanelli, Thibaut Weise, Juergen Gall, Luc Van Gool. (2011). Real Time Head Pose Estimation from Consumer Depth Cameras (https://link.springer.com/chapter/10.1007/978-3-642-23123-0_11). Lecture Notes in Computer Science, vol 6835. Springer, Berlin, Heidelberg https://doi.org/10.1007/978-3-642-23123-0_11.

How to do it…

In this section, you will be running through the examining_image_datasets.ipynb notebook to examine the FLOWERS and BIWI_HEAD_POSE datasets.

Once you have the notebook open in your fastai environment, complete the following steps:

Run the first two cells to import the necessary libraries and set up the notebook for fastai.
Run the following cell to copy the FLOWERS dataset into your filesystem (if it's not already there) and to define the path for the dataset:
```
path = untar_data(URLs.FLOWERS)
```
Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
Figure 2.16 – Output of path.ls()
Look at the contents of the valid.txt file. This indicates that train.txt, valid.txt, and test.txt contain lists of the image files that belong to each of these datasets:
Figure 2.17 – The first few records of valid.txt
Examine the jgp subdirectory:
```
(path/'jpg').ls()
```
Take a look at one of the image files. Note that the get_image_files() function doesn't need to be pointed to a particular subdirectory – it recursively collects all the image files in a directory and its subdirectories:
```
img_files = get_image_files(path)
img = PILImage.create(img_files[100])
img
```
You should have noticed that the image displayed in the previous step was the native size of the image, which makes it rather big for the notebook. To get the image at a more appropriate size, apply the to_thumb function with the image dimension specified as an argument. Note that you might see a different image when you run this cell:
Figure 2.18 – Applying to_thumb to an image
Now, ingest the BIWI_HEAD_POSE dataset:
```
path = untar_data(URLs.BIWI_HEAD_POSE)
```
Examine the path for this dataset:
```
path.ls()
```
Examine the 05 subdirectory:
```
(path/"05").ls()
```
Examine one of the images. Note that you may see a different image:
Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset
In addition to the image files, this dataset also includes text files that encode the pose depicted in the image. Ingest one of these text files into a pandas DataFrame and display it:

Figure 2.20 – The first few records of one of the position text files

In this section, you learned how to ingest two different kinds of image datasets, explore their directory structure, and examine images from the datasets.

How it works…

You used the same untar_data() function to ingest the curated tabular, text, and image datasets, and the same ls() function to examine the directory structures for all the different kinds of datasets. On top of these common facilities, fastai provides additional convenience functions for examining image data: get_image_files() to collect all the image files in a directory tree starting at a given directory, and to_thumb() to render the image at a size that is suitable for a notebook.

There's more…

In addition to image classification datasets (where the goal of the trained model is to predict the category of what's displayed in the image) and image localization datasets (where the goal is to predict the location in the image of a given feature), the fastai curated datasets also include image segmentation datasets where the goal is to identify the subsets of an image that contain a particular object, including the CAMVID and CAMVID_TINY datasets.

Cleaning up raw datasets with fastai

Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb notebook in the ch2 directory of your repository.

How to do it…

In this section, you will be running through the cleaning_up_datasets.ipynb notebook to address missing values in the ADULT_SAMPLE dataset and replace categorical values with numeric identifiers.

Once you have the notebook open in your fastai environment, complete the following steps:

Run the first two cells to import the necessary libraries and set up the notebook for fastai.
Recall the Examining tabular datasets with fastai section of this chapter. When you checked to see which columns in the ADULT_SAMPLE dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns in ADULT_SAMPLE that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers.
First, let's ingest the ADULT_SAMPLE curated dataset again:
```
path = untar_data(URLs.ADULT_SAMPLE)
```
Now, create a pandas DataFrame for the dataset and check for the number of missing values in each column. Note which columns have missing values:
```
df = pd.read_csv(path/'adult.csv')
df.isnull().sum()
```
To deal with these missing values (and prepare categorical columns), we will use the fastai TabularPandas class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters:
a) procs is the list of transformations that will be applied to TabularPandas. Here, we will specify that we want missing values to be filled (FillMissing) and that we will replace values in categorical columns with numeric identifiers (Categorify).
b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of ADULT_SAMPLE, the dependent variable is salary.
c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the cont_cat_split() (https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:
```
procs = [FillMissing,Categorify]
dep_var = 'salary'
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
```
Now, create a TabularPandas object called df_no_missing using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:
```
df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
```
Apply the show API to df_no_missing to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed using show(). What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step:
Figure 2.21 – The first few records of df_no_missing
Now, display some sample contents of df_no_missing using the items.head() API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use the show() API, which transforms the numeric values in categorical columns back into their original values, while the items.head() API shows the actual numeric identifiers in the categorical columns:
Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns
Finally, let's confirm that the missing values were handled correctly. As you can see, the two columns that originally had missing values no longer have missing values in df_no_missing:

Figure 2.23 – Missing values in df_no_missing

By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.

How it works…

In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split() function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.

Filter reviews by

All

Amazon verified reviews

Richa Sethi Oct 29, 2021

As the name of the book describes, this book is actually a cookbook where it starts from the very basic ingredients (setting up the environment) needed for fastai, and eventually builds up the pace by getting deep into data cleaning and exploration. It then delves into training tabular data, text data and recommender systems, and finally gets into model deployment and maintenance. Great for beginners as well as individuals like me who have used fastai for certain applications before. This book should allow anyone apply fastai to create end-to-end deep learning models to make predictions on a wide variety of datasets.

Amazon Verified review

Daniel Armstrong Oct 30, 2021

This book is filled with great examples of how you can apply fast.ai to a wide range of deep learning projects. I may be a little biased because I am a huge fan of fast.ai, but I really enjoyed this book. I am glad to have it as part of my library. It not only covers the major deep learning topics like CV, NLP, and Tabular data, but is also covers other topics like using callback, memory management, and model deployment. The only part that I was a little disappointed about was the section on object detection, the author did a great job covering the subject, but the fast.ai library doesn't cover an easy way to do object detection out of the box. The good new is their is a library called Ice Vision that is built and maintained by a group of former Fast.ai student/alumni, which is built on top of fast.ai that does a great job making object detection possible and as easy as most things you can do in fast.ai.

Guangping zhang Sep 24, 2021

Finally I read a book for FastAI, "Deep Learning with fastai Cookbook".Fist this book discusses how to setting up a fastai environment in Google Colab, then introduce theapplications: tables, text, recommender systems, and images.After discussing data cleaning, which is a necessary step using fastai, this book discuss the four typesof application of Fastai chapter by chapter, all the application share same training and predict api, fit_one_cycleand predict, for transfer learning, you can use fine_tune() replace fit_one_cycle.Several steps for model web deployment and advanced deployment, including export(), load_learner() and web_flask_deploy and/or web_flask_deploy_image_mode()were discussed using two chapter in this book.This book uses clean logic and easy understanding language to introduce fastai to readers, I suggest you read this book if you plan to use fastai.

Erfan Chowdhury Nov 30, 2021

This book is perfect for people who wants to utilize the highly optimized Fastai library. Jeremy Howard , the creator of Fastai, has a free of charge online series where he demonstrates and teaches how to use the library. He even published a book "Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD" to follow along with the course. However , many topics were just brushed over in these videos and the book and the task was left for the user to go and tinker around.This is time consuming unfortunately. But here is the good part! This is the book that basically spoon feeds the Fastai library to you. This book contains in-depth explanations to topics that were not deeply covered in the series or the book. All the chapters in the book comes with their own datasets to test out what you have learnt. Starting from cleaning the data to solving all different kinds of structured and unstructured problems that you might face in the field of deep learning, this book teaches you how to effectively use the Fastai toolkit. This book also demonstrates how you can deploy the models that you trained. The "Test Your Knowledge" section at the end of every chapter is like a brain teaser that strengthens your conceptualization of that particular chapter.The book is well written and well organized. I thoroughly enjoyed the book. Highly Recommended!

Carlo Nov 29, 2021

This book provides an excellent starting point to anyone interested in building deep learning applications.It effectively shows how to use fastai, a powerful and intuitive deep learning framework.The book provides detailed instructions and tutorials to quickly dive into deep learning examples. A very nice aspect is the step-by-step guidance in setting up the right working environment to run the examples.It is a perfect book for practitioners and machine learning beginners that want to get started as quickly as possible. However, it doesn't provide much theoretical background on machine learning and how models work, so should definitely be accompanied by other resources for practitioners with no background on machine learning who want to understand the inner workings of ML models.Overall it's a useful book especially recommended to anyone who wants to use the fast.ai framework to build powerful deep learning models quickly.

Deep Learning with fastai Cookbook: Leverage the easy-to-use fastai framework to unlock the power of deep learning

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access