Chapter 2: Exploring and Cleaning Up Data with fastai
In the previous chapter, we got started with the fastai framework by setting up its coding environment, working through a concrete application example (MNIST), and investigating two frameworks with different relationships to fastai: PyTorch and Keras. In this chapter, we are going to dive deeper into an important aspect of fastai: ingesting, exploring, and cleaning up data. In particular, we are going to explore a selection of the datasets that are curated by fastai.
By the end of this chapter, you will be able to describe the complete set of curated datasets that fastai supports, use the facilities of fastai to examine these datasets, and clean up a dataset to eliminate missing and non-numeric values.
Here are the recipes that will be covered in this chapter:
- Getting the complete set of oven-ready fastai datasets
- Examining tabular datasets with fastai
- Examining text datasets with fastai
- Examining image datasets with fastai
- Cleaning up raw datasets with fastai
Technical requirements
Ensure that you have completed the setup sections in Chapter 1, Getting Started with fastai, and that you have a working Gradient instance or Colab setup. Ensure that you have cloned the repository for this book (https://github.com/PacktPublishing/Deep-Learning-with-fastai-Cookbook) and have access to the ch2
folder. This folder contains the code samples that will be described in this chapter.
Getting the complete set of oven-ready fastai datasets
In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.
In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.
Getting ready
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb
notebook in the ch2
directory of your cloned repository.
How to do it…
In this section, you will be running through the fastai_dataset_walkthrough.ipynb
notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:
- Run the first three cells of the notebook to load the required libraries, set up the notebook for fastai, and define the MNIST dataset:
Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset
- Consider the argument to
untar_data
:URLs.MINST
. What is this? Let's try the??
shortcut to examine the source code for aURLs
object:Figure 2.2 – Source for URLs
- By looking at the
image classification datasets
section of the source code forURLs
, we can find the definition ofURLs.MNIST
:MNIST = f'{S3_IMAGE}mnist_png.tgz'
- Working backward through the source code for the
URLs
class, we can get the whole URL for MNIST:S3_IMAGE = f'{S3}imageclas/' S3 = 'https://s3.amazonaws.com/fast-ai-'
- Putting it all together, we get the URL for
URLs.MNIST
:https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz
- You can download this file for yourself and untar it. You will see that the directory structure of the untarred package looks like this:
mnist_png ├── testing │ ├── 0 │ ├── 1 │ ├── 2 │ ├── 3 │ ├── 4 │ ├── 5 │ ├── 6 │ ├── 7 │ ├── 8 │ └── 9 └── training ├── 0 ├── 1 ├── 2 ├── 3 ├── 4 ├── 5 ├── 6 ├── 7 ├── 8 └── 9
- In the untarred directory structure, each of the testing and training directories contain subdirectories for each digit. These digit directories contain image files for that digit. This means that the label of the dataset – the value that we want the model to predict – is encoded in the directory that the image file resides in.
- Is there a way to get the directory structure of one of the curated datasets without having to determine its URL from the definition of
URLs
, download the dataset, and unpack it? There is – usingpath.ls()
:Figure 2.3 – Using path.ls() to get the dataset's directory structure
- This tells us that there are two subdirectories in the dataset:
training
andtesting
. You can callls()
to get the structure of thetraining
subdirectory:Figure 2.4 – The structure of the training subdirectory
- Now that we have learned how to get the directory structure of the MNIST dataset using the
ls()
function, what else can we learn from the output of??URLs
? - First, let's look at the other datasets listed in the output of
??URLs
by group. First, let's look at the datasets listed undermain datasets
. This list includes tabular datasets (ADULT_SAMPLE
), text datasets (IMDB_SAMPLE
), recommender system datasets (ML_SAMPLE
), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE
):ADULT_SAMPLE = f'{URL}adult_sample.tgz' BIWI_SAMPLE = f'{URL}biwi_sample.tgz' CIFAR = f'{URL}cifar10.tgz' COCO_SAMPLE = f'{S3_COCO}coco_sample.tgz' COCO_TINY = f'{S3_COCO}coco_tiny.tgz' HUMAN_NUMBERS = f'{URL}human_numbers.tgz' IMDB = f'{S3_NLP}imdb.tgz' IMDB_SAMPLE = f'{URL}imdb_sample.tgz' ML_SAMPLE = f'{URL}movie_lens_sample.tgz' ML_100k = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip' MNIST_SAMPLE = f'{URL}mnist_sample.tgz' MNIST_TINY = f'{URL}mnist_tiny.tgz' MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz' PLANET_SAMPLE = f'{URL}planet_sample.tgz' PLANET_TINY = f'{URL}planet_tiny.tgz' IMAGENETTE = f'{S3_IMAGE}imagenette2.tgz' IMAGENETTE_160 = f'{S3_IMAGE}imagenette2-160.tgz' IMAGENETTE_320 = f'{S3_IMAGE}imagenette2-320.tgz' IMAGEWOOF = f'{S3_IMAGE}imagewoof2.tgz' IMAGEWOOF_160 = f'{S3_IMAGE}imagewoof2-160.tgz' IMAGEWOOF_320 = f'{S3_IMAGE}imagewoof2-320.tgz' IMAGEWANG = f'{S3_IMAGE}imagewang.tgz' IMAGEWANG_160 = f'{S3_IMAGE}imagewang-160.tgz' IMAGEWANG_320 = f'{S3_IMAGE}imagewang-320.tgz'
- Next, let's look at the datasets in the other categories: image classification datasets, NLP datasets, image localization datasets, audio classification datasets, and medical image classification datasets. Note that the list of curated datasets includes datasets that aren't directly associated with any of the four main application areas supported by fastai. The audio datasets, for example, apply to a use case outside the four main application areas:
# image classification datasets CALTECH_101 = f'{S3_IMAGE}caltech_101.tgz' CARS = f'{S3_IMAGE}stanford-cars.tgz' CIFAR_100 = f'{S3_IMAGE}cifar100.tgz' CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz' FLOWERS = f'{S3_IMAGE}oxford-102-flowers.tgz' FOOD = f'{S3_IMAGE}food-101.tgz' MNIST = f'{S3_IMAGE}mnist_png.tgz' PETS = f'{S3_IMAGE}oxford-iiit-pet.tgz' # NLP datasets AG_NEWS = f'{S3_NLP}ag_news_csv.tgz' AMAZON_REVIEWS = f'{S3_NLP}amazon_review_full_csv.tgz' AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz' DBPEDIA = f'{S3_NLP}dbpedia_csv.tgz' MT_ENG_FRA = f'{S3_NLP}giga-fren.tgz' SOGOU_NEWS = f'{S3_NLP}sogou_news_csv.tgz' WIKITEXT = f'{S3_NLP}wikitext-103.tgz' WIKITEXT_TINY = f'{S3_NLP}wikitext-2.tgz' YAHOO_ANSWERS = f'{S3_NLP}yahoo_answers_csv.tgz' YELP_REVIEWS = f'{S3_NLP}yelp_review_full_csv.tgz' YELP_REVIEWS_POLARITY = f'{S3_NLP}yelp_review_polarity_csv.tgz' # Image localization datasets BIWI_HEAD_POSE = f"{S3_IMAGELOC}biwi_head_pose.tgz" CAMVID = f'{S3_IMAGELOC}camvid.tgz' CAMVID_TINY = f'{URL}camvid_tiny.tgz' LSUN_BEDROOMS = f'{S3_IMAGE}bedroom.tgz' PASCAL_2007 = f'{S3_IMAGELOC}pascal_2007.tgz' PASCAL_2012 = f'{S3_IMAGELOC}pascal_2012.tgz' # Audio classification datasets MACAQUES = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip' ZEBRA_FINCH = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip' # Medical Imaging datasets SIIM_SMALL = f'{S3_IMAGELOC}siim_small.tgz'
- Now that we have listed all the datasets defined in
URLs
, how can we find out more information about them?a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in
URLs
. Note that this documentation is not consistent with what's listed in the source ofURLs
. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source ofURLs
as your single source of truth about fastai curated datasets.b) Use the
path.ls()
function to examine the directory structure, as shown in the following example, which lists the directories under thetraining
subdirectory of the MNIST dataset:Figure 2.5 – Structure of the training subdirectory
c) Check out the file structure that gets installed when you run
untar_data
. For example, in Gradient, the datasets get installed instorage/data
, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.d) For example, let's say
untar_data
is run withURLs.PETS
as the argument:path = untar_data(URLs.PETS)
e) Here, you can find the dataset in
storage/data/oxford-iiit-pet
, and you can see the directory's structure:oxford-iiit-pet ├── annotations │ ├── trimaps │ └── xmls └── images
- If you want to see the definition of a function in a notebook, you can run a cell with
??
, followed by the name of the function. For example, to see the definition of thels()
function, you can use??Path.ls
:Figure 2.6 – Source for Path.ls()
- To see the documentation for any function, you can use the
doc()
function. For example, the output ofdoc(Path.ls)
shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function:

Figure 2.7 – Output of doc(Path.ls)
You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.
How it works…
As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs
class. When you call untar_data
with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data
in a Gradient instance). The object you get back from untar_data
allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.
There's more…
You might be asking yourself why we went to the trouble of examining the source code for the URLs
class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data
calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.
Examining tabular datasets with fastai
In the previous section, we looked at the whole set of datasets curated by fastai. In this section, we are going to dig into a tabular dataset from the curated list. We will ingest the dataset, look at some example records, and then explore characteristics of the dataset, including the number of records and the number of unique values in each column.
Getting ready
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_tabular_datasets.ipynb
notebook in the ch2
directory of your repository.
I am grateful for the opportunity to include the ADULT_SAMPLE dataset featured in this section.
Dataset citation
Ron Kohavi. (1996) Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid (http://robotics.stanford.edu/~ronnyk/nbtree.pdf).
How to do it…
In this section, you will be running through the examining_tabular_datasets.ipynb
notebook to examine the ADULT_SAMPLE
dataset.
Once you have the notebook open in your fastai environment, complete the following steps:
- Run the first two cells to import the necessary libraries and set up the notebook for fastai.
- Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
path = untar_data(URLs.ADULT_SAMPLE)
- Run the following cell to get the output of
path.ls()
so that you can examine the directory structure of the dataset:Figure 2.8 – Output of path.ls()
- The dataset is in the
adult.csv
file. Run the following cell to ingest this CSV file into a pandas DataFrame:df = pd.read_csv(path/'adult.csv')
- Run the
head()
command to get a sample of records from the beginning of the dataset:Figure 2.9 – Sample of records from the beginning of the dataset
- Run the following command to get the number of records (rows) and fields (columns) in the dataset:
df.shape
- Run the following command to get the number of unique values in each column of the dataset. Can you tell from the output which columns are categorical?
df.nunique()
- Run the following command to get the count of missing values in each column of the dataset. Which columns have missing values?
df.isnull().sum()
- Run the following command to display some sample records from the subset of the dataset for people whose age is less than or equal to 40:
df_young = df[df.age <= 40] df_young.head()
Congratulations! You have ingested a tabular dataset curated by fastai and done a basic examination of the dataset.
How it works…
The dataset that you explored in this section, ADULT_SAMPLE
, is one of the datasets you would have seen in the source for URLs
in the previous section. Note that while the source for URLs
identifies which datasets are related to image or NLP (text) applications, it does not explicitly identify the tabular or recommender system datasets. ADULT_SAMPLE
is one of the datasets listed under main datasets
:

Figure 2.10 – Main datasets from the source for URLs
How did I determine that ADULT_SAMPLE
was a tabular dataset? First, the paper by Howard and Gugger (https://arxiv.org/pdf/2002.04688.pdf) identifies ADULT_SAMPLE
as a tabular dataset. Second, I just had to ingest it and try it out to confirm it could be ingested into a pandas DataFrame.
There's more…
What about the other curated datasets that aren't explicitly categorized in the source for URLs
? Here's a summary of the datasets listed in the source for URLs
under main datasets
:
- Tabular:
a)
ADULT_SAMPLE
- NLP (text):
a)
HUMAN_NUMBERS
b)
IMDB
c)
IMDB_SAMPLE
- Collaborative filtering:
a)
ML_SAMPLE
b)
ML_100k
- Image data:
a) All of the other datasets listed in
URLs
undermain datasets
.
Examining text datasets with fastai
In the previous section, we looked at how a curated tabular dataset could be ingested. In this section, we are going to dig into a text dataset from the curated list.
Getting ready
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_text_datasets.ipynb
notebook in the ch2
directory of your repository.
I am grateful for the opportunity to use the WIKITEXT_TINY dataset (https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) featured in this section.
Dataset citation
Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2016). Pointer Sentinel Mixture Models (https://arxiv.org/pdf/1609.07843.pdf).
How to do it…
In this section, you will be running through the examining_text_datasets.ipynb
notebook to examine the WIKITEXT_TINY
dataset. As its name suggests, this is a small set of text that's been gleaned from good and featured Wikipedia articles.
Once you have the notebook open in your fastai environment, complete the following steps:
- Run the first two cells to import the necessary libraries and set up the notebook for fastai.
- Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
path = untar_data(URLs.WIKITEXT_TINY)
- Run the following cell to get the output of
path.ls()
so that you can examine the directory structure of the dataset:Figure 2.11 – Output of path.ls()
- There are two CSV files that make up this dataset. Let's ingest each of them into a pandas DataFrame, starting with
train.csv
:df_train = pd.read_csv(path/'train.csv')
- When you use
head()
to check the DataFrame, you'll notice that something's wrong – the CSV file has no header with column names, but by default,read_csv
assumes the first row is the header, so the first row gets misinterpreted as a header. As shown in the following screenshot, the first row of output is in bold, which indicates that the first row is being interpreted as a header, even though it contains a regular data row:Figure 2.12 – First record in df_train
- To fix this problem, rerun the
read_csv
function, but this time with theheader=None
parameter, to specify that the CSV file doesn't have a header:df_train = pd.read_csv(path/'train.csv',header=None)
- Check
head()
again to confirm that the problem has been resolved:Figure 2.13 – Revising the first record in df_train
- Ingest
test.csv
into a DataFrame using theheader=None
parameter:df_test = pd.read_csv(path/'test.csv',header=None)
- We want to tokenize the dataset and transform it into a list of words. Since we want a common set of tokens for the entire dataset, we will begin by combining the test and train DataFrames:
df_combined = pd.concat([df_train,df_test])
- Confirm the shape of the train, test, and combined dataframes – the number of rows in the combined DataFrame should be the sum of the number of rows in the train and test DataFrames:
print("df_train: ",df_train.shape) print("df_test: ",df_test.shape) print("df_combined: ",df_combined.shape)
- Now, we're ready to tokenize the DataFrame. The
tokenize_df()
function takes the list of columns containing the text we want to tokenize as a parameter. Since the columns of the DataFrame are not labeled, we need to refer to the column we want to tokenize using its position rather than its name:df_tok, count = tokenize_df(df_combined,[df_combined.columns[0]])
- Check the contents of the first few records of
df_tok
, which is the new DataFrame containing the tokenized contents of the combined DataFrame:Figure 2.14 – The first few records of df_tok
- Check the count for a few sample words to ensure they are roughly what you expected. Pick a very common word, a moderately common word, and a rare word:
print("very common word (count['the']):", count['the']) print("moderately common word (count['prepared']):", count['prepared']) print("rare word (count['gaga']):", count['gaga'])
Congratulations! You have successfully ingested, explored, and tokenized a curated text dataset.
How it works…
The dataset that you explored in this section, WIKITEXT_TINY
, is one of the datasets you would have seen in the source for URLs
in the Getting the complete set of oven-ready fastai datasets section. Here, you can see that WIKITEXT_TINY
is in the NLP datasets section of the source for URLs
:

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs
Examining image datasets with fastai
In the past two sections, we examined tabular and text datasets and got a taste of the facilities that fastai provides for accessing and exploring these datasets. In this section, we are going to look at image data. We are going to look at two datasets: the FLOWERS
image classification dataset and the BIWI_HEAD_POSE
image localization dataset.
Getting ready
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_image_datasets.ipynb
notebook in the ch2
directory of your repository.
I am grateful for the opportunity to use the FLOWERS dataset featured in this section.
Dataset citation
Maria-Elena Nilsback, Andrew Zisserman. (2008). Automated flower classification over a large number of classes (https://www.robots.ox.ac.uk/~vgg/publications/papers/nilsback08.pdf).
I am grateful for the opportunity to use the BIWI_HEAD_POSE dataset featured in this section.
Dataset citation
Gabriele Fanelli, Thibaut Weise, Juergen Gall, Luc Van Gool. (2011). Real Time Head Pose Estimation from Consumer Depth Cameras (https://link.springer.com/chapter/10.1007/978-3-642-23123-0_11). Lecture Notes in Computer Science, vol 6835. Springer, Berlin, Heidelberg https://doi.org/10.1007/978-3-642-23123-0_11.
How to do it…
In this section, you will be running through the examining_image_datasets.ipynb
notebook to examine the FLOWERS
and BIWI_HEAD_POSE
datasets.
Once you have the notebook open in your fastai environment, complete the following steps:
- Run the first two cells to import the necessary libraries and set up the notebook for fastai.
- Run the following cell to copy the
FLOWERS
dataset into your filesystem (if it's not already there) and to define the path for the dataset:path = untar_data(URLs.FLOWERS)
- Run the following cell to get the output of
path.ls()
so that you can examine the directory structure of the dataset:Figure 2.16 – Output of path.ls()
- Look at the contents of the
valid.txt
file. This indicates thattrain.txt
,valid.txt
, andtest.txt
contain lists of the image files that belong to each of these datasets:Figure 2.17 – The first few records of valid.txt
- Examine the
jgp
subdirectory:(path/'jpg').ls()
- Take a look at one of the image files. Note that the
get_image_files()
function doesn't need to be pointed to a particular subdirectory – it recursively collects all the image files in a directory and its subdirectories:img_files = get_image_files(path) img = PILImage.create(img_files[100]) img
- You should have noticed that the image displayed in the previous step was the native size of the image, which makes it rather big for the notebook. To get the image at a more appropriate size, apply the
to_thumb
function with the image dimension specified as an argument. Note that you might see a different image when you run this cell:Figure 2.18 – Applying to_thumb to an image
- Now, ingest the
BIWI_HEAD_POSE
dataset:path = untar_data(URLs.BIWI_HEAD_POSE)
- Examine the path for this dataset:
path.ls()
- Examine the
05
subdirectory:(path/"05").ls()
- Examine one of the images. Note that you may see a different image:
Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset
- In addition to the image files, this dataset also includes text files that encode the pose depicted in the image. Ingest one of these text files into a pandas DataFrame and display it:

Figure 2.20 – The first few records of one of the position text files
In this section, you learned how to ingest two different kinds of image datasets, explore their directory structure, and examine images from the datasets.
How it works…
You used the same untar_data()
function to ingest the curated tabular, text, and image datasets, and the same ls()
function to examine the directory structures for all the different kinds of datasets. On top of these common facilities, fastai provides additional convenience functions for examining image data: get_image_files()
to collect all the image files in a directory tree starting at a given directory, and to_thumb()
to render the image at a size that is suitable for a notebook.
There's more…
In addition to image classification datasets (where the goal of the trained model is to predict the category of what's displayed in the image) and image localization datasets (where the goal is to predict the location in the image of a given feature), the fastai curated datasets also include image segmentation datasets where the goal is to identify the subsets of an image that contain a particular object, including the CAMVID
and CAMVID_TINY
datasets.
Cleaning up raw datasets with fastai
Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.
Getting ready
Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb
notebook in the ch2
directory of your repository.
How to do it…
In this section, you will be running through the cleaning_up_datasets.ipynb
notebook to address missing values in the ADULT_SAMPLE
dataset and replace categorical values with numeric identifiers.
Once you have the notebook open in your fastai environment, complete the following steps:
- Run the first two cells to import the necessary libraries and set up the notebook for fastai.
- Recall the Examining tabular datasets with fastai section of this chapter. When you checked to see which columns in the
ADULT_SAMPLE
dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns inADULT_SAMPLE
that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers. - First, let's ingest the
ADULT_SAMPLE
curated dataset again:path = untar_data(URLs.ADULT_SAMPLE)
- Now, create a pandas DataFrame for the dataset and check for the number of missing values in each column. Note which columns have missing values:
df = pd.read_csv(path/'adult.csv') df.isnull().sum()
- To deal with these missing values (and prepare categorical columns), we will use the fastai
TabularPandas
class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters:a) procs is the list of transformations that will be applied to
TabularPandas
. Here, we will specify that we want missing values to be filled (FillMissing
) and that we will replace values in categorical columns with numeric identifiers (Categorify
).b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of
ADULT_SAMPLE
, the dependent variable issalary
.c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the
cont_cat_split()
(https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:procs = [FillMissing,Categorify] dep_var = 'salary' cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
- Now, create a
TabularPandas
object calleddf_no_missing
using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
- Apply the
show
API todf_no_missing
to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed usingshow()
. What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step:Figure 2.21 – The first few records of df_no_missing
- Now, display some sample contents of
df_no_missing
using theitems.head()
API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use theshow()
API, which transforms the numeric values in categorical columns back into their original values, while theitems.head()
API shows the actual numeric identifiers in the categorical columns:Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns
- Finally, let's confirm that the missing values were handled correctly. As you can see, the two columns that originally had missing values no longer have missing values in
df_no_missing
:

Figure 2.23 – Missing values in df_no_missing
By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.
How it works…
In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas
class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split()
function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.