You're reading from Neural Network Projects with Python

Product type Book

Published in Feb 2019

Publisher Packt

ISBN-13 9781789138900

Pages 308 pages

Edition 1st Edition

Languages

Python

Concepts

Neural Networks

Author (1):

James Loy

Predicting Taxi Fares with Deep Feedforward Networks

In this chapter, we will use a deep feedforward neural network to predict taxi fares in New York City (NYC), given inputs such as the pickup and drop off locations.

In the previous chapter, Chapter 2, Predicting Diabetes with Multilayer Perceptrons, we saw how we can use a MLP with two hidden layers to perform a classification task (whether the patient is at risk of diabetes or not). In this chapter, we will build a deep neural network to perform a regression task of estimating taxi fares. As we shall see, we will need a deeper (that is, more complex) neural network to achieve this goal.

In this chapter, we will cover the following topics:

The motivation for the problem that we're trying to tackle—making accurate predictions of taxi fares
Classification versus regression problems in machine learning
In-depth analysis...

Technical requirements

The key Python libraries required for this chapter are as follows:

matplotlib 3.0.2
pandas 0.23.4
Keras 2.2.4
NumPy 1.15.2
scikit-learn 0.20.2

To download the dataset required for this project, please refer to the instructions at https://raw.githubusercontent.com/PacktPublishing/Neural-Network-Projects-with-Python/master/Chapter03/how_to_download_the_dataset.txt.

The code for this chapter can be found in the GitHub repository for the book at https://github.com/PacktPublishing/Neural-Network-Projects-with-Python.

To download the code into your computer, run the following git clone command:

$ git clone https://github.com/PacktPublishing/Neural-Network-Projects-with-Python.git

After the process is complete, there will be a folder titled Neural-Network-Projects-with-Python. Enter the folder by running the following command:

$ cd Neural-Network-Projects-with...

Predicting taxi fares in New York City

Yellow cabs in NYC are perhaps one of the most recognizable icons in the city. Tens of thousands of commuters in NYC rely on taxis as a mode of transportation around the bustling metropolis. In recent years, the taxi industry in NYC has been put under increasing pressure from ride-hailing apps such as Uber.

In order to rise to the challenge from ride-hailing apps, yellow cabs in NYC are looking to modernize their operations, and to provide a user experience on par with Uber. In August 2018, the Taxi and Limousine Commission of NYC launched a new app that allows commuters to book a yellow cab from their phones. The app provides fare pricing upfront before they hail a cab. Creating an algorithm to provide fare pricing upfront is no simple feat. The algorithm needs to consider various environmental variables such as traffic conditions, time...

The NYC taxi fares dataset

The dataset that we will be using for this project is the NYC taxi fares dataset, as provided by Kaggle. The original dataset contains a massive 55 million trip records from 2009 to 2015, including data such as the pick up and drop off locations, number of passengers, and pickup datetime. This dataset provides an interesting opportunity to use big datasets in machine learning projects, as well to visualize geolocation data.

Exploratory data analysis

Let's dive right into the dataset. The instructions to download the NYC taxi fares dataset can be found in the accompanying GitHub repository for the book (refer to the Technical requirements section). Unlike in the previous chapter, Chapter 2, Predicting Diabetes with Multilayer Perceptrons, we're not going to import the original dataset of 55 million rows. In fact, most computers would not be able to store the entire dataset in memory! Instead, let's just import the first 0.5 million rows. Doing this does have its drawbacks, but it is a necessary tradeoff in order to use the dataset in an efficient manner.

To do this, run the read_csv() function with pandas:

import pandas as pd

df = pd.read_csv('NYC_taxi.csv', parse_dates=['pickup_datetime'], nrows=500000)

The parse_dates parameter in read_csv allows pandas to easily...

Data preprocessing

Recall from the previous project that we had to preprocess the data by removing missing values and other data anomalies. In this project, we'll perform the same process. We'll also perform feature engineering to improve both the quality and quantity of the features before training our neural network on it.

Handling missing values and data anomalies

Let's do a check to see whether there are any missing values in our dataset:

print(df.isnull().sum())

We'll see the following output showing the number of missing values in each column:

We can see that there are only five rows (out of 500,000 rows) with missing data. With a missing data percentage of just 0.001%, it seems that we don&apos...

Feature engineering

As briefly discussed in the previous chapter, Chapter 2, Predicting Diabetes with Multilayer Perceptrons feature engineering is the process of using one's domain knowledge of the problem to create new features for the machine learning algorithm. In this section, we shall create features based on the date and time of pickup, and location-related features.

Temporal features

As we've seen earlier in the section on data visualization, ridership volume depends heavily on the day of the week, as well as the time of day.

Let's look at the format of the pickup_datetime column by running the following code:

print(df.head()['pickup_datetime'])

We get the following output:

Recall that neural...

Feature scaling

As a final preprocessing step, we should also scale our features before passing them to the neural network. Recall from the previous chapter, Chapter 2, Predicting Diabetes with Multilayer Perceptrons, that scaling ensures that all features have a uniform range of scale. This ensures that features with a greater scale (for example, year has a scale of > 2000) does not dominate features with a smaller scale (for example, passenger count has a scale between 1 to 6).

Before we scale the features in the DataFrame, it's a good idea to keep a copy of the prescaled DataFrame. The values of the features will be transformed after scaling (for example, year 2010 may be transformed to a value such as -0.134 after scaling), which can make it difficult for us to interpret the values. By keeping a copy of the prescaled DataFrame, we can easily reference the original...

Deep feedforward networks

So far in this chapter, we have done an in-depth visualization of the dataset, cleaned up the dataset by handling outliers, and also performed feature engineering to create useful features for our model. For the rest of the chapter, we'll talk about the architecture of deep feedforward neural networks, and we'll train one in Keras for a regression task.

Model architecture

In the previous chapter, Chapter 2, Predicting Diabetes with Multilayer Perceptrons, we used a relatively simple MLP as our neural network. For this project, since there are more features, we shall use a deeper model to account for the additional complexity. The deep feedforward network will have four hidden layers. The...

Model building in Python using Keras

Now, let's implement our model architecture in Keras. Just like in the previous project, we're going to build our model layer by layer in Keras using the Sequential class.

First, split the DataFrame into the training features (X) and the target variable that we're trying to predict (y):

X = df.loc[:, df.columns != 'fare_amount'] 
y = df.loc[:, 'fare_amount']

Then, split the data into a training set (80%) and a testing set (20%):

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Next, let's build our Sequential model in Keras according to the neural network architecture we outlined earlier:

from keras.models import Sequential
from keras.layers import Dense

model = Sequential()
model.add(Dense(128, activation= 'relu', input_dim...

Results analysis

Now that we have our neural network trained, let's use it to make some predictions to understand its accuracy.

We can create a function to make a prediction using a random sample from the testing set:

def predict_random(df_prescaled, X_test, model):
    sample = X_test.sample(n=1, random_state=np.random.randint(low=0, 
                                                              high=10000))
    idx = sample.index[0]
  
    actual_fare = df_prescaled.loc[idx,'fare_amount']
    day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 
                 'Saturday', 'Sunday']
    day_of_week = day_names[df_prescaled.loc[idx,'day_of_week']]
    hour = df_prescaled.loc[idx,'hour']
    predicted_fare = model.predict(sample)[0][0]
    rmse = np.sqrt(np.square(predicted_fare...

Putting it all together

We have accomplished a lot in this chapter. Let's do a quick recap of the code that we have written so far.

We started off by defining a function for preprocessing. This preprocess function takes a DataFrame as an input and performs the following actions:

Removing missing values
Removing outliers in the fare amount
Replacing outliers in passenger count with the mode
Removing outliers in latitude and longitude (that is, only considering points within NYC)

This function is saved under utils.py in our project folder.

Next, we also defined a feature_engineer function for feature engineering. This function takes a DataFrame as an input and performs the following actions:

Creating new columns for year, month, day, day of the week, and hour
Creating new column for the Euclidean distance between the pickup and drop off points
Creating new columns for the...

Summary

In this chapter, we designed and implemented a deep feedforward neural network capable of predicting taxi fares in NYC within an error of ~$3.50. We first performed exploratory data analysis, where we gained important insights on the factors that affect taxi fares. With these insights, we then performed feature engineering, which is the process of using your domain knowledge of the problem to create new features. We also introduced the concept of modularizing our functions in machine learning projects, which allowed us to keep our main code relatively short and neat.

We created our deep feedforward neural network in Keras, and trained it using the preprocessed data. Our results show that the neural network is able to make highly accurate predictions for both short and long distance trips. Even for fixed-rate trips, our neural network was able to produce highly accurate...

Questions

When reading a CSV file using pandas, how does pandas recognize that certain columns are datetime?

We can use the parse_dates argument when reading the CSV file using the read_csv function in pandas.

How can we filter a DataFrame to only select rows within a certain range of values, assuming that we have a DataFrame, df, and we want to select rows with height values within the range of 160 and 180?

We can filter a DataFrame like so:

df = df[(df['height'] >= 160) & (df['height'] <= 180)]

This returns a new DataFrame with range of height values between 160 and 180.

How can we use code modularization to organize our neural network projects?

We can compartmentalize our functions using modular pieces of code. For example, in this project, we defined a preprocess and feature_engineer function in utils.py, which allows us to focus on...