Reader small image

You're reading from  Neural Network Projects with Python

Product typeBook
Published inFeb 2019
Reading LevelBeginner
PublisherPackt
ISBN-139781789138900
Edition1st Edition
Languages
Right arrow
Author (1)
James Loy
James Loy
author image
James Loy

James Loy has more than five years, expert experience in data science in the finance and healthcare industries. He has worked with the largest bank in Singapore to drive innovation and improve customer loyalty through predictive analytics. He has also experience in the healthcare sector, where he applied data analytics to improve decision-making in hospitals. He has a master's degree in computer science from Georgia Tech, with a specialization in machine learning. His research interest includes deep learning and applied machine learning, as well as developing computer-vision-based AI agents for automation in industry. He writes on Towards Data Science, a popular machine learning website with more than 3 million views per month.
Read more about James Loy

Right arrow

Predicting Diabetes with Multilayer Perceptrons

In the first chapter, we went through the inner workings of a neural network, how to build our own neural network using Python libraries such as Keras, as well as the end-to-end machine learning workflow. In this chapter, we will apply what we have learned to build a multilayer perceptron (MLP) that can predict whether a patient is at risk of diabetes. This marks the first neural network project that we will build from scratch.

In this chapter, we will cover the following topics:

  • Understanding the problem that we're trying to tackle—diabetes mellitus
  • How AI is being used in healthcare today, and how AI will continue to transform healthcare
  • An in-depth analysis of the diabetes mellitus dataset, including data visualization using Python
  • Understanding MLPs, and the model architecture that we will use
  • A step-by-step guide...

Technical requirements

The key Python libraries required for this chapter are as follows:

  • matplotlib 3.0.2
  • pandas 0.23.4
  • Keras 2.2.4
  • NumPy 1.15.2
  • seaborn 0.9.0
  • scikit-learn 0.20.2

The code for this chapter can be found in the GitHub repository for the book at https://github.com/PacktPublishing/Neural-Network-Projects-with-Python.

To download the code into your computer, you may run the following git clone command:

$ git clone https://github.com/PacktPublishing/Neural-Network-Projects-with-Python.git

After the process is complete, there will be a folder titled Neural-Network-Projects-with-Python . Enter the folder by running this command:

$ cd Neural-Network...

Diabetes – understanding the problem

Diabetes is a chronic medical condition that is associated with elevated blood sugar levels in the body. Diabetes often leads to cardiovascular disease, stroke, kidney damage, and long-term damage to the extremities (that is, limbs and eyes).

It is estimated that there are 415 million people in the world suffering from diabetes, with up to 5 million deaths every year attributed to diabetes-related complications. In the United States, diabetes is estimated to be the seventh highest cause of death. Clearly, diabetes is a cause of concern to the wellbeing of modern society.

Diabetes can be divided into two subtypes: type 1 and type 2. Type 1 diabetes results from the body's inability to produce sufficient insulin. Type 1 diabetes is relatively rare compared to type 2 diabetes, and it only accounts for approximately 5% of diabetes....

AI in healthcare

Beyond predicting diabetes using machine learning, the field of healthcare, in general, is ripe for disruption by AI. According to a study by Accenture, the market for AI in healthcare is set for explosive growth, with an estimated compound annual growth rate of 40% by 2021. This significant growth is driven by a proliferation of AI and tech companies in healthcare.

Apple's chief executive officer, Tim Cook, believes that Apple can make significant contributions in healthcare. Apple's vision for disrupting healthcare can be exemplified by its developments in wearable technology. In 2018, Apple announced a new generation of smartwatches with active monitoring of cardiovascular health. Apple's smartwatches can now conduct electrocardiography in real time, and even warn you when your heart rate becomes abnormal, which is an early sign of cardiovascular...

The diabetes mellitus dataset

The dataset that we will be using for this project comes from the Pima Indians Diabetes dataset, as provided by the National Institute of Diabetes and Digestive and Kidney Diseases (and hosted by Kaggle).

The Pima Indians are a group of native Americans living in Arizona, and they are a highly studied group of people due to their genetic predisposition to diabetes. It is believed that the Pima Indians carry a gene that allows them to survive long periods of starvation. This thrifty gene allowed the Pima Indians to store in their bodies whatever glucose and carbohydrates they may eat, which is genetically advantageous in an environment where famines were common.

However, as society modernized and the Pima Indians began to change their diet to one of processed food, the rate of type 2 diabetes among them began to increase as well. Today, the incidence...

Exploratory data analysis

Let's dive into the dataset to understand the kind of data we are working with. We import the dataset into pandas:

import pandas as pd

df = pd.read_csv('diabetes.csv')

Let's take a quick look at the first five rows of the dataset by calling the df.head() command:

print(df.head())

We get the following output:

It looks like there are nine columns in the dataset, which are as follows:

  • Pregnancies: Number of previous pregnancies
  • Glucose: Plasma glucose concentration
  • BloodPressure: Diastolic blood pressure
  • SkinThickness: Skin fold thickness measured from the triceps
  • Insulin : Blood serum insulin concentration
  • BMI: Body mass index
  • DiabetesPedigreeFunction: A summarized score that indicates the genetic predisposition of the patient for diabetes, as extrapolated from the patient's family record for diabetes
  • Age: Age in years
  • Outcome...

Data preprocessing

In the previous section, Exploratory data analysis, we have discovered that there are 0 values in certain columns, which indicates missing values. We have also seen that the variables have different scales, which can negatively impact model performance. In this section, we will perform data preprocessing to handle these issues.

Handling missing values

First, let's call the isnull() function to check whether there are any missing values in the dataset:

print(df.isnull().any())

We'll see the following output:

It seems like there are no missing values in the dataset, but are we sure? Let's get a statistical summary of the dataset to investigate further:

print(df.describe())

The output is as...

MLPs

Now that we have completed exploratory data analysis and data preprocessing, let's turn our attention towards designing the neural network architecture. In this project, we will be using MLPs.

An MLP is a class of feedforward neural network, and it distinguishes itself from the single-layer perceptron that we've discussed in Chapter 1, Machine Learning and Neural Networks 101, by having at least one hidden layer, with each layer activated by a non-linear activation function. This multilayer neural network architecture and non-linear activation allows MLPs to produce non-linear decision boundaries, which is crucial in multi-dimensional real-world datasets such as the Pima Indians Diabetes dataset.

Model architecture

...

Model building in Python using Keras

We're finally ready to build and train our MLP in Keras.

Model building

As we mentioned in Chapter 1, Machine Learning and Neural Networks 101, the Sequential() class in Keras allows us to construct a neural network like Lego, stacking layers on top of one another.

Let's create a new Sequential() class:

from keras.models import Sequential

model = Sequential()

Next, let's stack our first hidden layer. The first hidden will have 32 nodes, and the input dimensions will be 8 (because there are 8 columns in X_train). Notice that for the very first hidden layer, we need to indicate the input dimensions. Subsequently, Keras will take care of the size compatibility of other hidden...

Results analysis

Having successfully trained our MLP, let's evaluate our model based on the testing accuracy, confusion matrix, and receiver operating characteristic (ROC) curve.

Testing accuracy

We can evaluate our model on the training set and testing set using the evaluate() function:

scores = model.evaluate(X_train, y_train)
print("Training Accuracy: %.2f%%\n" % (scores[1]*100))

scores = model.evaluate(X_test, y_test)
print("Testing Accuracy: %.2f%%\n" % (scores[1]*100))

We get the following result:

The accuracy is 91.85% and 78.57% on the training set and testing set respectively. The difference in accuracy between the training and testing set isn't surprising since the model was trained on...

Summary

In this chapter, we have designed and implemented an MLP that is capable of predicting the onset of diabetes with ~80% accuracy.

We first performed exploratory data analysis where we looked at the distribution of each variable, as well as the relationship between each variable and the target variable. We then performed data preprocessing to remove missing data and we also standardized our data such that each variable has a mean of 0 with unit standard deviation. Finally, we split our original data randomly into a training set, a validation set, and a testing set.

We then looked at the architecture of the MLP that we used, which consists of 2 hidden layers, with 32 nodes in the first hidden layer and 16 nodes in the second hidden layer. We then implemented this MLP in Keras using the sequential model, which allows us to stack layers on one another. We then trained our MLP...

Questions

  1. How do we plot a histogram of each variable in a pandas DataFrame, and why are histograms useful?

We can plot a histogram by calling the df.hist() function built into a pandas DataFrame class. A histogram provides an accurate representation of the distribution of our numerical data.

  1. How do we check for missing values (NaN values) in a pandas DataFrame?

We can call the df.isnull().any() function to easily check whether there are any null values in each column of the dataset.

  1. Besides NaN values, what other kinds of missing values could appear in a dataset?

Missing values can also appear in the form of 0 values. Missing values are often recorded as 0 in a dataset due to certain issues during data collection—perhaps the equipment was faulty, or there are other issues hindering data collection.

  1. Why is it crucial to remove missing values in a dataset before...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Neural Network Projects with Python
Published in: Feb 2019Publisher: PacktISBN-13: 9781789138900
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
James Loy

James Loy has more than five years, expert experience in data science in the finance and healthcare industries. He has worked with the largest bank in Singapore to drive innovation and improve customer loyalty through predictive analytics. He has also experience in the healthcare sector, where he applied data analytics to improve decision-making in hospitals. He has a master's degree in computer science from Georgia Tech, with a specialization in machine learning. His research interest includes deep learning and applied machine learning, as well as developing computer-vision-based AI agents for automation in industry. He writes on Towards Data Science, a popular machine learning website with more than 3 million views per month.
Read more about James Loy