Diabetes Onset Detection

The far-ranging developments in healthcare over the past few years have led to a huge collection of data that can be used for analysis. We can now easily predict the onset of various illnesses before they even happen, using a technology called neural networks. In this chapter, we are going to use a deep neural network and a grid search to predict the onset of diabetes for a set of patients. We will learn a lot about deep neural networks, the parameters that are used to optimize them, and how to choose the correct parameters for each.

We will cover the following topics in this chapter:

Detecting diabetes using a deep learning grid search
Introduction to the dataset
Building a Keras model
Performing a grid search using scikit-learn
Reducing overfitting using dropout regularization
Finding the optimal hyperparameters
Generating predictions using optimal...

Detecting diabetes using a grid search

We will be predicting diabetes on a of patients by using a deep learning algorithm, which we will optimize with a grid search to find the optimal hyperparameters. We are going to be doing this project in Jupyter Notebook, as follows:

Start by opening up Command Prompt in Windows or Terminal in Linux systems. We will navigate to our project directory using the cd command.
Our next step is to open the Jupyter Notebook by typing the following command:

jupyter notebook

Alternatively, you can use the jupyter lab command to open an instance of Jupyter Lab, which is just a better version of Notebook.

Once the Notebook is open, we will rename the unnamed file to Deep Learning Grid Search.
We will then import our packages using general import statements. We will print the version numbers, as shown in the following screenshot:

Keras has two options...

Introduction to the dataset

Our next step is to import the Pima Indians diabetes dataset, which contains the details of about 750 patients:

The dataset that we need can be found at https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv. We can import it by using the following line:

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

If we navigate to the preceding URL, we can see a lot of raw information. Once we have imported the dataset, we have to define column names. We will do this using the following lines of code:

names = ['n_pregnant', 'glucose_concentration', 'blood_pressure (mm Hg)', 'skin_thickness (mm)', 'serum_insulin (mu U/ml)', 'BMI', 'pedigree_function', 'age', 'class']

As can...

Building our Keras model

We'll now start building our Keras model, which is a deep learning algorithm:

The first thing that we're going to do is import the necessary packages and layers. We will do that by running the following lines of code:

from sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import Adam

In the preceding code snippet , GridSearchCV is the function we will use to perform a grid search, and KFold will be used for performing the k-fold cross-validation. KerasClassifier is used as the wrapper. Adam is the optimizer that we will be using for this model; the rest of the functions are all general functions used to define the model:

Let's start by defining the model. We will use a create_model() function, because...

Performing a grid search using scikit-learn

It's now time to prepare our grid search algorithm. We will follow a step-by-step process to make it easier to understand and execute:

The first thing that we will do is copy the create_model() function, which we created in the Building a Keras model section, and paste it into a new cell, as shown in the following screenshot:

Now, we will define a random seed through NumPy. This helps us to create results that are reproducible. We are also going to add random initialization of weights and random divisions of data into different groups. We will set a starting point so that we have the same initialization and the same divisions for all the data. This can be done by adding a few lines of code above the create_model() function, as shown in the following screenshot:

Our next step is to initialize the KerasClassifier that we imported...

Reducing overfitting using dropout regularization

We will now use the information we gained in the Performing a grid search using scikit-learn section to optimize other aspects of our model. It looks like we might be overfitting the data a little bit, as we are getting better results on our training data than our testing data. We're now going to look at adding in dropout regularization:

Our first step is to copy the code that is present in the grid search cell that we ran in the previous section, and paste it in a fresh cell. We will keep the general structure of the code and play around with some of the parameters present.

We will then import the Dropout function from keras.layers using the following line:

from keras.layers import Dropout

We will now convert the learning rate into a variable by defining it in the Adam optimizer code block. We will use learn_rate as...

Finding the optimal hyperparameters

We're now going to optimize the weight initialization that we're applying to the end of each of these neurons:

To do this, we will first copy the code from the cell that we ran in the previous Reducing overfitting using dropout regularization section, and paste it into a new one. In this section, we won't be changing the general structure of the code; instead, we will be modifying some parameters and optimizing the search.
We now know the best learn_rate and dropout_rate, so we are going to hardcode these and remove them. We are also going to remove the Dropout layers that we added in the previous section. We will modify the learning rate of the Adam optimizer to 0.001, as this is the best value that we found.
Since we are trying to optimize the activation and init variables, we will define them in the create_model() function...

Optimizing the number of neurons

Let's now move on to tuning the number of neurons in each of these layers. Since we are following the same steps as the preceding sections, we will go through all these steps and do a recap with the code snippet at the end. So, let's get started with the following steps:

We will start by copying the code from the cell used in the Finding the optimal hyperparameters section, and paste it in a new cell. In this new cell, we will play around with the number of neurons by modifying some of the variables.
We will convert the total number of neurons present in each hidden layer into variables, such as neuron1 and neuron2. We will also define these variables in the create_model() function, so that they are called every time we execute it.
We will also change the kernel_initializer and activation values to tanh and normal, since those were the...

Generating predictions using optimal hyperparameters

We know now some optimal hyperparameters for our grid search. We will use these to predict the onset of diabetes for the patients in our dataset. To do this, we will carry out the following steps:

We will predict whether diabetes will occur for every example in the dataset by using the predict() function, as shown in the following code snippet:

# generate predictions with optimal hyperparameters
y_pred = grid.predict(X_standardized)

We will then use the .shape command to see what the predictions look like. The following screenshot shows the output for this step:

From the preceding screenshot, we can see that there are 392 predictions with a numerical value for each.

Let's print off the first five and see what they look like. We get the following output:

We are now going to do a classification report and get an...

Summary

In this chapter, we built a deep neural network in Keras and we found the optimal hyperparameters using the scikit-learn grid search. We also learned how to optimize a network by tuning the hyperparameters. Note that the results that we get might not be the same for all of us, but as long as we get similar predictions, we can consider our model a success. When you start training on new data, or if you're trying to address a different problem with a different dataset, you will have to go through this process again. In this chapter, we also learned about deep learning and hyperparameter optimization and explored how to apply them to the network to predict the onset of diabetes on a huge dataset of patients.

In the next chapter, we will look at how to classify DNA using machine algorithms.