Machine learning, a subset of artificial intelligence (AI), has taken the world by storm. Within the healthcare domain, it is possible to see how machine learning can make manual processes easier, providing benefits for patients, providers, and pharmaceutical companies alike. Google, for example, has developed a machine learning algorithmÂ that can identify cancerous tumors on mammograms. Similarly, Stanford University has developed a deep learning algorithmÂ to identify skin cancer.
In this chapter, weÂ willÂ discuss how one can use machine learning to detect breast cancer. We will look at the following topics:
- Objective of this project
- Detecting breast cancer with SVM and KNN
- Data visualization with machine learning
- Relationships between variables
- Understanding machine learning algorithmsÂ
- Training modelsÂ
- Predictions in machine learning
The main objective of this chapter is to see how machine learning helps detect cancer through the SVM and KNN models. The following screenshot is an example of the final output that we are trying to achieve in this project:
We will receive the information shown in the preceding screenshot for approximatelyÂ 700Â cells in our dataset. This will include factors such asÂ
mitoses, all of which are properties that would be valuable for a pathologist. In the screenshot, you can see that theÂ
4, which means that it is malignant; so, this particular cell is cancerous. A class ofÂ
2, on the other hand, would be benign, or healthy.
Now, let's take a look at the models that we will be training as the chapter progresses in the following screenshot:
Based on the cell's information, both models have predicted that the cell is cancerous, or malignant.Â In this project, we will go through the steps required to achieve this goal. We will start by downloading and installing packages with Anaconda, we will move on to starting a Jupyter Notebook, and then you will learn how to program these machine learning models in Python.Â
In this section, we will take a look at how to detect breast cancer with a support vector machine (SVM). We're also going to throw in a k-nearest neighbors (KNN) clustering algorithm, and compare the results.Â We will be using the
conda distribution, which is a great way to download and install Python since
conda is a package manager, meaning that it makes downloading and installing the necessary packages easy and straightforward. With
conda, we're also going to install the Jupyter Notebook, which we will use to program in Python. This will make sharing code and collaborating across different platforms much easier.Â
Now, let's go through the steps required to use Anaconda, as follows:
- Start by downloading
conda, and make sure that is in your
- Open up a Command Prompt, which is the best way to use
conda, and go into the
condais in your
Pathvariables, you can simply typeÂ
conda install, followed by whichever package you need. We're going to be using
numpy, so we will type that, as you can see in the following screenshot:
If you get an error saying that the command
condaÂ was not found, it means thatÂ
conda isn't in the
PathÂ variables. Edit the environment variables and add
- To start the Jupyter Notebook, simply type
jupyter notebookand pressÂ Enter. If
condais in the path, Jupyter will be found, as well, because it's located in the same folder. It will start to load up, as shown in the following screenshot:
TheÂ folder that we're in when we type
jupyter notebook is where it will open up on the web browser.
- After that, click onÂ
Python [default].Â Using
Python 2.7Â would be preferable, as it seems to be more of a standard in the industry.
- ToÂ checkÂ that we all have the same versions, we will conduct an import step.
- Rename the notebook to
Breast Cancer Detection with Machine Learning.
sys, so that we can check whether we're using Python 2.7.
We will need to import
numpyÂ for computational operations and arrays,
matplotlib for plotting,
pandas to handle the datasets, and
sklearn, to get the machine learning packages.
- We will import
pandas, and the
packages and print their versions. We can view the changes in the following screenshot:
To run the cell inÂ Jupyter Notebook, simply pressÂ Shift + Enter. AÂ number will pop up when itÂ completes, andÂ it'll print out the statements. Once again, if we encounter errors in this step and we are unable to import any of the preceding packages, we have to exit the Jupyter Notebook, type
conda install, and mention whichever package we are missing in the Terminal. These will then be installed. The necessary packages and versions are shown as follows:
- Python 2.7
- 1.14 for NumPy
The following screenshot illustrates how to import these libraries in the specific way that we're going to use them in this project:
In the following steps, we will look at how to import the different arguments in these libraries:
- First, we will import NumPy, using the commandÂ
import numpy as np.
- Next, we will import the various classes and functions in
sklearnÂ - namely,Â
- From neighbors, we will import
KNeighborsClassifier,Â which will be
sklearn.svm, we will import the support vector classifier (SVC).
- We're going to do
model_selection, so that we can use both KNN and SVCÂ in one step.
- We will then get some metrics, in which we will import the
classification_report, as well as the
pandas, we need to import plotting, which is the
scatter_matrix. This will be useful when we're exploring some data visualizations, before diving into the actual machine learning.
- Finally, from
matplotlib.pyplot, we will import
- Now, pressÂ Shift+ Enter,Â and make sure that all of the arguments import.
You may get a deprecation warning, as shown in the preceding screenshot. That is because some of these packages are getting old.
- Now that we have all of our packages set up, we can move on to loading the dataset. This is where we're going to be getting our information from. We're going to be using the UCI repository, since theyÂ have a large collection of datasets for machine learning, and they're free and available for everybody to use.Â
- The URL that we're going to use can be imported directly, if we type the whole URL. This is going to import our dataset with 11 different columns. We can see the URL and the various columns in the following screenshot:
We will then import the cell data. This will include the following aspects:
- The first column will simply be theÂ
IDÂ of the cell
- In the second column, we will haveÂ
- In the third column, we will have
- In the fourth column, we will have
- In the fifth column, we will haveÂ
- In the sixth column, we will haveÂ
- In the seventh column, we will haveÂ
- In the eighth column, we will have
- In the ninth column, we will haveÂ
- InÂ the tenth column, we will haveÂ
- And finally, in the eleventh column, we will haveÂ
These are factors that a pathologist would consider to determine whether or not a cell had cancer. When we discuss machine learning in healthcare, it has to be a collaborative project between doctors and computer scientists. While a doctor can help by indicating which factors are important to include, a computer scientist can help by carrying out machine learning. Now, let's move on to the next steps:
- Since we've got the names of our columns, we will now start a DataFrame.
- The next step will be to add
pd, which stands for
pandas. We're going to use the function
read_csv_url, which means that the names will be equal to those listed previously.
- PressÂ Shift + Enter, and make sure that all of the imports are right.
- We will then have to preprocess our data and carry out some visualizations, as we want to explore the dataset before we begin.
In machine learning, it's very important to understand the data that you're going to be using. This will help you pick which algorithm to use, and understand which results you're actually looking for. It is important to understand, for example, what is considered a good result, because accuracy is not always the most important classification metric. Take a look at the following steps:
- First, our dataset contains some missing data. To deal with this, we will add aÂ
df.replaceÂ gives us a question mark, it means that there's no data there. We're simply going to input the value
-99999and tell Python to ignore that data.
- We will then perform the
print(df.axes)operation,Â so that we can see the columns. We can see that we haveÂ 699 different data points, and each of those cases has 11 different columns.
- Next, we will print the shape of the dataset using the
We will drop the
Id class, as we don't want to carry out machine learning on the ID column. That is because it won't tell us anything interesting.
Let's view the output of the preceding steps in the following screenshot:
As we now have all of the columns, we can detect whether the tumor is benign (which means it is non-cancerous) or malignant (which means it is cancerous).Â We nowÂ have 10 columns, as we have dropped the ID column.
In the following screenshot, we can see the first cell in our dataset, as well as its different features:
Now let's visualize the parameters of the dataset, in the following steps:
- We will print the first point, so that we can see what it entails.
- We have a value of between 0 and 10 in all of the different columns. In theÂ
classÂ column, the number
2represents a benign tumor, and the numberÂ
4represents a malignant tumor. There areÂ 699 cells in the datasets.
- The next step will be to do a
print.describeoperation, which gives us the mean, standard deviation, and other aspects for each of our different parameters or features. This is shown in the following screenshot:
Here, we have a
maxÂ value of
10 for all of the different columns, apart from the
class column, which will either be
mean is a little closer toÂ
2, so we have a few more benign cases than we do malignant cases. Because theÂ
min and the
maxÂ values are between
10 for all columns, it means that we've successfully ignored the missing data, so we're not factoring that in. Each column has a relatively low mean, but most of them have a
10, which means that we have a case where we hit
10 in all but one of the classes.
Let's get started with data visualization. We will plot histograms for each variable. The steps in the preceding section are important, because we need to understand these datasets if we want to accurately and effectively use machine learning. Otherwise, we're shooting in the dark, and we might spend time on a method that doesn't need to be investigated. We will use the
plt method and make a plot, in whichÂ we will add the histograms of our dataset and edit the figure sizes, to make them easier to see.
We can see the output in the following screenshot:
As you can see, most of the preceding histograms have the majority of their data at around
1, with some data at a slightly higher value. Each histogram, apart from
class, has at least one case where the value is
10. The histogram for
clump thicknessis pretty evenly distributed, while the histogram for
chromatinis skewed to the left.
We will now look at a scatterplot matrix, to see the relationships between some of these variables.Â A scatterplot matrix is a very useful function to use, because it can tell us whether a linear classifier will be a good classifier for our data, or whether we have to investigate more complicated methods.
We will add a
scatter_matrix method and adjust the size toÂ
figsize(18, 18), to make it easier to see.
The output, as shown in the following screenshot, indicates theÂ relationship between eachvariable and every other variable:
All of the variables are listed on both the x and the y axes. Where they intersect, we can see the histograms that we saw previously.
In theÂ block indicated by the mouse cursor in the preceding screenshot,Â we can see that there is a pretty strong linear relationship betweenÂ
uniform_cell_size.Â This is expected. When we go through the preceding screenshot, we can see that some other cells have a good linear relationship.Â If we look at our classifications, however, there's no easy way to classify these relationships.
classÂ in the preceding screenshot, we can see thatÂ
4is a malignant classification. We can also seeÂ that there are cells that are scored from
clump_thickness, and were still classified as malignant.
Thus, we come to the conclusion that there aren't any strong relationships between any of the variables of our dataset.
The following steps will help you to better understand the machine learning algorithm:
- The first step that we need to perform is to split our dataset into
Ydatasets for training.Â We won't train all of the available data, as we need to save some for our validation step. This will help us to determine how well these algorithms can generalize to new data, and not just how well they know the training data.
Xdata will contain all of the variables, except for the class column, and our
Ydata is going to be the class column, which is the classification of whether a tumorÂ is malignant or benign.
- Next, we will use theÂ
train_test_splitfunction, and we will thenÂ split our dataÂ intoÂ
- In the same line, we will add
X, y, test_size. About 20% of our data is fairly standard, so we will make the test sizeÂ
0.2to test the data as shown in the following screenshot:
- Next, we will add a seed, which makes the dataÂ reproducible. We will start with a random seed, which will change the results a little bit every time.
- In scoring, we will add
accuracy. This is shown in the following screenshot:
In the preceding section, you learned about how machine learning algorithms can be used for healthcare purposes. We also looked at the testing parameters that are used for this application.
- First, make an empty list, in which we will append the KNN model.
- Enter the
KNeighborsClassifierÂ function and explore the number of neighbors.
- Start with
n_neighbors = 5, and play around with the variable a little, to see how it changes our results.
- Next, we will add our models: the SVM and the SVC. We will evaluate each model, in turn.
- The next step will be to get a results list and a names list, so that we can print out some of the information at the end.
- We will then perform a
forloopÂ for each of the models defined previously, such asÂ
- We will also do a
k-foldcomparison, which will run each of these a couple of times, and then take the best results.Â The numberÂ of splits, orÂ
n_splits, defines how many times it runs.
- Since we don't want a random state, we will go from the seed.Â Now, we will get our results. We will use theÂ
model_selectionÂ function that we imported previously, and the
- For each model, we'll provide training data toÂ
X_train, and thenÂ
- We will also add the specification scoring, which was the
accuracythat we added previously.
- We will also appendÂ
name, and we will print out aÂ
msg. We will then substitute some variables.
- Finally, we will look at the mean results and theÂ standard deviation.
k-foldtraining will take place, which means that this will be run 10 times.Â We will receive the average result and the average accuracy for each of them. We will use a random seed of
8, so that it is consistent across different trials and runs. Now, pressÂ Shift + Enter.Â We can see the output in the following screenshot:
In this case, our KNN narrowly beats the SVC. We will now go back and make predictions on our validation set, because the numbers shown in the preceding screenshotÂ just represent the accuracy of our training data. If we split up the datasets differently, we'll get the following results:
However, once again, it looks like we have pretty similar results, at least with regard to accuracy, on the training data between our KNN and our support vector classifier.Â The KNN tries to cluster the different data points into two groups: malignant and benign. The SVM, on the other hand, is looking for the optimal separating hyperplane that can separate these data points into malignant cells and benign cells.
In this section, we will make predictions on the validation dataset. So far, machine learning hasn't been very helpful, because it has told us information about the training data that we already know. Let's have a look at the following steps:
- First, we will make predictionsÂ on the validation sets with the
X_testthat we split out earlier.
- We'll do another
for name, and
- Then, we will do the
model.fit, and it will train it once again on the
ytraining data. Since we want to make predictions, we're going to use the model to actually make a prediction about the
- Once the model has been trained, we're going to use it to make a prediction. It will print out the
name, the accuracy score (based on a comparison of the
y_testÂ data with the predictions we made), and a
classification_report, which will tell us information about the false positives and negatives that we found.
- Now, pressÂ Shift + Enter. The following screenshotÂ shows the preceding steps, and theÂ output:
In the preceding screenshot, we can see that the KNN got a 98% accuracy rating in the validation set. The SVM achieved a result that was a little higher, at 95%.
The preceding screenshot also shows some other measures, such as
recall, and the
precision is a measure of false positives. It is actually the ratio of correctly predicted positive observations to the total predicted positive observations. A high value forÂ
precision means that we don't have too many false positives.Â The SVM has a lower precision score than the KNN, meaning that it classified a few cases as malignant when they were actually benign. It is vital to minimize the chance of getting false positives in this case, especially because we don't want to mistakenly diagnose a patient with cancer.
recall is a measure of false negatives. In our KNN, we actually have a few malignant cells that are getting through our KNN without being labeled. The
f1-scoreÂ column is a combination of the
We will now go back, to do another split and randomly sort our data again. In the following screenshot, we can see that our results have changed:
This time, we did much better on both the KNN and the SVM.Â We also got much higher precision scores from both, at 97%. This means that we probably onlyÂ gotÂ one or two false positives for our KNN. We had no false negatives for our SVM, in this case.
We will now look into another example of predicting, once again based on the cell features:
- First, we will make an SVC and get an accuracy score for it, based on our testing data.
- Next, we will add an example. Type in
np.arrayand pick whichever data points you want. We're going to need 10 of them.Â We also need to remember to see whether we get a malignant prediction.
- We will then take
reshapeÂ to it. We will flip it around, so that we get a column vector.
- We will then print our prediction and pressÂ Shift + Enter.
The following screenshot shows that we actually did get a malignant prediction:
In the preceding screenshot, we can see that we are 96% accurate, which is exactly what we were previously. By using the same model, we are actually able to predict whether a cell is malignant, based on its data.
When we run it again, we get the following results:
By changing the
10, the cells go from a malignant classification to a benign classification. When we change the values in theÂ
5, we learn that
4 means that it is malignant. Thus, the difference between a
4 and a
5 is enough to switch our SVM from thinking it's a malignant cell to a benign cell.
In this chapter, we imported data from the UCI repository. We named the columns (or features), and then put them into a
pandas DataFrame. We preprocessed our data and removed the ID column. We also explored the data, so that we would know more about it. We used the
describe function, which gave us features such as the mean, the maximum, the minimum, and the different quartiles. We also created some histograms (so that we could understand the distributions of the different features) and a scatterplot matrix (so that we could look for linear relationships between the variables).
We then split our dataset up into a training set and a testing validation set. We implemented some testing parameters, built a KNN classifier and an SVC, and compared their results using a classification report. This consisted of features such as accuracy, overall accuracy, precision, recall, F1 score, and support. Finally, we built our own cell and explored what it would take to actually get a malignant or benign classification.
In the next chapter, you will learn about the detection of diabetes.Â Stay tuned for more!