Reader small image

You're reading from  Artificial Intelligence for IoT Cookbook

Product typeBook
Published inMar 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781838981983
Edition1st Edition
Languages
Right arrow
Author (1)
Michael Roshak
Michael Roshak
author image
Michael Roshak

Michael Roshak is a cloud architect and strategist with extensive subject matter expertise in enterprise cloud transformation programs and infrastructure modernization through designing, and deploying cloud-oriented solutions and architectures. He is responsible for providing strategic advisory for cloud adoption, consultative technical sales, and driving broad cloud services consumption with highly strategic accounts across multiple industries.
Read more about Michael Roshak

Right arrow
Machine Learning for IoT

Machine learning has dramatically altered what manufacturers are able to do with IoT. Today, there are numerous industries that have specific IoT needs. For example, the internet of medical things (IoMT) has devices such as outpatient heart monitors that can be worn at home. These devices often require large amounts of data to be sent over the network or large compute capacity on the edge to process heart-related events. Another example is agricultural IoT (AIoT) devices that are often placed in locations where there is no Wi-Fi or cellular network. Prescriptions or models are pushed down to these semi-connected devices. Many of these devices require that decisions be made on the edge. When connectivity is finally established using technology such as LoRAWAN or TV, white space models are downloaded to the devices.

In this chapter, we are going to...

Analyzing chemical sensors with anomaly detection

Accurate predictive models require a large number of devices in the field to have failed so that they have enough fail data to use for predictions. For some well-crafted industrial devices, failures on this scale can take years. Anomaly detection can identify devices that are not behaving like the other devices in the fleet. It can also be used to wade through thousands of similar messages and pinpoint the messages that are not like the others.

Anomaly detection in machine learning can be unsupervised, supervised, or semi-supervised. Usually, it starts by using an unsupervised machine learning algorithm to cluster data into patterns of behavior or groups. This presents a series of data in buckets. When the machines are examined, some of the buckets identify behavior while some identify an issue with the device. The device may have exhibited different patterns of behavior in a resting state, an in-use state, a cold state, or something that...

Getting ready

Anomaly detection is one of the easiest machine learning models to implement. In this recipe, we are going to use a dataset drawn from chemical sensors that are detecting either neutral, banana, or wine. To get ready, you will need to import the numpy, sklearn and matplotlib libraries.

How to do it...

The following steps need to be observed to complete this recipe:

  1. Import the required libraries:
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
  1. Upload the data file to a DataFrame:
df = spark.read.format("csv" \
.option("inferSchema", True) \
.option("header", True) \
.option("sep", "\t") \
.load("/FileStore/tables/HT_Sensor_metadata.dat")
  1. View the dataset to see if the grouping of data correlates to the number of clusters: 
pdf = df.toPandas()

y_pred = KMeans(n_clusters=3,
random_state=2).fit_predict(pdf[['dt','t0']])


plt.scatter(pdf['t0'],pdf['dt'], c=y_pred)
display(plt.show())

The output is as follows:

The preceding chart shows three different groups of data. Tight clusters represent data with well-defined boundaries. If we adjust the number of clusters to 10, we may be able to get better...

How it works...

In this recipe, we are using numpy for data manipulation, sklearn for the machine learning algorithm, and matplotlib for viewing the results. Next, we pull the tab-separated file into a Spark dataframe. In this step, we convert the data into a pandas DataFrame. Then we run the k-means algorithm with three clusters, which gives the chart as the output.

K-means is an algorithm that helps group data into clusters. K-means is a popular clustering algorithm for examining data without labels. K-means first randomly initializes cluster centroids. In our example, it had three cluster centroids. It then assigns the centroids to the nearest data points. Next, it moves each centroid to the spot that is in the middle of its respective cluster. It repeats these steps until it achieves an appropriate division of data points.

There's more...

In the chart, you may have noticed outliers. These are very important to note when looking at prototypes. Outliers can represent power fluctuations within a machine, bad sensor placement, or a number of other issues. The following example shows a simple standard deviation calculation on our data. From here, we are able to see two values that fall outside three standard deviations from the mean:

from numpy import mean
from numpy import std

data_mean, data_std = mean(pdf['dt']), std(pdf['dt'])

cut_off = data_std * 3
lower, upper = data_mean - cut_off, data_mean + cut_off

outliers = [x for x in pdf['dt'] if x < lower or x > upper]
print('Identified outliers: %d' % len(outliers))
print(outliers)

Logistic regression with the IoMT 

In this recipe, we're going to talk about using logistic regression to classify data from mammography machines. Recently, the IoMT has expanded greatly. Many devices are being worn by patients when they go home from their doctor, providing an in-home medical monitoring solution, while others are in hospitals, giving the doctors additional feedback on medical tests being run. In many cases, machine learning algorithms are able to spot diseases and issues that doctors may miss, or give them additional recommendations. In this recipe, we are going to work with a breast cancer dataset and determine whether a mammogram record is malignant or benign.

Getting ready

The dataset, along with the Databricks notebooks, is available in the GitHub repository. The dataset is unwieldy. It has bad columns with a high degree of correlation, which is another way of saying some sensors are duplicates, and there are unused columns and extraneous data. For the sake of readability, there will be two notebooks in the GitHub repository. The first does all of the data manipulation and puts the data into a data table. The second notebook does the machine learning. We will focus this recipe on the data manipulation notebook. At the end of the recipe, we will talk about two other notebooks to show an example of MLflow.

One other thing you will need in this recipe is an MLflow workspace. To set up an MLflow workspace, you will need to go into Databricks and create the workspace for this experiment. We will write the results of our experiment there.

How to do it...

Follow these steps to complete this recipe:

  1. Import the required libraries: 
import pandas as pd

from sklearn import neighbors, metrics
from sklearn.metrics import roc_auc_score, classification_report,\
precision_recall_fscore_support,confusion_matrix,precision_score, \
roc_curve,precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

import statsmodels.api as sm
import statsmodels.formula.api as smf
  1. Import the data:
df = spark.sql("select * from BreastCancer")
pdf = df.toPandas()
  1. Split the data:
X = pdf
y = pdf['diagnosis']

X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3, random_state=40)
  1. Create the formula:
cols = pdf.columns.drop('diagnosis')
formula = 'diagnosis ~ ' + ' + '.join(cols)
  1.  Train the model:
model = smf.glm(formula=formula, data=X_train, 
family=sm.families.Binomial())
logistic_fit = model.fit()
  1. Test our model:
predictions...

How it works...

In this recipe, we used logistic regression. Logistic regression is a technique that can be used for traditional statistics as well as machine learning. Due to its simplicity and power, many data scientists use logistic regression as their first model and use it as a benchmark to beat. Logistic regression is a binary classifier, meaning it can classify something as true or false. In our case, the classifications are benign or malignant.

First, we import koalas for data manipulation and sklearn for our model and analysis. Next, we import data from our data table and put it into a Pandas DataFrame. Then we split the data into testing and training datasets. Next, we create a formula that will describe for the model the data columns being used. Next, we give the model the formula, the training dataset, and the algorithm it will use. We then output a model that we can use to evaluate new data. We now create a DataFrame called predictions_nominal, which we can use to compare...

There's more...

We will record the outcome in MLflow to be compared against other algorithms. We will also save other parameters, such as the main formula used and the family of predictions:

import pickle
import mlflow

with mlflow.start_run():
mlflow.set_experiment("/Shared/experiments/BreastCancer")

mlflow.log_param("formula", formula)
mlflow.log_param("family", "binomial")

mlflow.log_metric("precision", precision)
mlflow.log_metric("recall", recall)
mlflow.log_metric("fscore", fscore)
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))

mlflow.log_artifact(filename)

Classifying chemical sensors with decision trees

In this recipe, we are going to use chemical sensor data from Metal-Oxide (MOx) sensors to determine whether there is wine in the air. This type of sensor is commonly used to determine whether food or chemical particulates are in the air. Chemical sensors can detect gasses that would be poisonous to people or food spillage at a warehouse.

How to do it...

Follow these steps to complete this recipe:

  1. Import the libraries:
import pandas as pd
import numpy as np

from sklearn import neighbors, metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
  1. Import the data:
df = spark.sql("select * from ChemicalSensor")
pdf = df.toPandas()
  1. Encode the values:
label_encoder = LabelEncoder()
integer_encoded = \
label_encoder.fit_transform(pdf['classification'])
onehot_encoder = OneHotEncoder(sparse=False)

integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
  1. Test/train the split data:
X = pdf[feature_cols]
y = onehot_encoded

X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=5)
  1. Train and predict:
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred...

How it works...

As always, we import the libraries we need for this project. Next, we import data from our Spark data table into a Pandas DataFrame. One-hot encoding can change categorical values, such as our example of Wine and No Wine, into encoded values that machine learning algorithms can use better. In step 4, we take our feature columns and our one-hot encoded column and perform a split, splitting them into a testing and training set. In step 5, we create a decision tree classifier, use the X_train and y_train data to train the model, and then use the X_test data to create a y_prediction dataset. In other words, in the end, we will have a set of predictions called y_pred based on the predictions the dataset had on the X_test set. In step 6, we evaluate the accuracy of the model and the area under the curve (AUC).

Decision tree classifiers are used when the data is complex. In the same way, you can use a decision tree to follow a set of logical rules using yes/no questions...

There's more...

The sklearn decision tree classifier has two hyperparameters that we can tune: criterion and max depth. Hyperparameters are often changed to see if accuracy can be increased. The criterion is either gini or entropy. Both of these criteria evaluate impurities in the child nodes. The next one is max depth. The max depth of the decision tree can affect over- and underfitting.

Underfitting versus overfitting
Models that underfit are inaccurate and poorly represent the data they were trained on.

Models that overfit are unable to generalize from the data trained on. It misses similar data to the training set because it only works on exactly the same data it was trained on.

Simple predictive maintenance with XGBoost

Every device has an end of life or will require maintenance from time to time. Predictive maintenance is one of the most commonly used machine learning algorithms in IoT. The next chapter will cover predictive maintenance in depth, looking at sequential data and how that data changes with seasonality. This recipe will look at predictive maintenance from the simpler perspective of classification.

In this recipe, we are going to use the NASA Turbofan engine degradation simulation dataset. We are going to be looking at having three classifications. Green means the engine does not need maintenance; yellow, the engine needs maintenance within the next 14 maintenance cycles; or red, the engine needs maintenance within the next cycle. For an algorithm, we are going to use extreme gradient boosting (XGBoost). XGBoost has become popular in recent years because it tends to win more Kaggle competitions than other algorithms.

Getting ready

To get ready you will need the NASA Turbofan engine degradation simulation dataset. The data, along with a Spark notebook, can be found in the companion GitHub repository for this book or on the NASA website. Next, you will need to make sure you install XGBoost as a library in Databricks.

How to do it...

The steps for this recipe are as follows:

  1.  Import the libraries:

import pandas as pd
import numpy as np
from pyspark.sql.types import *
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score
import pickle
import mlflow
  1. Import the data:
file_location = "/FileStore/tables/train_FD001.txt"
file_type = "csv"

schema = StructType([
StructField("engine_id", IntegerType()),
StructField("cycle", IntegerType()),
StructField("setting1", DoubleType()),
StructField("setting2", DoubleType()),
StructField("setting3", DoubleType()),
StructField("s1", DoubleType()),
StructField("s2", DoubleType()),
StructField("s3", DoubleType()),
StructField("...

How it works...

First, we import pandas, pyspark, and numpy for data wrangling, xgboost for our algorithm, sklearn for scoring our results, and finally mlflow and pickle for saving those results. In step 2, we specify a schema in Spark. The inferred schema feature of Databricks can often get the schema wrong. Often we need to specify data types. In the next step, we create a temp view of the data so that we can use the SQL tools in Databricks. In step 4, we use the magic %sql tag at the top of the page to change the language to SQL. We then create a table call, engine, that has the engine data plus a new column that gives 0 if the engine has more than 14 cycles left, 1 if it has only one cycle left, and 2 if it has 14 cycles left. We then switch back to the default Python language and split the data into test and training datasets. In step 6, we specify the columns in the model as well as the hyperparameters. From here we train the model. We then test our model and...

Detecting unsafe drivers

Computer vision in machine learning has allowed us to tell if there are accidents on roads or unsafe work environments and can be used in conjunction with complex systems such as smart sales assistants. Computer vision has opened up many possibilities in IoT. Computer vision is also one of the most challenging from a cost perspective. In the next two recipes, we are going to discuss two different ways of using computer vision. The first one takes in large amounts of images generated from IoT devices and performs predictions and analysis on them using the high-performance distributed Databricks format. In the next recipe, we are going to use a technique for performing machine learning on edge devices with a small amount of compute using a low compute algorithm.

Getting ready

To get ready, you will need Databricks. In the example of this recipe, we are going to pull images from Azure Blob Storage.

How to do it...

The steps for this recipe are as follows:

  1. Import the libraries and configurations:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer
from pyspark.ml.evaluation import \
MulticlassClassificationEvaluator
from pyspark.sql.functions import lit
import pickle
import mlflow

storage_account_name = "Your Storage Account Name"
storage_account_access_key = "Your Key"
  1. Read the data: 
safe_images = "wasbs://unsafedrivers@"+storage_account_name+\
".blob.core.windows.net/safe/"
safe_df = spark.read.format('image').load(safe_images)\
.withColumn("label", lit(0))

unsafe_images = "wasbs://unsafedrivers@"+storage_account_name+\
".blob.core.windows.net/unsafe/"
unsafe_df = spark.read.format('image').load(unsafe_images)\
.withColumn("label", lit(1))
  1. Query the data...

How it works...

First, we are defining where the files are located. For this recipe, we are using Azure Blob Storage, but any storage system, such as S3 or HDFS, would work as well. Replace the storage_account_name and storage_account_access_key fields with the keys of your Blob Storage account. Read both safe and unsafe images in from our storage account into a Spark image DataFrame. In our example, we have placed safe images in one folder and unsafe images in another. Query the image DataFrame to see if it got the images. Create safe and unsafe test and training sets. We then union our datasets into a training set and a testing set. Next, we create a machine learning pipeline. We use the ResNet-50 algorithm as a featurizer. Next, we use logistic regression as our classifier. We then put it into a pipeline and train our model. Next, we take our pipeline and run our training DataFrame through it to come out with a trained model. We then evaluate the accuracy of our model. Finally, we...

There's more...

To change our pipeline to use Inception instead of ResNet50we simply need to change the model:

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", 
modelName="ResNet50")

Using Inception v3, we are able to test the accuracy of different models on the image set:

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features",
modelName="InceptionV3")

We could use an array of models and record the results in MLflow:

for m in ['InceptionV3', 'Xception','ResNet50', 'VGG19']:
featurizer = DeepImageFeaturizer(inputCol="image",
outputCol="features",
modelName=m)

Face detection on constrained devices

Deep neural networks tend to outperform other classification techniques. However, with IoT devices, there is not a large amount of RAM, compute, or storage. On constrained devices, RAM and storage are often in MB and not in GB, making traditional classifiers not possible. Some video classification services in the cloud charge over $10,000 per device for live streaming video. OpenCV's Haar classifiers have the same underlying principles as a convolutional neural network but at a fraction of the compute and storage. OpenCV is available in multiple languages and runs on some of the most constrained devices.

In this recipe, we are going to set up a Haar Cascade to detect if a person is close to the camera. This is often used in Kiosk and other interactive smart devices. The Haar Cascade can be run at a high rate of speed and when it finds a face that is close to the machine it can send that image via a cloud service or a different onboard machine...

Getting ready

The first thing we need to do is install the OpenCV framework:

pip install opencv-python

Next, we download the model. The model can be downloaded from the OpenCV GitHub page or the book's GitHub page. The file is haarcascade_frontalface_default.xml.

Next, we create a new folder by importing the haarcascade_frontalface_default.xml file and creating a Python file for the code. Finally, if the device does not have a camera attached to it, attach one. In the following recipe, we are going to implement a Haar Cascade using OpenCV.

How to do it...

The steps for this recipe are as follows:

  1. Import the libraries and settings:
import cv2
from time import sleep

debugging = True
classifier = \
cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
video = cv2.VideoCapture(0)
  1. Initialize the camera:
while True:
if not video.isOpened():
print('Waiting for Camera.')
sleep(5)
pass
  1. Capture and transform the image:
ret, frame = video.read()
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
  1. Classify the image:
faces = classifier.detectMultiScale(gray,
minNeighbors=5,
minSize=(100, 100)
)
  1. Debug the images:
if debugging:
# Draw a rectangle around the faces
for (x, y, w, h) in faces:
cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)

cv2.imshow('Video', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
  1. Detect the face:
if len(faces...

How it works...

First, import the libraries and set the settings. In the next step, we import the opencv and python libraries and we also import time so we can wait if the camera is not ready. Next, we set some debugging flags so we can test the output visually if we are debugging. Then we import the Haar Cascade XML file into our classifier. Finally, we open the first video camera attached to the machine. In step 2, we wait for the camera to become ready. This is often not a problem when developing the software as the system has already recognized the camera. Then we set this program to run automatically; the camera may not be available for up to a minute when the system is restarted. We are also starting an infinite loop of processing the camera images. In the next step, we capture and transform the image into black and white. Next, we run the classifier. The detectMultiScale classifier allows faces of different sizes to be detected. The minNeighbors parameter specifies...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Artificial Intelligence for IoT Cookbook
Published in: Mar 2021Publisher: PacktISBN-13: 9781838981983
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Michael Roshak

Michael Roshak is a cloud architect and strategist with extensive subject matter expertise in enterprise cloud transformation programs and infrastructure modernization through designing, and deploying cloud-oriented solutions and architectures. He is responsible for providing strategic advisory for cloud adoption, consultative technical sales, and driving broad cloud services consumption with highly strategic accounts across multiple industries.
Read more about Michael Roshak