Reader small image

You're reading from  F# for Machine Learning Essentials

Product typeBook
Published inFeb 2016
Reading LevelExpert
Publisher
ISBN-139781783989348
Edition1st Edition
Languages
Right arrow
Author (1)
Sudipta Mukherjee
Sudipta Mukherjee
author image
Sudipta Mukherjee

Sudipta Mukherjee was born in Kolkata and migrated to Bangalore. He is an electronics engineer by education and a computer engineer/scientist by profession and passion. He graduated in 2004 with a degree in electronics and communication engineering. He has a keen interest in data structure, algorithms, text processing, natural language processing tools development, programming languages, and machine learning at large. His first book on Data Structure using C has been received quite well. Parts of the book can be read on Google Books. The book was also translated into simplified Chinese, available from Amazon.cn. This is Sudipta's second book with Packt Publishing. His first book, .NET 4.0 Generics , was also received very well. During the last few years, he has been hooked to the functional programming style. His book on functional programming, Thinking in LINQ, was released in 2014. He lives in Bangalore with his wife and son. Sudipta can be reached via e-mail at sudipto80@yahoo.com and via Twitter at @samthecoder.
Read more about Sudipta Mukherjee

Right arrow

Chapter 3. Classification Techniques

"Telling chalk and cheese apart"

In the previous chapter, you learned how to predict the real values using linear regression. In this chapter, you will learn about classification. Classification is the process of tagging/marking a given object with a class/tag value. For example, for a given set of cancer patient records with benign and malignant cancerous cases, a program can be written to automatically categorize the new patient record to be either benign or malignant.

What's fascinating about this approach is that the algorithm doesn't change and you can use the same algorithm to predict other things that are important in other settings. For example, the same algorithm can tell apart dogs and cats from their photographs, as you will find out later in the chapter.

In this chapter, you will learn how to implement several algorithms in F# that are used for classification. For some simple algorithms, such as k-NN, you will develop the algorithm from scratch...

Objective


After reading this chapter, you will be able to break down a real problem to a classification problem whenever applicable and then be able to use the proper algorithm to solve the problem.

Different classification algorithms you will learn


Following are the different types of algorithms that you will be looking into:

  • k-NN (K Nearest Neighbor)

  • Logistic regression

  • Multinomial logistic regression

  • Decision trees (J48)

Some interesting things you can do


You can use machine learning to differentiate between dogs and cats from photographs. Then surprisingly enough, you may tweak the same algorithm to detect the cancerous cells from the normal cells in breast cancer patients. You may use decision trees to predict whether there will be a traffic jam on a given date and time. There can be several other parameters that can be helpful while predicting a traffic jam. The intention is that after reading this chapter you will be able to use these classification techniques to address some of the problems you are facing yourself.

Binary classification using k-NN

In this example, you will solve the kaggle cat and dog identification challenge (https://www.kaggle.com/c/dogs-vs-cats). The challenge is to identify dogs and cats from photographs. Following are a couple of example photographs:

  • Image 1

  • Image 2

How might we decide if an image contains a cat or a dog? What are the visual differences between cats and dogs? Some of...

Understanding logistic regression


Unlike linear regression which is used to predict the real values of a real entity, logistic regression is used to predict the class or tag of an unseen entry. Logistic regression's output is either a 0 or a 1 depicting the predicted class of the unseen entry. Logistic regression uses a smooth curve whose values range from 0 to 1 for all the values of the independent variable.

Sigmoid function (also called logistic function) is one option for this function. This is defined by the following formula:

The sigmoid function chart

The following chart is generated by the code snippet using FsPlot:

You need to install Chrome to get the chart rendered.

So you see that the function value approaches 1 as the value of X approaches infinity, and it approaches 0 as the value of approaches negative infinity. So for any given value of , you can determine the class if you set your threshold at 0.5. In other words, you can say that if for a given value of the value of the sigmoid...

Multiclass classification using logistic regression


You have seen in the previous section how logistic regression can be used to perform binary classification. In this section, you will see how to use logistic regression (which is known to do the binary classification) for multiclass classification. The algorithm used is known as the "one-vs-all" method.

The algorithm is very intuitive. It learns many models as many different classes of items are there in the training dataset. Later, when a new entry is given for identification, all the models are used to compute the confidence score that reflects the confidence of the model that the new entry belongs to that class. The model with the highest confidence is selected.

In this example, you will see how Accord.NET can be used to implement multiclass classification to identify iris flowers. There are three types of iris flowers, namely, Iris versicolor, Iris setosa, and Iris virginica. The task is to identify a given flower from the measurements...

Multiclass classification using decision trees


In Chapter 1, Introduction to Machine Learning, you saw how decision trees work to find several classes among unseen datasets. In the following section, you will see how to use WekaSharp, which is a wrapper on top of Weka to be used in a F# friendly way. Weka is an open source project for data mining and machine learning, written in Java (http://www.cs.waikato.ac.nz/ml/weka/).

Obtaining and using WekaSharp

You can download WekaSharp from https://wekasharp.codeplex.com/. Then you have to add the following DLLs in your F# application, as shown next:

In this example, you will see how to use WekaSharp to classify the iris flowers.

module DecisionTreesByWeka.Main

open System
open WekaSharp.Common
open WekaSharp.Classify
open WekaSharp.Dataset
open WekaSharp.Eval


[<EntryPoint>]
let main args = 
    let iris =            
            @"C:\iris.csv"
            |> WekaSharp.Dataset.readArff
            |> WekaSharp.Dataset.setClassIndexWithLastAttribute...

Predicting a traffic jam using a decision tree: a case study


When I go home from the office, I face a traffic jam, as shown in the following image, as many other commuters in Bangalore do almost every day.

I thought if I could only predict a traffic jam, I would reach home early to play with my son. I observed that if it rains and if it is a weekday and if it is past 4:30 PM in the dial, a traffic jam is highly likely.

So I thought I could use a decision tree to predict whether there will be a traffic jam today or not. In the following example, I used a toy dataset that I fabricated. In the real settings, the data has to be filled following a month or few months' observations, because anomalies do exist. Sometimes, even if it rains and even if it is a weekday, there is no traffic jam. The reason may be that there is a cricket match that people want to watch and they have taken a day off work or are working from home. However, these sorts of situations are not common and shouldn't be considered...

Challenge yourself!


Now that you know how to use k-NN, logistic regression, and the J48 decision tree to predict classes, can you use whatever you learnt in this chapter to create an e-mail spam identification system? Solve it using all kinds of algorithms and then check your result.

Get the spam data from http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/.

Summary


Congratulations on finishing yet another chapter. What you learned in this chapter is crucial for several complex machine learning activities. All the algorithms discussed in this chapter give a view of the underlying principals from several given examples (also called training sets). These types of algorithms are broadly classified under supervised learning.

You learnt how to use K-NN, a logistic regression model, and a J48 decision tree to predict the class/tag of an unknown entry.

In the next chapter, you will learn how to find similar products or items using several information retrieval metrics.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
F# for Machine Learning Essentials
Published in: Feb 2016Publisher: ISBN-13: 9781783989348
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sudipta Mukherjee

Sudipta Mukherjee was born in Kolkata and migrated to Bangalore. He is an electronics engineer by education and a computer engineer/scientist by profession and passion. He graduated in 2004 with a degree in electronics and communication engineering. He has a keen interest in data structure, algorithms, text processing, natural language processing tools development, programming languages, and machine learning at large. His first book on Data Structure using C has been received quite well. Parts of the book can be read on Google Books. The book was also translated into simplified Chinese, available from Amazon.cn. This is Sudipta's second book with Packt Publishing. His first book, .NET 4.0 Generics , was also received very well. During the last few years, he has been hooked to the functional programming style. His book on functional programming, Thinking in LINQ, was released in 2014. He lives in Bangalore with his wife and son. Sudipta can be reached via e-mail at sudipto80@yahoo.com and via Twitter at @samthecoder.
Read more about Sudipta Mukherjee