Reader small image

You're reading from  Practical Machine Learning Cookbook

Product typeBook
Published inApr 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785280511
Edition1st Edition
Languages
Right arrow
Author (1)
Atul Tripathi
Atul Tripathi
author image
Atul Tripathi

Atul Tripathi has spent more than 11 years in the fields of machine learning and quantitative finance. He has a total of 14 years of experience in software development and research. He has worked on advanced machine learning techniques, such as neural networks and Markov models. While working on these techniques, he has solved problems related to image processing, telecommunications, human speech recognition, and natural language processing. He has also developed tools for text mining using neural networks. In the field of quantitative finance, he has developed models for Value at Risk, Extreme Value Theorem, Option Pricing, and Energy Derivatives using Monte Carlo simulation techniques.
Read more about Atul Tripathi

Right arrow

Chapter 2. Classification

In this chapter, we will cover the following recipes:

  • Discriminant function analysis - geological measurements on brines from wells
  • Multinomial logistic regression - understanding program choices made by students
  • Tobit regression - measuring students' academic aptitude
  • Poisson regression - understanding species present in Galapagos Islands

Introduction


Discriminant analysis is used to distinguish distinct sets of observations and allocate new observations to previously defined groups. For example, if a study was to be carried out in order to investigate the variables that discriminate between fruits eaten by (1) primates, (2) birds, or (3) squirrels, the researcher could collect data on numerous fruit characteristics of those species eaten by each of the animal groups. Most fruits will naturally fall into one of the three categories. Discriminant analysis could then be used to determine which variables are the best predictors of whether a fruit will be eaten by birds, primates, or squirrels. Discriminant analysis is commonly used in biological species classification, in medical classification of tumors, in facial recognition technologies, and in the credit card and insurance industries for determining risk. The main goals of discriminant analysis are discrimination and classification. The assumptions regarding discriminant...

Discriminant function analysis - geological measurements on brines from wells


Let us assume that a study of ancient artifacts that have been collected from mines needs to be carried out. Rock samples have been collected from the mines. On the collected rock samples geochemical measurements have been carried out. A similar study has been carried out on the collected artifacts. In order to separate the samples into the mine from which they were excavated, DFA can be used as a function. The function can then be applied to the artifacts to predict which mine was the source of each artifact.

Getting ready

In order to perform discriminant function analysis we shall be using a dataset collected from mines.

Step 1 - collecting and describing data

The dataset on data analysis in geology titled BRINE shall be used. This can be obtained from http://www.kgs.ku.edu/Mathgeo/Books/Stat/ASCII/BRINE.TXT . The dataset is in a standard form, with rows corresponding to samples and columns corresponding to variables...

Multinomial logistic regression - understanding program choices made by students


Let's assume that high school students are to be enrolled on a program. The students are given the opportunity to choose programs of their choice. The choices of the students are based on three options. These choices are general program, vocational program, and academic program. The choice of each student is based on each student's writing score and social economic status.

Getting ready

In order to complete this recipe we shall be using a student's dataset. The first step is collecting the data.

Step 1 - collecting data

The student's dataset titled hsbdemo is being utilized. The dataset is available at: http://voia.yolasite.com/resources/hsbdemo.csv in an MS Excel format. There are 201 data rows and 13 variables in the dataset. The eight numeric measurements are as follows:

  • id
  • read
  • write
  • math
  • science
  • socst
  • awards
  • cid

The non-numeric measurements are as follows:

  • gender
  • ses
  • schtyp
  • prog
  • honors

How to do it...

Let's get into the...

Tobit regression - measuring the students' academic aptitude


Let us measure the academic aptitude of a student on a scale of 200-800. This measurement is based on the model using reading and math scores. The nature of the program in which the student has been enrolled is also to be taken into consideration. There are three types of programs: academic, general, and vocational. The problem is that some students may answer all the questions on the academic aptitude test correctly and score 800 even though it is likely that these students are not truly equal in aptitude. This may be true for all the students who may answer all the questions incorrectly and score 200.

Getting ready

In order to complete this recipe we shall be using a student's dataset. The first step is collecting the data.

Step 1 - collecting data

To develop the Tobit regression model we shall use the student dataset titled tobit, which is available at http://www.ats.ucla.edu/stat/data/tobit.csv in an MS Excel format. There are...

Poisson regression - understanding species present in Galapagos Islands


The Galapagos Islands are situated in the Pacific Ocean about 1000 km from the Ecuadorian coast. The archipelago consists of 13 islands, five of which are inhabited. The islands are rich in flora and fauna. Scientists are still perplexed by the fact that such a diverse set of species can flourish in such a small and remote group of islands.

Getting ready

In order to complete this recipe we shall be using species dataset. The first step is collecting the data.

Step 1 - collecting and describing the data

We will utilize the number of species dataset titled gala that is available at https://github.com/burakbayramli/kod/blob/master/books/Practical_Regression_Anove_Using_R_Faraway/gala.txt .

The dataset includes 30 cases and seven variables in the dataset. The seven numeric measurements include the following:

  • Species
  • Endemics
  • Area
  • Elevation
  • Nearest
  • Scruz
  • Adjcacent

How to do it...

Let's get into the details.

Step 2 - exploring the data...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Machine Learning Cookbook
Published in: Apr 2017Publisher: PacktISBN-13: 9781785280511
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Atul Tripathi

Atul Tripathi has spent more than 11 years in the fields of machine learning and quantitative finance. He has a total of 14 years of experience in software development and research. He has worked on advanced machine learning techniques, such as neural networks and Markov models. While working on these techniques, he has solved problems related to image processing, telecommunications, human speech recognition, and natural language processing. He has also developed tools for text mining using neural networks. In the field of quantitative finance, he has developed models for Value at Risk, Extreme Value Theorem, Option Pricing, and Energy Derivatives using Monte Carlo simulation techniques.
Read more about Atul Tripathi