Reader small image

You're reading from  Julia Cookbook

Product typeBook
Published inSep 2016
Reading LevelBeginner
Publisher
ISBN-139781785882012
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Jalem Raj Rohit
Jalem Raj Rohit
author image
Jalem Raj Rohit

Jalem Raj Rohit is an IIT Jodhpur graduate with a keen interest in recommender systems, machine learning, and serverless and distributed systems. Raj currently works as a senior consultantdata scienceand NLP at Episource, before which he worked at Zomato and Kayako. He contributes to open source projects in Python, Go, and Julia. He also speaks at tech conferences about serverless engineering and machine learning.
Read more about Jalem Raj Rohit

View More author details
Right arrow

Chapter 4. Building Data Science Models

In this chapter, we will cover the following recipes:

  • Dimensionality reduction

  • Linear discriminant analysis

  • Data preprocessing

  • Linear regression

  • Score-based classification

  • Clustering

  • Bayesian basics

  • Time series analysis

Introduction


In this chapter, you will learn about various data science and statistical models. You will learn to design, customize, and apply them to various data science problems. This chapter will also teach you about model selection and the ways to build and understand robust statistical models.

Dimensionality reduction


In this recipe, you will learn about the concept of dimensionality reduction. This is the set of algorithms used by statisticians and data scientists when data has a large number of dimensions. It helps make computations and model designing easy. We will use the Principal Component Analysis (PCA) algorithm for this recipe.

Getting ready

To get started with this recipe, you have to have the MultivariateStats Julia package installed and running. This can be done by entering Pkg.add("MultivariateStats") in the Julia REPL. When using it for the first time, it might show a long list of warnings; however you can safely ignore them for the time being. They in no way affect the algorithms and techniques that we will use in this chapter.

How to do it...

  1. Firstly, let's simulate about a hundred random observations, as a training set for the PCA algorithm which we will use. This can be done using the randn() function:

    X = randn(100,3) * [0.8 0.7; 0.9 0.5; 0.2 0.6]
    

  2. Now, to fit...

Linear discriminant analysis


Linear discriminant analysis is the algorithm that is used for classification tasks. This is often used to find the linear combination of the input features in the data, which can separate the observations into classes. In this case, it would be two classes; however, multi-class classification can also be done through the discriminant analysis algorithm, which is also called the multi-class linear discriminant analysis algorithm.

Getting ready

To get started with this recipe, you have to clone the DiscriminantAnalysis.jl library from GitHub. This can be done by the following command:

Pkg.clone("https://github.com/trthatcher/DiscriminantAnalysis.jl.git")

And then, we can import the library by calling by its name, which is DiscriminantAnalysis. This can be done as follows:

using DiscriminantAnalysis

We also have to use the DataFrames library from Julia. If this library doesn't exist in your local system, it can be added by the following command:

Pkg.add("DataFrames...

Data preprocessing


Data preprocessing is one of the most important parts of an analytics or a data science pipeline. It involves methods and techniques to sanitize the data being used, quick hacks for making the dataset easy to handle, and the elimination of unnecessary data to make it lightweight and efficient when used in the analytics process. For this recipe, we will use the MLBase package of Julia, which is known as the Swiss Army Knife of writing machine learning code. Installation and setup instructions for the library will be explained in the Getting ready section.

Getting ready

  1. To get started with this recipe, you have to add the MLBase Julia package, which can be done by running the Pkg.add() function in the REPL. It can be done as follows:

    Pkg.add("MLBase")
    
  2. After installing the package, it can be imported using the using ... command in the REPL. It can be done as follows:

    using MLBase
    

After importing the package following the preceding steps, you are ready to dive into the How to...

Linear regression


Linear Regression is a linear model that is used to determine and predict numerical values. Linear regression is one of the most basic and important starting points in understanding linear models and predictive analytics. For this recipe, we will use Julia's GLM.jl package.

Getting ready

To get started with this recipe, you have to add the GLM.jl Julia package. It can be added and imported in the REPL using the Pkg.add(" ") command just like we added other packages before. This can be done as follows:

Pkg.add("GLM")

Now, import the package using the using " " command. The DataFrames package is also required to be imported. This can be done as follows:

using GLM
using DataFrames

How to do it...

  1. Here, we will attempt to perform a simple linear regression on two basic arrays, which we have generated on-the-fly. Let's call the two array A and B and then, create a dataframe containing them. This can be done as follows:

    df = DataFrame(A = [3, 6, 9], B = [34, 56, 67])
    

  2. Now the...

Classification


Classification is one of the core concepts of data science and attempts to classify data into different classes or groups. A simple example of classification can be trying to classify a particular population of people as male and female, depending on the data provided. In this recipe, we will learn to perform score-based classification, where each class is assigned a score, and the class with the lowest or the highest score is selected depending on the problem and the analyst's choice.

Getting ready

To get ready, the MLBase library has to be installed and imported. So, as we already installed it for the Preprocessing recipe, we don't need to install it again. Instead, we can directly import it using the using MLBase command:

using MLBase

How to do it...

  1. We will explore score-based classification algorithms and techniques by creating simple arrays and matrices that can fulfill our purpose. The first and the most important function is the classify() function, which takes in the...

Performance evaluation and model selection


Analysis of performance is very important for any analytics and machine learning processes. It also helps in model selection. There are several evaluation metrics that can be leveraged on ML models. The technique depends on the type of data problem being handled, the algorithms used in the process, and also the way the analyst wants to gauge the success of the predictions or the results of the analytics process.

Getting ready

To get ready, the MLBase library has to be installed and imported. So, as we already installed it for the Preprocessing recipe, we don't need to install it again. Instead, we can directly import it using the using MLBase command.

How to do it...

  1. Firstly, the predictions and the ground truths need to be defined in order to evaluate the accuracy and performance of a machine learning model or an algorithm. They can take a simple form of a Julia array. This is how they can be defined:

    truths = [1, 2, 2, 4, 4, 3, 3, 3, 1]
    pred   = ...

Cross validation


Cross validation is one of the most underrated processes in the domain of data science and analytics. However, it is very popular among the practitioners of competitive data science. It is a model evaluation method. It can give the analyst an idea about how well the model would perform on new predictions that the model has not yet seen. It is also extensively used to gauge and avoid the problem of overfitting, which occurs due to an excessive precise fit on the training set leading to inaccurate or high-error predictions on the testing set.

Getting ready

To get ready, the MLBase library has to be installed and imported. So, as we already installed it for the Preprocessing recipe, we don't need to install it again. Instead, we can directly import it using the using MLBase command. This can be done as follows:

using MLBase

How to do it...

  1. Firstly, we will look at the k-fold cross-validation method, which is one of the most popular cross validation methods used. The input data...

Distances


In statistics, the distance between vectors or data sets are computed in various ways depending on the problem statement and the properties of the data. These distances are often used in algorithms and techniques such as recommender systems, which help e-commerce companies such as Amazon, eBay, and so on, to recommend relevant products to the customers.

Getting ready

To get ready, the Distances library has to be installed and imported. We install it using the Pkg.add() function. It can be done as follows:

Pkg.add("Distances")

Then, the package has to be imported for use in the session. It can be imported through the using ... command. This can be done as follows:

using Distances

How to do it...

  1. Firstly, we will look at the Euclidean distance. It is the ordinary distance between two points in Euclidean space. This can be calculated through the Pythagorean distance calculation method, which is the square root of the square of the element-wise differences. This can be done using the...

Distributions


A probability distribution is when each point or subset in a randomized experiment is allotted a certain probability. So, every random experiment (and, in fact, the data of every data science experiment) follows a certain probability distribution. And the type of distribution being followed by the data is very important for initiating the analytics process, as well as for selecting the machine learning algorithms that are to be implemented. It should also be noted that, in a multivariate data set, each variable might follow a separate distribution. So, it is not necessary that all variables in a dataset follow similar distributions.

Getting ready

To get ready, the Distributions library has to be installed and imported. We install it using the Pkg.add() function, as follows:

Pkg.add("Distributions")

Then the package has to be imported for use in the session. It can be imported through the using ... command, as follows:

using Distributions

How to do it...

  1. Firstly, let's start by...

Time series analysis


Time series is another very important form of data. It is more widely used in stock markets, market analysis, and signal processing. The data has a time dimension, which makes it look like a signal. So, in most cases, signal analysis techniques and formulae are applicable for time series data, such as autocorrelation, crosscorrelation, and so on, which we have already dealt with in the previous chapters. In this recipe, we will deal with methods to get around and work with datasets with the time series format.

Getting ready

To get ready for the recipe, the TimeSeries and MarketData libraries have to be installed and imported. We install them using the Pkg.add() function, as follows:

Pkg.add("TimeSeries")
Pkg.add("MarketData")

Then the package has to be imported for use in the session. It can be imported through the using ... command, as follows:

using TimeSeries
using MarketData

How to do it...

  1. The TimeArray format from the TimeSeries package makes it easy to interpret...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Julia Cookbook
Published in: Sep 2016Publisher: ISBN-13: 9781785882012
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Jalem Raj Rohit

Jalem Raj Rohit is an IIT Jodhpur graduate with a keen interest in recommender systems, machine learning, and serverless and distributed systems. Raj currently works as a senior consultantdata scienceand NLP at Episource, before which he worked at Zomato and Kayako. He contributes to open source projects in Python, Go, and Julia. He also speaks at tech conferences about serverless engineering and machine learning.
Read more about Jalem Raj Rohit