Packt+ | Advance your knowledge in tech

You're reading from Julia Cookbook

Product typeBook

Published inSep 2016

Reading LevelBeginner

Publisher

ISBN-139781785882012

Edition1st Edition

Languages

Julia

Concepts

Data Science

Authors (2):

Raj R Jalem

Jalem Raj Rohit

View More author details

Chapter 4. Building Data Science Models

In this chapter, we will cover the following recipes:

Dimensionality reduction
Linear discriminant analysis
Data preprocessing
Linear regression
Score-based classification
Clustering
Bayesian basics
Time series analysis

Introduction

In this chapter, you will learn about various data science and statistical models. You will learn to design, customize, and apply them to various data science problems. This chapter will also teach you about model selection and the ways to build and understand robust statistical models.

Dimensionality reduction

In this recipe, you will learn about the concept of dimensionality reduction. This is the set of algorithms used by statisticians and data scientists when data has a large number of dimensions. It helps make computations and model designing easy. We will use the Principal Component Analysis (PCA) algorithm for this recipe.

Getting ready

To get started with this recipe, you have to have the MultivariateStats Julia package installed and running. This can be done by entering Pkg.add("MultivariateStats") in the Julia REPL. When using it for the first time, it might show a long list of warnings; however you can safely ignore them for the time being. They in no way affect the algorithms and techniques that we will use in this chapter.

How to do it...

Firstly, let's simulate about a hundred random observations, as a training set for the PCA algorithm which we will use. This can be done using the randn() function:
```
X = randn(100,3) * [0.8 0.7; 0.9 0.5; 0.2 0.6]
```
Now, to fit...

Linear discriminant analysis

Linear discriminant analysis is the algorithm that is used for classification tasks. This is often used to find the linear combination of the input features in the data, which can separate the observations into classes. In this case, it would be two classes; however, multi-class classification can also be done through the discriminant analysis algorithm, which is also called the multi-class linear discriminant analysis algorithm.

Getting ready

To get started with this recipe, you have to clone the DiscriminantAnalysis.jl library from GitHub. This can be done by the following command:

Pkg.clone("https://github.com/trthatcher/DiscriminantAnalysis.jl.git")

And then, we can import the library by calling by its name, which is DiscriminantAnalysis. This can be done as follows:

using DiscriminantAnalysis

We also have to use the DataFrames library from Julia. If this library doesn't exist in your local system, it can be added by the following command:

Pkg.add("DataFrames...

Data preprocessing

Data preprocessing is one of the most important parts of an analytics or a data science pipeline. It involves methods and techniques to sanitize the data being used, quick hacks for making the dataset easy to handle, and the elimination of unnecessary data to make it lightweight and efficient when used in the analytics process. For this recipe, we will use the MLBase package of Julia, which is known as the Swiss Army Knife of writing machine learning code. Installation and setup instructions for the library will be explained in the Getting ready section.

Getting ready

To get started with this recipe, you have to add the MLBase Julia package, which can be done by running the Pkg.add() function in the REPL. It can be done as follows:
```
Pkg.add("MLBase")
```
After installing the package, it can be imported using the using ... command in the REPL. It can be done as follows:
```
using MLBase
```

After importing the package following the preceding steps, you are ready to dive into the How to...

Linear regression

Linear Regression is a linear model that is used to determine and predict numerical values. Linear regression is one of the most basic and important starting points in understanding linear models and predictive analytics. For this recipe, we will use Julia's GLM.jl package.

Getting ready

To get started with this recipe, you have to add the GLM.jl Julia package. It can be added and imported in the REPL using the Pkg.add(" ") command just like we added other packages before. This can be done as follows:

Pkg.add("GLM")

Now, import the package using the using " " command. The DataFrames package is also required to be imported. This can be done as follows:

using GLM
using DataFrames

How to do it...

Here, we will attempt to perform a simple linear regression on two basic arrays, which we have generated on-the-fly. Let's call the two array A and B and then, create a dataframe containing them. This can be done as follows:
```
df = DataFrame(A = [3, 6, 9], B = [34, 56, 67])
```
Now the...

Classification

Classification is one of the core concepts of data science and attempts to classify data into different classes or groups. A simple example of classification can be trying to classify a particular population of people as male and female, depending on the data provided. In this recipe, we will learn to perform score-based classification, where each class is assigned a score, and the class with the lowest or the highest score is selected depending on the problem and the analyst's choice.

Getting ready

To get ready, the MLBase library has to be installed and imported. So, as we already installed it for the Preprocessing recipe, we don't need to install it again. Instead, we can directly import it using the using MLBase command:

using MLBase

How to do it...

We will explore score-based classification algorithms and techniques by creating simple arrays and matrices that can fulfill our purpose. The first and the most important function is the classify() function, which takes in the...

Performance evaluation and model selection

Analysis of performance is very important for any analytics and machine learning processes. It also helps in model selection. There are several evaluation metrics that can be leveraged on ML models. The technique depends on the type of data problem being handled, the algorithms used in the process, and also the way the analyst wants to gauge the success of the predictions or the results of the analytics process.

Getting ready

How to do it...

Firstly, the predictions and the ground truths need to be defined in order to evaluate the accuracy and performance of a machine learning model or an algorithm. They can take a simple form of a Julia array. This is how they can be defined:
```
truths = [1, 2, 2, 4, 4, 3, 3, 3, 1]
pred   = ...
```

Cross validation

Cross validation is one of the most underrated processes in the domain of data science and analytics. However, it is very popular among the practitioners of competitive data science. It is a model evaluation method. It can give the analyst an idea about how well the model would perform on new predictions that the model has not yet seen. It is also extensively used to gauge and avoid the problem of overfitting, which occurs due to an excessive precise fit on the training set leading to inaccurate or high-error predictions on the testing set.

Getting ready

using MLBase

How to do it...

Firstly, we will look at the k-fold cross-validation method, which is one of the most popular cross validation methods used. The input data...

Distances

In statistics, the distance between vectors or data sets are computed in various ways depending on the problem statement and the properties of the data. These distances are often used in algorithms and techniques such as recommender systems, which help e-commerce companies such as Amazon, eBay, and so on, to recommend relevant products to the customers.

Getting ready

To get ready, the Distances library has to be installed and imported. We install it using the Pkg.add() function. It can be done as follows:

Pkg.add("Distances")

Then, the package has to be imported for use in the session. It can be imported through the using ... command. This can be done as follows:

using Distances

How to do it...

Firstly, we will look at the Euclidean distance. It is the ordinary distance between two points in Euclidean space. This can be calculated through the Pythagorean distance calculation method, which is the square root of the square of the element-wise differences. This can be done using the...

Distributions

A probability distribution is when each point or subset in a randomized experiment is allotted a certain probability. So, every random experiment (and, in fact, the data of every data science experiment) follows a certain probability distribution. And the type of distribution being followed by the data is very important for initiating the analytics process, as well as for selecting the machine learning algorithms that are to be implemented. It should also be noted that, in a multivariate data set, each variable might follow a separate distribution. So, it is not necessary that all variables in a dataset follow similar distributions.

Getting ready

To get ready, the Distributions library has to be installed and imported. We install it using the Pkg.add() function, as follows:

Pkg.add("Distributions")

Then the package has to be imported for use in the session. It can be imported through the using ... command, as follows:

using Distributions

How to do it...

Firstly, let's start by...

Time series analysis

Time series is another very important form of data. It is more widely used in stock markets, market analysis, and signal processing. The data has a time dimension, which makes it look like a signal. So, in most cases, signal analysis techniques and formulae are applicable for time series data, such as autocorrelation, crosscorrelation, and so on, which we have already dealt with in the previous chapters. In this recipe, we will deal with methods to get around and work with datasets with the time series format.

Getting ready

To get ready for the recipe, the TimeSeries and MarketData libraries have to be installed and imported. We install them using the Pkg.add() function, as follows:

Pkg.add("TimeSeries")
Pkg.add("MarketData")

Then the package has to be imported for use in the session. It can be imported through the using ... command, as follows:

using TimeSeries
using MarketData

How to do it...

The TimeArray format from the TimeSeries package makes it easy to interpret...

The rest of the chapter is locked

You have been reading a chapter from

Julia Cookbook

Published in: Sep 2016Publisher: ISBN-13: 9781785882012

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Raj R Jalem

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages