You're reading from Julia Cookbook
In this recipe, you will learn about the concept of dimensionality reduction. This is the set of algorithms used by statisticians and data scientists when data has a large number of dimensions. It helps make computations and model designing easy. We will use the Principal Component Analysis (PCA) algorithm for this recipe.
To get started with this recipe, you have to have the MultivariateStats
Julia package installed and running. This can be done by entering Pkg.add("MultivariateStats")
in the Julia REPL. When using it for the first time, it might show a long list of warnings; however you can safely ignore them for the time being. They in no way affect the algorithms and techniques that we will use in this chapter.
Linear discriminant analysis is the algorithm that is used for classification tasks. This is often used to find the linear combination of the input features in the data, which can separate the observations into classes. In this case, it would be two classes; however, multi-class classification can also be done through the discriminant analysis algorithm, which is also called the multi-class linear discriminant analysis algorithm.
To get started with this recipe, you have to clone the DiscriminantAnalysis.jl
library from GitHub. This can be done by the following command:
Pkg.clone("https://github.com/trthatcher/DiscriminantAnalysis.jl.git")
And then, we can import the library by calling by its name, which is DiscriminantAnalysis
. This can be done as follows:
using DiscriminantAnalysis
We also have to use the DataFrames
library from Julia. If this library doesn't exist in your local system, it can be added by the following command:
Pkg.add("DataFrames...
Data preprocessing is one of the most important parts of an analytics or a data science pipeline. It involves methods and techniques to sanitize the data being used, quick hacks for making the dataset easy to handle, and the elimination of unnecessary data to make it lightweight and efficient when used in the analytics process. For this recipe, we will use the MLBase
package of Julia, which is known as the Swiss Army Knife of writing machine learning code. Installation and setup instructions for the library will be explained in the Getting ready section.
To get started with this recipe, you have to add the
MLBase
Julia package, which can be done by running thePkg.add()
function in the REPL. It can be done as follows:Pkg.add("MLBase")
After installing the package, it can be imported using the
using ...
command in the REPL. It can be done as follows:using MLBase
After importing the package following the preceding steps, you are ready to dive into the How to...
Linear Regression is a linear model that is used to determine and predict numerical values. Linear regression is one of the most basic and important starting points in understanding linear models and predictive analytics. For this recipe, we will use Julia's GLM.jl
package.
To get started with this recipe, you have to add the GLM.jl
Julia package. It can be added and imported in the REPL using the Pkg.add(" ")
command just like we added other packages before. This can be done as follows:
Pkg.add("GLM")
Now, import the package using the using " "
command. The DataFrames
package is also required to be imported. This can be done as follows:
using GLM using DataFrames
Classification is one of the core concepts of data science and attempts to classify data into different classes or groups. A simple example of classification can be trying to classify a particular population of people as male and female, depending on the data provided. In this recipe, we will learn to perform score-based classification, where each class is assigned a score, and the class with the lowest or the highest score is selected depending on the problem and the analyst's choice.
To get ready, the MLBase
library has to be installed and imported. So, as we already installed it for the Preprocessing recipe, we don't need to install it again. Instead, we can directly import it using the using MLBase
command:
using MLBase
Analysis of performance is very important for any analytics and machine learning processes. It also helps in model selection. There are several evaluation metrics that can be leveraged on ML models. The technique depends on the type of data problem being handled, the algorithms used in the process, and also the way the analyst wants to gauge the success of the predictions or the results of the analytics process.
To get ready, the MLBase
library has to be installed and imported. So, as we already installed it for the Preprocessing recipe, we don't need to install it again. Instead, we can directly import it using the using MLBase
command.
Firstly, the predictions and the ground truths need to be defined in order to evaluate the accuracy and performance of a machine learning model or an algorithm. They can take a simple form of a Julia array. This is how they can be defined:
truths = [1, 2, 2, 4, 4, 3, 3, 3, 1] pred = ...
Cross validation is one of the most underrated processes in the domain of data science and analytics. However, it is very popular among the practitioners of competitive data science. It is a model evaluation method. It can give the analyst an idea about how well the model would perform on new predictions that the model has not yet seen. It is also extensively used to gauge and avoid the problem of overfitting, which occurs due to an excessive precise fit on the training set leading to inaccurate or high-error predictions on the testing set.
To get ready, the MLBase
library has to be installed and imported. So, as we already installed it for the Preprocessing recipe, we don't need to install it again. Instead, we can directly import it using the using MLBase
command. This can be done as follows:
using MLBase
In statistics, the distance between vectors or data sets are computed in various ways depending on the problem statement and the properties of the data. These distances are often used in algorithms and techniques such as recommender systems, which help e-commerce companies such as Amazon, eBay, and so on, to recommend relevant products to the customers.
To get ready, the Distances
library has to be installed and imported. We install it using the Pkg.add()
function. It can be done as follows:
Pkg.add("Distances")
Then, the package has to be imported for use in the session. It can be imported through the
using ...
command. This can be done as follows:
using Distances
Firstly, we will look at the Euclidean distance. It is the ordinary distance between two points in Euclidean space. This can be calculated through the Pythagorean distance calculation method, which is the square root of the square of the element-wise differences. This can be done using the...
A probability distribution is when each point or subset in a randomized experiment is allotted a certain probability. So, every random experiment (and, in fact, the data of every data science experiment) follows a certain probability distribution. And the type of distribution being followed by the data is very important for initiating the analytics process, as well as for selecting the machine learning algorithms that are to be implemented. It should also be noted that, in a multivariate data set, each variable might follow a separate distribution. So, it is not necessary that all variables in a dataset follow similar distributions.
To get ready, the Distributions
library has to be installed and imported. We install it using the Pkg.add()
function, as follows:
Pkg.add("Distributions")
Then the package has to be imported for use in the session. It can be imported through the
using ...
command, as follows:
using Distributions
Time series is another very important form of data. It is more widely used in stock markets, market analysis, and signal processing. The data has a time dimension, which makes it look like a signal. So, in most cases, signal analysis techniques and formulae are applicable for time series data, such as autocorrelation, crosscorrelation, and so on, which we have already dealt with in the previous chapters. In this recipe, we will deal with methods to get around and work with datasets with the time series format.
To get ready for the recipe, the TimeSeries
and MarketData
libraries have to be installed and imported. We install them using the Pkg.add()
function, as follows:
Pkg.add("TimeSeries") Pkg.add("MarketData")
Then the package has to be imported for use in the session. It can be imported through the
using ...
command, as follows:
using TimeSeries using MarketData