Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On Data Science with Anaconda

You're reading from  Hands-On Data Science with Anaconda

Product type Book
Published in May 2018
Publisher Packt
ISBN-13 9781788831192
Pages 364 pages
Edition 1st Edition
Languages
Authors (2):
Yuxing Yan Yuxing Yan
Profile icon Yuxing Yan
James Yan James Yan
Profile icon James Yan
View More author details

Table of Contents (15) Chapters

Preface Ecosystem of Anaconda Anaconda Installation Data Basics Data Visualization Statistical Modeling in Anaconda Managing Packages Optimization in Anaconda Unsupervised Learning in Anaconda Supervised Learning in Anaconda Predictive Data Analytics – Modeling and Validation Anaconda Cloud Distributed Computing, Parallel Computing, and HPCC References Other Books You May Enjoy

Predictive Data Analytics – Modeling and Validation

Our utmost objective in conducting various data analyses is trying to find patterns in order to predict what might happen in the future. For the stock market, researchers and professionals are conducting various tests to understand market mechanisms. In this case, many questions could be asked. What will the market index level be in the next five years? What will IBM's price range be next year? Will the market volatility increase or decrease in the future? What might be the impact if governments change their tax policies? What is the potential gain and loss if one country launches a trade war with another one? How do we predict a consumer's behavior by analyzing some related variables? Could we predict the probability that an undergraduate student will successfully graduate? Could we find an association between...

Understanding predictive data analytics

In terms of future events, people could have many questions. For an investor, if he/she could predict the future movement of a stock price, he/she could make more profit. For a company, if they could forecast the trend of their products, they could increase their stock price and products' market shares. For governments, if they could predict the impact of an aging population on society and the economy, they would have more incentive to design a better policy in terms of government budget and other related strategic decisions.

For universities, if they could have a good grasp of the market demand in terms of quality and skill sets for their graduates, they could design a set of better programs or launch new programs to satisfy the future needs in terms of a labor force.

For a better prediction or forecast, researchers have to consider...

Useful datasets

One of the best data sources is the UCI Machine Learning Repository. When we go to the web page at https://archive.ics.uci.edu/ml/datasets.html, we see the following list:

For example, if we click the first dataset (Abalone), we see the following. To save space, only the top part is shown:

From the web page, users can download the dataset and find definitions of variables and even citations. The code that follows can be used to download a related R dataset:

dataSet<-"UCIdatasets" 
path<-"http://canisius.edu/~yany/RData/" 
con<-paste(path,dataSet,".RData",sep='') 
load(url(con)) 
dim(.UCIdatasets) 
head(.UCIdatasets) 

The related output is shown here:

From the preceding output, we know that the dataset has 427 observations (dataset). For each dataset, we have 7 related features, such as Name, Data_Types, Default_Task...

Predicting future events

There are many techniques we could employ when trying to predict the future, such as moving average (MA), regression, auto-regression, and the like. First, let's start with the simplest one for a moving average:

movingAverageFunction<- function(data,n=10){
out= data
for(i in n:length(data)){
out[i] = mean(data[(i-n+1):i])
}
return(out)
}

In the preceding program, the default value for the number of periods is 10. We could use the dataset called MSFT included in the R package called timeSeries (see the code that follows):

> library(timeSeries)
> data(MSFT)
> p<-MSFT$Close
> #
> ma<-movingAverageFunction(p,3)
> head(p)
[1] 60.6250 61.3125 60.3125 59.1250 56.5625 55.4375
> head(ma)
[1] 60.62500 61.31250 60.75000 60.25000 58.66667 57.04167
> mean(p[1:3])
[1] 60.75
> mean(p[2:4])
[1] 60.25

Manually, we find that the average...

Model selection

When finding a good model, sometimes we face under fitting and over fitting. The first example is borrowed; you can download the program at http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py. It demonstrates the problems of under fitting and over fitting and how we can use linear regression with polynomial features to approximate nonlinear functions. The true function is given here:

In the following program, we try to use linear and polynomial models to approximate the equation. The slightly modified code is shown here. The program tries to show the impact of different models in terms of under-fitting and over-fitting:

import sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import...

Granger causality test

When saying that A causes B, this means that A is the reason that B happens. This is the common definition of causality: which one causes the next one. The Granger causality test is used to determine whether one time series is a factor and offers useful information in forecasting the second one. In the following code, a dataset called ChickEgg is used as an illustration. The dataset has two columns, number of chicks and number of eggs, with a timestamp:

> library(lmtest)
> data(ChickEgg)
> dim(ChickEgg)
[1] 54 2
> ChickEgg[1:5,]
chicken egg
[1,] 468491 3581
[2,] 449743 3532
[3,] 436815 3327
[4,] 444523 3255
[5,] 433937 3156

The question is: could we use this year's egg numbers to predict the next year's chicken numbers? If this is true, our statement will be the number of chicks Granger causes the number of eggs. If this is not true, we...

Summary

In this chapter, we have discussed predictive data analytics, modeling and validation, some useful datasets, time series analytics, how to predict future events, seasonality, and how to visualize our data. For Python packages, we have mentioned prsklearn and catwalk. For R packages, we have discussed datarobot, LiblineaR, andeclust. For Julia packages, we explained EQuantEcon. For Octave, we have explained ltfat.

In the next chapter, we will discuss Anaconda Cloud. Some topics include the Jupyter Notebook in depth, different formats of the Jupyter Notebooks, how to share notebooks with your partners, how to share different projects over different platforms, how to share your working environments, and how to replicate others' environments locally.

Review questions and exercises

  1. Why do we care about predicting the future?
  2. What does seasonality mean? How could it impact our predictions?
  3. How does one measure the impact of seasonality?
  4. Write an R program to use the moving average of the last five years to predict the next year's expected return. The source of the data is http://fiannce.yahoo.com. You can test a few stocks such as IBM, C, and WMT. In addition, apply the same method to the S&P500 index. What is your conclusion?
  5. Assume that we have the following true model:

Write a Python program to use linear and polynomial models to approximate the previous function and show the related graphs.

  1. Download a market index monthly data and estimate its next year's annual return. The S&P500 could be used as the index and Yahoo!Finance at finance.yahoo.com could be used as the source of data. Source of data: https...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Science with Anaconda
Published in: May 2018 Publisher: Packt ISBN-13: 9781788831192
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}