Reader small image

You're reading from  Hands-On Data Science with Anaconda

Product typeBook
Published inMay 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788831192
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Yuxing Yan
Yuxing Yan
author image
Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

James Yan
James Yan
author image
James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan

View More author details
Right arrow

Data Basics

In this chapter, we'll first discuss sources of open data, which includes the University of California at Irvine (UCI) Machine Learning Depository, the Bureau of Labor Statistics, the Census Bureau, Professor French's Data Library, and the Federal Reserve's Data Library. Then, we will show you several ways of inputting data, how to deal with missing values, sorting, choosing a subset, merging different datasets, and data output. For different languages, such as Python, R, and Julia, several relevant packages for data manipulation will be introduced as well. In particular, the Python pandas package will be discussed.

In this chapter, the following topics will be covered:

  • Sources of data
  • Introduction to the Python pandas package
  • Several ways to inputting packages
  • Introduction to the Quandl data delivery platform
  • Dealing with missing data
  • Sorting data...

Sources of data

For users in the area of data science and business analytics, one important issue is the source of data, or simply where to get data. When working at a company, the obvious source of data is one's own company, such as sales, cost of raw materials, the salary of managers and other employees, the related information of suppliers and clients, estimations of future sales, the cost of raw materials, and so on. It is a good idea to find some data for learning purposes, and this is especially true for full-time students.

Generally speaking, there are two types of data: public and private. Private or proprietary databases are quite expensive. A typical example is the Center for Research in Security Prices (CRSP) database, a financial database generated and maintained by the University of Chicago. This database has daily, weekly, monthly, and annual trading data for...

UCI machine learning

The UCI maintains 413 datasets, as of 1/10/2018, for machine learning: http://archive.ics.uci.edu/ml/index.php. The following screenshot shows the top three downloaded datasets:

For the number one downloaded dataset called Iris, we have the following information:

The beauty of these datasets is that they give quite detailed information such as the source, the creator or donator, a description, and even citations.

The following table shows several potential public data sources for users in the area of data science and business analytics:

Name
Web page
Data types

UCI


http://archive.ics.uci.edu/ml/index.php

Data for machine learning

World Health Organization


http://www.who.int/en/

Healthcare data

Amazon Web Services

https://aws.amazon.com/cn/datasets/?nc1=h_ls

Web usage

Data.gov (US Government Open Data)

https://www.data...

Introduction to the Python pandas package

The Python pandas package is very useful when dealing with data. The pandas package is a wonderful tool for data preprocessing, which is essential for data analysis. There is a humorous way of describing the importance of data cleaning: "A data scientist spends 80% of their time cleaning the data and the other 20% complaining about cleaning the data". To test if the package is preinstalled, we can type import pandas as pd after we launch Python. If we don't see any error messages, it means that the package was preinstalled. If we do, then we can use conda install pandas to install the package. To find all available functions, we could use the following three lines of Python code:

To find out about the usage or examples of individual functions, the help() function can be used. For example, for the to_pickle functionality...

Several ways to input data

Inputting data using R

The R code is shown here:

> path<-"http://archive.ics.uci.edu/ml/machine-learning-databases/" 
> dataSet<-"iris/bezdekIris.data" 
> a<-paste(path,dataSet,sep='') 
> x<-read.csv(a,header=F) 
> colnames(x)<-c("sepalLength","sepalWidth","petalLength","petalWidth","Class"...

Introduction to the Quandl data delivery platform

Quandl is a data delivery platform that includes many free datasets. Its website is https://www.quandl.com. The following are a few programs written in R or Python to download data from the platform. The following program retrieves the latest 50 trading days' data for International Business Machine (IBM):

> library(Quandl) 
> x<- Quandl.dataset.get("WIKI/ibm", list(rows=50)) 
> head(x,2) 
        Date   Open    High     Low  Close  Volume Ex-Dividend 
1 2018-01-09 163.90 164.530 163.060 163.83 4333418           0 
2 2018-01-08 162.66 163.905 161.701 163.47 5101023           0 
  Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume 
1           1    163.90   164.530  163.060     163.83     4333418 
2           1    162.66   163.905  161.701     163.47     5101023 

Note that we just need to issue...

Dealing with missing data

First, let's look at the missing codes for different languages:

Languages
Missing code
Explanation or examples

R

NA

NA stands for Not Available

Python

nan

import scipy as sp

misingCode=sp.nan

Jullia

missing

julia> missing + 5

missing

Octave

NaN

Same for MATLAB as well

Table 3.7: Missing codes for R, Python, Julia, and Octave

For R, the missing code is NA. Here are several functions we could use to remove those missing observations, shown in an example:

> head(na_example,20) 
[1]  2  1  3  2  1  3  1  4  3  2  2 NA  2  2  1  4 NA  1  1  2 
> length(na_example) 
[1] 1000 
> x<-na.exclude(na_example) 
> length(x) 
[1] 855 
> head(x,20) 
[1] 2 1 3 2 1 3 1 4 3 2 2 2 2 1 4 1 1 2 1 2 

In the previous example, we removed 145 missing values by using the R function called na.exclude(). We could...

Data sorting

In R, we have several ways to sort data. The easiest way is to use the sort() function (see the code for the simplest one-dimensional data):

> set.seed(123) 
> x<-rnorm(100) 
> head(x) 
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499 
> y<-sort(x) 
> head(y) 
[1] -2.309169 -1.966617 -1.686693 -1.548753 -1.265396 -1.265061 

Let's look at another way to sort data. The dataset used is called nyseListing, which is included in the R package called fImport, shown here:

library(fImport) 
data(nyseListing) 
dim(nyseListing) 
head(nyseListing) 

The output is shown here:

In total, we have 3,387 observations, each with 4 variables. The dataset is sorted by Symbol, as in the tickers of individual stocks. Assume that we want to sort them by Name, as shown here:

> x<-nyseListing[order(nyseListing$Name),] 
> head(x...

Introduction to the cbsodata Python package

To install the cbsodata Python package, perform the following steps:

  1. We can use one of the following commands:
conda install cbsodata 
pip install cbsodata 

For more detailed instructions about how to install the Python package, please see Chapter 6, Managing Packages:

  1. The next program shows one example of using the package:
import pandas as pd 
import cbsodata as cb 
name='82070ENG' 
data = pd.DataFrame(cb.get_data(name)) 
print(data.head()) 
info=cb.get_info(name) 
print(info['Title']) 
  1. The corresponding output is shown in the following screenshot:

The last line in the screenshot gives the name of the dataset. In the previous example, we used the dataset with the name 82070ENG.

  1. To find out all the names of lists, we use the get_table_list() function; see the following code:
import cbsodata as cb 
list=cb...

Introduction to the datadotworld Python package

To install the datadotworld Python package, follow these steps:

  1. First, we have to install the package. To do so, we could try one of the following lines:
conda install datadotworld 
pip install datadotworld 
  1. After the package is successfully installed, we can use the dir() function to list all its functions, as shown in this screenshot:
  1. A user has to get an APK token at https://data.world/settings/advaned in order to run a Python program. Without such a token, we might get the following error message if we run a datadotworld function:
  1. According to the error message, we must run the following configure command:
  1. Now we can use the Python package as shown in the following Python program:
import datadotworld as dw 
name='jonloyens/an-intro-to-dataworld-dataset' 
results = dw.query(name, 
    'SELECT * FROM...

Introduction to the haven and foreign R packages

The R package called haven is for import and export from SPSS, Stata and SAS files. The package is for Labelled Data Utility Functions, which is a collection of many small functions dealing with labelled data, such as reading and writing data between R and other statistical software packages such as SAS, SPSS, or Stata, and working with labelled data.

This includes easy ways to get, set, and change value and variable label attributes, convert labelled vectors into factors or numeric values (and vice versa), and deal with multiple declared missing values. The following example is about writing several specific outputs:

library(haven)
x<-1:100
y<-matrix(x,50,2)
z<-data.frame(y)
colnames(z)<-c("a","b")
write_sas(z,"c:/temp/tt.sas7bdat")
write_spss(z,"c:/temp/tt.sav")
write_stata(z,"...

Introduction to the dslabs R package

The dslabs R package is short for Data Science Labs. The package includes several datasets, such as the dataset called murders for US gun murders by state for 2010:

> library(dslabs) 
> data(murders) 
> head(murders) 
       state abb region population total 
1    Alabama  AL  South    4779736   135 
2     Alaska  AK   West     710231    19 
3    Arizona  AZ   West    6392017   232 
4   Arkansas  AR  South    2915918    93 
5 California  CA   West   37253956  1257 
6   Colorado  CO   West    5029196    65 

The following table shows the datasets included in the package:

Name of dataset
Description

admissions

Gender bias among graduate school admissions to UC Berkeley

divorce_margarine

Divorce rate and margarine consumption data

ds_theme_set

dslabs theme set

gapminder

Gapminder data

heights

Self-Reported...

Generating Python datasets

To generate a Python dataset, we use the Pandas to_pickle functionality. The dataset we plan to use is called adult.pkl, as shown in the following screenshot:

The related Python code is given here:

import pandas as pd 
path="http://archive.ics.uci.edu/ml/machine-learning-databases/" 
dataSet="adult/adult.data" 
inFile=path+dataSet 
x=pd.read_csv(inFile,header=None) 
adult=pd.DataFrame(x,index=None) 
adult= adult.rename(columns={0:'age',1: 'workclass', 
2:'fnlwgt',3:'education',4:'education-num', 
5:'marital-status',6:'occupation',7:'relationship', 
8:'race',9:'sex',10:'capital-gain',11:'capital-loss', 
12:'hours-per-week',13:'native-country',14:'class'}) 
adult.to_pickle("c:/temp...

Generating R datasets

Here, we show you how to generate an R dataset called iris.RData by using the R save() function:

path<-"http://archive.ics.uci.edu/ml/machine-learning-databases/" 
dataSet<-"iris/bezdekIris.data" 
a<-paste(path,dataSet,sep='') 
.iris<-read.csv(a,header=F) 
colnames(.iris)<-c("sepalLength","sepalWidth","petalLength","petalWidth","Class") 
save(iris,file="c:/temp/iris.RData") 

To upload the function, we use the load() function:

>load("c:/temp/iris.RData") 
> head(.iris) 
  sepalLength sepalWidth petalLength petalWidth       Class 
1         5.1        3.5         1.4        0.2 Iris-setosa 
2         4.9        3.0         1.4        0.2 Iris-setosa 
3         4.7        3.2         1.3        0.2 Iris-setosa 
4         4.6        3.1   ...

Summary

In this chapter, we first discussed sources of open data, which included The Bureau of Labor Statistics, the Census Bureau, Professor French's data library, the Federal Reserve's data library, and the UCI Machine Learning Depository. After that, we showed you how to input data; how to deal with missing data; how to sort, slice, and dice the datasets; and how to merge different datasets. Data output was discussed in detail. For different languages, such as Python, R, and Julia, several relevant packages for data manipulation were introduced and discussed.

In Chapter 4, Data Visualization, we will discuss data visualization in R, Python, and Julia separately. To make our visual presentation more eye catching, we will show how you to generate simple graphs and bar charts, as well as how to add trend lines and legends. Other explanations will include how to save...

Review questions and exercises

  1. What is the difference between open data and proprietary databases?
  2. Is it enough for learners in the area of data science to use open data?
  3. Where can we access open public data?
  4. From The UCI Data Depository, http://archive.ics.uci.edu/ml/index.php, download a dataset called Wine. Write a program in R to import it.
  5. From the UCI Data Depository, download a dataset called Forest Fire. Write a program in Python to import it.
  6. From the UCI Data Depository, download a dataset called Bank Marketing. Write a program in Octave to import it. Answer the following questions: 1) How many banks? and 2) What is the cost?
  7. How can we find all R functions with read. as their leading letters? (Note that there is a dot after read.)
  8. How can we find more information on an R function called read.xls()?
  9. Explain the differences between two R functions: save() and saveRDS...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Science with Anaconda
Published in: May 2018Publisher: PacktISBN-13: 9781788831192
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

author image
James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan