You're reading from Hands-On Data Science with Anaconda

Product typeBook

Published inMay 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781788831192

Edition1st Edition

Languages

Python

Tools

Jupyter Anaconda

Concepts

Data Science

Authors (2):

Yuxing Yan

James Yan

View More author details

Data Basics

In this chapter, we'll first discuss sources of open data, which includes the University of California at Irvine (UCI) Machine Learning Depository, the Bureau of Labor Statistics, the Census Bureau, Professor French's Data Library, and the Federal Reserve's Data Library. Then, we will show you several ways of inputting data, how to deal with missing values, sorting, choosing a subset, merging different datasets, and data output. For different languages, such as Python, R, and Julia, several relevant packages for data manipulation will be introduced as well. In particular, the Python pandas package will be discussed.

In this chapter, the following topics will be covered:

Sources of data
Introduction to the Python pandas package
Several ways to inputting packages
Introduction to the Quandl data delivery platform
Dealing with missing data
Sorting data...

Sources of data

For users in the area of data science and business analytics, one important issue is the source of data, or simply where to get data. When working at a company, the obvious source of data is one's own company, such as sales, cost of raw materials, the salary of managers and other employees, the related information of suppliers and clients, estimations of future sales, the cost of raw materials, and so on. It is a good idea to find some data for learning purposes, and this is especially true for full-time students.

Generally speaking, there are two types of data: public and private. Private or proprietary databases are quite expensive. A typical example is the Center for Research in Security Prices (CRSP) database, a financial database generated and maintained by the University of Chicago. This database has daily, weekly, monthly, and annual trading data for...

UCI machine learning

The UCI maintains 413 datasets, as of 1/10/2018, for machine learning: http://archive.ics.uci.edu/ml/index.php. The following screenshot shows the top three downloaded datasets:

For the number one downloaded dataset called Iris, we have the following information:

The beauty of these datasets is that they give quite detailed information such as the source, the creator or donator, a description, and even citations.

The following table shows several potential public data sources for users in the area of data science and business analytics:

Name	Web page	Data types
UCI	http://archive.ics.uci.edu/ml/index.php	Data for machine learning
World Health Organization	http://www.who.int/en/	Healthcare data
Amazon Web Services	https://aws.amazon.com/cn/datasets/?nc1=h_ls	Web usage
Data.gov (US Government Open Data)	https://www.data...

Introduction to the Python pandas package

The Python pandas package is very useful when dealing with data. The pandas package is a wonderful tool for data preprocessing, which is essential for data analysis. There is a humorous way of describing the importance of data cleaning: "A data scientist spends 80% of their time cleaning the data and the other 20% complaining about cleaning the data". To test if the package is preinstalled, we can type import pandas as pd after we launch Python. If we don't see any error messages, it means that the package was preinstalled. If we do, then we can use conda install pandas to install the package. To find all available functions, we could use the following three lines of Python code:

To find out about the usage or examples of individual functions, the help() function can be used. For example, for the to_pickle functionality...

Several ways to input data

First, let's look at how to input a Comma Separated Value (CSV) file. The input dataset is the most popular one from UCI Machine Learning Data Depository. The location is http://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data; you can refer to the following screenshot as well:

Inputting data using R

The R code is shown here:

> path<-"http://archive.ics.uci.edu/ml/machine-learning-databases/" 
> dataSet<-"iris/bezdekIris.data" 
> a<-paste(path,dataSet,sep='') 
> x<-read.csv(a,header=F) 
> colnames(x)<-c("sepalLength","sepalWidth","petalLength","petalWidth","Class"...

Introduction to the Quandl data delivery platform

Quandl is a data delivery platform that includes many free datasets. Its website is https://www.quandl.com. The following are a few programs written in R or Python to download data from the platform. The following program retrieves the latest 50 trading days' data for International Business Machine (IBM):

> library(Quandl) 
> x<- Quandl.dataset.get("WIKI/ibm", list(rows=50)) 
> head(x,2) 
        Date   Open    High     Low  Close  Volume Ex-Dividend 
1 2018-01-09 163.90 164.530 163.060 163.83 4333418           0 
2 2018-01-08 162.66 163.905 161.701 163.47 5101023           0 
  Split Ratio Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume 
1           1    163.90   164.530  163.060     163.83     4333418 
2           1    162.66   163.905  161.701     163.47     5101023

Note that we just need to issue...

Dealing with missing data

First, let's look at the missing codes for different languages:

Languages	Missing code	Explanation or examples
R	`NA`	`NA` stands for Not Available
Python	`nan`	`import scipy as sp` `misingCode=sp.nan`
Jullia	`missing`	`julia> missing + 5` `missing`
Octave	`NaN`	Same for MATLAB as well

Table 3.7: Missing codes for R, Python, Julia, and Octave

For R, the missing code is NA. Here are several functions we could use to remove those missing observations, shown in an example:

> head(na_example,20) 
[1]  2  1  3  2  1  3  1  4  3  2  2 NA  2  2  1  4 NA  1  1  2 
> length(na_example) 
[1] 1000 
> x<-na.exclude(na_example) 
> length(x) 
[1] 855 
> head(x,20) 
[1] 2 1 3 2 1 3 1 4 3 2 2 2 2 1 4 1 1 2 1 2

In the previous example, we removed 145 missing values by using the R function called na.exclude(). We could...

Data sorting

In R, we have several ways to sort data. The easiest way is to use the sort() function (see the code for the simplest one-dimensional data):

> set.seed(123) 
> x<-rnorm(100) 
> head(x) 
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499 
> y<-sort(x) 
> head(y) 
[1] -2.309169 -1.966617 -1.686693 -1.548753 -1.265396 -1.265061

Let's look at another way to sort data. The dataset used is called nyseListing, which is included in the R package called fImport, shown here:

library(fImport) 
data(nyseListing) 
dim(nyseListing) 
head(nyseListing)

The output is shown here:

In total, we have 3,387 observations, each with 4 variables. The dataset is sorted by Symbol, as in the tickers of individual stocks. Assume that we want to sort them by Name, as shown here:

> x<-nyseListing[order(nyseListing$Name),] 
> head(x...

Introduction to the cbsodata Python package

To install the cbsodata Python package, perform the following steps:

We can use one of the following commands:

conda install cbsodata 
pip install cbsodata

For more detailed instructions about how to install the Python package, please see Chapter 6, Managing Packages:

The next program shows one example of using the package:

import pandas as pd 
import cbsodata as cb 
name='82070ENG' 
data = pd.DataFrame(cb.get_data(name)) 
print(data.head()) 
info=cb.get_info(name) 
print(info['Title'])

The corresponding output is shown in the following screenshot:

The last line in the screenshot gives the name of the dataset. In the previous example, we used the dataset with the name 82070ENG.

To find out all the names of lists, we use the get_table_list() function; see the following code:

import cbsodata as cb 
list=cb...

Introduction to the datadotworld Python package

To install the datadotworld Python package, follow these steps:

First, we have to install the package. To do so, we could try one of the following lines:

conda install datadotworld 
pip install datadotworld

After the package is successfully installed, we can use the dir() function to list all its functions, as shown in this screenshot:

A user has to get an APK token at https://data.world/settings/advaned in order to run a Python program. Without such a token, we might get the following error message if we run a datadotworld function:

According to the error message, we must run the following configure command:

Now we can use the Python package as shown in the following Python program:

import datadotworld as dw 
name='jonloyens/an-intro-to-dataworld-dataset' 
results = dw.query(name, 
    'SELECT * FROM...

Introduction to the haven and foreign R packages

The R package called haven is for import and export from SPSS, Stata and SAS files. The package is for Labelled Data Utility Functions, which is a collection of many small functions dealing with labelled data, such as reading and writing data between R and other statistical software packages such as SAS, SPSS, or Stata, and working with labelled data.

This includes easy ways to get, set, and change value and variable label attributes, convert labelled vectors into factors or numeric values (and vice versa), and deal with multiple declared missing values. The following example is about writing several specific outputs:

library(haven)
x<-1:100
y<-matrix(x,50,2)
z<-data.frame(y)
colnames(z)<-c("a","b")
write_sas(z,"c:/temp/tt.sas7bdat")
write_spss(z,"c:/temp/tt.sav")
write_stata(z,"...

Introduction to the dslabs R package

The dslabs R package is short for Data Science Labs. The package includes several datasets, such as the dataset called murders for US gun murders by state for 2010:

> library(dslabs) 
> data(murders) 
> head(murders) 
       state abb region population total 
1    Alabama  AL  South    4779736   135 
2     Alaska  AK   West     710231    19 
3    Arizona  AZ   West    6392017   232 
4   Arkansas  AR  South    2915918    93 
5 California  CA   West   37253956  1257 
6   Colorado  CO   West    5029196    65

The following table shows the datasets included in the package:

Name of dataset	Description
admissions	Gender bias among graduate school admissions to UC Berkeley
divorce_margarine	Divorce rate and margarine consumption data
ds_theme_set	dslabs theme set
gapminder	Gapminder data
heights	Self-Reported...

Generating Python datasets

To generate a Python dataset, we use the Pandas to_pickle functionality. The dataset we plan to use is called adult.pkl, as shown in the following screenshot:

The related Python code is given here:

import pandas as pd 
path="http://archive.ics.uci.edu/ml/machine-learning-databases/" 
dataSet="adult/adult.data" 
inFile=path+dataSet 
x=pd.read_csv(inFile,header=None) 
adult=pd.DataFrame(x,index=None) 
adult= adult.rename(columns={0:'age',1: 'workclass', 
2:'fnlwgt',3:'education',4:'education-num', 
5:'marital-status',6:'occupation',7:'relationship', 
8:'race',9:'sex',10:'capital-gain',11:'capital-loss', 
12:'hours-per-week',13:'native-country',14:'class'}) 
adult.to_pickle("c:/temp...

Generating R datasets

Here, we show you how to generate an R dataset called iris.RData by using the R save() function:

path<-"http://archive.ics.uci.edu/ml/machine-learning-databases/" 
dataSet<-"iris/bezdekIris.data" 
a<-paste(path,dataSet,sep='') 
.iris<-read.csv(a,header=F) 
colnames(.iris)<-c("sepalLength","sepalWidth","petalLength","petalWidth","Class") 
save(iris,file="c:/temp/iris.RData")

To upload the function, we use the load() function:

>load("c:/temp/iris.RData") 
> head(.iris) 
  sepalLength sepalWidth petalLength petalWidth       Class 
1         5.1        3.5         1.4        0.2 Iris-setosa 
2         4.9        3.0         1.4        0.2 Iris-setosa 
3         4.7        3.2         1.3        0.2 Iris-setosa 
4         4.6        3.1   ...

Summary

In this chapter, we first discussed sources of open data, which included The Bureau of Labor Statistics, the Census Bureau, Professor French's data library, the Federal Reserve's data library, and the UCI Machine Learning Depository. After that, we showed you how to input data; how to deal with missing data; how to sort, slice, and dice the datasets; and how to merge different datasets. Data output was discussed in detail. For different languages, such as Python, R, and Julia, several relevant packages for data manipulation were introduced and discussed.

In Chapter 4, Data Visualization, we will discuss data visualization in R, Python, and Julia separately. To make our visual presentation more eye catching, we will show how you to generate simple graphs and bar charts, as well as how to add trend lines and legends. Other explanations will include how to save...

Review questions and exercises

What is the difference between open data and proprietary databases?
Is it enough for learners in the area of data science to use open data?
Where can we access open public data?
From The UCI Data Depository, http://archive.ics.uci.edu/ml/index.php, download a dataset called Wine. Write a program in R to import it.
From the UCI Data Depository, download a dataset called Forest Fire. Write a program in Python to import it.
From the UCI Data Depository, download a dataset called Bank Marketing. Write a program in Octave to import it. Answer the following questions: 1) How many banks? and 2) What is the cost?
How can we find all R functions with read. as their leading letters? (Note that there is a dot after read.)
How can we find more information on an R function called read.xls()?
Explain the differences between two R functions: save() and saveRDS...

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Science with Anaconda

Published in: May 2018Publisher: PacktISBN-13: 9781788831192

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages