Packt+ | Advance your knowledge in tech

You're reading from R for Data Science Cookbook (n)

Product typeBook

Published inJul 2016

Reading LevelIntermediate

Publisher

ISBN-139781784390815

Edition1st Edition

Languages

Tools

ggplot

Concepts

Data Science

Author (1)

Yu-Wei, Chiu (David Chiu)

Chapter 2. Data Extracting, Transforming, and Loading

This chapter covers the following topics:

Downloading open data
Reading and writing CSV files
Scanning text files
Working with Excel files
Reading data from databases
Scraping web data
Accessing Facebook data
Working with twitteR

Introduction

Before using data to answer critical business questions, the most important thing is to prepare it. Data is normally archived in files, and using Excel or text editors allows it to be easily obtained. However, data can be located in a range of different sources, such as databases, websites, and various file formats. Being able to import data from these sources is crucial.

There are four main types of data. Data recorded in text format is the simplest. As some users require storing data in a structured format, files with a .tab or .csv extension can be used to arrange data in a fixed number of columns. For many years, Excel has had a leading role in the field of data processing, and this software uses the .xls and .xlsx formats. Knowing how to read and manipulate data from databases is another crucial skill. Moreover, as most data is not stored in a database, one must know how to use the web scraping technique to obtain data from the Internet. As part of this chapter, we introduce...

Downloading open data

Before conducting any data analysis, an essential step is to collect high-quality, meaningful data. One important data source is open data, which is selected, organized, and freely available to the public. Most open data is published online in either text format or as APIs. Here, we introduce how to download the text format of an open data file with the download.file function.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Please perform the following steps to download open data from the Internet:

First, visit the http://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices link to view the historical price of the S&P 500 in Yahoo Finance:
Figure 1: Historical price of S&P 500
Scroll down to the bottom of the page, right-click and copy the link in Download to Spreadsheet (the link should appear similar to http://real-chart.finance.yahoo.com/table.csv?s=%5EGSPC&d=6&e...

Reading and writing CSV files

In the previous recipe, we downloaded the historical S&P 500 market index from Yahoo Finance. We can now read the data into an R session for further examination and manipulation. In this recipe, we demonstrate how to read a file with an R function.

Getting ready

In this recipe, you need to have followed the previous recipe by downloading the S&P 500 market index text file to the current directory.

How to do it…

Please perform the following steps to read text data from the CSV file.

First, determine the current directory with getwd, and use list.files to check where the file is, as follows:
```
> getwd()
> list.files('./')
```
You can then use the read.table function to read data by specifying the comma as the separator:
```
> stock_data <- read.table('snp500.csv', sep=',' , header=TRUE)
```
Next, filter data by selecting the first six rows with column Date, Open, High, Low, and Close:
```
> subset_data <- stock_data[1:6, c("Date", "Open", "High", "Low", "Close...
```

Scanning text files

In previous recipes, we introduced how to use read.table and read.csv to load data into an R session. However, read.table and read.csv only work if the number of columns is fixed and the data size is small. To be more flexible in data processing, we will demonstrate how to use the scan function to read data from the file.

Getting ready

In this recipe, you need to have completed the previous recipes and have snp500.csv downloaded in the current directory.

How to do it…

Please perform the following steps to scan data from the CSV file:

First, you can use the scan function to read data from snp500.csv:

> stock_data3 <- scan('snp500.csv',sep=',', what=list(Date = '', Open = 0, High = 0, Low = 0,Close = 0, Volume = 0, Adj_Close = 0),  skip=1, fill=T)
Read 16481 records

You can then examine loaded data with mode and str:

> mode(stock_data3)
[1] "list"
> str(stock_data3)
List of 7
 $ Date     : chr [1:16481] "2015-07-02" "2015-07-01" "2015-06-30" "2015-06-29" ...
 $ Open...

Working with Excel files

Excel is another popular tool used to store and analyze data. Of course, one can convert Excel files to CSV files or other text formats by using Excel. Alternatively, to simplify the process, you can use install and load the xlsx package to read and process Excel data in R.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Please perform the following steps to read Excel documents:

First, install and load the xlsx package:

> install.packages("xlsx")
> library(xlsx)

Access www.data.worldbank.org/topic/economy-and-growth to find world economy indicator data in Excel:
Figure 6: World economy indicator

Download world economy indicator data from the following URL using download.file:

> download.file("http://api.worldbank.org/v2/en/topic/3?downloadformat=excel", "worldbank.xls", mode="wb")

Examine the downloaded file with Excel (or Open Office):
Figure 7: Using Excel to examine...

Reading data from databases

As R reads data into memory, it is perfect for processing and analyzing small datasets. However, as an enterprise accumulates much more data than individuals in their daily lives, database documents are becoming more common for the purpose of storing and analyzing bigger data. To access databases with R, one can use RJDBC, RODBC, or RMySQL as the communications bridge. In this section, we will demonstrate how to use RJDBC to connect data stored in the database.

Getting ready

In this section, we need to prepare a MySQL environment first. If you have a MySQL environment installed on your machine (Windows), you can inspect server status from MySQL Notifier. If the local server is running, the server status should prompt localhost (Online), as shown in the following screenshot:

Figure 8: MySQL Notifier

Once we have our database server online, we need to validate whether we are authorized to access the database with a given username and password by using any database connection...

Scraping web data

In most cases, the majority of data will not exist in your database, but will instead be published in different forms on the Internet. To dig up more valuable information from these data sources, we need to know how to access and scrape data from the Web. Here, we will illustrate how to use the rvest package to harvest finance data from http://www.bloomberg.com/.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to scrape data from http://www.bloomberg.com/:

First, access the following link to browse the S&P 500 index on the Bloomberg Business websitehttp://www.bloomberg.com/quote/SPX:IND:
Figure 9: S&P 500 index
Once the page appears, as shown in the preceding screenshot, we can begin installing and loading the rvest package:
```
>  install.packages("rvest")
>  library(rvest)
```
Next, you can use the HTML function from the rvest package to scrape and...

Accessing Facebook data

Social network data is another great source for the user who is interested in exploring and analyzing social interactions. The main difference between social network data and web data is that social network platforms often provide a semi-structured data format (mostly JSON). Thus, one can easily access the data without the need to inspect how the data is structured. In this recipe, we will illustrate how to use rvest and rson to read and parse data from Facebook.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to access data from Facebook:

First, we need to log in to Facebook and access the developer page (https://developers.facebook.com/):
Figure 18: Accessing the Facebook developer page
Click on Tools & Support, and select Graph API Explorer:
Figure 19: Selecting the Graph API Explorer
Next, click on Get Token, and choose Get Access Token:
Figure...

Working with twitteR

In addition to obtaining social network interaction data, one can collect millions of tweets from Twitter for further text mining tasks. The method for retrieving data from Twitter is very similar to Facebook. For both social platforms, all we need is an access token to access insight data. After we have retrieved the access token, we can then use twitteR to access millions of tweets.

Getting ready

In this recipe, you need to prepare your environment with R installed and a computer that can access the Internet.

How to do it…

Perform the following steps to read data from Twitter:

First, you need to log in to Twitter and access the page of Twitter Apps at https://apps.twitter.com/. Click on Create New App:
Figure 26: Creating a new Twitter app
Fill in all required application details to create a new application:
Figure 27: Filling in the required details
Next, you can select Keys and Access Tokens and then access Application Settings:
Figure 28: Copying API key and secret
Click on...

The rest of the chapter is locked

You have been reading a chapter from

R for Data Science Cookbook (n)

Published in: Jul 2016Publisher: ISBN-13: 9781784390815

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Yu-Wei, Chiu (David Chiu)

Yu-Wei, Chiu (David Chiu) is the founder of LargitData (www.LargitData.com), a startup company that mainly focuses on providing big data and machine learning products. He has previously worked for Trend Micro as a software engineer, where he was responsible for building big data platforms for business intelligence and customer relationship management systems. In addition to being a start-up entrepreneur and data scientist, he specializes in using Spark and Hadoop to process big data and apply data mining techniques for data analysis. Yu-Wei is also a professional lecturer and has delivered lectures on big data and machine learning in R and Python, and given tech talks at a variety of conferences. In 2015, Yu-Wei wrote Machine Learning with R Cookbook, Packt Publishing. In 2013, Yu-Wei reviewed Bioinformatics with R Cookbook, Packt Publishing. For more information, please visit his personal website at www.ywchiu.com. **********************************Acknowledgement************************************** I have immense gratitude for my family and friends for supporting and encouraging me to complete this book. I would like to sincerely thank my mother, Ming-Yang Huang (Miranda Huang); my mentor, Man-Kwan Shan; the proofreader of this book, Brendan Fisher; Members of LargitData; Data Science Program (DSP); and other friends who have offered their support.
Read more about Yu-Wei, Chiu (David Chiu)

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages