Packt+ | Advance your knowledge in tech

You're reading from Jupyter for Data Science

Product typeBook

Published inOct 2017

Reading LevelBeginner

PublisherPackt

ISBN-139781785880070

Edition1st Edition

Languages

Python

Tools

Jupyter

Concepts

Data Analysis

Author (1)

Dan Toomey

Chapter 4. Data Mining and SQL Queries

PySpark exposes the Spark programming model to Python. Spark is a fast, general engine for large-scale data processing. We can use Python under Jupyter. So, we can use Spark in Jupyter.

Installing Spark requires the following components to be installed on your machine:

Java JDK.
Scala from http://www.scala-lang.org/download/.
Python recommend downloading Anaconda with Python (from http://continuum.io).
Spark from https://spark.apache.org/downloads.html.
winutils: This is a command-line utility that exposes Linux commands to Windows. There are 32-bit and 64-bit versions available at:
- 32-bit winutils.exe at https://code.google.com/p/rrd-hadoop-win32/source/checkout
- 64-bit winutils.exe at https://github.com/steveloughran/winutils/tree/master/hadoop-2.6.0/bin

Then set environment variables that show the position of the preceding components:

JAVA_HOME: The bin directory where you installed JDK
PYTHONPATH: Directory where Python was installed
HADOOP_HOME: Directory...

Special note for Windows installation

Spark (really Hadoop) needs a temporary storage location for its working set of data. Under Windows this defaults to the \tmp\hive location. If the directory does not exist when Spark/Hadoop starts it will create it. Unfortunately, under Windows, the installation does not have the correct tools built-in to set the access privileges to the directory.

You should be able to run chmod under winutils to set the access privileges for the hive directory. However, I have found that the chmod function does not work correctly.

A better idea has been to create the tmp\hive directory yourself in admin mode. And then grant full privileges to the hive directory to all users, again in admin mode.

Without this change, Hadoop fails right away. When you start pyspark, the output (including any errors) are displayed in the command line window. One of the errors will be insufficient access to this directory.

Using Spark to analyze data

The first thing to do in order to access Spark is to create a SparkContext. The SparkContext initializes all of Spark and sets up any access that may be needed to Hadoop, if you are using that as well.

The initial object used to be a SQLContext, but that has been deprecated recently in favor of SparkContext, which is more open-ended.

We could use a simple example to just read through a text file as follows:

from pyspark import SparkContextsc = SparkContext.getOrCreate()lines = sc.textFile("B05238_04 Spark Total Line Lengths.ipynb")lineLengths = lines.map(lambda s: len(s))totalLength = lineLengths.reduce(lambda a, b: a + b)print(totalLength)

In this example:

We obtain a SparkContext
With the context, read in a file (the Jupyter file for this example)
We use a Hadoop map function to split up the text file into different lines and gather the lengths
We use a Hadoop reduce function to calculate the length of all the lines
We display our results

Under Jupyter this looks like...

Another MapReduce example

We can use MapReduce in another example where we get the word counts from a file. A standard problem, but we use MapReduce to do most of the heavy lifting. We can use the source code for this example. We can use a script similar to this to count the word occurrences in a file:

import pysparkif not 'sc' in globals():    sc = pyspark.SparkContext()text_file = sc.textFile("Spark File Words.ipynb")counts = text_file.flatMap(lambda line: line.split(" ")) \             .map(lambda word: (word, 1)) \             .reduceByKey(lambda a, b: a + b)for x in counts.collect():    print x

Note

We have the same preamble to the coding.

Then we load the text file into memory.

Note

text_file is a Spark RDD (Resilient Distributed Dataset), not a data frame.

It is assumed to be massive and the contents distributed over many handlers.

Once the file is loaded we split each line into words, and then use a lambda function to tick off each occurrence of a word. The code is truly creating a new record...

Using SparkSession and SQL

Spark exposes many SQL-like actions that can be taken upon a data frame. For example, we could load a data frame with product sales information in a CSV file:

from pyspark.sql import SparkSession spark = SparkSession(sc) df = spark.read.format("csv") \        .option("header", "true") \        .load("productsales.csv");df.show()

The example:

Starts a SparkSession (needed for most data access)
Uses the session to read a CSV formatted file, that contains a header record
Displays initial rows

We have a few interesting columns in the sales data:

Actual sales for the products by division
Predicted sales for the products by division

If this were a bigger file, we could use SQL to determine the extent of the product list. Then the following is the Spark SQL to determine the product list:

df.groupBy("PRODUCT").count().show()

The data frame groupBy function works very similar to the SQL Group By clause. Group By collects the items in the dataset according to the values in the column...

Combining datasets

So, we have seen moving a data frame into Spark for analysis. This appears to be very close to SQL tables. Under SQL it is standard practice not to reproduce items in different tables. For example, a product table might have the price and an order table would just reference the product table by product identifier, so as not to duplicate data. So, then another SQL practice is to join or combine the tables to come up with the full set of information needed. Keeping with the order analogy, we combine all of the tables involved as each table has pieces of data that are needed for the order to be complete.

How difficult would it be to create a set of tables and join them using Spark? We will use example tables of Product, Order, and ProductOrder:

Table	Columns
Product	Product ID, Description, Price
Order	Order ID, Order Date
ProductOrder	Order ID, Product ID, Quantity

So, an Order has a list of Product/Quantity values associated.

We can populate the data frames and move them into Spark:

from...

Loading JSON into Spark

Spark can also access JSON data for manipulation. Here we have an example that:

Loads a JSON file into a Spark data frame
Examines the contents of the data frame and displays the apparent schema
Like the other preceding data frames, moves the data frame into the context for direct access by the Spark session
Shows an example of accessing the data frame in the Spark context

The listing is as follows:

Our standard includes for Spark:

from pyspark import SparkContextfrom pyspark.sql import SparkSession sc = SparkContext.getOrCreate()spark = SparkSession(sc)

Read in the JSON and display what we found:

#using some data from file from https://gist.github.com/marktyers/678711152b8dd33f6346df = spark.read.json("people.json")df.show()

I had a difficult time getting a standard JSON to load into Spark. Spark appears to expect one record of data per list of the JSON file versus most JSON I have seen pretty much formats the record layouts with indentation and the like.

Note

Notice the use...

Using Spark pivot

The pivot() function allows you to translate rows into columns while performing aggregation on some of the columns. If you think about it you are physically adjusting the axes of a table about a pivot point.

I thought of an easy example to show how this all works. I think it is one of those features that once you see it in action you realize the number of areas that you could apply it.

In our example, we have some raw price points for stocks and we want to convert that table about a pivot to produce average prices per year per stock.

The code in our example is:

from pyspark import SparkContextfrom pyspark.sql import SparkSessionfrom pyspark.sql import functions as funcsc = SparkContext.getOrCreate()spark = SparkSession(sc)# load product setpivotDF = spark.read.format("csv") \        .option("header", "true") \        .load("pivot.csv");pivotDF.show()pivotDF.createOrReplaceTempView("pivot")# pivot data per the year to get average prices per stock per yearpivotDF \    .groupBy...

Summary

In this chapter, we got familiar with obtaining a SparkContext. We saw examples of using Hadoop MapReduce. We used SQL with Spark data. We combined data frames and operated on the resulting set. We imported JSON data and manipulated it with Spark. Lastly, we looked at using a pivot to gather information about a data frame.

In the next chapter, we will look at using R programming under Jupyter.

The rest of the chapter is locked

You have been reading a chapter from

Jupyter for Data Science

Published in: Oct 2017Publisher: PacktISBN-13: 9781785880070

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey

Other recommended products

Related to this chapter

Jupyter Cookbook

Jupyter has garnered a strong interest in the data science community of late, as it makes common data processing and analysis tasks much simpler. This book is for data science professionals who want to master various tasks related to Jupyter to create efficient, easy-to-share applications related to data analysis and visualization.

BookApr 2018238 pages

Learning Jupyter 5

In this book, you will learn how to build interactive dashboards in a Jupyter notebook. Explore JupyterHub and various Jupyter widgets through which you can easily perform 3D data visualization, 3D plotting, and geospatial analytics. This book helps you understand BeakerX to create interactive tables and interact with spreadsheets.

BookAug 2018282 pages

JupyterLab Quick Start Guide

Jupyterlab is a web-based data science interface and natural evolution of Jupyter Notebooks. This guide will take you through the core commands and functionalities of JupyterLab. You will learn to customize and enhance your JupyterLab productivity by installing additional extensions.

BookDec 2019160 pages

Hands-On Exploratory Data Analysis with R

Hands-On Exploratory Data Analysis with R puts the complete process of exploratory data analysis into a practical demonstration in one nutshell. You will understand the concepts of data analysis right from data ingestion, data cleaning, data manipulation to applying statistical techniques and visualizing hidden patterns.

BookMay 2019266 pages

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

Practical Data Science Cookbook

As an increasing amount of data is generated each year, and the need to analyze and operationalize it is more important than ever. Companies that know what to do with their data have a competitive advantage over companies that don't. This drives a higher demand for knowledgeable and competent data professionals. By sequentially working through the steps presented in each chapter, you will quickly familiarize yourself with the data science process, and learn how to apply it to a variety of situations with examples using the two most popular programming languages for data analysis - R and Python.

BookJun 2017434 pages

R Data Analysis Cookbook

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book empowers you by showing you ways to use R to generate professional analysis reports. The book also teaches you to quickly adapt the example code for your own needs and save yourself the time needed to construct code from scratch.

BookSep 2017560 pages

Advanced Analytics with R and Tableau

R is the go-to tool for statistics and data mining while Tableau offers an interface to filter data, plug and play with rich visualizations to describe insights from your data. When combined these two tools makes it easier to harness interesting patterns and communicate stories. This book covers various analytical techniques like prediction, classification, clustering and best practices to visualize it using interactive dashboard with drop-downs, sliders, and other visual cues of Tableau. Get to know how R can be used in conjunction with Tableau and implement powerful machine learning techniques making big data analytics accessible and presentable through Tableau workbooks.

BookAug 2017178 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages