Packt+ | Advance your knowledge in tech

You're reading from Learning Jupyter

Product typeBook

Published inNov 2016

Reading LevelIntermediate

PublisherPackt

ISBN-139781785884870

Edition1st Edition

Languages

Python

Tools

Jupyter

Concepts

Data Analysis

Author (1)

Dan Toomey

Chapter 10. Jupyter and Big Data

Big data is the topic on everyone's mind. I thought it would be good to see what can be done with big data in Jupyter. An up-and-coming language for dealing with large datasets is Spark. Spark is an open source big data processing framework. Spark can run over Hadoop, in the cloud, or standalone. We can use Spark coding in Jupyter much like the other languages we have seen.

In this chapter, we will cover the following topics:

Installing Spark for use in Jupyter
Using Spark's features

Apache Spark

One of the tools we will be using is Apache Spark. Spark is an open source toolset for cluster computing. While we will not be using a cluster, the typical usage for Spark is a larger set of machines or cluster that operate in parallel to analyze a big data set. An installation guide is available at https://www.dataquest.io/blog/pyspark-installation-guide. In particular, you will need to add two settings to your bash profile: SPARK_HOME and PYSPARK_SUBMIT_ARGS. SPARK_HOME is the directory where the software is installed. PYSPARK_SUBMIT_ARGS sets the number of cores to use in the local cluster.

Mac installation

To install, we download the latest TGZ file from the Spark download page at https://spark.apache.org/downloads.html, unpack the TGZ file, and move the unpacked directory to our Applications folder.

Spark relies on Scala's availability. We installed Scala in Chapter 7, Sharing and Converting Jupyter Notebooks.

Open a command-line window to the Spark directory and run this...

Our first Spark script

Our first script reads in a text file and sees how much the line lengths add up to:

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
lines = sc.textFile("Spark File Words.ipynb")
lineLengths = lines.map(lambda s: len(s))
totalLength = lineLengths.reduce(lambda a, b: a + b) 
print(totalLength)

In the script, we are first initializing Spark-only if we have not done already. Spark will complain if you try to initialize it more than once, so all Spark scripts should have this if prefix statement.

The script reads in a text file (the source of this script), takes every line, and computes its length; then it adds all the lengths together.

A lambda function is an anonymous (not named) function that takes arguments and returns a value. In the first case, given a string s, it returns its length.

A reduce function takes an argument, applies the second argument to it, replaces the first value with the result, and then proceeds with the rest of the list. In...

Spark word count

Now that we have seen some of the functionality, let's explore further. We can use a similar script to count the word occurrences in a file, as follows:

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
text_file = sc.textFile("Spark File Words.ipynb")
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
for x in counts.collect():
    print x

We have the same preamble to the coding. Then we load the text file into memory.

Once the file is loaded, we split each line into words. Use a lambda function to tick off each occurrence of a word. The code is truly creating a new record for each word occurrence. If a word appears in the stream, a record with the count of 1 is added for that word and for every other instance the word appears, new records with the same count of 1 are added. The idea is that this process could be split over multiple processors, where each...

Sorted word count

Using the same script with a slight modification, we can make one more call and have sorted results. The script now looks like this:

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
text_file = sc.textFile("Spark File Words.ipynb")
sorted_counts = text_file.flatMap(lambda line: line.split(" ")) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b) \
            .sortByKey()
for x in sorted_counts.collect():
    print x

Here, we have added another function call to the RDD creation, sortByKey(). So, after we have map/reduced and arrived at list of words and occurrence, we can easily sort the results.

The resultant output looks like this:

Estimate Pi

We can use map/reduce to estimate the Pi. Suppose we have code like this:

import pyspark
import random
if not 'sc' in globals():
    sc = pyspark.SparkContext()
NUM_SAMPLES = 1000
def sample(p):
    x,y = random.random(),random.random()
    return 1 if x*x + y*y < 1 else 0
count = sc.parallelize(xrange(0, NUM_SAMPLES)) \
            .map(sample) \
            .reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)

This code has the same preamble. We are using the random Python package. There is a constant for the number of samples to attempt.

We are building an RDD called count. We call upon the parallelize function to split up this process over the nodes available. The code just maps the result of the sample function call. Finally, we reduce the generated map set by adding all the samples.

The sample function gets two random numbers and returns a 1 or a 0 depending on where the two numbers end up in size. We are looking for random numbers in a small...

Log file examination

I downloaded one of the access_log files from http://www.monitorware.com/. Like any other web access log, we have one line per entry, like this:

64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846

The first part is the IP address of the caller, followed by timestamp, type of HTTP access, URL referenced, HTTP type, resultant HTTP Response code, and finally, the number of bytes in the response.
We can use Spark to load in and parse out some statistics of the log entries, as in this script:

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
textFile = sc.textFile("access_log")
print(textFile.count(),"access records")
gets = textFile.filter(lambda line: "GET" in line)
print(gets.count(),"GETs")
posts = textFile.filter(lambda line: "POST" in line)
print(posts.count(),"POSTs")
other = textFile.subtract(gets).subtract(posts)
print(other.count(),"Other")...

Spark primes

We can run a series of numbers through a filter to determine whether each number is prime or not. We can use this script:

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
def is_it_prime(number):
    # make sure n is a positive integer
    number = abs(int(number))
    # simple tests
    if number < 2:
        return False
    # 2 is prime
    if number == 2:
        return True
    # other even numbers aren't
    if not number & 1:
        return False
    # check whether number is divisible into it's square root
    for x in range(3, int(number**0.5)+1, 2):
        if number % x == 0:
            return False
    #if we get this far we are good
    return True
# create a set of numbers to 100,000
numbers = sc.parallelize(xrange(100000))
# count out the number of primes we found
print numbers.filter(is_it_prime).count()

The script generates numbers up to 100,000.

We then loop over each of the numbers and pass it to our filter. If the filter returns...

Spark text file analysis

In this example, we will look through a news article to determine some basic information from it.

We will be using the following script against the 2600raid news article (from http://newsitem.com/):

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()
sentences = sc.textFile('2600raid.txt') \
    .glom() \
    .map(lambda x: " ".join(x)) \
    .flatMap(lambda x: x.split("."))
print(sentences.count(),"sentences")
bigrams = sentences.map(lambda x:x.split()) \
    .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])
print(bigrams.count(),"bigrams")
frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \
    .map(lambda x:(x[1],x[0])) \
    .sortByKey(False)
frequent_bigrams.take(10)

The code reads in the article and splits up the article into sentences as determined by ending with a period. From there, the code maps out the bigrams present. A bigram is a pair of words that appear next to each other. We then sort the list and print out...

Spark - evaluating history data

In this example, we combine the previous sections to look at some historical data and determine some useful attributes.

The historical data we are using is the guest list for The Jon Stewart Show. A typical record from the data looks like this:

1999,actor,1/11/99,Acting,Michael J. Fox

It contains the year, occupation of the guest, date of appearance, logical grouping of the occupation, and the name of the guest.

For our analysis, we will be looking at number of appearances per year, the most appearing occupation, and the most appearing personality.

We will be using this script:

import pyspark
import csv
import operator
import itertools
import collections
if not 'sc' in globals():
    sc = pyspark.SparkContext()
years = {}
occupations = {}
guests = {}
#The file header contains these column descriptors
#YEAR,GoogleKnowlege_Occupation,Show,Group,Raw_Guest_List
with open('daily_show_guests.csv', 'rb') as csvfile:    
    reader = csv.DictReader(csvfile)
    for row...

Summary

In this chapter, we used Spark functionality via Python coding for Jupyter. First, we installed the Spark additions to Jupyter on a Windows machine and a Mac machine. We wrote an initial script that just read lines from a text file. We went further and determined the word counts in that file. We added sorting to the results. There was a script to estimate Pi. We evaluated web log files for anomalies. We determined a set of prime numbers. And we evaluated a text stream for some characteristics.

The rest of the chapter is locked

You have been reading a chapter from

Learning Jupyter

Published in: Nov 2016Publisher: PacktISBN-13: 9781785884870

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages