Reader small image

You're reading from  Jupyter Cookbook

Product typeBook
Published inApr 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788839440
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dan Toomey
Dan Toomey
author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey

Right arrow

Chapter 9. Interacting with Big Data

In this chapter, we will cover the following recipes:

  • Obtaining a word count from a big-text data source
  • Obtaining a sorted word count from a big-text source
  • Examining big-text log file access
  • Computing prime numbers using parallel operations
  • Analyzing big-text data
  • Analyzing big data history files

Introduction


In this chapter, we cover the methods for accessing big data from Jupyter. Big data is meant to be large data files, often in the many millions of rows. Big data is a topic of discussion in many firms. Most firms have it in one form or another, and they are trying hard to draw some value from all of the data they have stored.

An up-and-coming language for dealing with large datasets is Spark. Spark is an open source toolset specifically made for dealing with large datasets. We can use Spark coding in Jupyter much like the other languages we have seen.

In Chapter 2,Adding an Engine, we dealt with installing Spark for use in Jupyter. For this chapter, we will be using the Python 3 engine for further work. As a reminder, we start a Notebook using the Python 3 engine and then import the Python-Spark library to invoke Spark functionality.

Most importantly, we will be using Spark to access big data.

Obtaining a word count from a big-text data source


While this is not a big data source, we will show how to get a word count from a text file first. Then we'll find a larger data file to work with.

How to do it...

We can use this script to see the word counts for a file:

import pyspark

if not 'sc' in globals():
    sc = pyspark.SparkContext()

text_file = sc.textFile("B09656_09_word_count.ipynb")
counts = text_file.flatMap(lambda line: line.split(" ")) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

for x in counts.collect():
    print(x)

When we run this in Jupyter, we see something akin to this display:

The display continues for every individual word that was detected in the source file.

How it works...

We have a standard preamble to the coding. All Spark programs need a context to work with. The context is used to define the number of threads and the like. We are only using the defaults. It's important to note that Spark will automatically utilize underlying multiple...

Obtaining a sorted word count from a big-text source


Now that we have a word count, the more interesting use is to sort them by occurrence to determine the highest usage.

How to do it...

We can slightly modify the previous script to produce a sorted listed as follows:

import pyspark

if not 'sc' in globals():
 sc = pyspark.SparkContext()

text_file = sc.textFile("B09656_09_word_count.ipynb")
sorted_counts = text_file.flatMap(lambda line: line.split(" ")) \
 .map(lambda word: (word, 1)) \
 .reduceByKey(lambda a, b: a + b) \
 .sortByKey()

for x in sorted_counts.collect():
 print(x)

Producing the output as follows:

The list continues for every word found. Notice the descending order of occurrences and the sorting with words of the same occurrence. What Spark uses to determine word breaks does not appear to be too good.

How it works...

The coding is exactly the same as in the previous example, except for the last line, .sortByKey(). Our key, by default, is the word count column (as that is what we...

Examining big-text log file access


MonitorWare is a network monitoring solution for Windows machines. It has sample log files that show access to different systems. I downloaded the HTTP log file sample set from http://www.monitorware.com/en/logsamples/apache.php. The log file has entries for different HTTP requests made to a server. 

The URl downloads the apache-samples.rar file. A .rar file is a type of compressed format for very large files that would overwhelm the normal .zip file format. This example is only 20 KB. You need to extract the log file from the .rar file for access in the following coding.

How to do it...

We can use a similar script that loads the file, and then we use additional functions to pull out the records of interest. The coding is:

import pyspark

if not 'sc' in globals():
    sc = pyspark.SparkContext()

textFile = sc.textFile("access_log")
print(textFile.count(),"access records")

gets = textFile.filter(lambda line: "GET" in line)
print(gets.count(),"GETs")

posts...

Computing prime numbers using parallel operations


A good method for determining whether a number is prime or not is Eratosthenes's sieve. For each number, we check whether it fits the bill for a prime (if it meets the criteria for a prime, it will filter through the sieve).

The series of tests are run on every number we check for prime. This is a great usage for parallel operations. Spark has the in-built ability to split up a task among the threads/machines available. The threads are configured through the SparkContext (we see that in every example).

In our case, we split up the workload among the available threads, each taking a set of numbers to check, and collect the results later on.

How to do it...

We can use a script like this:

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()

#check if a number is prime
def isprime(n):
    # must be positive
    n = abs(int(n))

    # 2 or more
    if n < 2:
        return False

    # 2 is the only even prime number
    if...

Analyzing big-text data


We can run an analysis on large text streams, such as news, articles, to attempt to glean important themes. Here we are pulling out bigrams—combinations of two words—that appear in sequence throughout the article.

How to do it...

For this example, I am using text from an online article from Atlantic Monthly called The World Might Be Better Off Without College for Everyone at https://www.theatlantic.com/magazine/archive/2018/01/whats-college-good-for/546590/.

I am using this script:

import pyspark
if not 'sc' in globals():
 sc = pyspark.SparkContext()

sentences = sc.textFile('B09656_09_article.txt') \
    .glom() \
    .map(lambda x: " ".join(x)) \
    .flatMap(lambda x: x.split("."))
print(sentences.count(),"sentences")

bigrams = sentences.map(lambda x:x.split()) \
    .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])
print(bigrams.count(),"bigrams")

frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \
    .map(lambda x:(x[1],x[0])) \
    .sortByKey...

Analyzing big data history files


In this example we will be using a larger .csv file for analysis. Specifically, it's the CSV file of the daily show guests from https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/daily_show_guests.csv.

How to do it...

We can use the following script:

import pyspark
import csv
import operator
import itertools
import collections
import io

if not 'sc' in globals():
 sc = pyspark.SparkContext()

years = {}
occupations = {}
guests = {}

#The file header contains these column descriptors
#YEAR,GoogleKnowlege_Occupation,Show,Group,Raw_Guest_List

with open('daily_show_guests.csv', newline='') as csvfile: 
    reader = csv.DictReader(csvfile, delimiter=',', quotechar='|')
    try:
        for row in reader:

            #track how many shows occurred in the year
            year = row['YEAR']
            if year in years:
                years[year] = years[year] + 1
            else:
                years[year] = 1

            # what...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Jupyter Cookbook
Published in: Apr 2018Publisher: PacktISBN-13: 9781788839440
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey