Packt+ | Advance your knowledge in tech

You're reading from Jupyter Cookbook

Product typeBook

Published inApr 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781788839440

Edition1st Edition

Languages

Python

Tools

Jupyter

Concepts

Data Analysis

Author (1)

Dan Toomey

Chapter 9. Interacting with Big Data

In this chapter, we will cover the following recipes:

Obtaining a word count from a big-text data source
Obtaining a sorted word count from a big-text source
Examining big-text log file access
Computing prime numbers using parallel operations
Analyzing big-text data
Analyzing big data history files

Introduction

In this chapter, we cover the methods for accessing big data from Jupyter. Big data is meant to be large data files, often in the many millions of rows. Big data is a topic of discussion in many firms. Most firms have it in one form or another, and they are trying hard to draw some value from all of the data they have stored.

An up-and-coming language for dealing with large datasets is Spark. Spark is an open source toolset specifically made for dealing with large datasets. We can use Spark coding in Jupyter much like the other languages we have seen.

In Chapter 2,Adding an Engine, we dealt with installing Spark for use in Jupyter. For this chapter, we will be using the Python 3 engine for further work. As a reminder, we start a Notebook using the Python 3 engine and then import the Python-Spark library to invoke Spark functionality.

Most importantly, we will be using Spark to access big data.

Obtaining a word count from a big-text data source

While this is not a big data source, we will show how to get a word count from a text file first. Then we'll find a larger data file to work with.

How to do it...

We can use this script to see the word counts for a file:

import pyspark

if not 'sc' in globals():
    sc = pyspark.SparkContext()

text_file = sc.textFile("B09656_09_word_count.ipynb")
counts = text_file.flatMap(lambda line: line.split(" ")) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

for x in counts.collect():
    print(x)

When we run this in Jupyter, we see something akin to this display:

The display continues for every individual word that was detected in the source file.

How it works...

We have a standard preamble to the coding. All Spark programs need a context to work with. The context is used to define the number of threads and the like. We are only using the defaults. It's important to note that Spark will automatically utilize underlying multiple...

Obtaining a sorted word count from a big-text source

Now that we have a word count, the more interesting use is to sort them by occurrence to determine the highest usage.

How to do it...

We can slightly modify the previous script to produce a sorted listed as follows:

import pyspark

if not 'sc' in globals():
 sc = pyspark.SparkContext()

text_file = sc.textFile("B09656_09_word_count.ipynb")
sorted_counts = text_file.flatMap(lambda line: line.split(" ")) \
 .map(lambda word: (word, 1)) \
 .reduceByKey(lambda a, b: a + b) \
 .sortByKey()

for x in sorted_counts.collect():
 print(x)

Producing the output as follows:

The list continues for every word found. Notice the descending order of occurrences and the sorting with words of the same occurrence. What Spark uses to determine word breaks does not appear to be too good.

How it works...

The coding is exactly the same as in the previous example, except for the last line, .sortByKey(). Our key, by default, is the word count column (as that is what we...

Examining big-text log file access

MonitorWare is a network monitoring solution for Windows machines. It has sample log files that show access to different systems. I downloaded the HTTP log file sample set from http://www.monitorware.com/en/logsamples/apache.php. The log file has entries for different HTTP requests made to a server.

The URl downloads the apache-samples.rar file. A .rar file is a type of compressed format for very large files that would overwhelm the normal .zip file format. This example is only 20 KB. You need to extract the log file from the .rar file for access in the following coding.

How to do it...

We can use a similar script that loads the file, and then we use additional functions to pull out the records of interest. The coding is:

import pyspark

if not 'sc' in globals():
    sc = pyspark.SparkContext()

textFile = sc.textFile("access_log")
print(textFile.count(),"access records")

gets = textFile.filter(lambda line: "GET" in line)
print(gets.count(),"GETs")

posts...

Computing prime numbers using parallel operations

A good method for determining whether a number is prime or not is Eratosthenes's sieve. For each number, we check whether it fits the bill for a prime (if it meets the criteria for a prime, it will filter through the sieve).

The series of tests are run on every number we check for prime. This is a great usage for parallel operations. Spark has the in-built ability to split up a task among the threads/machines available. The threads are configured through the SparkContext (we see that in every example).

In our case, we split up the workload among the available threads, each taking a set of numbers to check, and collect the results later on.

How to do it...

We can use a script like this:

import pyspark
if not 'sc' in globals():
    sc = pyspark.SparkContext()

#check if a number is prime
def isprime(n):
    # must be positive
    n = abs(int(n))

    # 2 or more
    if n < 2:
        return False

    # 2 is the only even prime number
    if...

Analyzing big-text data

We can run an analysis on large text streams, such as news, articles, to attempt to glean important themes. Here we are pulling out bigrams—combinations of two words—that appear in sequence throughout the article.

How to do it...

For this example, I am using text from an online article from Atlantic Monthly called The World Might Be Better Off Without College for Everyone at https://www.theatlantic.com/magazine/archive/2018/01/whats-college-good-for/546590/.

I am using this script:

import pyspark
if not 'sc' in globals():
 sc = pyspark.SparkContext()

sentences = sc.textFile('B09656_09_article.txt') \
    .glom() \
    .map(lambda x: " ".join(x)) \
    .flatMap(lambda x: x.split("."))
print(sentences.count(),"sentences")

bigrams = sentences.map(lambda x:x.split()) \
    .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)])
print(bigrams.count(),"bigrams")

frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \
    .map(lambda x:(x[1],x[0])) \
    .sortByKey...

Analyzing big data history files

In this example we will be using a larger .csv file for analysis. Specifically, it's the CSV file of the daily show guests from https://raw.githubusercontent.com/fivethirtyeight/data/master/daily-show-guests/daily_show_guests.csv.

How to do it...

We can use the following script:

import pyspark
import csv
import operator
import itertools
import collections
import io

if not 'sc' in globals():
 sc = pyspark.SparkContext()

years = {}
occupations = {}
guests = {}

#The file header contains these column descriptors
#YEAR,GoogleKnowlege_Occupation,Show,Group,Raw_Guest_List

with open('daily_show_guests.csv', newline='') as csvfile: 
    reader = csv.DictReader(csvfile, delimiter=',', quotechar='|')
    try:
        for row in reader:

            #track how many shows occurred in the year
            year = row['YEAR']
            if year in years:
                years[year] = years[year] + 1
            else:
                years[year] = 1

            # what...

The rest of the chapter is locked

You have been reading a chapter from

Jupyter Cookbook

Published in: Apr 2018Publisher: PacktISBN-13: 9781788839440

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey

Other recommended products

Related to this chapter

Learning Jupyter 5

In this book, you will learn how to build interactive dashboards in a Jupyter notebook. Explore JupyterHub and various Jupyter widgets through which you can easily perform 3D data visualization, 3D plotting, and geospatial analytics. This book helps you understand BeakerX to create interactive tables and interact with spreadsheets.

BookAug 2018282 pages

JupyterLab Quick Start Guide

Jupyterlab is a web-based data science interface and natural evolution of Jupyter Notebooks. This guide will take you through the core commands and functionalities of JupyterLab. You will learn to customize and enhance your JupyterLab productivity by installing additional extensions.

BookDec 2019160 pages

Jupyter for Data Science

Jupyter Notebook is a web-based environment that enables interactive computing in notebook documents. It allows you to create documents that contain live code, equations and visualizations. This book will be a comprehensive guide to getting started with data science using the popular Jupyter notebook. It will show you how to leverage the capabilties of Jupyter to perform various data science tasks efficiently. From data exploration to visualization, this book will take you through every step of the way in implementing an effective data science pipeline using Jupyter.

BookOct 2017242 pages

Mastering Geospatial Analysis with Python

Python comes with many libraries and tools that help you work on geoprocessing tasks without investing in expensive tools. This book introduces you to new libraries that perform geospatial and statistical analysis and data management. It uses examples that explain how Python v3 differs from v2, and solve age-old problems in geospatial analysis.

BookApr 2018440 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages