Spark - evaluating history data
In this example, we combine the previous sections to look at some historical data and determine some useful attributes.
The historical data we are using is the guest list for The Jon Stewart Show. A typical record from the data looks like this:
1999,actor,1/11/99,Acting,Michael J. Fox
It contains the year, occupation of the guest, date of appearance, logical grouping of the occupation, and the name of the guest.
For our analysis, we will be looking at number of appearances per year, the most appearing occupation, and the most appearing personality.
We will be using this script:
import pyspark
import csv
import operator
import itertools
import collections
if not 'sc' in globals():
sc = pyspark.SparkContext()
years = {}
occupations = {}
guests = {}
#The file header contains these column descriptors
#YEAR,GoogleKnowlege_Occupation,Show,Group,Raw_Guest_List
with open('daily_show_guests.csv', 'rb') as csvfile:
reader = csv.DictReader(csvfile)
for row...