We can use MapReduce in another example where we get the word counts from a file. A standard problem, but we use MapReduce to do most of the heavy lifting. We can use the source code for this example. We can use a script similar to this to count the word occurrences in a file:
import pyspark
if not 'sc' in globals():
sc = pyspark.SparkContext()
text_file = sc.textFile("Spark File Words.ipynb")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
for x in counts.collect():
print x
We have the same preamble to the coding.
Then we load the text file into memory.
text_file is a Spark RDD (Resilient Distributed Dataset), not a data frame.
It is assumed to be massive and the contents distributed over many handlers.
Once the...