Chapter 03: Working with Big Data Frameworks
Activity 8: Parsing Text
Read the text files into the Spark object using the text method:
rdd_df = spark.read.text("/localdata/myfile.txt").rddTo parse the file that we are reading, we will use lambda functions and Spark operations such as map, flatMap, and reduceByKey. flatmap applies a function to all elements of an RDD, flattens the results, and returns the transformed RDD. reduceByKey merges the values based on the given key, combining the values. With these functions, we can count the number of lines and words in the text.
Extract the lines from the text using the following command:
lines = rdd_df.map(lambda line: line[0])
This splits each line in the file as an entry in the list. To check the result, you can use the collect method, which gathers all data back to the driver process:
lines.collect()
Now, let's count the number of lines, using the count method:
lines.count()
Note
Be careful when using the collect method! If the DataFrame or RDD being...