Packt+ | Advance your knowledge in tech

You're reading from Big Data Analysis with Python

Product type Book

Published in Apr 2019

Publisher Packt

ISBN-13 9781789955286

Pages 276 pages

Edition 1st Edition

Languages

Python

Concepts

Data Visualization

Authors (3):

Ivan Marin

Ankit Shukla

Sarang VK

View More author details

Table of Contents (11) Chapters

Big Data Analysis with Python

Preface

1. The Python Data Science Stack

2. Statistical Visualizations

3. Working with Big Data Frameworks

4. Diving Deeper with Spark

5. Handling Missing Values and Correlation Analysis

6. Exploratory Data Analysis

7. Reproducibility in Big Data Analysis

8. Creating a Full Analysis Report

Appendix

Chapter 03: Working with Big Data Frameworks

Activity 8: Parsing Text

Read the text files into the Spark object using the text method:
```
rdd_df = spark.read.text("/localdata/myfile.txt").rdd
```
To parse the file that we are reading, we will use lambda functions and Spark operations such as map, flatMap, and reduceByKey. flatmap applies a function to all elements of an RDD, flattens the results, and returns the transformed RDD. reduceByKey merges the values based on the given key, combining the values. With these functions, we can count the number of lines and words in the text.
Extract the lines from the text using the following command:
```
lines = rdd_df.map(lambda line: line[0])
```
This splits each line in the file as an entry in the list. To check the result, you can use the collect method, which gathers all data back to the driver process:
```
lines.collect()
```
Now, let's count the number of lines, using the count method:
```
lines.count()
```
Note
Be careful when using the collect method! If the DataFrame or RDD being collected is larger than the memory of the local driver, Spark will throw an error.

Now, let's first split each line into words, breaking it by the space around it, and combining all elements, removing words in uppercase:
```
splits = lines.flatMap(lambda x: x.split(' '))
lower_splits = splits.map(lambda x: x.lower())
```
Let's also remove the stop words. We could use a more consistent stop words list from NLTK, but for now, we will row our own:
```
stop_words = ['of', 'a', 'and', 'to']
```
Use the following command to remove the stop words from our token list:
```
tokens = lower_splits.filter(lambda x: x and x not in stop_words)
```
We can now process our token list and count the unique words. The idea is to generate a list of tuples, where the first element is the token and the second element is the count of that particular token.

First, let's map our token to a list:

token_list = tokens.map(lambda x: [x, 1])

Use the reduceByKey operation, which will apply the operation to each of the lists:

count = token_list.reduceByKey(add).sortBy(lambda x: x[1], ascending=False)
count.collect()

Remember, collect all data back to the driver node! Always check whether there is enough memory by using tools such as top and htop.

The rest of the chapter is locked

You're reading from Big Data Analysis with Python

Table of Contents (11) Chapters

Chapter 03: Working with Big Data Frameworks

Activity 8: Parsing Text

Note

Authors (3)

Other recommended products

Personalised recommendations for you

You're reading from Big Data Analysis with Python

Table of Contents (11) Chapters

Chapter 03: Working with Big Data Frameworks

Activity 8: Parsing Text

Note

Unlock this book and the full library FREE for 7 days

Authors (3)

Other recommended products

Personalised recommendations for you