Packt+ | Advance your knowledge in tech

You're reading from Big Data Analysis with Python [Instructor Edition] Combine Spark and Python to unlock the powers of parallel computing and machine learning

Product type Hardcover

Published in Apr 2019

Publisher Packt

ISBN-13 9781789955286

Length 276 pages

Edition 1st Edition

Languages

Python

Tools

Combine

Concepts

Big Data

Authors (3):

Ivan Marin

Sarang VK

Ankit Shukla

Big Data Analysis with Python

Preface

1. The Python Data Science Stack FREE CHAPTER

2. Statistical Visualizations

3. Working with Big Data Frameworks

4. Diving Deeper with Spark

5. Handling Missing Values and Correlation Analysis

6. Exploratory Data Analysis

7. Reproducibility in Big Data Analysis

8. Creating a Full Analysis Report

Appendix

Read the text files into the Spark object using the text method:
```
rdd_df = spark.read.text("/localdata/myfile.txt").rdd
```
To parse the file that we are reading, we will use lambda functions and Spark operations such as map, flatMap, and reduceByKey. flatmap applies a function to all elements of an RDD, flattens the results, and returns the transformed RDD. reduceByKey merges the values based on the given key, combining the values. With these functions, we can count the number of lines and words in the text.
Extract the lines from the text using the following command:
```
lines = rdd_df.map(lambda line: line[0])
```
This splits each line in the file as an entry in the list. To check the result, you can use the collect method, which gathers all data back to the driver process:
```
lines.collect()
```
Now, let's count the number of lines, using the count method:
```
lines.count()
```
Note
Be careful when using the collect method! If the DataFrame or RDD being...