![Apache Spark 3 for Data Engineering and Analytics with Python [Video]](https://content.packt.com/V18004/cover_image_small.jpeg)
Apache Spark 3 for Data Engineering and Analytics with Python [Video]
Subscription
FREE
Video + Subscription
$15.99
Video
$54.99
What do you get with a Packt Subscription?
What do you get with a Packt Subscription?
What do you get with Video + Subscription?
What do you get with a Packt Subscription?
What do you get with eBook?
What do I get with Print?
What do you get with video?
What do you get with Audiobook?
Subscription
FREE
Video + Subscription
$15.99
Video
$54.99
What do you get with a Packt Subscription?
What do you get with a Packt Subscription?
What do you get with Video + Subscription?
What do you get with a Packt Subscription?
What do you get with eBook?
What do I get with Print?
What do you get with video?
What do you get with Audiobook?
-
Free ChapterIntroduction to Spark and Installation
- Introduction
- The Spark Architecture
- The Spark Unified Stack
- Java Installation
- Hadoop Installation
- Python Installation
- PySpark Installation
- Install Microsoft Build Tools
- MacOS - Java Installation
- MacOS - Python Installation
- MacOS - PySpark Installation
- MacOS - Testing the Spark Installation
- Install Jupyter Notebooks
- The Spark Web UI
- Section Summary
-
Spark Execution Concepts
-
RDD Crash Course
-
Structured API - Spark DataFrame
- Structured APIs Introduction
- Preparing the Project Folder
- PySpark DataFrame, Schema, and DataTypes
- DataFrame Reader and Writer
- Challenge Part 1 – Brief
- Challenge Part 1 - Data Preparation
- Working with Structured Operations
- Managing Performance Errors
- Reading a JSON File
- Columns and Expressions
- Filter and Where Conditions
- Distinct Drop Duplicates Order By
- Rows and Union
- Adding, Renaming, and Dropping Columns
- Working with Missing or Bad Data
- Working with User-Defined Functions
- Challenge Part 2 – Brief
- Challenge Part 2 - Remove Null Row and Bad Records
- Challenge Part 2 - Get the City and State
- Challenge Part 2 - Rearrange the Schema
- Challenge Part 2 - Write Partitioned DataFrame to Parquet
- Aggregations
- Aggregations - Setting Up Flight Summary Data
- Aggregations - Count and Count Distinct
- Aggregations - Min Max Sum SumDistinct AVG
- Aggregations with Grouping
- Challenge Part 3 – Brief
- Challenge Part 3 - Prepare 2019 Data
- Challenge Part 3 - Q1 Get the Best Sales Month
- Challenge Part 3 - Q2 Get the City that Sold the Most Products
- Challenge Part 3 - Q3 When to Advertise
- Challenge Part 3 - Q4 Products Bought Together
-
Introduction to Spark SQL and Databricks
- Introduction to DataBricks
- Spark SQL Introduction
- Register Account on Databricks
- Create a Databricks Cluster
- Creating our First 2 Databricks Notebooks
- Reading CSV Files into DataFrame
- Creating a Database and Table
- Inserting Records into a Table
- Exposing Bad Records
- Figuring out How to Remove Bad Records
- Extract the City and State
- Inserting Records to Final Sales Table
- What was the Best Month in Sales?
- Get the City that Sold the Most Products
- Get the Right Time to Advertise
- Get the Most Products Sold Together
- Create a Dashboard
- Summary
About this video
Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will provide you with a detailed understanding of PySpark and its stack. This course is carefully developed and designed to guide you through the process of data analytics using Python Spark. The author uses an interactive approach in explaining keys concepts of PySpark such as the Spark architecture, Spark execution, transformations and actions using the structured API, and much more. You will be able to leverage the power of Python, Java, and SQL and put it to use in the Spark ecosystem.
You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark. Followed by the techniques for collecting, cleaning, and visualizing data by creating dashboards in Databricks. You will learn how to use SQL to interact with DataFrames. The author provides an in-depth review of RDDs and contrasts them with DataFrames.
There are multiple problem challenges provided at intervals in the course so that you get a firm grasp of the concepts taught in the course.
The code bundle for this course is available here: https://github.com/PacktPublishing/Apache-Spark-3-for-Data-Engineering-and-Analytics-with-Python-
- Publication date:
- August 2021
- Publisher
- Packt
- Duration
- 8 hours 30 minutes
- ISBN
- 9781803244303