You're reading from Data Ingestion with Python Cookbook

Product typeBook

Published inMay 2023

PublisherPackt

ISBN-139781837632602

Edition1st Edition

Concepts

Data Engineering

Author (1)

Gláucia Esppenchutz

Using PySpark with Deﬁned and Non-Deﬁned Schemas

Generally, schemas are forms used to create or apply structures to data. As someone who works or will work with large volumes of data, it is essential to understand how to manipulate DataFrames and apply structure when it is necessary to bring more context to the information involved.

However, as seen in the previous chapters, data can come from different sources or be present without a well-defined structure, and applying a schema can be challenging. Here, we will see how to create schemas and standard formats using PySpark with structured and unstructured data.

In this chapter, we will cover the following recipes:

Applying schemas to data ingestion
Importing structured data using a well-deﬁned schema
Importing unstructured data with an undefined schema
Ingesting unstructured data with a well-deﬁned schema and format
Inserting formatted SparkSession logs to facilitate your work...

Technical requirements

You can also find the code for this chapter in the GitHub repository here: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Using Jupyter Notebook is not mandatory but can help you see how the code works interactively. Since we will execute Python and PySpark code, it can help us understand the scripts better. Once you have it installed, you can execute Jupyter using the following line:

$ jupyter Notebook

It is recommended to create a separate folder to store the Python files or Notebooks we will cover in this chapter; however, feel free to organize the files in the best way that fits you.

In this chapter, all recipes will need a SparkSession instance initialized, and you can use the same session for all of them. You can use the following code to create your session:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
      .master("local[1]") \
    ...

Applying schemas to data ingestion

The application of schemas is common practice when ingesting data, and PySpark natively supports applying them to DataFrames. To define and apply schemas to our DataFrames, we need to understand some concepts of Spark.

This recipe introduces the basic concept of working with schemas using PySpark and its best practices so that we can later apply them to structured and unstructured data.

Getting ready

Make sure PySpark is installed and working on your machine for this recipe. You can run the following code on your command line to check this requirement:

$ pyspark --version

You should see the following output:

Figure 6.1 – PySpark version console output

If don’t have PySpark installed on your local machine, please refer to the Installing PySpark recipe in Chapter 1.

I will use Jupyter Notebook to execute the code to make it more interactive. You can use this link and follow the instructions...

Importing structured data using a well-deﬁned schema

As seen in the previous chapter, Ingesting Data from Structured and Unstructured Databases, structured data has a standard format presented in rows and columns and is often stored inside a database.

Due to its format, the application of a DataFrame schema tends to be less complex and has several benefits, such as ensuring the ingested information is the same as the data source or follows a rule.

In this recipe, we will ingest data from a structured file such as a CSV file and apply a DataFrame schema to understand better how it is used in a real-world scenario.

Getting ready

This exercise requires the listings.csv file found inside the GitHub repository for this book. Also, make sure your SparkSession is initialized.

All the code in this recipe can be executed in Jupyter Notebook cells or a PySpark shell.

How to do it…

Here are the steps to perform this recipe:

Importing Spark data types...

Importing unstructured data without a schema

As seen before, unstructured data or NoSQL is a group of information that does not follow a format, such as relational or tabular data. It can be presented as an image, video, metadata, transcripts, and so on. The data ingestion process usually involves a JSON file or a document collection, as we previously saw when ingesting data from MongoDB.

In this recipe, we will read a JSON file and transform it into a DataFrame without a schema. Although unstructured data is supposed to have a more flexible design, we will see some implications of not having any schema or structure in our DataFrame.

Getting ready…

Here, we will use the holiday_brazil.json file to create the DataFrame. You can find it in the GitHub repository here: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

We will use SparkSession to read the JSON file and create a DataFrame to ensure the session is up and running.

All code can be...

Ingesting unstructured data with a well-deﬁned schema and format

In the previous recipe, Importing unstructured data without schema, we read a JSON file without any schema or formatting application. This led us to an odd output, which could bring confusion and require additional work later in the data pipeline. While this example pertains specifically to a JSON file, it also applies to all other NoSQL or unstructured data that needs to be converted into analytical data.

The objective is to continue the last recipe and apply a schema and standard to our data, making it more legible and easy to process in the subsequent phases of ETL.

Getting ready

This recipe has the exact same requirements as the Importing unstructured data without a schema recipe.

How to do it…

We will perform the following steps to perform this recipe:

Importing data types: As usual, let’s start by importing our data types from the PySpark library:
```
from pyspark.sql.types...
```

Inserting formatted SparkSession logs to facilitate your work

A commonly underestimated best practice is how to create valuable logs. Applications that log information and small code files can save a significant amount of debugging time. This is also true when ingesting or processing data.

This recipe approaches the best practice of logging events in our PySpark scripts. The examples here will give a more generic overview, which can be applied to any other piece of code and will even be used later in this book.

Getting ready

We will use the listings.csv file to execute the read method from Spark. You can find this dataset inside the GitHub repository for this book. Make sure your SparkSession is up and running.

How to do it…

Here are the steps to perform this recipe:

Setting the log level: Now, using sparkContext, we will assign the log level:
```
spark.sparkContext.setLogLevel("INFO")
```
Instantiating the log4j logger: The next step is to create...

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems. She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.
Read more about Gláucia Esppenchutz

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Ingestion with Python Cookbook

Using PySpark with Deﬁned and Non-Deﬁned Schemas

Technical requirements

Applying schemas to data ingestion

Getting ready

Importing structured data using a well-deﬁned schema

Getting ready

How to do it…

Importing unstructured data without a schema

Getting ready…

Ingesting unstructured data with a well-deﬁned schema and format

Getting ready

How to do it…

Inserting formatted SparkSession logs to facilitate your work

Getting ready

How to do it…

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook