Reader small image

You're reading from  Data Ingestion with Python Cookbook

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781837632602
Edition1st Edition
Right arrow
Author (1)
Gláucia Esppenchutz
Gláucia Esppenchutz
author image
Gláucia Esppenchutz

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems. She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.
Read more about Gláucia Esppenchutz

Right arrow

Reading CSV and JSON Files and Solving Problems

When working with data, we come across several different types of data, such as structured, semi-structured, and non-structured, and some specifics from other systems’ outputs. Yet two widespread file types are ingested, comma-separated values (CSV) and JavaScript Object Notation (JSON). There are many applications for these two files, which are widely used for data ingestion due to their versatility.

In this chapter, you will learn more about these file formats and how to ingest them using Python and PySpark, apply the best practices, and solve ingestion and transformation-related problems.

In this chapter, we will cover the following recipes:

  • Reading a CSV file
  • Reading a JSON file
  • Creating a SparkSession for PySpark
  • Using PySpark to read CSV files
  • Using PySpark to read JSON files

Technical requirements

You can find the code for this chapter in this GitHub repository: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Using Jupyter Notebook is not mandatory but can help you see how the code works interactively. Since we will execute Python and PySpark code, it can help us understand the scripts better. Once you have installed it, you can execute Jupyter using the following command:

$ jupyter notebook

It is recommended to create a separate folder to store the Python files or notebooks we will create in this chapter; however, feel free to organize them however suits you best.

Reading a CSV file

A CSV file is a plain text file where commas separate each data point, and each line represents a new record. It is widely used in many areas, such as finance, marketing, and sales, to store data. Software such as Microsoft Excel and LibreOffice, and even online solutions such as Google Spreadsheets, provide reading and writing operations for this file. Visually it resembles a structured table, which greatly enhances the file’s usability.

Getting ready

You can download the CSV dataset for this from Kaggle. Use this link to download the file: https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data. We are going to use the same Spotify dataset as in Chapter 2.

Note

Since Kaggle is a dynamic platform, the filename might change occasionally. After downloading it, I named the file spotify_data.csv.

For this recipe, we will use only Python and Jupyter Notebook to execute the code and create a more friendly visualization.

How to do it...

Reading a JSON file

JavaScript Object Notation (JSON) is a semi-structured data format. Some articles also define JSON as an unstructured data format, but the truth is this format can be used for multiple purposes.

JSON structure uses nested objects and arrays and, due to its flexibility, many applications and APIs use it to export or share data. That is why describing this file format in this chapter is essential.

This recipe will explore how to read a JSON file using a built-in Python library and explain how the process works.

Note

JSON is an alternative to XML files, which are very verbose and require more coding to manipulate their data.

Getting ready

This recipe is going to use the GitHub Events JSON data, which can be found in the GitHub repository of this book at https://github.com/jdorfman/awesome-json-datasets with other free JSON data.

To retrieve the data, click on GitHub API | Events, copy the content from the page, and save it as a .json file...

Creating a SparkSession for PySpark

Previously introduced in Chapter 1, PySpark is a Spark library that was designed to work with Python. PySpark uses a Python API to write Spark functionalities such as data manipulation, processing (batch or real-time), and machine learning.

However, before ingesting or processing data using PySpark, we must initialize a SparkSession. This recipe will teach us how to create a SparkSession using PySpark and explain its importance.

Getting ready

We first need to ensure we have the correct PySpark version. We installed PySpark in Chapter 1; however, checking if we are using the correct version is always good. Run the following command:

$ pyspark –version

You should see the following output:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _...

Using PySpark to read CSV files

As expected, PySpark provides native support for reading and writing CSV files. It also allows data engineers to pass diverse kinds of setups in case the CSV has a different type of delimiter, special encoding, and so on.

In this recipe, we are going to cover how to read CSV files using PySpark using the most common configurations, and we will explain why they are needed.

Getting ready

You can download the CSV dataset for this recipe from Kaggle: https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data. We are going to use the same Spotify dataset as in Chapter 2.

As in the Creating a SparkSession for PySpark recipe, make sure PySpark is installed and running with the latest stable version. Also, using Jupyter Notebook is optional.

How to do it…

Let’s get started:

  1. We first import and create a SparkSession :
    from pyspark.sql import
    spark = .builder \
          .master("local...

Using PySpark to read JSON files

In the Reading a JSON file recipe, we saw that JSON files are widely used to transport and share data between applications, and we saw how to read a JSON file using simple Python code.

However, with the increase in data size and sharing, using only Python to process a high volume of data can lead to performance or resilience issues. That’s why, for this type of scenario, it is highly recommended to use PySpark to read and process JSON files. As you might expect, PySpark comes with a straightforward reading solution.

In this recipe, we will cover how to read a JSON file with PySpark, the common associated issues, and how to solve them.

Getting ready

As in the previous recipe, Reading a JSON file, we are going to use the GitHub Events JSON file. Also, the use of Jupyter Notebook is optional.

How to do it…

Here are the steps for this recipe:

  1. We first create the SparkSession:
    spark = .builder \
      &...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023Publisher: PacktISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gláucia Esppenchutz

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems. She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.
Read more about Gláucia Esppenchutz