You're reading from Data Ingestion with Python Cookbook

Product typeBook

Published inMay 2023

PublisherPackt

ISBN-139781837632602

Edition1st Edition

Concepts

Data Engineering

Author (1)

Gláucia Esppenchutz

Reading CSV and JSON Files and Solving Problems

When working with data, we come across several different types of data, such as structured, semi-structured, and non-structured, and some specifics from other systems’ outputs. Yet two widespread file types are ingested, comma-separated values (CSV) and JavaScript Object Notation (JSON). There are many applications for these two files, which are widely used for data ingestion due to their versatility.

In this chapter, you will learn more about these file formats and how to ingest them using Python and PySpark, apply the best practices, and solve ingestion and transformation-related problems.

In this chapter, we will cover the following recipes:

Reading a CSV ﬁle
Reading a JSON ﬁle
Creating a SparkSession for PySpark
Using PySpark to read CSV ﬁles
Using PySpark to read JSON ﬁles

Technical requirements

You can find the code for this chapter in this GitHub repository: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Using Jupyter Notebook is not mandatory but can help you see how the code works interactively. Since we will execute Python and PySpark code, it can help us understand the scripts better. Once you have installed it, you can execute Jupyter using the following command:

$ jupyter notebook

It is recommended to create a separate folder to store the Python files or notebooks we will create in this chapter; however, feel free to organize them however suits you best.

Reading a CSV ﬁle

A CSV file is a plain text file where commas separate each data point, and each line represents a new record. It is widely used in many areas, such as finance, marketing, and sales, to store data. Software such as Microsoft Excel and LibreOffice, and even online solutions such as Google Spreadsheets, provide reading and writing operations for this file. Visually it resembles a structured table, which greatly enhances the file’s usability.

Getting ready

You can download the CSV dataset for this from Kaggle. Use this link to download the file: https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data. We are going to use the same Spotify dataset as in Chapter 2.

Note

Since Kaggle is a dynamic platform, the filename might change occasionally. After downloading it, I named the file spotify_data.csv.

For this recipe, we will use only Python and Jupyter Notebook to execute the code and create a more friendly visualization.

How to do it...

Reading a JSON ﬁle

JavaScript Object Notation (JSON) is a semi-structured data format. Some articles also define JSON as an unstructured data format, but the truth is this format can be used for multiple purposes.

JSON structure uses nested objects and arrays and, due to its flexibility, many applications and APIs use it to export or share data. That is why describing this file format in this chapter is essential.

This recipe will explore how to read a JSON file using a built-in Python library and explain how the process works.

Note

JSON is an alternative to XML files, which are very verbose and require more coding to manipulate their data.

Getting ready

This recipe is going to use the GitHub Events JSON data, which can be found in the GitHub repository of this book at https://github.com/jdorfman/awesome-json-datasets with other free JSON data.

To retrieve the data, click on GitHub API | Events, copy the content from the page, and save it as a .json file...

Creating a SparkSession for PySpark

Previously introduced in Chapter 1, PySpark is a Spark library that was designed to work with Python. PySpark uses a Python API to write Spark functionalities such as data manipulation, processing (batch or real-time), and machine learning.

However, before ingesting or processing data using PySpark, we must initialize a SparkSession. This recipe will teach us how to create a SparkSession using PySpark and explain its importance.

Getting ready

We first need to ensure we have the correct PySpark version. We installed PySpark in Chapter 1; however, checking if we are using the correct version is always good. Run the following command:

$ pyspark –version

You should see the following output:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _...

Using PySpark to read CSV ﬁles

As expected, PySpark provides native support for reading and writing CSV files. It also allows data engineers to pass diverse kinds of setups in case the CSV has a different type of delimiter, special encoding, and so on.

In this recipe, we are going to cover how to read CSV files using PySpark using the most common configurations, and we will explain why they are needed.

Getting ready

You can download the CSV dataset for this recipe from Kaggle: https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data. We are going to use the same Spotify dataset as in Chapter 2.

As in the Creating a SparkSession for PySpark recipe, make sure PySpark is installed and running with the latest stable version. Also, using Jupyter Notebook is optional.

How to do it…

Let’s get started:

We first import and create a SparkSession :

from pyspark.sql import
spark = .builder \
      .master("local...

Using PySpark to read JSON ﬁles

In the Reading a JSON file recipe, we saw that JSON files are widely used to transport and share data between applications, and we saw how to read a JSON file using simple Python code.

However, with the increase in data size and sharing, using only Python to process a high volume of data can lead to performance or resilience issues. That’s why, for this type of scenario, it is highly recommended to use PySpark to read and process JSON files. As you might expect, PySpark comes with a straightforward reading solution.

In this recipe, we will cover how to read a JSON file with PySpark, the common associated issues, and how to solve them.

Getting ready

As in the previous recipe, Reading a JSON file, we are going to use the GitHub Events JSON file. Also, the use of Jupyter Notebook is optional.

How to do it…

Here are the steps for this recipe:

We first create the SparkSession:
```
spark = .builder \
  &...
```

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems. She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.
Read more about Gláucia Esppenchutz

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Ingestion with Python Cookbook

Reading CSV and JSON Files and Solving Problems

Technical requirements

Reading a CSV ﬁle

Getting ready

How to do it...

Reading a JSON ﬁle

Getting ready

Creating a SparkSession for PySpark

Getting ready

Using PySpark to read CSV ﬁles

Getting ready

How to do it…

Using PySpark to read JSON ﬁles

Getting ready

How to do it…

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook