Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Ingestion with Python Cookbook

You're reading from  Data Ingestion with Python Cookbook

Product type Book
Published in May 2023
Publisher Packt
ISBN-13 9781837632602
Pages 414 pages
Edition 1st Edition
Languages
Author (1):
Gláucia Esppenchutz Gláucia Esppenchutz
Profile icon Gláucia Esppenchutz

Table of Contents (17) Chapters

Preface Part 1: Fundamentals of Data Ingestion
Chapter 1: Introduction to Data Ingestion Chapter 2: Principals of Data Access – Accessing Your Data Chapter 3: Data Discovery – Understanding Our Data before Ingesting It Chapter 4: Reading CSV and JSON Files and Solving Problems Chapter 5: Ingesting Data from Structured and Unstructured Databases Chapter 6: Using PySpark with Defined and Non-Defined Schemas Chapter 7: Ingesting Analytical Data Part 2: Structuring the Ingestion Pipeline
Chapter 8: Designing Monitored Data Workflows Chapter 9: Putting Everything Together with Airflow Chapter 10: Logging and Monitoring Your Data Ingest in Airflow Chapter 11: Automating Your Data Ingestion Pipelines Chapter 12: Using Data Observability for Debugging, Error Handling, and Preventing Downtime Index Other Books You May Enjoy

Principals of Data Access – Accessing Your Data

Data access is a term that refers to the ability to store, retrieve, transfer, and copy data from one system or application to another. It crucially involves security, legal, and, in some cases, national matters. In addition to the last two, we will also cover some security topics in this chapter.

As data engineers or scientists, knowing how to retrieve data correctly is necessary. Some of it may require encrypted authentication, and for this, we need to understand how some decrypting libraries work and how to use them without compromising or leaking sensitive data. Data access also refers to the levels of authorization a system or database have, from administration to read-only roles.

In this chapter, we will cover how the levels of data access are defined and the most used libraries and authentication methods in the data ingestion process.

In this chapter, you will work through the following recipes:

  • Implementing...

Technical requirements

A Google Cloud account can be easily created if you already have a Gmail account, and most of the resources can be accessed with a free tier. It also provides $300 of credit for resources that are not free. It is a good incentive if you want to make other tests using the other recipes in this book inside GCP.

To access and enable a Google Cloud account, go to the https://cloud.google.com/ page and follow the steps provided on the screen.

Note

All the recipes covered in this chapter are eligible to use the free tier.

You can also find the code from this chapter in this GitHub repository here: https://github.com/PacktPublishing/Data-Ingestion-with-Python-Cookbook.

Implementing governance in a data access workflow

As we saw previously, data access or accessibility is a governance pillar and is closely related to security. Data safety is not only a concern for administrators or managers but also for everyone that is involved with data. Having said that, it is essential to know how to design a base workflow to implement security layers for our data, allowing only authorized people to read or manipulate it.

This recipe will create a workflow with essential topics to implement data access management.

Getting ready

Before designing our workflow, we need to identify the vectors interfering with our data access.

So, what are data vectors?

Vectors are paths someone can use to gain unauthorized access to a server, network, or database. In this case, we will identify the ones related to data leaks.

Let’s explore them in a visual form, as shown in the following diagram:

Figure 2.1 – Data governance vectors

Figure 2.1 – Data governance...

Accessing databases and data warehouses

Databases are the foundation of any system or application, no matter your architecture. A database is sometimes needed to store logs, user activities or information, and system stuff.

Putting it in a bigger perspective, data warehouses have the same usage but are related to analytical data. After ingesting and transforming data, we need to load it somewhere where it is easier to retrieve analytic information for use on dashboards, reports, etc.

Currently, it is possible to find several types of databases (of the SQL and NoSQL types) and data warehouse architectures. However, this recipe aims to cover how access control is usually done for both relational structures. The goal is to understand how the access levels are defined, even using a generic scenario.

Getting ready

For this recipe, we will use MySQL. You can install it following the instructions on the MySQL official page here: https://dev.mysql.com/downloads/installer/.

You...

Accessing SSH File Transfer Protocol (SFTP) files

The File Transfer Protocol (FTP) was introduced in the 1970s at Massachusetts Institute of Technology (MIT) and is based on the Transmission Control Protocol/Internet Protocol (TCP/IP) application layer. Since the 1980s, it has been widely used to transfer files between computers.

Over the years, and with the increase in computer and internet usage, it became necessary to introduce a more secure way to use this solution. An SSH layer was implemented to improve the security of FTP transactions, creating the SSH File Transfer Protocol (SFTP) protocol.

Nowadays, it is common to ingest data from SFTP servers, and in this recipe, we will work to retrieve data from a public SFTP server.

Getting ready

In this recipe, we will create code with Python, using the pysftp library, to connect and retrieve sample data from a public SFTP server.

If you own an SFTP server, feel free to test the Python code here to exercise a little...

Retrieving data using API authentication

An Application Programming Interface (API) is a set of configurations that allows two systems or applications to communicate or transmit data with each other. Its concept has been improved in recent years, allowing faster transmissions and more security with OAuth methods, preventing Denial of Service (DoS) or Distributed Denial of Service (DDoS) attacks, and so on.

Its use is widely applied in data ingesting, whether to retrieve data from an application to retrieve the latest logs for analysis or from BigQuery using a cloud provider such as Google. Most applications nowadays make their data available through an API service, from which the data world gets a lot of benefits. The critical aspect here is to know how to retrieve data from an API service using the most accepted forms of authentication.

In this recipe, we will retrieve data from a public API using API key authentication, a standard method to gather data.

Getting ready

...

Managing encrypted files

When handling sensitive data is common, some fields or even the entire file is encrypted. It is comprehensive when this file security measure is implemented since sensitive data can expose the life of users. After all, encryption is the process of converting information into code that hides the original content.

Nonetheless, we must still ingest and process these encrypted files in our data pipelines. To be able to do so, we need to understand a bit more about how encryption works and how it is done.

In this recipe, we will decrypt a GnuPG-encrypted (where GnuPG stands for GNU Privacy Guard) file using Python libraries and best practices.

Getting ready

Before jumping into the fun part, we must install the GnuPG library on our local machine and download the encrypted dataset.

You will need two installations for the GnuPG file – one for the operating system (OS) and another for a Python package. This because the Python package requires...

Accessing data from AWS using S3

AWS is one of the most popular cloud providers, mixing different service architectures and allowing easy and fast implementations.

While it has various solutions for relational and non-relational databases, in this recipe, we will cover how to manage data access from S3 buckets, which is an object storage service allowing not only text files to be uploaded, but also media and several other types of files used in the IoT and big data fields.

There are two commonly used types of data access management for S3 buckets, both used on ingest pipelines – user control and bucket policies. In this recipe, we will learn how to manage access by user control, given that it is the most used method among data ingestion pipelines.

Getting ready

To do this recipe, having or creating an AWS account is not mandatory. The objective is to build a step-by-step Identity Access Management (IAM) policy to retrieve data from an S3 bucket using good data access...

Accessing data from GCP using Cloud Storage

Google Cloud Platform (GCP) is a cloud provider that offers manifold services, from cloud computing to Artificial Intelligence (AI), which can be implemented in only a few steps. It also provides broad-spectrum storage called Cloud Storage.

In this recipe, we will build step-by-step policies to control access to data inside our Cloud Storage buckets.

Getting ready

This recipe will use the uniform method, as defined by the Google Cloud team:

  1. First, we will create a testing user. Go to the IAM page (https://console.cloud.google.com/iam-admin/iam) and select Grant Access. Add a valid Gmail address in the New principals field. For now, this user will only have the Browser role:
Figure 2.22 – The GCP IAM page to attach policies to a user

Figure 2.22 – The GCP IAM page to attach policies to a user

  1. Then, we will create a Cloud Storage bucket. Go to the Cloud Storage page and select Create a bucket: https://console.cloud.google.com/storage/create...
lock icon The rest of the chapter is locked
You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023 Publisher: Packt ISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}