You're reading from Python Real-World Projects

Product type Book

Published in Sep 2023

Publisher Packt

ISBN-13 9781803246765

Pages 478 pages

Edition 1st Edition

Languages

Concepts

Programming Language

Author (1):

Steven F. Lott

Table of Contents (20) Chapters

Preface

1. Chapter 1: Project Zero: A Template for Other Projects

2. Chapter 2: Overview of the Projects

3. Chapter 3: Project 1.1: Data Acquisition Base Application

4. Chapter 4: Data Acquisition Features: Web APIs and Scraping

5. Chapter 5: Data Acquisition Features: SQL Database

6. Chapter 6: Project 2.1: Data Inspection Notebook

7. Chapter 7: Data Inspection Features

8. Chapter 8: Project 2.5: Schema and Metadata

9. Chapter 9: Project 3.1: Data Cleaning Base Application

10. Chapter 10: Data Cleaning Features

11. Chapter 11: Project 3.7: Interim Data Persistence

12. Chapter 12: Project 3.8: Integrated Data Acquisition Web Service

13. Chapter 13: Project 4.1: Visual Analysis Techniques

14. Chapter 14: Project 4.2: Creating Reports

15. Chapter 15: Project 5.1: Modeling Base Application

16. Chapter 16: Project 5.2: Simple Multivariate Statistics

17. Chapter 17: Next Steps

18. Other Books You Might Enjoy

19. Index

2.1 General data acquisition

All data analysis processing starts with the essential step of acquiring the data from a source.

The above statement seems almost silly, but failures in this effort often lead to complicated rework later. It’s essential to recognize that data exists in these two essential forms:

Python objects, usable in analytic programs. While the obvious candidates are numbers and strings, this includes using packages like Pillow to operate on images as Python objects. A package like librosa can create objects representing audio data.
A serialization of a Python object. There are many choices here:
- Text. Some kind of string. There are numerous syntax variants, including CSV, JSON, TOML, YAML, HTML, XML, etc.
- Pickled Python Objects. These are created by the pickle module.
- Binary Formats. Tools like Protobuf can serialize native Python objects into a stream of bytes. Some YAML extensions, similarly, can serialize an object in a binary format that isn’t text. Images and audio samples are often stored in compressed binary formats.

The format for the source data is — almost universally — not fixed by any rules or conventions. Writing an application based on the assumption that source data is always a CSV-format file can lead to problems when a new format is required.

It’s best to treat all input formats as subject to change. The data — once acquired — can be saved in a common format used by the analysis pipeline, and independent of the source format (we’ll get to the persistence in Clean, validate, standardize, and persist).

We’ll start with Project 1.1: ”Acquire Data”. This will build the Data Acquisition Base Application. It will acquire CSV-format data and serve as the basis for adding formats in later projects.

There are a number of variants on how data is acquired. In the next few chapters, we’ll look at some alternative data extraction approaches.