Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python Real-World Projects

You're reading from  Python Real-World Projects

Product type Book
Published in Sep 2023
Publisher Packt
ISBN-13 9781803246765
Pages 478 pages
Edition 1st Edition
Languages
Author (1):
Steven F. Lott Steven F. Lott
Profile icon Steven F. Lott

Table of Contents (20) Chapters

Preface 1. Chapter 1: Project Zero: A Template for Other Projects 2. Chapter 2: Overview of the Projects 3. Chapter 3: Project 1.1: Data Acquisition Base Application 4. Chapter 4: Data Acquisition Features: Web APIs and Scraping 5. Chapter 5: Data Acquisition Features: SQL Database 6. Chapter 6: Project 2.1: Data Inspection Notebook 7. Chapter 7: Data Inspection Features 8. Chapter 8: Project 2.5: Schema and Metadata 9. Chapter 9: Project 3.1: Data Cleaning Base Application 10. Chapter 10: Data Cleaning Features 11. Chapter 11: Project 3.7: Interim Data Persistence 12. Chapter 12: Project 3.8: Integrated Data Acquisition Web Service 13. Chapter 13: Project 4.1: Visual Analysis Techniques 14. Chapter 14: Project 4.2: Creating Reports 15. Chapter 15: Project 5.1: Modeling Base Application 16. Chapter 16: Project 5.2: Simple Multivariate Statistics 17. Chapter 17: Next Steps 18. Other Books You Might Enjoy 19. Index

2.1 General data acquisition

All data analysis processing starts with the essential step of acquiring the data from a source.

The above statement seems almost silly, but failures in this effort often lead to complicated rework later. It’s essential to recognize that data exists in these two essential forms:

  • Python objects, usable in analytic programs. While the obvious candidates are numbers and strings, this includes using packages like Pillow to operate on images as Python objects. A package like librosa can create objects representing audio data.

  • A serialization of a Python object. There are many choices here:

    • Text. Some kind of string. There are numerous syntax variants, including CSV, JSON, TOML, YAML, HTML, XML, etc.

    • Pickled Python Objects. These are created by the pickle module.

    • Binary Formats. Tools like Protobuf can serialize native Python objects into a stream of bytes. Some YAML extensions, similarly, can serialize an object in a binary format that isn’t text. Images and audio samples are often stored in compressed binary formats.

The format for the source data is — almost universally — not fixed by any rules or conventions. Writing an application based on the assumption that source data is always a CSV-format file can lead to problems when a new format is required.

It’s best to treat all input formats as subject to change. The data — once acquired — can be saved in a common format used by the analysis pipeline, and independent of the source format (we’ll get to the persistence in Clean, validate, standardize, and persist).

We’ll start with Project 1.1: ”Acquire Data”. This will build the Data Acquisition Base Application. It will acquire CSV-format data and serve as the basis for adding formats in later projects.

There are a number of variants on how data is acquired. In the next few chapters, we’ll look at some alternative data extraction approaches.

You have been reading a chapter from
Python Real-World Projects
Published in: Sep 2023 Publisher: Packt ISBN-13: 9781803246765
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}