What do you get with Print?

Instant access to your digital copy whilst your Print order is Shipped

Paperback book shipped to your preferred address

Redeem a companion digital copy on all Print orders

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

AI Assistant (beta) to help accelerate your learning

Python Real-World Projects

Chapter 2
Overview of the Projects

Our general plan is to craft analytic, decision support modules and applications. These applications support decision-making by providing summaries of available data to the stakeholders. Decision-making spans a spectrum from uncovering new relationships among variables to confirming that data variation is random noise within narrow limits. The processing will start with acquiring data and moving it through several stages until statistical summaries can be presented.

The processing will be decomposed into several stages. Each stage will be built as a core concept application. There will be subsequent projects to add features to the core application. In some cases, a number of features will be added to several projects all combined into a single chapter.

The stages are inspired by the Extract-Transform-Load (ETL) architectural pattern. The design in this book expands on the ETL design with a number of additional steps. The words have been changed because the legacy terminology can be misleading. These features – often required for real-world pragmatic applications — will be inserted as additional stages in a pipeline.

Once the data is cleaned and standardized, then the book will describe some simple statistical models. The analysis will stop there. You are urged to move to more advanced books that cover AI and machine learning.

There are 22 distinct projects, many of which build on previous results. It’s not required to do all of the projects in order. When skipping a project, however, it’s important to read the description and deliverables for the project being skipped. This can help to more fully understand the context for the later projects.

This chapter will cover our overall architectural approach to creating a complete sequence of data analysis programs. We’ll use the following multi-stage approach:

Data acquisition
Inspection of data
Cleaning data; this includes validating, converting, standardizing, and saving intermediate results
Summarizing, and modeling data
Creating more sophisticated statistical models

The stages fit together as shown in Figure 2.1.

A central idea behind this is separation of concerns. Each stage is a distinct operation, and each stage may evolve independently of the others. For example, there may be multiple sources of data, leading to several distinct data acquisition implementations, each of which creates a common internal representation to permit a single, uniform inspection tool.

Similarly, data cleansing problems seem to arise almost randomly in organizations, leading to a need to add distinct validation and standardization operations. The idea is to allocate responsibility for semantic special cases and exceptions in this stage of the pipeline.

One of the architectural ideas is to mix automated applications and a few manual JupyterLab notebooks into an integrated whole. The notebooks are essential for troubleshooting questions or problems. For elegant reports and presentations, notebooks are also very useful. While Python applications can produce tidy PDF files with polished reporting, it seems a bit easier to publish a notebook with analysis and findings.

We’ll start with the acquisition stage of processing.

2.2 Acquisition via Extract

Since data formats are in a constant state of flux, it’s helpful to understand how to add and modify data formats. These projects will all build on Project 1.1 by adding features to the base application. The following projects are designed around alternative sources for data:

Project 1.2: ”Acquire Web Data from an API”. This project will acquire data from web services using JSON format.
Project 1.3: ”Acquire Web Data from HTML”. This project will acquire data from a web page by scraping the HTML.
Two separate projects are part of gathering data from a SQL database:
- Project 1.4: ”Build a Local Database”. This is a necessary sidebar project to build a local SQL database. This is necessary because SQL databases accessible by the public are a rarity. It’s more secure to build our own demonstration database.
- Project 1.5: ”Acquire Data from a Local Database”. Once a database is available, we can acquire data from a SQL extract.

These projects will focus on data represented as text. For CSV files, the data is text; an application must convert it to a more useful Python type. HTML pages, also, are pure text. Sometimes, additional attributes are provided to suggest the text should be treated as a number. A SQL database is often populated with non-text data. To be consistent, the SQL data should be serialized as text. The acquisition applications all share a common approach of working with text.

These applications will also minimize the transformations applied to the source data. To process the data consistently, it’s helpful to make a shift to a common format. As we’ll see in Chapter 3, Project 1.1: Data Acquisition Base Application the NDJSON format provides a useful structure that can often be mapped back to source files.

After acquiring new data, it’s prudent to do a manual inspection. This is often done a few times at the start of application development. After that, inspection is only done to diagnose problems with the source data. The next few chapters will cover projects to inspect data.

2.3 Inspection

Data inspection needs to be done when starting development. It’s essential to survey new data to be sure it really is what’s needed to solve the user’s problems. A common frustration is incomplete or inconsistent data, and these problems need to be exposed as soon as possible to avoid wasting time and effort creating software to process data that doesn’t really exist.

Additionally, data is inspected manually to uncover problems. It’s important to recognize that data sources are in a constant state of flux. As applications evolve and mature, the data provided for analysis will change. In many cases, data analytics applications discover other enterprise changes after the fact via invalid data. It’s important to understand the evolution via good data inspection tools.

Inspection is an inherently manual process. Therefore, we’re going to use JupyterLab to create notebooks to look at the data and determine some basic features.

In rare cases where privacy is important, developers may not be allowed to do data inspection. More privileged people — with permission to see payment card or healthcare details — may be part of data inspection. This means an inspection notebook may be something created by a developer for use by stakeholders.

In many cases, a data inspection notebook can be the start of a fully-automated data cleansing application. A developer can extract notebook cells as functions, building a module that’s usable from both notebook and application. The cell results can be used to create unit test cases.

The stage in the pipeline requires a number of inspection projects:

Project 2.1: ”Inspect Data”. This will build a core data inspection notebook with enough features to confirm that some of the acquired data is likely to be valid.
Project 2.2: ”Inspect Data: Cardinal Domains”. This project will add analysis features for measurements, dates, and times. These are cardinal domains that reflect measures and counts.
Project 2.3: ”Inspect Data: Nominal and Ordinary Domains”. This project will add analysis features for text or coded numeric data. This includes nominal data and ordinal numeric domains. It’s important to recognize that US Zip Codes are digit strings, not numbers.
Project 2.4: ”Inspect Data: Reference Data”. This notebook will include features to find reference domains when working with data that has been normalized and decomposed into subsets with references via coded ”key” values.
Project 2.5: ”Define a Reusable Schema”. As a final step, it can help define a formal schema, and related metadata, using the JSON Schema standard.

While some of these projects seem to be one-time efforts, they often need to be written with some care. In many cases, a notebook will need to be reused when there’s a problem. It helps to provide adequate explanations and test cases to help refresh someone’s memory on details of the data and what are known problem areas. Additionally, notebooks may serve as examples for test cases and the design of Python classes or functions to automate cleaning, validating, or standardizing data.

After a detailed inspection, we can then build applications to automate cleaning, validating, and normalizing the values. The next batch of projects will address this stage of the pipeline.

2.4 Clean, validate, standardize, and persist

Once the data is understood in a general sense, it makes sense to write applications to clean up any serialization problems, and perform more formal tests to be sure the data really is valid. One frustratingly common problem is receiving duplicate files of data; this can happen when scheduled processing was disrupted somewhere else in the enterprise, and a previous period’s files were reused for analysis.

The validation testing is sometimes part of cleaning. If the data contains any unexpected invalid values, it may be necessary to reject it. In other cases, known problems can be resolved as part of analytics by replacing invalid data with valid data. An example of this is US Postal Codes, which are (sometimes) translated into numbers, and the leading zeros are lost.

These stages in the data analysis pipeline are described by a number of projects:

Project 3.1: ”Clean Data”. This builds the data cleaning base application. The design details can come from the data inspection notebooks.
Project 3.2: ”Clean and Validate”. These features will validate and convert numeric fields.
Project 3.3: ”Clean and Validate Text and Codes”. The validation of text fields and numeric coded fields requires somewhat more complex designs.
Project 3.4: ”Clean and Validate References”. When data arrives from separate sources, it is essential to validate references among those sources.
Project 3.5: ”Standardize Data”. Some data sources require standardizing to create common codes and ranges.
Project 3.6: ”Acquire and Clean Pipeline”. It’s often helpful to integrate the acquisition, cleaning, validating, and standardizing into a single pipeline.
Project 3.7: ”Acquire, Clean, and Save”. One key architectural feature of this pipeline is saving intermediate files in a common format, distinct from the data sources.
Project 3.8: ”Data Provider Web Service”. In many enterprises, an internal web service and API are expected as sources for analytic data. This project will wrap the data acquisition pipeline into a RESTful web service.

In these projects, we’ll transform the text values from the acquisition applications into more useful Python objects like integers, floating-point values, decimal values, and date-time values.

Once the data is cleaned and validated, the exploration can continue. The first step is to summarize the data, again, using a Jupyter notebook to create readable, publishable reports and presentations. The next chapters will explore the work of summarizing data.

2.8 Summary

This data analysis pipeline moves data from sources through a series of stages to create clean, valid, standardized data. The general flow supports a variety of needs and permits a great deal of customization and extension.

For developers with an interest in data science or machine learning, these projects cover what is sometimes called the ”data wrangling” part of data science or machine learning. It can be a significant complication as data is understood and differences among data sources are resolved and explored. These are the — sometimes difficult — preparatory steps prior to building a model that can be used for AI decision-making.

For readers with an interest in the web, this kind of data processing and extraction is part of presenting data via a web application API or website. Project 3.7 creates a web server, and will be of particular interest. Because the web service requires clean data, the preceding projects are helpful for creating data that can be published.

For folks with an automation or IoT interest, Part 2 explains how to use Jupyter Notebooks to gather and inspect data. This is a common need, and the various steps to clean, validate, and standardize data become all the more important when dealing with real-world devices subject to the vagaries of temperature and voltage.

We’ve looked at the following multi-stage approach to doing data analysis:

Data Acquisition
Inspection of Data
Clean, Validate, Standardize, and Persist
Summarize and Analyze
Create a Statistical Model

This pipeline follows the Extract-Transform-Load (ETL) concept. The terms have been changed because the legacy words are sometimes misleading. Our acquisition stage overlaps with what is understood as the ”Extract” operation. For some developers, Extract is limited to database extracts; we’d like to go beyond that to include other data source transformations. Our cleaning, validating, and standardizing stages are usually combined into the ”Transform” operation. Saving the clean data is generally the objective of ”Load”; we’re not emphasizing a database load, but instead, we’ll use files.

Throughout the book, we’ll describe each project’s objective and provide the foundation of a sound technical approach. The details of the implementation are up to you. We’ll enumerate the deliverables; this may repeat some of the information from Chapter 1, Project Zero: A Template for Other Projects. The book provides a great deal of information on acceptance test cases and unit test cases — the definition of done. By covering the approach, we’ve left room for you to design and implement the needed application software.

In the next chapter, we’ll build the first data acquisition project. This will work with CSV-format files. Later projects will work with database extracts and web services.

Key benefits

Master Python and related technologies by working on 12 hands-on projects

Accelerate your career by building a personal project portfolio

Explore data acquisition, preparation, and analysis applications

Purchase of the print or Kindle book includes a free PDF eBook

Description

In today's competitive job market, a project portfolio often outshines a traditional resume. Python Real-World Projects empowers you to get to grips with crucial Python concepts while building complete modules and applications. With two dozen meticulously designed projects to explore, this book will help you showcase your Python mastery and refine your skills. Tailored for beginners with a foundational understanding of class definitions, module creation, and Python's inherent data structures, this book is your gateway to programming excellence. You’ll learn how to harness the potential of the standard library and key external projects like JupyterLab, Pydantic, pytest, and requests. You’ll also gain experience with enterprise-oriented methodologies, including unit and acceptance testing, and an agile development approach. Additionally, you’ll dive into the software development lifecycle, starting with a minimum viable product and seamlessly expanding it to add innovative features. By the end of this book, you’ll be armed with a myriad of practical Python projects and all set to accelerate your career as a Python programmer.

Who is this book for?

This book is for beginner-to-intermediate level Python programmers looking to enhance their resume by adding a portfolio of 12 practical projects. A basic understanding of the Python language and its aligned technologies is a must. The book helps you polish your Python skills and project-building prowess without delving into basic Python fundamentals.

What you will learn

Explore core deliverables for an application including documentation and test cases

Discover approaches to data acquisition such as file processing, RESTful APIs, and SQL queries

Create a data inspection notebook to establish properties of source data

Write applications to validate, clean, convert, and normalize source data

Use foundational graphical analysis techniques to visualize data

Build basic univariate and multivariate statistical analysis tools

Create reports from raw data using JupyterLab publication tools

What do you get with Print?

Instant access to your digital copy whilst your Print order is Shipped

Paperback book shipped to your preferred address

Redeem a companion digital copy on all Print orders

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

AI Assistant (beta) to help accelerate your learning

Frequently bought together

50 Algorithms Every Programmer Should Know

€37.99

€38.99

€34.99

Total € 111.97

POE Oct 09, 2023

The projects presented in the book are presented with a focus on how to approach them. The projects are not completed, they are meant to help the reader get started and finish on their own based on the approaches detailed in the book. This is not the normal programming book that provides all source code and then explains it – this requires more from the reader. Personally, I think this is a great approach. We learn to program by programming and this book will push you to the limit.

Amazon Verified review

K Lavanya Nov 10, 2023

"Python Real World Objects" appears to provide a practical, project-based approach to mastering data handling in Python. It walks us through the full data analytics pipeline, from data acquisition to statistical modeling. With its focus on real-world applications and step-by-step projects, the book offers valuable insights to enhance our skills.The gradual project methodology used in the book, together with the incorporation of concepts such as online APIs and JSON Schema, helped me to become familiar with the complexities of processing data at the business level.

Tiny Oct 10, 2023

This book was a truly different learning experience. Rather than focusing on some principles with small projects, it is massive project after massive project. The approach to building nd learning items is awesome, and could be an excellent reference for those developing exclusively in Python. Even without that emphasis, the access to the GitHub pieces to explore and template. It always helps me to be able to look through how others solve problems, and this did exactly that for me. If the hardest part of coding for you is getting started with that template, you could definitely use this book.

Konstantin Mazanik Nov 14, 2023

Overall, it is a great book! Not for beginners, this book is for experienced programmers who want to improve their skills by completing professional-level Python projects. It’s also for developers who need to display their skills by demonstrating a portfolio of work. The included projects require a wide variety of skills, including software and data architecture, design, Python programming and test design. You can download the code bundle for the book on GitHub.This book is all about data handling. The great and unique feature of the book is that it provides big project after another big project. Starting with Acquiring Data From Sources, the first six projects will cover projects to acquire data for analytic processing from a variety of sources. Next topic Inspect and Survey includes five projects to inspect data to make sure it’s usable, and diagnose odd problems, outliers, and exceptions. The general analytics pipeline moves on to Cleaning, Converting, and Normalizing. There are eight projects that tackle these closely-related problems. The useful results begin with Presenting Summaries. There’s a lot of variability here, so only two project ideas included. This book winds up with two small projects covering some basics of Statistical Modeling. Author encourages you to continue your study of Python applications in the data science realm.Completion of the projects will place you on another level in Python. Enjoy the challenge in every project!

AlbertoEG Oct 19, 2024

Unprofessional content formatting/layout. Repetitive content. Incomplete code provided.It seems like the text has been automatically generated and then adapted or revised.I don't understand the other reviews, with 5 stars. They encouraged me to buy this book, but now I feel deceived.

Python Real-World Projects: Craft your Python portfolio with deployable applications

What do you get with Print?

Python Real-World Projects

Chapter 2
Overview of the Projects

2.1 General data acquisition

2.2 Acquisition via Extract

2.3 Inspection

2.4 Clean, validate, standardize, and persist

2.5 Summarize and analyze

2.6 Statistical modeling

2.7 Data contracts

2.8 Summary

Page 1 of 9

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Python Real-World Projects: Craft your Python portfolio with deployable applications

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Key benefits

Description

Who is this book for?

What you will learn

Product Details

What do you get with Print?

Contact Details

Shipping Address

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access