Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python Real-World Projects
Python Real-World Projects

Python Real-World Projects: Craft your Python portfolio with deployable applications

By Steven F. Lott
$15.99 per month
Book Sep 2023 478 pages 1st Edition
eBook
$36.99 $24.99
Print
$45.99
Subscription
$15.99 Monthly
eBook
$36.99 $24.99
Print
$45.99
Subscription
$15.99 Monthly

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Sep 15, 2023
Length 478 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781803246765
Category :
Table of content icon View table of contents Preview book icon Preview Book

Python Real-World Projects

Chapter 2
Overview of the Projects

Our general plan is to craft analytic, decision support modules and applications. These applications support decision-making by providing summaries of available data to the stakeholders. Decision-making spans a spectrum from uncovering new relationships among variables to confirming that data variation is random noise within narrow limits. The processing will start with acquiring data and moving it through several stages until statistical summaries can be presented.

The processing will be decomposed into several stages. Each stage will be built as a core concept application. There will be subsequent projects to add features to the core application. In some cases, a number of features will be added to several projects all combined into a single chapter.

The stages are inspired by the Extract-Transform-Load (ETL) architectural pattern. The design in this book expands on the ETL design with a number of additional steps. The words have been changed because the legacy terminology can be misleading. These features – often required for real-world pragmatic applications — will be inserted as additional stages in a pipeline.

Once the data is cleaned and standardized, then the book will describe some simple statistical models. The analysis will stop there. You are urged to move to more advanced books that cover AI and machine learning.

There are 22 distinct projects, many of which build on previous results. It’s not required to do all of the projects in order. When skipping a project, however, it’s important to read the description and deliverables for the project being skipped. This can help to more fully understand the context for the later projects.

This chapter will cover our overall architectural approach to creating a complete sequence of data analysis programs. We’ll use the following multi-stage approach:

  • Data acquisition

  • Inspection of data

  • Cleaning data; this includes validating, converting, standardizing, and saving intermediate results

  • Summarizing, and modeling data

  • Creating more sophisticated statistical models

The stages fit together as shown in Figure 2.1.

Figure 2.1: Data Analysis Pipeline
Figure 2.1: Data Analysis Pipeline

A central idea behind this is separation of concerns. Each stage is a distinct operation, and each stage may evolve independently of the others. For example, there may be multiple sources of data, leading to several distinct data acquisition implementations, each of which creates a common internal representation to permit a single, uniform inspection tool.

Similarly, data cleansing problems seem to arise almost randomly in organizations, leading to a need to add distinct validation and standardization operations. The idea is to allocate responsibility for semantic special cases and exceptions in this stage of the pipeline.

One of the architectural ideas is to mix automated applications and a few manual JupyterLab notebooks into an integrated whole. The notebooks are essential for troubleshooting questions or problems. For elegant reports and presentations, notebooks are also very useful. While Python applications can produce tidy PDF files with polished reporting, it seems a bit easier to publish a notebook with analysis and findings.

We’ll start with the acquisition stage of processing.

2.1 General data acquisition

All data analysis processing starts with the essential step of acquiring the data from a source.

The above statement seems almost silly, but failures in this effort often lead to complicated rework later. It’s essential to recognize that data exists in these two essential forms:

  • Python objects, usable in analytic programs. While the obvious candidates are numbers and strings, this includes using packages like Pillow to operate on images as Python objects. A package like librosa can create objects representing audio data.

  • A serialization of a Python object. There are many choices here:

    • Text. Some kind of string. There are numerous syntax variants, including CSV, JSON, TOML, YAML, HTML, XML, etc.

    • Pickled Python Objects. These are created by the pickle module.

    • Binary Formats. Tools like Protobuf can serialize native Python objects into a stream of bytes. Some YAML extensions, similarly, can serialize an object in a binary format that isn’t text. Images and audio samples are often stored in compressed binary formats.

The format for the source data is — almost universally — not fixed by any rules or conventions. Writing an application based on the assumption that source data is always a CSV-format file can lead to problems when a new format is required.

It’s best to treat all input formats as subject to change. The data — once acquired — can be saved in a common format used by the analysis pipeline, and independent of the source format (we’ll get to the persistence in Clean, validate, standardize, and persist).

We’ll start with Project 1.1: ”Acquire Data”. This will build the Data Acquisition Base Application. It will acquire CSV-format data and serve as the basis for adding formats in later projects.

There are a number of variants on how data is acquired. In the next few chapters, we’ll look at some alternative data extraction approaches.

2.2 Acquisition via Extract

Since data formats are in a constant state of flux, it’s helpful to understand how to add and modify data formats. These projects will all build on Project 1.1 by adding features to the base application. The following projects are designed around alternative sources for data:

  • Project 1.2: ”Acquire Web Data from an API”. This project will acquire data from web services using JSON format.

  • Project 1.3: ”Acquire Web Data from HTML”. This project will acquire data from a web page by scraping the HTML.

  • Two separate projects are part of gathering data from a SQL database:

    • Project 1.4: ”Build a Local Database”. This is a necessary sidebar project to build a local SQL database. This is necessary because SQL databases accessible by the public are a rarity. It’s more secure to build our own demonstration database.

    • Project 1.5: ”Acquire Data from a Local Database”. Once a database is available, we can acquire data from a SQL extract.

These projects will focus on data represented as text. For CSV files, the data is text; an application must convert it to a more useful Python type. HTML pages, also, are pure text. Sometimes, additional attributes are provided to suggest the text should be treated as a number. A SQL database is often populated with non-text data. To be consistent, the SQL data should be serialized as text. The acquisition applications all share a common approach of working with text.

These applications will also minimize the transformations applied to the source data. To process the data consistently, it’s helpful to make a shift to a common format. As we’ll see in Chapter 3, Project 1.1: Data Acquisition Base Application the NDJSON format provides a useful structure that can often be mapped back to source files.

After acquiring new data, it’s prudent to do a manual inspection. This is often done a few times at the start of application development. After that, inspection is only done to diagnose problems with the source data. The next few chapters will cover projects to inspect data.

2.3 Inspection

Data inspection needs to be done when starting development. It’s essential to survey new data to be sure it really is what’s needed to solve the user’s problems. A common frustration is incomplete or inconsistent data, and these problems need to be exposed as soon as possible to avoid wasting time and effort creating software to process data that doesn’t really exist.

Additionally, data is inspected manually to uncover problems. It’s important to recognize that data sources are in a constant state of flux. As applications evolve and mature, the data provided for analysis will change. In many cases, data analytics applications discover other enterprise changes after the fact via invalid data. It’s important to understand the evolution via good data inspection tools.

Inspection is an inherently manual process. Therefore, we’re going to use JupyterLab to create notebooks to look at the data and determine some basic features.

In rare cases where privacy is important, developers may not be allowed to do data inspection. More privileged people — with permission to see payment card or healthcare details — may be part of data inspection. This means an inspection notebook may be something created by a developer for use by stakeholders.

In many cases, a data inspection notebook can be the start of a fully-automated data cleansing application. A developer can extract notebook cells as functions, building a module that’s usable from both notebook and application. The cell results can be used to create unit test cases.

The stage in the pipeline requires a number of inspection projects:

  • Project 2.1: ”Inspect Data”. This will build a core data inspection notebook with enough features to confirm that some of the acquired data is likely to be valid.

  • Project 2.2: ”Inspect Data: Cardinal Domains”. This project will add analysis features for measurements, dates, and times. These are cardinal domains that reflect measures and counts.

  • Project 2.3: ”Inspect Data: Nominal and Ordinary Domains”. This project will add analysis features for text or coded numeric data. This includes nominal data and ordinal numeric domains. It’s important to recognize that US Zip Codes are digit strings, not numbers.

  • Project 2.4: ”Inspect Data: Reference Data”. This notebook will include features to find reference domains when working with data that has been normalized and decomposed into subsets with references via coded ”key” values.

  • Project 2.5: ”Define a Reusable Schema”. As a final step, it can help define a formal schema, and related metadata, using the JSON Schema standard.

While some of these projects seem to be one-time efforts, they often need to be written with some care. In many cases, a notebook will need to be reused when there’s a problem. It helps to provide adequate explanations and test cases to help refresh someone’s memory on details of the data and what are known problem areas. Additionally, notebooks may serve as examples for test cases and the design of Python classes or functions to automate cleaning, validating, or standardizing data.

After a detailed inspection, we can then build applications to automate cleaning, validating, and normalizing the values. The next batch of projects will address this stage of the pipeline.

2.4 Clean, validate, standardize, and persist

Once the data is understood in a general sense, it makes sense to write applications to clean up any serialization problems, and perform more formal tests to be sure the data really is valid. One frustratingly common problem is receiving duplicate files of data; this can happen when scheduled processing was disrupted somewhere else in the enterprise, and a previous period’s files were reused for analysis.

The validation testing is sometimes part of cleaning. If the data contains any unexpected invalid values, it may be necessary to reject it. In other cases, known problems can be resolved as part of analytics by replacing invalid data with valid data. An example of this is US Postal Codes, which are (sometimes) translated into numbers, and the leading zeros are lost.

These stages in the data analysis pipeline are described by a number of projects:

  • Project 3.1: ”Clean Data”. This builds the data cleaning base application. The design details can come from the data inspection notebooks.

  • Project 3.2: ”Clean and Validate”. These features will validate and convert numeric fields.

  • Project 3.3: ”Clean and Validate Text and Codes”. The validation of text fields and numeric coded fields requires somewhat more complex designs.

  • Project 3.4: ”Clean and Validate References”. When data arrives from separate sources, it is essential to validate references among those sources.

  • Project 3.5: ”Standardize Data”. Some data sources require standardizing to create common codes and ranges.

  • Project 3.6: ”Acquire and Clean Pipeline”. It’s often helpful to integrate the acquisition, cleaning, validating, and standardizing into a single pipeline.

  • Project 3.7: ”Acquire, Clean, and Save”. One key architectural feature of this pipeline is saving intermediate files in a common format, distinct from the data sources.

  • Project 3.8: ”Data Provider Web Service”. In many enterprises, an internal web service and API are expected as sources for analytic data. This project will wrap the data acquisition pipeline into a RESTful web service.

In these projects, we’ll transform the text values from the acquisition applications into more useful Python objects like integers, floating-point values, decimal values, and date-time values.

Once the data is cleaned and validated, the exploration can continue. The first step is to summarize the data, again, using a Jupyter notebook to create readable, publishable reports and presentations. The next chapters will explore the work of summarizing data.

2.5 Summarize and analyze

Summarizing data in a useful form is more art than technology. It can be difficult to know how best to present information to people to help them make more valuable, or helpful decisions.

There are a few projects to capture the essence of summaries and initial analysis:

  • Project 4.1: ”A Data Dashboard”. This notebook will show a number of visual analysis techniques.

  • Project 4.2: ”A Published Report”. A notebook can be saved as a PDF file, creating a report that’s easily shared.

The initial work of summarizing and creating shared, published reports sets the stage for more formal, automated reporting. The next set of projects builds modules that provide deeper and more sophisticated statistical models.

2.6 Statistical modeling

The point of data analysis is to digest raw data and present information to people to support their decision-making. The previous stages of the pipeline have prepared two important things:

  • Raw data has been cleaned and standardized to provide data that are relatively easy to analyze.

  • The process of inspecting and summarizing the data has helped analysts, developers, and, ultimately, users understand what the information means.

The confluence of data and deeper meaning creates significant value for an enterprise. The analysis process can continue as more formalized statistical modeling. This, in turn, may lead to artificial intelligence (AI) and machine learning (ML) applications.

The processing pipeline includes these projects to gather summaries of individual variables as well as combinations of variables:

  • Project 5.1: ”Statistical Model: Core Processing”. This project builds the base application for applying statistical models and saving parameters about the data. This will focus on summaries like mean, median, mode, and variance.

  • Project 5.2: ”Statistical Model: Relationships”. It’s common to want to know the relationships among variables. This includes measures like correlation among variables.

This sequence of stages produces high-quality data and provides ways to diagnose and debug problems with data sources. The sequence of projects will illustrate how automated solutions and interactive inspection can be used to create useful, timely, insightful reporting and analysis.

2.7 Data contracts

We will touch on data contracts at various stages in this pipeline. This application’s data acquisition, for example, may have a formalized contract with a data provider. It’s also possible that an informal data contract, in the form of a schema definition, or an API is all that’s available.

In Chapter 8, Project 2.5: Schema and Metadata we’ll consider some schema publication concerns. In Chapter 11, Project 3.7: Interim Data Persistence we’ll consider the schema provided to downstream applications. These two topics are related to a formal data contract, but this book won’t delve deeply into data contracts, how they’re created, or how they might be used.

2.8 Summary

This data analysis pipeline moves data from sources through a series of stages to create clean, valid, standardized data. The general flow supports a variety of needs and permits a great deal of customization and extension.

For developers with an interest in data science or machine learning, these projects cover what is sometimes called the ”data wrangling” part of data science or machine learning. It can be a significant complication as data is understood and differences among data sources are resolved and explored. These are the — sometimes difficult — preparatory steps prior to building a model that can be used for AI decision-making.

For readers with an interest in the web, this kind of data processing and extraction is part of presenting data via a web application API or website. Project 3.7 creates a web server, and will be of particular interest. Because the web service requires clean data, the preceding projects are helpful for creating data that can be published.

For folks with an automation or IoT interest, Part 2 explains how to use Jupyter Notebooks to gather and inspect data. This is a common need, and the various steps to clean, validate, and standardize data become all the more important when dealing with real-world devices subject to the vagaries of temperature and voltage.

We’ve looked at the following multi-stage approach to doing data analysis:

  • Data Acquisition

  • Inspection of Data

  • Clean, Validate, Standardize, and Persist

  • Summarize and Analyze

  • Create a Statistical Model

This pipeline follows the Extract-Transform-Load (ETL) concept. The terms have been changed because the legacy words are sometimes misleading. Our acquisition stage overlaps with what is understood as the ”Extract” operation. For some developers, Extract is limited to database extracts; we’d like to go beyond that to include other data source transformations. Our cleaning, validating, and standardizing stages are usually combined into the ”Transform” operation. Saving the clean data is generally the objective of ”Load”; we’re not emphasizing a database load, but instead, we’ll use files.

Throughout the book, we’ll describe each project’s objective and provide the foundation of a sound technical approach. The details of the implementation are up to you. We’ll enumerate the deliverables; this may repeat some of the information from Chapter 1, Project Zero: A Template for Other Projects. The book provides a great deal of information on acceptance test cases and unit test cases — the definition of done. By covering the approach, we’ve left room for you to design and implement the needed application software.

In the next chapter, we’ll build the first data acquisition project. This will work with CSV-format files. Later projects will work with database extracts and web services.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Master Python and related technologies by working on 12 hands-on projects
  • Accelerate your career by building a personal project portfolio
  • Explore data acquisition, preparation, and analysis applications
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

In today's competitive job market, a project portfolio often outshines a traditional resume. Python Real-World Projects empowers you to get to grips with crucial Python concepts while building complete modules and applications. With two dozen meticulously designed projects to explore, this book will help you showcase your Python mastery and refine your skills. Tailored for beginners with a foundational understanding of class definitions, module creation, and Python's inherent data structures, this book is your gateway to programming excellence. You’ll learn how to harness the potential of the standard library and key external projects like JupyterLab, Pydantic, pytest, and requests. You’ll also gain experience with enterprise-oriented methodologies, including unit and acceptance testing, and an agile development approach. Additionally, you’ll dive into the software development lifecycle, starting with a minimum viable product and seamlessly expanding it to add innovative features. By the end of this book, you’ll be armed with a myriad of practical Python projects and all set to accelerate your career as a Python programmer.

What you will learn

Explore core deliverables for an application including documentation and test cases Discover approaches to data acquisition such as file processing, RESTful APIs, and SQL queries Create a data inspection notebook to establish properties of source data Write applications to validate, clean, convert, and normalize source data Use foundational graphical analysis techniques to visualize data Build basic univariate and multivariate statistical analysis tools Create reports from raw data using JupyterLab publication tools

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Sep 15, 2023
Length 478 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781803246765
Category :

Table of Contents

20 Chapters
Preface Chevron down icon Chevron up icon
Chapter 1: Project Zero: A Template for Other Projects Chevron down icon Chevron up icon
Chapter 2: Overview of the Projects Chevron down icon Chevron up icon
Chapter 3: Project 1.1: Data Acquisition Base Application Chevron down icon Chevron up icon
Chapter 4: Data Acquisition Features: Web APIs and Scraping Chevron down icon Chevron up icon
Chapter 5: Data Acquisition Features: SQL Database Chevron down icon Chevron up icon
Chapter 6: Project 2.1: Data Inspection Notebook Chevron down icon Chevron up icon
Chapter 7: Data Inspection Features Chevron down icon Chevron up icon
Chapter 8: Project 2.5: Schema and Metadata Chevron down icon Chevron up icon
Chapter 9: Project 3.1: Data Cleaning Base Application Chevron down icon Chevron up icon
Chapter 10: Data Cleaning Features Chevron down icon Chevron up icon
Chapter 11: Project 3.7: Interim Data Persistence Chevron down icon Chevron up icon
Chapter 12: Project 3.8: Integrated Data Acquisition Web Service Chevron down icon Chevron up icon
Chapter 13: Project 4.1: Visual Analysis Techniques Chevron down icon Chevron up icon
Chapter 14: Project 4.2: Creating Reports Chevron down icon Chevron up icon
Chapter 15: Project 5.1: Modeling Base Application Chevron down icon Chevron up icon
Chapter 16: Project 5.2: Simple Multivariate Statistics Chevron down icon Chevron up icon
Chapter 17: Next Steps Chevron down icon Chevron up icon
Other Books You Might Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.