You're reading from Learn Python by Building Data Science Applications

Product typeBook

Published inAug 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789535365

Edition1st Edition

Languages

Python

Tools

Pygame

Concepts

Application Development

Authors (2):

Philipp Kats

David Katz

View More author details

Data Exploration and Visualization

In the previous chapter, we went deep into data cleaning and preparation. But what is inside this dataset? What story does it tell about the war, and how can we make those stories clear? Knowing how to dissect data, understand it, and extract insights is one of the crucial skills for data analysis and is a mandatory step before building anything driven by this data. In this chapter, we'll learn how to explore a dataset, compute aggregate statistics, and understand outliers and general trends through data visualization. The skills we'll learn are essential to any data analysis and are used throughout the industry and academia.

In particular, the following topics will be covered in this chapter:

Descriptive statistics
Aggregation and resampling
The ecosystem of modern visualizations using matplotlib with pandas, altair, and datashader...

Technical requirements

In this chapter, we'll make use of three additional visualization libraries: geopandas, altair, and datashader. All of them can be installed via Anaconda or PIP and are included in our environment.yaml file. As always, if you followed the instructions in Chapter 1, Preparing the Workspace, you're all set. If not, you can install them using conda.

Exploring the dataset

For this chapter, we'll use the dataset on WWII battles we collected earlier in Chapter 7, Scraping Data from the Web with Beautiful Soup 4. As you may remember, the dataset includes dates, results, sides, leaders, and the number of troops and casualties of those battles. But what questions can we answer with this information? Let's start with something simple: which battles took the most casualties on both sides? Where were most of the tanks destroyed? How was the number of casualties distributed over time and geography?

In the previous chapter, we cleaned and processed most of the data; however, given the sensitivity of the subject, we went ahead and cross-checked main values row-by-row, manually, and, indeed, had to correct a few values. This work cannot be completely automated. In this and further chapters, we'll work with the corrected...

Declarative visualization with vega and altair

Until now, we have used the matplotlib library, via the built-in pandas interface. matplotlib is powerful and essential to Python's data visualization ecosystem. It is not, however, the only visualization library we can use. In fact, there are a plethora of visualization tools, different in their format, focus, or even philosophy. In this section, we'll introduce you to a different tool—and different concept of data visualization—and that is altair, which is a Python library based around the Vega engine. What makes it so different? A couple of things, in fact.

First of all, its core philosophy is based on the declarative approach, which can be boiled down to the following principle: the core idea is to write each chart in code as a declaration—basically, a recipe. This declaration would define what to...

Big data visualization with datashader

Big data also needs to be visualized! Big data visualizations are somewhat rare; in part because they are hard to do, but also because they are hard to interpret and communicate insights. A big data visualization is usually either a network, a map, or a mapping (similarity-based, computed 2- or 3-dimensional distributions). They are usually astonishing and complex! In fact, a few early inventors of big data visualizations, such as Eric Fisher, became famous for their work with big data.

As we mentioned, big data visualizations are generally hard due to the mere size of the dataset. Standard tools won't work— for matplotlib, even with a raster engine, it will take hours to plot millions of points, and Altair won't do it at all. For a long time, there wasn't an easy solution to this problem. This changed with the announcement...

Summary

In this chapter, we discussed how to derive insights from the raw data—compute descriptive statistics and aggregates and draw basic plots of relationships—and use special tools for big data visualization. As a result, we've learned how to start working with the dataset, investigate its overall properties, and drill down to specific details. We also learned how to visualize data, a vital skill for both personal data exploration and sharing the insights with a broad audience. These skills are fundamental for data analysis—knowing what to ask and how to answer your question with the data and noticing patterns and anomalies in the data and being able to interpret them and speculate on their origins.

In our next chapter, we'll go a step further in that direction, leveraging statistical and machine learning models to guide our interpretation.

...

Questions

How can we understand some general properties of dataset in pandas?
What does the resample function do in pandas? How is it different from aggregation?
How does visualization work in pandas?
What are the benefits of declarative data visualization (for example, with Altair)?
In which cases can big data visualization be useful?

Data Visualization with Python, by Mario Döbler, Tim Großmann, et al., published by Packt (https://www.packtpub.com/in/big-data-and-business-intelligence/data-visualisation-python)
Learning Python Data Visualization, by Benjamin Walter Keller), published by Packt (https://www.packtpub.com/big-data-and-business-intelligence/learning-python-data-visualization-video-0)

The rest of the chapter is locked

You have been reading a chapter from

Learn Python by Building Data Science Applications

Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5