Reader small image

You're reading from  Learn Python by Building Data Science Applications

Product typeBook
Published inAug 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789535365
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Philipp Kats
Philipp Kats
author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz
David Katz
author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

View More author details
Right arrow

Data Exploration and Visualization

In the previous chapter, we went deep into data cleaning and preparation. But what is inside this dataset? What story does it tell about the war, and how can we make those stories clear? Knowing how to dissect data, understand it, and extract insights is one of the crucial skills for data analysis and is a mandatory step before building anything driven by this data. In this chapter, we'll learn how to explore a dataset, compute aggregate statistics, and understand outliers and general trends through data visualization. The skills we'll learn are essential to any data analysis and are used throughout the industry and academia.

In particular, the following topics will be covered in this chapter:

  • Descriptive statistics
  • Aggregation and resampling
  • The ecosystem of modern visualizations using matplotlib with pandas, altair, and datashader...

Technical requirements

In this chapter, we'll make use of three additional visualization libraries: geopandas, altair, and datashader. All of them can be installed via Anaconda or PIP and are included in our environment.yaml file. As always, if you followed the instructions in Chapter 1, Preparing the Workspace, you're all set. If not, you can install them using conda.

Exploring the dataset

For this chapter, we'll use the dataset on WWII battles we collected earlier in Chapter 7, Scraping Data from the Web with Beautiful Soup 4. As you may remember, the dataset includes dates, results, sides, leaders, and the number of troops and casualties of those battles. But what questions can we answer with this information? Let's start with something simple: which battles took the most casualties on both sides? Where were most of the tanks destroyed? How was the number of casualties distributed over time and geography?

In the previous chapter, we cleaned and processed most of the data; however, given the sensitivity of the subject, we went ahead and cross-checked main values row-by-row, manually, and, indeed, had to correct a few values. This work cannot be completely automated. In this and further chapters, we'll work with the corrected...

Declarative visualization with vega and altair

Until now, we have used the matplotlib library, via the built-in pandas interface. matplotlib is powerful and essential to Python's data visualization ecosystem. It is not, however, the only visualization library we can use. In fact, there are a plethora of visualization tools, different in their format, focus, or even philosophy. In this section, we'll introduce you to a different tooland different concept of data visualization—and that is altair, which is a Python library based around the Vega engine. What makes it so different? A couple of things, in fact.

First of all, its core philosophy is based on the declarative approach, which can be boiled down to the following principle: the core idea is to write each chart in code as a declarationbasically, a recipe. This declaration would define what to...

Big data visualization with datashader

Big data also needs to be visualized! Big data visualizations are somewhat rare; in part because they are hard to do, but also because they are hard to interpret and communicate insights. A big data visualization is usually either a network, a map, or a mapping (similarity-based, computed 2- or 3-dimensional distributions). They are usually astonishing and complex! In fact, a few early inventors of big data visualizations, such as Eric Fisher, became famous for their work with big data.

As we mentioned, big data visualizations are generally hard due to the mere size of the dataset. Standard tools won't work— for matplotlib, even with a raster engine, it will take hours to plot millions of points, and Altair won't do it at all. For a long time, there wasn't an easy solution to this problem. This changed with the announcement...

Summary

In this chapter, we discussed how to derive insights from the raw datacompute descriptive statistics and aggregates and draw basic plots of relationships—and use special tools for big data visualization. As a result, we've learned how to start working with the dataset, investigate its overall properties, and drill down to specific details. We also learned how to visualize data, a vital skill for both personal data exploration and sharing the insights with a broad audience. These skills are fundamental for data analysisknowing what to ask and how to answer your question with the data and noticing patterns and anomalies in the data and being able to interpret them and speculate on their origins.

In our next chapter, we'll go a step further in that direction, leveraging statistical and machine learning models to guide our interpretation.

...

Questions

  1. How can we understand some general properties of dataset in pandas?
  2. What does the resample function do in pandas? How is it different from aggregation?
  3. How does visualization work in pandas?
  4. What are the benefits of declarative data visualization (for example, with Altair)?
  5. In which cases can big data visualization be useful?

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz