Reader small image

You're reading from  Learn Python by Building Data Science Applications

Product typeBook
Published inAug 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789535365
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Philipp Kats
Philipp Kats
author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz
David Katz
author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

View More author details
Right arrow

Python for Data Applications

We have worked with data already in some of the previous chapters in this book, including data collection and some statistical computations. The samples in all of those cases were quite small, though. To run data analysis and train machine learning models smoothly on datasets of millions of records, researchers built a distinctive ecosystem of Python packages.

In this introductory chapter, we won't code much—instead, we'll overview the foundational packages and tools for the data science ecosystem, which we will be using throughout this part of this book, including the following:

  • Introducing Python for data science
  • Exploring NumPy
  • Understanding pandas
  • Trying SciPy and scikit-learn
  • Understanding Jupyter

Technical requirements

Introducing Python for data science

The fundamental task of data analysis is to generalize some trends and shared properties over a dataset of multiple—probably manydata points. Imagine how that would look in a standard Python distribution: you'll have a list of, say, Person objects, each with its own values. To run some aggregate statistics, we would have to loop over each object, pull its properties, and calculate the statistics. If we need to get a few measurements, the code will quickly grow large and unmaintainable.

Instead, many computations in data analysis can be vectorized. Here, vectorization is a fancy term for saying the same exact loops will be run in C, rather than Python, which speeds things up by a few orders of magnitude. At the same time, it means that we won't need to explicitly write those loops, making code cleaner and more readable...

Exploring NumPy

NumPy is a library built around the notion of numeric arrays—multidimensional, index-based (like a list) collection of data, which (unlike a list) guarantees the type of the stored values to stay consistent and predefined—say, a 2-dimensional array of integers or 1-dimensional array of floats. It is based on the C code and allows us to boost computation by a few orders of magnitude, compared to base Python. The gap in performance is staggering even on relatively small datasets and grows exponentially for large datasets and complex algorithms. NumPy is capable of handling a few million rows of data and is primarily bounded by the operational memorynot the CPU.

Let's illustrate this staggering difference in performance with an example. Imagine that we need to summarize three lists of values, pairwise. In pure Python, the code will be similar...

Beginning with pandas

Of course, not all dataand data analysisis numeric. To address that gap, and inspired by the R language's dataframe objects, another packagepandaswas created by Wes McKinney in 2008. While it heavily relies on NumPy for numeric computations, its core interface objects are dataframes (2-dimensional multitype tables) and series (1-dimensional arrays). Dataframes, in comparison to NumPy matrices, don't require all data to be of the same type. On the contrary, they allow you to mix numeric values with Boolean, strings, DateTimes, and any other arbitrary Python objects. It does require (and enforce), however, the data type to be uniform verticallywithin the same columns. Compared to NumPy, it also allows dataframe columns and rows to have arbitrary numeric or string names—or even hierarchical, multilevel...

Trying SciPy and scikit-learn

The SciPy package essentially kicked off the entire era of scientific Python. Created in 2001 by researchers Travis Oliphant, Pearu Peterson, and Eric Jones, it was formed as a collection of basic and universal scientific techniques. Over time, the package grew and now offers generic tooling and popular techniques for scientific analysis. Its submodules cover linear algebra, integration, optimization, interpolation, statistics, and many more.

With the rise of machine learning, the corresponding submodule of SciPy grew more and more complex. At some point, it became so big, the decision was made to reintroduce it as a separate, independent package—scikit-learn. As the mark of its origins, the package kept its name, defined earlier as SciPy kit—learn. Due to its simple and unified interface and a large variety of models, scikit-learn quickly...

Understanding Jupyter

Finally, there is Jupyter. We're familiar with this tool already, as it proved invaluable for teachingand learning Python on simple examples, but it especially shines for data science; given its rich media and visualization capabilities, Jupyter is an excellent environment for data analysis. It allows quick iteration and experimentation, supports markdown documentation and rich mediaimages, plots, interactive widgets, video, and so on. Of course, Jupyter is 100% open source and free.

Jupyter is also language agnostic. At the moment, there is a handful of languages to use with Jupyter, including Ruby, C, Rust, R, and many more. It also supports third-party plugins, for example, leaflet and Mapbox viewers for GeoJSON files or the Vega data visualization viewer. Another advantage is that Jupyter Notebooks are properly rendered on GitHub...

Summary

In this chapter, we covered the foundation of Python's data science stack—the NumPy, pandas, SciPy, scikit-learn, and Jupyter libraries. By doing so, we were able to gather an understanding of this ecosystem, why and when we need all of these packages, and how they relate to each other. Understanding their relationships helps to navigate and search for a specific functionality or tool to use.

We also touched upon the reasons why NumPy-based computations are so fast, and why this leads to a somewhat different philosophy of data-driven development. We further showcased how pandas complements NumPy arrays by supporting plenty of data formats and types, and SciPy and scikit-learn build upon those data structures, allowing us to quickly train and use machine learning models. Finally, we discussed why Jupyter plays such an important role in this process and what...

Questions

  1. Why should we use a special stack of packages for data analysis?
  2. Why are NumPy computations so fast compared to normal Python?
  3. What is the use case and benefit of using Pandas over NumPy?
  4. What does sklearn stand for?
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz