You're reading from Learn Python by Building Data Science Applications

Product typeBook

Published inAug 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781789535365

Edition1st Edition

Languages

Python

Tools

Pygame

Concepts

Application Development

Authors (2):

Philipp Kats

David Katz

View More author details

Python for Data Applications

We have worked with data already in some of the previous chapters in this book, including data collection and some statistical computations. The samples in all of those cases were quite small, though. To run data analysis and train machine learning models smoothly on datasets of millions of records, researchers built a distinctive ecosystem of Python packages.

In this introductory chapter, we won't code much—instead, we'll overview the foundational packages and tools for the data science ecosystem, which we will be using throughout this part of this book, including the following:

Introducing Python for data science
Exploring NumPy
Understanding pandas
Trying SciPy and scikit-learn
Understanding Jupyter

Technical requirements

The code for this chapter makes use of two packages—numpy and pandas, both of which are included in the default Anaconda distribution. The notebook for this chapter is in the Chapter10 folder in the repository (https://github.com/PacktPublishing/Learn-Python-by-Building-Data-Science-Applications).

Introducing Python for data science

The fundamental task of data analysis is to generalize some trends and shared properties over a dataset of multiple—probably many—data points. Imagine how that would look in a standard Python distribution: you'll have a list of, say, Person objects, each with its own values. To run some aggregate statistics, we would have to loop over each object, pull its properties, and calculate the statistics. If we need to get a few measurements, the code will quickly grow large and unmaintainable.

Instead, many computations in data analysis can be vectorized. Here, vectorization is a fancy term for saying the same exact loops will be run in C, rather than Python, which speeds things up by a few orders of magnitude. At the same time, it means that we won't need to explicitly write those loops, making code cleaner and more readable...

Exploring NumPy

NumPy is a library built around the notion of numeric arrays—multidimensional, index-based (like a list) collection of data, which (unlike a list) guarantees the type of the stored values to stay consistent and predefined—say, a 2-dimensional array of integers or 1-dimensional array of floats. It is based on the C code and allows us to boost computation by a few orders of magnitude, compared to base Python. The gap in performance is staggering even on relatively small datasets and grows exponentially for large datasets and complex algorithms. NumPy is capable of handling a few million rows of data and is primarily bounded by the operational memory—not the CPU.

Let's illustrate this staggering difference in performance with an example. Imagine that we need to summarize three lists of values, pairwise. In pure Python, the code will be similar...

Beginning with pandas

Of course, not all data—and data analysis—is numeric. To address that gap, and inspired by the R language's dataframe objects, another package—pandas—was created by Wes McKinney in 2008. While it heavily relies on NumPy for numeric computations, its core interface objects are dataframes (2-dimensional multitype tables) and series (1-dimensional arrays). Dataframes, in comparison to NumPy matrices, don't require all data to be of the same type. On the contrary, they allow you to mix numeric values with Boolean, strings, DateTimes, and any other arbitrary Python objects. It does require (and enforce), however, the data type to be uniform vertically—within the same columns. Compared to NumPy, it also allows dataframe columns and rows to have arbitrary numeric or string names—or even hierarchical, multilevel...

Trying SciPy and scikit-learn

The SciPy package essentially kicked off the entire era of scientific Python. Created in 2001 by researchers Travis Oliphant, Pearu Peterson, and Eric Jones, it was formed as a collection of basic and universal scientific techniques. Over time, the package grew and now offers generic tooling and popular techniques for scientific analysis. Its submodules cover linear algebra, integration, optimization, interpolation, statistics, and many more.

With the rise of machine learning, the corresponding submodule of SciPy grew more and more complex. At some point, it became so big, the decision was made to reintroduce it as a separate, independent package—scikit-learn. As the mark of its origins, the package kept its name, defined earlier as SciPy kit—learn. Due to its simple and unified interface and a large variety of models, scikit-learn quickly...

Understanding Jupyter

Finally, there is Jupyter. We're familiar with this tool already, as it proved invaluable for teaching—and learning Python on simple examples, but it especially shines for data science; given its rich media and visualization capabilities, Jupyter is an excellent environment for data analysis. It allows quick iteration and experimentation, supports markdown documentation and rich media—images, plots, interactive widgets, video, and so on. Of course, Jupyter is 100% open source and free.

Jupyter is also language agnostic. At the moment, there is a handful of languages to use with Jupyter, including Ruby, C, Rust, R, and many more. It also supports third-party plugins, for example, leaflet and Mapbox viewers for GeoJSON files or the Vega data visualization viewer. Another advantage is that Jupyter Notebooks are properly rendered on GitHub...

Summary

In this chapter, we covered the foundation of Python's data science stack—the NumPy, pandas, SciPy, scikit-learn, and Jupyter libraries. By doing so, we were able to gather an understanding of this ecosystem, why and when we need all of these packages, and how they relate to each other. Understanding their relationships helps to navigate and search for a specific functionality or tool to use.

We also touched upon the reasons why NumPy-based computations are so fast, and why this leads to a somewhat different philosophy of data-driven development. We further showcased how pandas complements NumPy arrays by supporting plenty of data formats and types, and SciPy and scikit-learn build upon those data structures, allowing us to quickly train and use machine learning models. Finally, we discussed why Jupyter plays such an important role in this process and what...

Questions

Why should we use a special stack of packages for data analysis?
Why are NumPy computations so fast compared to normal Python?
What is the use case and benefit of using Pandas over NumPy?
What does sklearn stand for?

The rest of the chapter is locked

You have been reading a chapter from

Learn Python by Building Data Science Applications

Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5