Reader small image

You're reading from  Learn Python by Building Data Science Applications

Product typeBook
Published inAug 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789535365
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Philipp Kats
Philipp Kats
author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

David Katz
David Katz
author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz

View More author details
Right arrow

Assessments

Chapter 1

What version of Python do we use?

Throughout this book, we are using the Anaconda distribution, along with Python version 3.7.3.

Will it work on a Windows PC?

Absolutely! Python is a cross-platform language and will run on any Windows, Mac, or Linux device. In fact, it can even run on Raspberry Pi, Lego Mindstorms, and Arduino boards!

Do I need to install any additional packages?

Not if you've installed them in bulk using the environment.yaml file from the repository, or using a Docker image. Otherwise, you need to install packages using PIP or Conda.

What is a Jupyter Notebook?

Jupyter Notebook is a special file format, based on JSON, and used by Project Jupyter; in a nutshell, it represents code in an interactive and descriptive manner and can mix code with text, rich media, and interactive widgets.

When and why should we use Jupyter Notebooks?

Jupyter Notebooks...

Chapter 2

Why do we need to use variables in code?

Variables work as aliases or symbols in mathematic equations. With variables, we can write business logic, or how, without knowing specific values, or what, beforehand we don't have to repeat doing so over and over again.

What is the recommended way of naming variables? Why does it matter?

There are a few simple requirements when it comes to naming variables that are mandatory—they can't start with a number, contain whitespaces, or special characters. Finally, none of the keywords that are reserved by Python can be used.

That being said, there is some guidance on to better naming; first of all PEP8. According to PEP, it is recommended to name variables meaningfully and consistently so that they are easy to understand. It is also suggested to use "snakecase" (lowercase whitespace represented...

Chapter 3

What are functions, and when should we use them?

In programming, a function is the named section of the code that encapsulates a specific task and can be used relatively independently from the surrounding code.

How can data be provided to functions?

Conceptually, code in a function can access data from outside. The best way to pass the data, however, is via arguments—special temporary variables used exactly for that.

What does indentation mean? Is it required?

Yes; in Python, indentation is required and defines the grouping of code.

What should be covered in the docstring function? How can I read the docstring function?

Ideally, every module, function, and class should have a docstring. In all those cases, a docstring can be shown using the help function, or accessed programmatically via the __doc__ attribute.

When could it be useful to use type annotations...

Chapter 4

How do we retrieve one element from a list? How do we retrieve the last element of the list without computing its length explicitly?

To retrieve any element from a list, we can pass its index (order, starting with zero) in square brackets: mylist[0] will get the first element. Similarly, negative indices will return elements in reverse order—mylist[-1] will get the last element, no matter how many of them are stored.

How do we get all the elements of a list – except the first one and the last one – in reverse order?

For that, we can use slicing. In a slice, the first number represents the start, the second number represents the end, and the third one represents the step. A negative number will lead to the reverse order. Since we're using all three values and the step is negative, we need to swap the start and end values. Since the start is already...

Chapter 5

Can the if clause work with multiple (more than two) logical branches?

Yes! For that, you can use an additional keyword—elif. This way, you can have an unlimited number of logical branches, though it's recommended to use no more than four to five at a time.

What is the difference between for and while loops?

for loops are explicitly finite—they run for every element in a given iterable (although you can pass an infinite iterable if you need to). They are also meant to use that iterable.

while loops are explicitly infinite until certain criteria are met—so they are good if you don't know the number of iterations it would require to meet them (or want an explicitly infinite loop, which would be stopped from within the loop itself).

How can I loop through multiple (two or more) arrays of the same length? Or of different lengths?

The best...

Chapter 6

What is an API? Why would we use it?

An API is a programmatic interface; for example, a way to interact with a given tool or service using code. Generally speaking, any tool can (and many do) have an API; for example, every Python package has some, but usually, it is used in the context of a Web API—in other words, an interface for a certain service that's accessible programmatically via the internet. You use Web APIs all of the time—most applications on your phone communicate with the corresponding servers via their APIs. For us, a Web API is a way to leverage the power and information of web services from within our Python code.

What do the various HTTP(S) response status codes mean?

HTTP response statuses are integers that define the status of interactions and are defined by a server. For example, if routing servers can't find a URL you&apos...

Chapter 7

What does the term web scraping mean?

Web scraping is the process of collecting information directly from HTML web pages. Just like mining, we have to first collect ore of the HTML, from which we can then refine the valuable data points.

What are the main differences between scraping and using a web API? What are the challenges?

The main difference is the lack of any guarantees there is no promise that the web page won't change in terms of its structure, or will be shown at all. In fact, many services actively attempt to prevent web scraping. Another challenge is processing raw HTML into valuable information, as it often requires some custom code.

What exactly does Beautiful Soup do? Can we scrape without it?

In our stack (requests and BeautifulSoup), the latter allows us to navigate the document and query it, pulling specific values. We can definitely...

Chapter 8

What are classes? When should we use them?

Classes represent a way in which we can create complex objects, with the corresponding data (attributes) and functions (methods). Classes are a useful concept to represent any entity, such as a database connection, file object, algorithm, and so on. There's also a set of special methods and variables that's used by Python to change the behavior of certain instances.

Can we compare two instances of a class or use arithmetic operations with them?

Yes—this is one of the use cases for special methods. For example, in order for us to check instances for equality, we need to set the __eq__ method of the class. Here, we are checking whether the instance is greater, smaller, and so on—there is a corresponding special method for each operation.

When should we use inheritance?

Inheritance is an important property...

Chapter 9

What is a shell? Why and when are command-line interfaces advantageous compared to graphical interfaces?

A shell is a user interface that you use to interact with the operating system of a computer. Usually, people use this term to refer to a command-line shell that allows you to control the OS with a set of textual commands. There are three main advantages of command-line interfaces over GUIs. First, textual commands can be combined and stored and thus form scripts. Second, they require a minimal amount of memory and thus are way more suitable for interacting with remote machines via the internet. Third, command-line interfaces are quite unified across different operating systems—commands on Linux and macOS are identical, and even Windows has either similar or aliased commands.

What exactly does version control mean? Is it suitable for research projects?

Version...

Chapter 10

Why should we use a special stack of packages for data analysis?

Data analysis requires a fast and easy way to operate on multiple elements at once—a so-called vectorized approach. Python's scientific stack allows this by using numpya package for fast array operations.

Why are NumPy computations so fast compared to normal Python?

NumPy is drastically faster than vanilla Python on numerical operations. This is all thanks to a different data representation—NumPy arrays, in contrast to standard Python collections, require all the elements to be of the same data type. Because of that, an array can be passed to a CPU as one entity and computed more effectively.

What is the use case and benefit of using Pandas over NumPy?

NumPy only supports numeric arrays. Pandas, on the other hand, supports datetime, string, and categorical arrays. In addition...

Chapter 11

Why, if there is an empty cell in the Pandas column, are integer values in this column converted into floats?

This happens since NumPy (and based on it, Pandas) does not support null integers—every null is a special case of a float. Thus, to keep the datatype consistent across the column, NumPy has to convert all integers into floats.

What is the benefit of plotting missing values?

Often, missing values in a dataset can have a certain pattern—for example, records with a missing value in one column also miss values in others. Having a bird's-eye view allows you to find those patterns and define an appropriate imputation strategy.

What is RegEx? Is it a separate language?

Indeed, Regular Expressions, or regex, is a distinct mini-language for text extraction and search. RegEx is implemented in most programming languages—including Python.

How can...

Chapter 12

How can we understand some general properties of a dataset with pandas?

Using either specific statistics, such as mean, median, or standard deviation, on specific columns. Alternatively, you can use the describe method—it will compute descriptive statistics (the ones above it, plus the minimum/maximum, quartiles, and a few more) for all the columns in a dataframe.

What does the resample function do in pandas? How is it different from aggregation?

This method is meant to be used on a dataframe of time-based records. resample works similar to aggregation, except that it groups by a time period and returns rows (with empty values) for missing periods as well.

How does visualization work in pandas?

Pandas has an extensive and simple interface for visualization, but it doesn't create charts on its own; all the actual visual stuff is done by matplotlib. Starting...

Chapter 13

What is machine learning?

Machine learning is a discipline (a branch of artificial intelligence) that focuses on automatic model building. Machine learning algorithms allow us to automatically find patterns or a hierarchy in data (unsupervised learning), or even predict the property of a given sample after training on the prepared "training" dataset (supervised learning).

What is the difference between supervised and non-supervised learning?

Unsupervised learning algorithms operate on any given dataset with no special preparation required and aim to find patterns or structures without any prior knowledge. Supervised learning models are trained on a properly labeled "training set," which they do by building a generalized model, and then are able to infer values for the new data samples it hasn't seen before.

What are the drawbacks of k-means...

Chapter 14

What is overfitting?

Many ML models (for example, decision trees) actively fit to perform well on the training set at hand, but at some point, this process goes beyond generalizable knowledge that's valuable for the task, with some parts being irrelevant to the test set. This is not only meaningless but will also affect the model's performance on other data. This phenomenon is known as overfitting, and there are ways to overcome it.

Why should we use cross-validation?

Cross-validation is a technique that's aimed at overcoming the issue of overfitting. In its basic form, it splits a training set into multiple folds, trains multiple models with the same settings on different combinations of those folds, and measures their performance on other folds—and then averages the performance across all models. As a result, this sampling and prediction on the...

Chapter 15

What are the benefits of packaging code?

Packaging code is a great way to do the following:

  • Make certain code available to use from multiple other packages
  • Share code with colleagues or make it easy to install for yourself
  • Set a project to collaborate on with others
  • Add reliability to your code by constantly running tests
  • Structure code better and isolate it from your day-to-day work

What is the main difference between Conda and pip as package managers?

At this moment, the difference is not as great as it was before. Historically, pip didn't support adding non-Python code as a binary for various reasons. This is a problem for data analysis projects since many data-related packages, namely NumPy, SciPy, and sklearn, use C and even Fortran under the hood.

This is where Conda comes into play—it allows you to install any tool in any language, even one that...

Chapter 16

What are the benefits of writing tasks rather than using simple scripts?

Scripts are great for simple and one-off jobs. If you have a repetitive task to do or even more so if there is a set of tasks that depend on each other, and you need to ensure that they don't run without a dependency missing, or that they won't override (or append to) existing data—then ETL pipelines and tasks are for you. As a free bonus, frameworks such as Luigi have a lot of utility code that helps to build pipelines you won't need to write a solution for writing to S3 or a database, or parse a command-line command.

What is the base element of Luigi jobs?

The base element of Luigi jobs (pipelines) is the Task class. All the business logic of a task needs to be wrapped in the run method. Its output and dependencies are defined within the output and requires...

Chapter 17

What are the main differences between visualizing data in the notebook and on a dashboard?

The main differences are as follows:

  1. The audience for the dashboard is meant to be wide—so the dashboard should be easily accessible, for example, via an internet browser, and well-explained. One-off visualizations, on the other hand, are often made for self-consumption, and thus don't need to be self-explanatory.
  2. Dashboards are meant to be frequently updated and exploratory. Visualizations are often static and show a specific aspect of data.

Why do we call some dashboards "static"? What are the pros and cons of a static dashboard?

In common terms, static web pages are ones that are provided "as-is," as flat files, and there is no active server behind them. Static dashboards are easier to maintain and provide for a wide audience but have some...

Chapter 18

What is the REST API?

REST, or REpresentational State Transfer, is a general architecture for APIs interaction that uses the HTTP protocol. The main features of REST-compliant systems are being stateless and their separation of concerns between the client and the server.

What Python packages can be used to build a REST API?

At this point, there are quite a lot of frameworks that can be used to build a REST API in Python. The most popular ones are Flask, Django REST, Hug, Falcon, CherryPy, Quart, and many others. In this book, we're using the FastAPI framework.

What are the key features of the FastAPI framework?

FastAPI has a few unique characteristics. First, it is designed specifically with API in mind, which is different to many others. Second, it fully supports asynchronous execution and can work with a Uvicorn-Gunicorn inspired asynchronous server. Third, it...

Chapter 19

What does a serverless application mean?

Serverless applications still run on normal servers, but control over the server's behavior and the stack are completely handled by the cloud provider—all that's required from the developer is to write a function that describes the business logic. This function can be set to trigger on a request to a certain API endpoint, on a certain event (for example, a file addition to the S3 bucket), or on a scheduler so that it runs every day.

What are the limitations of the serverless approach?

Serverless applications are mainly bound by the memory they can use and, therefore, the packages that can be installed. For AWS Lambda, the limit is 50 MB.

What are the benefits of serverless APIs?

Serverless APIs have quite a few benefits. First and foremost, you don't need to spend time on the development and maintenance...

Chapter 20

How can we measure which line in the code took the most time to complete?

The simplest way to do that is via a utility called line__profiler. This utility will show each line of the given code and show how much time was spent on each line. Knowing the distribution of the time that was required helps us focus on the right parts of the code.

Does NumPy run faster than Pandas?

In most cases with numeric computations, Pandas uses NumPy under the hood, so the difference is minimal. It does, however, spend certain additional time on building series and dataframes, when needed. So, for a well-scoped and purely numeric task, it makes sense to switch to pure NumPy.

When should we use Numba? What are the challenges and benefits of using Numba?

Numba uses a modern C compiler with some modern techniques to significantly improve performance. It can also be run on a GPU. Its "...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learn Python by Building Data Science Applications
Published in: Aug 2019Publisher: PacktISBN-13: 9781789535365
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Philipp Kats

Philipp Kats is a researcher at the Urban Complexity Lab, NYU CUSP, a research fellow at Kazan Federal University, and a data scientist at StreetEasy, with many years of experience in software development. His interests include data analysis, urban studies, data journalism, and visualization. Having a bachelor's degree in architectural design and a having followed the rocky path (at first) of being a self-taught developer, Philipp knows the pain points of learning programming and is eager to share his experience.
Read more about Philipp Kats

author image
David Katz

David Katz is a researcher and holds a Ph.D. in mathematics. As a mathematician at heart, he sees code as a tool to express his questions. David believes that code literacy is essential as it applies to most disciplines and professions. David is passionate about sharing his knowledge and has 6 years of experience teaching college and high school students.
Read more about David Katz