Reader small image

You're reading from  Jupyter for Data Science

Product typeBook
Published inOct 2017
Reading LevelBeginner
PublisherPackt
ISBN-139781785880070
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Dan Toomey
Dan Toomey
author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey

Right arrow

Chapter 2. Working with Analytical Data on Jupyter

Jupyter does none of the heavy lifting for analyzing data: all the work is done by programs written in a selected language. Jupyter provides the framework to run a variety of programming language modules. So, we have a choice how we analyze data in Jupyter.

A popular choice for data analysis programming is Python. Jupyter does have complete support for Python programming. We will look at a variety of programming solutions that might tax such a support system and see how Jupyter fairs.

Data scraping with a Python notebook


A common tool for data analysis is gathering the data from a public source such as a website. Python is adept at scraping websites for data. Here, we look at an example that loads stock price information from Google Finance data.

In particular, given a stock symbol, we want to retrieve the last year of price ranges for that symbol.

One of the pages on the Google Finance site will give the last years' worth of price data for a security company. For example, if we were interested in the price points for Advanced Micro Devices (AMD), we would enter the following URL:

https://www.google.com/finance/historical?q=NASDAQ:AMD

Here, NASDAQ is the stock exchange that carries the AMD security. On the resultant Google page, there is a table of data points of interest, as seen in the following partial screenshot.

Like many sites that you will be attempting to access, there is a lot of other information on the page as well, like headers and footers and ads, as you can see...

Using heavy-duty data processing functions in Jupyter


Python has several groups of processing functions that can tax computer system power. Let us use some of these in Jupyter and determine if the functionality performs as expected.

Using NumPy functions in Jupyter

NumPy is a package in Python providing multidimensional arrays and routines for array processing. We bring in the NumPy package using import * from numpy statement. In particular, the NumPy package defines the array keyword, referencing a NumPy object with extensive functionality.

The NumPy array processing functions run from the mundane, such as min() and max() functions (which provide the minimum and maximum values over the array dimensions provided), to more interesting utility functions for producing histograms and calculating correlations using the elements of a data frame.

With NumPy, you can manipulate arrays in many ways. For example, we will go over some of these functions with the following scripts, where we will use NumPy...

Using SciPy in Jupyter


SciPy is an open source library for mathematics, science and, engineering. With such a wide scope, there are many areas we can explore using SciPy:

  • Integration
  • Optimization
  • Interpolation
  • Fourier transforms
  • Linear algebra
  • There are several other intense sets of functionality as well, such as signal processing

Using SciPy integration in Jupyter

A standard mathematical process is integrating an equation. SciPy accomplishes this using a callback function to iteratively calculate out the integration of your function. For example, suppose that we wanted to determine the integral of the following equation:

We would use a script like the following. We are using the definition of pi from the standard math package.

from scipy.integrate import quadimport mathdef integrand(x, a, b):    return a*math.pi + ba = 2b = 1quad(integrand, 0, 1, args=(a,b))

Again, this coding is very clean and simple, yet almost impossible to do in many languages. Running this script in Jupyter we see the results...

Expanding on panda data frames in Jupyter


There are more functions built-in for working with data frames than we have used so far. If we were to take one of the data frames from a prior example in this chapter, the Titanic dataset from an Excel file, we could use additional functions to help portray and work with the dataset.

As a repeat, we load the dataset using the script:

import pandas as pddf = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')

We can then inspect the data frame using the info function, which displays the characteristics of the data frame:

df.info()

Some of the interesting points are as follows:

  • 1309 entries
  • 14 columns
  • Not many fields with valid data in the body column—most were lost
  • Does give a good overview of the types of data involved

We can also use the describe function, which gives us a statistical breakdown of the number columns in the data frame.

df.describe()

This produces the following tabular display:

For each numerical column we have...

Summary


In this chapter, we looked at some of the more compute intensive tasks that might be performed in Jupyter. We used Python to scrape a website to gather data for analysis. We used Python NumPy, pandas, and SciPy functions for in-depth computation of results. We went further into pandas and explored manipulating data frames. Lastly, we saw examples of sorting and filtering data frames.

In the next chapter, we will make some predictions and use visualization to validate our predictions.

 

 

 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Jupyter for Data Science
Published in: Oct 2017Publisher: PacktISBN-13: 9781785880070
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Dan Toomey

Dan Toomey has been developing application software for over 20 years. He has worked in a variety of industries and companies, in roles from sole contributor to VP/CTO-level. For the last few years, he has been contracting for companies in the eastern Massachusetts area. Dan has been contracting under Dan Toomey Software Corp. Dan has also written R for Data Science, Jupyter for Data Sciences, and the Jupyter Cookbook, all with Packt.
Read more about Dan Toomey