Reader small image

You're reading from  Hands-On Data Science with Anaconda

Product typeBook
Published inMay 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788831192
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Yuxing Yan
Yuxing Yan
author image
Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

James Yan
James Yan
author image
James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan

View More author details
Right arrow

Distributed Computing, Parallel Computing, and HPCC

Since our society has entered a data-intensive era (that is, a big data era), we face larger and larger datasets. For this reason, companies and users are considering what kinds of tools they could use to speed up the process when dealing with data. One obvious solution is to increase their data storage capacity. Unfortunately, there is a huge cost associated with this. The other solutions include distributed computing and some ways to accelerate our process.

In this chapter, we'll cover the following topics:

  • Introduction to distributed versus parallel computing
  • Understanding MPI
  • Parallel processing in Python
  • Compute nodes
  • Anaconda add-ons
  • Introduction to HPCC

Introduction to distributed versus parallel computing

Distributed computing is a subfield of computer science that studies distributed systems and models in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.

It is worthwhile to discuss another phrase: parallel computing. Parallel computing is more tightly coupled to multi-threading, or how to make full use of a single CPU, while distributed computing refers to the notion of divide and conquer, executing subtasks on different machines, and then merging the results.

Since we have entered a so-called big data era, it seems that the distinction is melting. In fact, nowadays, many systems use a combination of parallel and distributed computing.

...

Understanding MPI

Usually, a parallel algorithm needs to move data between different engines. One way to do so is by doing a pull and then a push using the direct view. However, this method is quite slow since all the data has to go through the controller to the client and then back through the controller, to its final destination. A much better way of moving data between engines is to use a message passing library, such as the Message Passing Interface (MPI). IPython's parallel computing architecture has been designed to integrate with MPI. To download and install Windows MPI, readers can refer to https://msdn.microsoft.com/en-us/library/bb524831%28v=vs.85%29.aspx.

In addition, you could install the mpi4py package.

R package Rmpi

...

Parallel processing in Python

The following example is about computing π digits and is borrowed from the website http://ipyparallel.readthedocs.io/en/latest/demos.html#parallel-examples. Since the first part needs a program called one_digit_freqs() function, we could run a Python program called pidigits.py contained at .../ipython-ipython-in-depth-4d98937\examples\Parallel Computing\pi, and this path depends on where the reader downloaded and saved his/her files.

To complete our part, we simply include it in the first part of the program, as shown here:

import matplotlib.pyplot as plt
import sympy
import numpy as np
#
def plot_one_digit_freqs(f1):
"""
Plot one digit frequency counts using matplotlib.
"""
ax = plt.plot(f1,'bo-')
plt.title('Single digit counts in pi')
plt.xlabel('Digit')
plt.ylabel...

Compute nodes

A compute node provides the ephemeral storage, networking, memory, and processing resources that can be consumed by virtual machine instances. The cloud system supports two types of compute nodes: ESX clusters, where clusters are created in VMware vCenter Server, and KVM compute nodes, where KVM compute nodes are created manually. In the previous chapter, we mentioned the concept of the cloud.

Within a cloud environment, which is quite useful for a more complex project, compute nodes form the core of resources. Typically, these notes supply the processing, memory, network, and storage that virtual machine instances need. When an instance is created, it is matched to a compute node with the available resources. A compute node can host multiple instances until all of its resources are consumed.

Anaconda add-on

The following information is from the Anaconda Addon Development Guide.

An Anaconda add-on is a Python package containing a directory with an __init__.py file and other source directories (sub packages) inside. Because Python allows importing each package name only once, the package top-level directory name must be unique. At the same time, the name can be arbitrary, because add-ons are loaded regardless of their name; the only requirement is that they must be placed in a specific directory.

The suggested naming convention for add-ons is therefore similar to that of Java packages or D-Bus service names: prefix the add-on name with the reversed domain name of your organization, using underscores (_) instead of dots so that the directory name is a valid identifier for a Python package. An example add-on name following these suggestions would therefore be, for example...

Introduction to HPCC

HPCC stands for High-Performance Computing Cluster. It is also known as Data Analytics Supercomputer (DAS), an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on computing clusters to provide high-performance, data-parallel processing design for various applications using big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

You can see a simple example of using Wharton's HPCC system at https://research-it.wharton.upenn.edu/documentation/. Wharton's HPC Cluster (HPCC) provides access...

Summary

In this chapter, we have discussed several R packages such as plyr, snow, Rmpi, and parallel, and the Python package ipyparallel. In addition, we mentioned compute nodes, project add-ons, parallel processing, and HPCC.

Now we've arrived at the end of our journey. We wish you good luck for the amazing endeavors you'll be taking up with the knowledge you've got from this book.

Review questions and exercises

  1. What is distributed computing? Why it is useful?
  2. From where could we get a task view for parallel computing?
  3. From the task view related to parallel computing, we can find many R packages. Identify a few of them. Install two and find a few examples of using these two packages.
  4. Conduct a word frequency analysis using: The Count of Monte Cristo by Alexandre Dumas (input file is at http://www.gutenberg.org/files/1184/1184-0.txt).
  5. From where could we find more information about Anaconda add-ons?
  6. What is HPCC and how does it work?
  7. How do we find the path of an installed R package?
  8. In the sample Jupyter notebook about parallel Monte-Carlo options pricing, the related Asian options are defined here, where call(Asian) is the Asian put option, Put(Asian), K is the exercise price, and is the average price over the path:

Write a Jupyter notebook to use...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Science with Anaconda
Published in: May 2018Publisher: PacktISBN-13: 9781788831192
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

author image
James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan