You're reading from Hands-On Data Science with Anaconda

Product typeBook

Published inMay 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781788831192

Edition1st Edition

Languages

Python

Tools

Jupyter Anaconda

Concepts

Data Science

Authors (2):

Yuxing Yan

James Yan

View More author details

Distributed Computing, Parallel Computing, and HPCC

Since our society has entered a data-intensive era (that is, a big data era), we face larger and larger datasets. For this reason, companies and users are considering what kinds of tools they could use to speed up the process when dealing with data. One obvious solution is to increase their data storage capacity. Unfortunately, there is a huge cost associated with this. The other solutions include distributed computing and some ways to accelerate our process.

In this chapter, we'll cover the following topics:

Introduction to distributed versus parallel computing
Understanding MPI
Parallel processing in Python
Compute nodes
Anaconda add-ons
Introduction to HPCC

Introduction to distributed versus parallel computing

Distributed computing is a subfield of computer science that studies distributed systems and models in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal.

It is worthwhile to discuss another phrase: parallel computing. Parallel computing is more tightly coupled to multi-threading, or how to make full use of a single CPU, while distributed computing refers to the notion of divide and conquer, executing subtasks on different machines, and then merging the results.

Since we have entered a so-called big data era, it seems that the distinction is melting. In fact, nowadays, many systems use a combination of parallel and distributed computing.

...

Understanding MPI

Usually, a parallel algorithm needs to move data between different engines. One way to do so is by doing a pull and then a push using the direct view. However, this method is quite slow since all the data has to go through the controller to the client and then back through the controller, to its final destination. A much better way of moving data between engines is to use a message passing library, such as the Message Passing Interface (MPI). IPython's parallel computing architecture has been designed to integrate with MPI. To download and install Windows MPI, readers can refer to https://msdn.microsoft.com/en-us/library/bb524831%28v=vs.85%29.aspx.

In addition, you could install the mpi4py package.

R package Rmpi

...

Parallel processing in Python

The following example is about computing π digits and is borrowed from the website http://ipyparallel.readthedocs.io/en/latest/demos.html#parallel-examples. Since the first part needs a program called one_digit_freqs() function, we could run a Python program called pidigits.py contained at .../ipython-ipython-in-depth-4d98937\examples\Parallel Computing\pi, and this path depends on where the reader downloaded and saved his/her files.

To complete our part, we simply include it in the first part of the program, as shown here:

import matplotlib.pyplot as plt
import sympy
import numpy as np 
#
def plot_one_digit_freqs(f1):
    """
    Plot one digit frequency counts using matplotlib.
    """
    ax = plt.plot(f1,'bo-')
    plt.title('Single digit counts in pi')
    plt.xlabel('Digit')
    plt.ylabel...

Compute nodes

A compute node provides the ephemeral storage, networking, memory, and processing resources that can be consumed by virtual machine instances. The cloud system supports two types of compute nodes: ESX clusters, where clusters are created in VMware vCenter Server, and KVM compute nodes, where KVM compute nodes are created manually. In the previous chapter, we mentioned the concept of the cloud.

Within a cloud environment, which is quite useful for a more complex project, compute nodes form the core of resources. Typically, these notes supply the processing, memory, network, and storage that virtual machine instances need. When an instance is created, it is matched to a compute node with the available resources. A compute node can host multiple instances until all of its resources are consumed.

Anaconda add-on

The following information is from the Anaconda Addon Development Guide.

An Anaconda add-on is a Python package containing a directory with an __init__.py file and other source directories (sub packages) inside. Because Python allows importing each package name only once, the package top-level directory name must be unique. At the same time, the name can be arbitrary, because add-ons are loaded regardless of their name; the only requirement is that they must be placed in a specific directory.

The suggested naming convention for add-ons is therefore similar to that of Java packages or D-Bus service names: prefix the add-on name with the reversed domain name of your organization, using underscores (_) instead of dots so that the directory name is a valid identifier for a Python package. An example add-on name following these suggestions would therefore be, for example...

Introduction to HPCC

HPCC stands for High-Performance Computing Cluster. It is also known as Data Analytics Supercomputer (DAS), an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on computing clusters to provide high-performance, data-parallel processing design for various applications using big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

You can see a simple example of using Wharton's HPCC system at https://research-it.wharton.upenn.edu/documentation/. Wharton's HPC Cluster (HPCC) provides access...

Summary

In this chapter, we have discussed several R packages such as plyr, snow, Rmpi, and parallel, and the Python package ipyparallel. In addition, we mentioned compute nodes, project add-ons, parallel processing, and HPCC.

Now we've arrived at the end of our journey. We wish you good luck for the amazing endeavors you'll be taking up with the knowledge you've got from this book.

Review questions and exercises

What is distributed computing? Why it is useful?
From where could we get a task view for parallel computing?
From the task view related to parallel computing, we can find many R packages. Identify a few of them. Install two and find a few examples of using these two packages.
Conduct a word frequency analysis using: The Count of Monte Cristo by Alexandre Dumas (input file is at http://www.gutenberg.org/files/1184/1184-0.txt).
From where could we find more information about Anaconda add-ons?
What is HPCC and how does it work?
How do we find the path of an installed R package?
In the sample Jupyter notebook about parallel Monte-Carlo options pricing, the related Asian options are defined here, where call(Asian) is the Asian put option, Put(Asian), K is the exercise price, and is the average price over the path:

Write a Jupyter notebook to use...

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Data Science with Anaconda

Published in: May 2018Publisher: PacktISBN-13: 9781788831192

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Yuxing Yan

Yuxing Yan graduated from McGill University with a PhD in finance. Over the years, he has been teaching various finance courses at eight universities: McGill University and Wilfrid Laurier University (in Canada), Nanyang Technological University (in Singapore), Loyola University of Maryland, UMUC, Hofstra University, University at Buffalo, and Canisius College (in the US). His research and teaching areas include: market microstructure, open-source finance and financial data analytics. He has 22 publications including papers published in the Journal of Accounting and Finance, Journal of Banking and Finance, Journal of Empirical Finance, Real Estate Review, Pacific Basin Finance Journal, Applied Financial Economics, and Annals of Operations Research. He is good at several computer languages, such as SAS, R, Python, Matlab, and C. His four books are related to applying two pieces of open-source software to finance: Python for Finance (2014), Python for Finance (2nd ed., expected 2017), Python for Finance (Chinese version, expected 2017), and Financial Modeling Using R (2016). In addition, he is an expert on data, especially on financial databases. From 2003 to 2010, he worked at Wharton School as a consultant, helping researchers with their programs and data issues. In 2007, he published a book titled Financial Databases (with S.W. Zhu). This book is written in Chinese. Currently, he is writing a new book called Financial Modeling Using Excel — in an R-Assisted Learning Environment. The phrase "R-Assisted" distinguishes it from other similar books related to Excel and financial modeling. New features include using a huge amount of public data related to economics, finance, and accounting; an efficient way to retrieve data: 3 seconds for each time series; a free financial calculator, showing 50 financial formulas instantly, 300 websites, 100 YouTube videos, 80 references, paperless for homework, midterms, and final exams; easy to extend for instructors; and especially, no need to learn R.
Read more about Yuxing Yan

James Yan

James Yan is an undergraduate student at the University of Toronto (UofT), currently double-majoring in computer science and statistics. He has hands-on knowledge of Python, R, Java, MATLAB, and SQL. During his study at UofT, he has taken many related courses, such as Methods of Data Analysis I and II, Methods of Applied Statistics, Introduction to Databases, Introduction to Artificial Intelligence, and Numerical Methods, including a capstone course on AI in clinical medicine.
Read more about James Yan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages