Reader small image

You're reading from  Scientific Computing with Python - Second Edition

Product typeBook
Published inJul 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781838822323
Edition2nd Edition
Languages
Right arrow
Authors (3):
Claus Führer
Claus Führer
author image
Claus Führer

Claus Führer is a professor of scientific computations at Lund University, Sweden. He has an extensive teaching record that includes intensive programming courses in numerical analysis and engineering mathematics across various levels in many different countries and teaching environments. Claus also develops numerical software in research collaboration with industry and received Lund University's Faculty of Engineering Best Teacher Award in 2016.
Read more about Claus Führer

View More author details
Right arrow
Python for Parallel Computing

This chapter covers parallel computing and the module mpi4py. Complex and time-consuming computational tasks can often be divided into subtasks, which can be carried out simultaneously if there is capacity for it. When these subtasks are independent of each other, executing them in parallel can be especially efficient. Situations where subtasks have to wait until another subtask is completed are less suited for parallel computing.

Consider the task of computing an integral of a function by a quadrature rule:

with . If the evaluation of  is time-consuming and  is large , it would be advantageous to split the problem into two or several subtasks of smaller size:

                                             ...

18.1 Multicore computers and computer clusters

Most of the modern computers are multicore computers. For example, the laptop used when writing this book has an Intel® i7-8565U processor that has four cores with two threads each.

What does this mean? Four cores on a processor allow performing four computational tasks in parallel. Four cores with two threads each are often counted as eight CPUs by system monitors. For the purposes of this chapter only the number of cores matters.

These cores share a common memory—the RAM of your laptop—and have individual memory in the form of cache memory:

Figure 18.1: A multicore architecture with shared and local cache memory

The cache memory is used optimally by its core and is accessed at high speed, while the shared memory can be accessed by all cores of one CPU. On top, there is the computer's RAM memory and finally, the hard disk, which is also shared memory.

In the next section, we will see how a computational...

18.2 Message passing interface (MPI)

Programming for several cores or on a computer cluster with distributed memory requires special techniques. We describe here message passing and related tools standardized by the MPI standard. These tools are similar in different programming languages, such as C, C++, and FORTRAN, and are realized in Python by the module mpi4py.

18.2.1 Prerequisites

You need to install this module first by executing the following in a terminal window:

conda install mpi4py

The module is imported by adding the following line to your Python script:

import mpi4py as mpi

The execution of a parallelized code is done from a terminal with the command mpiexec. Assuming that your code is stored in the file script.py, executing this code on a computer with a four-core CPU is done in the terminal window by running the following command:

mpiexec -n 4 python script.py

Alternatively, to execute the same script on a cluster with two computers, run the following in a terminal window:

mpiexec --hostfile=hosts.txt python script.py

You have to provide a file hosts.txt containing the names or IP addresses of the computers with the number of their cores you want to bind to a cluster:

# Content of hosts.txt
192.168.1.25 :4 # master computer with 4 cores
192.168.1.101:2 # worker computer with 2 cores

The Python script, here script.py, has to be copied...

18.3 Distributing tasks to different cores

When executed on a multicore computer, we can think of it that mpiexec copies the given Python script to the number of cores and runs each copy. As an example, consider the one-liner script print_me.py with the command print("Hello it's me"), that, when executed with mpiexec -n 4 print_me.py, generates the same message on the screen four times, each sent from a different core.

In order to be able to execute different tasks on different cores, we have to be able to distinguish these cores in the script.

To this end, we create a so-called communicator instance, which organizes the communication between the world, that is, the input and output units like the screen, the keyboard, or a file, and the individual cores. Furthermore, the individual cores are given identifying numbers, called a rank:

from mpi4py import MPI
comm=MPI.COMM_WORLD # making a communicator instance
rank=comm.Get_rank() # querrying for the numeric...

18.3.1 Information exchange between processes

There are different ways to send and receive information between processes:

  • Point-to-point communication
  • One-to-all and all-to-one
  • All-to-all

In this section, we will introduce point-to-point, one-to-all, and all-to-one communication.

Speaking to a neighbor and letting information pass along a street this way is an example from daily life of the first communication type from the preceding list, while the second can be illustrated by the daily news, spoken by one person and broadcast to a big group of listeners.One-to-all and all-to-one communication

                    

Figure 18.2: Point-to-point communication and one-to-all communication

In the next subsections, we will study these different communication types in a computational context.

18.3.2 Point-to-point communication

Point-to-point communication directs information flow from one process to a designated receiving process. We first describe the methods and features by considering a ping-pong situation and a telephone-chain situation and explain the notion of blocking.

Point-to-point communication is applied in scientific computing, for instance in random-walk or particle-tracing applications on domains that are divided into a number of subdomains corresponding to the number of processes that can be carried out in parallel.

The ping-pong example assumes that we have two processors sending an integer back and forth to each other and increasing its value by one.

We start by creating a communicator object and checking that we have two processes available:

from mpi4py import MPI
comm=MPI.COMM_WORLD # making a communicator instance
rank=comm.Get_rank() # querying for the numeric identifier of the core
size=comm.Get_size() # the total number of cores assigned
if not (size...

18.3.3 Sending NumPy arrays

The commands send and recv are high-level commands. That means they do under-the-hood work that saves the programmer time and avoids possible errors. They allocate memory after having internally deduced the datatype and the amount of buffer data needed for communication. This is done internally on a lower level based on C constructions.

NumPy arrays are objects that themselves make use of these C-buffer-like objects, so when sending and receiving NumPy arrays you can gain efficiency by using them in the lower-level communication counterparts Send and Recv (mind the capitalization!).

In the following example, we send an array from one processor to another:

 

from mpi4py import MPI
comm=MPI.COMM_WORLD # making a communicator instance
rank=comm.Get_rank() # querying for the numeric identifier of the core
size=comm.Get_size() # the total number of cores assigned
import numpy as np

if rank==0:
A = np.arange(700)
comm.Send(A, dest=1...

18.3.4 Blocking and non-blocking communication

The commands send and recv and their buffer counterparts Send and Recv are so-called blocking commands. That means a command send is completed when the corresponding send buffer is freed. When this will happen depends on several factors such as the particular communication architecture of your system and the amount of data that is to be communicated. Finally, the command send is considered to be freed when the corresponding command recv has got all the information. Without such a command recv, it will wait forever. This is called a deadlock situation.

The following script demonstrates a situation with the potential for deadlock. Both processes send simultaneously. If the amount of data to be communicated is too big to be stored the command send is waiting for a corresponding recv to empty the pipe, but recv never is invoked due to the waiting state. That's a deadlock.

from mpi4py...

18.3.5 One-to-all and all-to-one communication

When a complex task depending on a larger amount of data is divided into subtasks, the data also has to be divided into portions relevant to the related subtask and the results have to be assembled and processed into a final result.

Let's consider as an example the scalar product of two vectors  divided into subtasks:

                                  

with  All subtasks perform the same operations on portions of the initial data, the results have to be summed up, and possibly any remaining operations have to be carried out.

We have to perform the following steps:

  1. Creating the vectors u and v
  2. Dividing them into m subvectors with a balanced number of elements, that is,  elements if N is divisible by m, otherwise some subvectors have more elements
  3. Communicating each subvector to "its" processor
  4. Performing the scalar...

Preparing the data for communication

First, we will look into Step 2. It is a nice exercise to write a script that splits a vector into m pieces with a balanced number of elements. Here is one suggestion for such a script, among many others:

def split_array(vector, n_processors):
# splits an array into a number of subarrays
# vector one dimensional ndarray or a list
# n_processors integer, the number of subarrays to be formed

n=len(vector)
n_portions, rest = divmod(n,n_processors) # division with remainder
# get the amount of data per processor and distribute the res on
# the first processors so that the load is more or less equally
# distributed
# Construction of the indexes needed for the splitting
counts = [0]+ [n_portions + 1 \
if p < rest else n_portions for p in range(n_processors)]
counts=numpy.cumsum(counts)
start_end=zip(counts[:-1],counts[1:]) # a generator
slice_list=(slice(*sl) for sl in start_end) # a generator comprehension...

The commands – scatter and gather

Now we are ready to look at the entire script for our demo problem, the scalar product:

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocessors = comm.Get_size()
import splitarray as spa

if rank == 0:
# Here we generate data for the example
n = 150
u = 0.1*np.arange(n)
v = - u
u_split = spa.split_array(u, nprocessors)
v_split = spa.split_array(v, nprocessors)
else:
# On all processor we need variables with these names,
# otherwise we would get an Exception "Variable not defined" in
# the scatter command below
u_split = None
v_split = None
# These commands run now on all processors
u_split = comm.scatter(u_split, root=0) # the data is portion wise
# distributed from root
v_split = comm.scatter(v_split, root=0)
# Each processor computes its part of the scalar product
partial_dot = u_split@v_split
# Each processor reports its result...

A final data reduction operation – the command reduce

The parallel scalar product example is typical for many other tasks in the way how results are handled: the amount of data coming from all processors is reduced to a single number in the last step. Here, the root processor sums up all partial results from the processors. The command reduce can be efficiently used for this task. We modify the preceding code by letting reduce do the gathering and summation in one step. Here, the last lines of the preceding code are modified in this way:

......... modification of the script above .....
# Each processor reports its result back to the root
# and these results are summed up
total_dot = comm.reduce(partial_dot, op=MPI.SUM, root=0)

if rank==0:
print(f'The parallel scalar product of u and v'
f' on {nprocessors} processors is {total_dot}.\n'
f'The difference to the serial computation \
is {abs(total_dot-u@v)}')

Other frequently applied...

Sending the same message to all

Another collective command is the broadcasting command bcast. In contrast to scatter it is used to send the same data to all processors. Its call is similar to that of scatter:

data = comm.bcast(data, root=0)

but it is the total data and not a list of portioned data that is sent. Again, the root processor can be any processor. It is the processor that prepares the data to be broadcasted.

Buffered data

In an analogous manner, mpi4py provides the corresponding collective commands for buffer-like data such as NumPy arrays by capitalizing the command: scatter/Scatter, gather/Gather, reduce/Reduce, bcast/Bcast.

18.4 Summary

In this chapter, we saw how to execute copies of the same script on different processors in parallel. Message passing allows the communication between these different processes. We saw point-to-point communication and the two different distribution type collective communications one-to-all and all-to-one. The commands presented in this chapter are provided by the Python module mpi4py, which is a Python wrapper to realize the MPI standard in C.

Having worked through this chapter, you are now able to work on your own scripts for parallel programming and you will find that we described only the most essential commands and concepts here. Grouping processes and tagging information are only two of those concepts that we left out. Many of these concepts are important for special and challenging applications, which are far too particular for this introduction.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scientific Computing with Python - Second Edition
Published in: Jul 2021Publisher: PacktISBN-13: 9781838822323
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Claus Führer

Claus Führer is a professor of scientific computations at Lund University, Sweden. He has an extensive teaching record that includes intensive programming courses in numerical analysis and engineering mathematics across various levels in many different countries and teaching environments. Claus also develops numerical software in research collaboration with industry and received Lund University's Faculty of Engineering Best Teacher Award in 2016.
Read more about Claus Führer