Parallelize It

In this article by Elliot Forbes, the author of the book Learning Concurrency in Python, will explain concurrency and parallelism thoroughly, and bring necessary CPU knowledge related to it.

Concurrency and parallelism are two concepts that are commonly confused. The reality though is that they are quite different and if you designed software to be concurrent when instead you needed parallel execution then you could be seriously impacting your software’s true performance potential.

Due to this, it's vital to know exactly what the two concepts mean so that you can understand the differences. Through knowing these differences you’ll be putting yourself at a distinct advantage when it comes to designing your own high performance software in Python.

In this article we’ll be covering the following topics:

  • What is concurrency and what are the major bottlenecks that impact our applications?
  • What is parallelism and how does this differ from concurrency?

(For more resources related to this topic, see here.)

Understanding concurrency

Concurrency is essentially the practice of doing multiple things at the same time, but not specifically in parallel. It can help us to improve the perceived performance of our applications and it can also improve the speed at which our applications run.

The best way to think of how concurrency works is to imagine one person working on multiple tasks and quickly switching between these tasks. Imagine this one person was working concurrently on a program and at the same time dealing with support requests. This person would focus primarily on the writing of their program and quickly context switch to fixing a bug or dealing with a support issue should there be one. Once they complete the support task, they could context switch again back to writing their program really quickly.

However, in computing there are typically two performance bottlenecks that we have to watch out for and guard against when writing our programs. It’s important to know the differences between the two bottlenecks as if we tried to apply concurrency to a CPU based bottleneck then you could find that the program actually starts to see performance decreases as opposed to increases. And if you tried to apply parallelism to a task that really require a concurrent solution then again you could see the same performance hits.

Properties of concurrent systems

All concurrent systems share a similar set of properties, these can be defined as:

  • Multiple actors: This represent the different processes and threads all trying to actively make progress on their own tasks. We could have multiple processes that contain multiple threads all trying to run at the same time.
  • Shared Resources: This represents the memory, the disk and other resources that the actors in the above group must utilize in order to perform what they need to do.
  • Rules: All concurrent systems must follow a strict set of rules that define when actors can and can’t acquire locks, access memory, modify state and so on. These rules are vital in order for these concurrent systems to work otherwise our programs would tear themselves apart.

Input/Output bottlenecks

Input/Output bottlenecks, or I/O bottlenecks for short, are bottlenecks where your computer spends more time waiting on various inputs and outputs than it does on processing the information.

You’ll typically find this type of bottleneck when you are working with an I/O heavy application. We could take your standard web browser as an example of a heavy I/O application. In a browser we typically spend a significantly longer amount of time waiting for network requests to finish for things like style sheets, scripts or HTML pages to load as opposed to rendering this on the screen.

If the rate at which data is requested is slower than the rate than which it is consumed at then you have yourself an I/O bottleneck.

One of the main ways to improve the speed of these applications typically is to either improve the speed of the underlying I/O by buying more expensive and faster hardware or to improve the way in which we handle these I/O requests.

A great example of a program bound by I/O bottlenecks would be a web crawler. Now the main purpose of a web crawler is to traverse the web and essentially index web pages so that they can be taken into consideration when Google runs its search ranking algorithm to decide the top 10 results for a given keyword.

We’ll start by creating a very simple script that just requests a page and times how long it takes to request said web page:

import urllib.request
import time
t0 = time.time()
req = urllib.request.urlopen('http://www.example.com')
pageHtml = req.read()
t1 = time.time()
print("Total Time To Fetch Page: {} Seconds".format(t1-t0))

If we break down this code, first we import the two necessary modules, urllib.request and the time module. We then record the starting time and request the web page: example.com and then record the ending time and printing out the time difference.

Now say we wanted to add a bit of complexity and follow any links to other pages so that we could index them in the future. We could use a library such as BeautifulSoup in order to make our lives a little easier:

import urllib.request

import time

from bs4 
import
 BeautifulSoup



t0 =
 time.time()

req =
 urllib.request.urlopen(
'http://www.example.com'
)

t1 =
 time.time()
print("Total Time To Fetch Page: {} Seconds".format(t1-t0))

soup =
 BeautifulSoup(req.read(), 
"html.parser"
)



for link 
in
 soup.find_all(
'a'
):

  print
(link.get(
'href'
))



t2 =
 time.time()

print(
"Total Execeution Time: 
{}
 Seconds"
.format)

When I execute the above program I see the results like so in my terminal:

You’ll notice from this output that the time to fetch the page is over a quarter of a second. Now imagine we wanted to run our web crawler for a million different web pages, our total execution time would be roughly a million times longer.

The main real cause for this enormous execution time would be purely down to the I/O bottleneck we face in our program. We spend a massive amount of time waiting on our network requests and a fraction of that time parsing our retrieved page for further links to crawl.

Understanding parallelism

Parallelism is the art of executing two or more actions simultaneously as opposed to concurrency in which you make progress on two or more things at the same time. This is an important distinction, and in order to achieve true parallelism, we’ll need multiple processors on which to run our code on at the same time.

A good analogy to think of parallel processing is to think of a queue for coffee. If you had say two queues of 20 people all waiting to use this coffee machine so that they can get through the rest of the day. Well this would be an example of concurrency. Now say you were to introduce a second coffee machine into the mix, this would then be an example of something happening in parallel. This is exactly how parallel processing works, each of the coffee machines in that room would represent one processing core and are able to make progress on tasks simultaneously.

A real life example which highlights the true power of parallel processing is your computer’s graphics card. These graphics cards tend to have hundreds if not thousands of individual processing cores that live independently and can compute things at the same time. The reason we are able to run high-end PC games at such smooth frame rates is due to the fact we’ve been able to put so many parallel cores onto these cards.

CPU bound bottleneck

A CPU bound bottleneck is typically the inverse of an I/O bound bottleneck. This bottleneck is typically found in applications that do a lot of heavy number crunching or any other task that is computationally expensive. These are programs for which the rate at which they execute is bound by the speed of the CPU, if you throw a faster CPU in your machine you should see a direct increase in the speed of these programs.

If the rate at which you are processing data far outweighs the rate at which you are requesting data then you have a CPU Bound Bottleneck.

How do they work on a CPU?

Understanding the differences outlined in the previous section between both concurrency and parallelism is essential but it’s also very important to understand more about the systems that your software will be running on. Having an appreciation of the different architecture styles as well as the low level mechanics helps you make the most informed decisions in your software design.

Single core CPUs

Single core processors will only ever execute one thread at any given time as that is all they are capable of. However, in order to ensure that we don’t see our applications hanging and being unresponsive, these processors rapidly switch between multiple threads of execution many thousands of times per second. This switching between threads is what is called a "context switch" and involves storing all the necessary information for a thread at a specific point of time and then restoring it at a different point further down the line.

Using this mechanism of constantly saving and restoring threads allows us to make progress on quite a number of threads within a given second and it appears like the computer is doing multiple things at once. It is in fact doing only one thing at any given time but doing it at such speed that it’s imperceptible to users of that machine.

When writing multi-threaded applications in Python it is important to note that these context switches are computationally quite expensive. There is no way to get around this unfortunately and much of the design of operating systems these days is about optimizing for these context switches so that we don’t feel the pain quite as much.

Advantages of single core CPUs:

  • They do not require any complex communication protocols between multiple cores
  • Single core CPUs require less power which typically makes them better suited for IoT devices

Disadvantages:

  • They are limited in speed and larger applications will cause them to struggle and potentially freeze
  • Heat dissipation issues place a hard limit on how fast a single core CPU can go

Clock rate

One of the key limitations to a single-core application running on a machine is the Clock Speed of the CPU. When we talk about Clock rate, we are essentially talking about how many clock cycles a CPU can execute every second.

For the past 10 years we have watched as manufacturers have been able to surpass Moore’s law which was essentially an observation that the number of transistors one was able to place on a piece of silicon was able to double roughly every 2 years.

This doubling of transistors every 2 years paved the way for exponential gains in single-cpu clock rates and CPUs went from the low MHz to the 4-5GHz clock speeds we are seeing on Intel’s i7 6700k processor.

But with transistors getting as small as a few nanometers across, this is inevitably coming to an end. We’ve started to hit the boundaries of Physics and unfortunately if we go any smaller we’ll start to be hit by the effects of quantum tunneling. Due to these physical limitations we need to start looking at other methods in order to improve the speeds at which we are able to compute things.

This is where Materlli’s Model of Scalability comes into play.

Martelli model of scalability

The author of Python Cookbook, Alex Martelli came up with a model on scalability which Raymond Hettinger discussed in his brilliant hour-long talk "Thinking about Concurrency", which he gave at PyCon Russia 2016. This model represents three different types of problem and programs:

  • 1 core: single threaded and single process programs
  • 2-8 cores: multithreaded and multiprocess programs
  • 9+ cores: distributed computing

The first category, the single core, single threaded category is able to handle a growing number of problems due to the constant improvements of the speed of single core CPUs and as a result the second category is being rendered more and more obsolete. We will eventually hit a limit with the speed at which a 2-8 core system can run at and as a result we’ll have to start looking at other methods such as multiple CPU systems or even distributed computing.

If your problem is worth solving quickly and it requires a lot of power then the sensible approach is to go with the distributed computing category and spin up multiple machines and multiple instances of your program in order to tackle your problems in a truly parallel manner. Large enterprise systems that handle hundreds of millions of requests are the main inhabitants of this category. You’ll typically find that these enterprise systems are deployed on tens, if not hundreds of high performance, incredibly powerful servers in various locations across the world.

Time-Sharing - the task scheduler

One of the most important parts of the Operating System is the task scheduler. This acts as the maestro of the orchestra and directs everything with impeccable precision and incredible timing and discipline. This maestro has only one real goal and that is to ensure that every task has a chance to run through till completion, the when and where of a task’s execution however is non-deterministic. That is to say, if we gave a task scheduler two identical competing processes one after the other, there is no guarantee that the first process will complete first. This non-deterministic nature is what makes concurrent programming so challenging.

An excellent example that highlights this non-deterministic behavior is say we take the following code:

import threading
import time
import random
counter = 1
def workerA():
  global counter
  while counter < 1000:
    counter += 1
    print("Worker A is incrementing counter to {}".format(counter))
    sleepTime = random.randint(0,1)
    time.sleep(sleepTime)
def workerB():
  global counter
  while counter > -1000:
    counter -= 1
    print("Worker B is decrementing counter to {}".format(counter))
    sleepTime = random.randint(0,1)
    time.sleep(sleepTime)
def main():
  t0 = time.time()
  thread1 = threading.Thread(target=workerA)
  thread2 = threading.Thread(target=workerB)
  thread1.start()
  thread2.start()
  thread1.join()
  thread2.join()
  t1 = time.time()
  print("Execution Time {}".format(t1-t0))
if __name__ == '__main__':
  main()

Here we have two competing threads in Python that are each trying to accomplish their own goal of either decrementing the counter to 1,000 or conversely incrementing it to 1,000. In a single core processor there is the possibility that worker A managers to complete its task before worker B has a chance to execute and the same can be said for worker B. However there is a third potential possibility and that is that the task scheduler continues to switch between worker A and worker B for an infinite number of times and never complete.

The above code incidentally also shows one of the dangers of multiple threads accessing shared resources without any form of synchronization. There is no accurate way to determine what will happen to our counter and as such our program could be considered unreliable.

Multi-core processors

We’ve now got some idea as to how single-core processors work, but now it’s time to take a look at multicore processors. Multicore processors contain multiple independent processing units or “cores”. Each core contains everything it needs in order to execute a sequence of stored instructions. These cores each follow their own cycle:

  • Fetch - This step involves fetching instructions from program memory. This is dictated by a program counter (PC) which identifies the location of the next step to execute.
  • Decode - The core converts the instruction that it has just fetched and converts it into a series of signals that will trigger various other parts of the CPU.
  • Execute - Finally we perform the execute step. This is where we run the instruction that we have just fetched and decoded and typically the results of this execution are then stored in a CPU register.

Having multiple cores offers us the advantage of being able to work independently on multiple Fetch -> Decode -> Execute cycles. This style of architecture essentially enables us to create higher performance programs that leverage this parallel execution.

Advantages of multicore processors:

  • We are no longer bound by the same performance limitations that a single core processor is bound
  • Applications that are able to take advantage of multiple cores will tend to run faster if well designed

Disadvantages of multicore processors:

  • They require more power than your typical single core processor.
  • Cross-core communication is no simple feat, we have multiple different ways of doing this.

Summary

In this article we covered a multitude of topics including the differences between Concurrency and Parallelism. We also looked at how they both leverage the CPU in different ways.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

Learning Concurrency in Python

Explore Title