C# Multithreaded and Parallel Programming

4 (4 reviews total)
By Rodney Ringler
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Understanding Multiprocessing and Multiple Cores

About this book

Most modern machines have dual-core processors. This means that the present-day computer has the ability to multitask. Using multiple cores means your applications can process data faster and be more responsive to users. However, to fully exploit this in your applications, you need to write multithreading code.

We will begin by covering some techniques that have been around since the beginning of .NET, including the BackgroundWorker component, timers, and the Thread class. We will use tasks, task factories, and parallel loops to develop multithreaded applications at a higher level than directly creating and managing individual threads. Finally, we will look at the tools Visual Studio provides for debugging parallel applications, common concurrent design patterns, and the latest updates in PLINQ and async.

Publication date:
December 2014
Publisher
Packt
Pages
344
ISBN
9781849688321

 

Chapter 1. Understanding Multiprocessing and Multiple Cores

Taking into consideration the fact that we know how to develop C# applications and can make the software do what we need it to, this book focuses on how we can make our C# applications perform their tasks more efficiently and faster by taking advantage of today's powerful hardware.

In the old days, computers had a single CPU that could run one software thread at a time. With sophisticated scheduling logic and fast clock and bus speeds, they were able to make it appear that multiple software threads were running at the same time, but this was just an illusion. A single CPU system with one core in the CPU can only execute one thread's instruction every clock cycle. A computer has a clock that controls the execution of the CPU. Each time the clock counts one unit, the CPU executes an instruction. In this model, there is limited need to develop applications that use multiple software threads. It can still be useful for UI responsiveness so that a long-running task does not freeze the user interface. We will discuss this more in Chapter 2, Looking at Multithreaded Classes – BackgroundWorker. So, multithreaded applications allow software applications to be more responsive to the user but not to process tasks faster.

We must understand some fundamentals related to the multiprocessing capabilities offered by modern computers. We will have to consider them in order to develop applications that take full advantage of parallel processing features. In this chapter, we will cover many topics to help us understand the new challenges involved in parallel programming with modern hardware. Upon reading it and following the exercises, we shall benefit in the following ways:

  • Begin a paradigm shift in software design

  • Understand the techniques needed to develop a new generation of applications

  • Have an idea of the performance improvements we can achieve using parallel programming with C# using Gustafson's and Amdahl's laws

  • Perform accurate response-time estimation for critical processes

 

Mono-processor systems – the old gladiators


The mono-processor systems use old-fashioned, classic computer architecture. The microprocessor receives an input stream, executes the necessary processes, and sends the results in an output stream that is distributed to the indicated destinations. The following diagram represents a mono-processor system (one processor with just one core) with one user and one task running:

This working scheme is known as input-processing-output (IPO) or single instruction, single data (SISD). This basic design represents a von Neumann machine, developed by the outstanding mathematician, John von Neumann, in 1952.

Single core – only one warrior to fight against everybody

These days, systems with a single processing core, with just one logical processor, are known as single core.

When there is only one user running an application in a mono-processor machine and the processor is fast enough to deliver an adequate response time in critical operations, the model will work without any major problems.

For example, consider a robotic servant in the kitchen having just two hands to work with. If you ask him to do one task that requires both his hands, such as washing up, he will be efficient. He has a single processing core.

However, suppose that you ask him to do various tasks—wash up, clean the oven, prepare your lunch, mop the floor, cook dinner for your friends, and so on. You give him the list of tasks, and he works down the tasks. But since there is so much washing up, it's 2 p.m. before he even starts preparing your lunch—by which time you get very hungry and prepare it yourself. You need more robots when you have multiple tasks. You need multiple execution cores and many logical processors.

Each task performed by the robot is a critical operation, because you and your friends are very hungry!

Let's consider another case. We have a mono-processor computer and it has many users connected, requesting services that the computer must process. In this case, we have many input streams and many output streams, one for each connected user. As there is just one microprocessor, there is only one input channel and only one output channel. Therefore, the input streams are enqueued (multiplexing) for processing, and then the same happens with the output streams, but the order is inverted.

Doing a tiny bit of each task

Why does the robot take so long to cook dinner for you and your friends? The robot does a tiny bit of each task and then goes back to the list to see what else he should be doing. He has to keep moving to the list, read it, and then starts a new task. The time it takes to complete the list is much longer because he is not fast enough to finish multiple tasks in the required time. That's multiplexing, and the delay is called von Neumann's bottleneck. Multiplexing takes additional time because you have just one robot to do everything you need in the kitchen.

Systems that provide concurrent access to multiple users are known as multiuser systems.

If the processor is not fast enough to deliver an adequate response time in every critical operation requested by each connected user, a bottleneck will be generated in the processor's input queue. This is well known in computer architecture as von Neumann's bottleneck.

There are three possible solutions to this problem, each consisting of upgrading or increasing one of the following:

  • The processor's speed, by using a faster robot. He will need less time to finish each task.

  • The processor's capacity to process instructions concurrently (in parallel), that is, adding more hands to the robot and the capability to use his hands to do different jobs.

  • The number of installed processors or the number of processing cores, that is, adding more robots. They can all focus on one task, but everything gets done in parallel. All tasks are completed faster and you get your lunch on time. That is multitasking.

No matter which option we pick, we must consider other factors that depend particularly on the kind of operations performed by the computer and which could generate additional bottlenecks. In some cases, the main memory access speed could be too slow (the robot takes too much time to read each task). In other cases, the disks' subsystem could have bad response times (the robot takes too much time to memorize the tasks to be done), and so on. It is important to make a detailed analysis of these topics before making a decision to troubleshoot bottlenecks.

Moreover, sometimes the amount of data that needs to be processed is too large and the problem is the transfer time between the memory and the processor, that is, the robot is too slow to move each hand. Poor robot! Why don't you buy a new model?

In the last few years, every new microarchitecture developed by microprocessor manufacturers has focused on improving the processor's capacity to run instructions in parallel (a robot with more hands). Some examples of these are the continuous duplication of processing structures such as the Arithmetic and Logic Unit (ALU) and Floating Point Unit (FPU), and the growing number of processing cores that are included in one single physical processor. Hence, you can build a super robot with many independent robots and many hands. Each sub-robot can be made to specialize in a specific task, thus parallelizing the work.

Computers used as servers, with many connected users and running applications, take greater advantage of modern processors' capacity to run instructions in parallel as compared to those computers used by only one user. We will learn how to take full advantage of those features in the applications that are developed using the C# programming language. You want the robot to get your lunch on time!

 

Multiprocessor systems – many warriors to win a battle


Systems with multiple processors are a solution to von Neumann's bottleneck, but it is first necessary to know their detailed features in order to set aside some myths about them. They do not offer an immediate performance improvement for all applications! The dilemma is that systems with multiple processors are not always the most appropriate solution to a performance problem.

There are two basic procedures to distribute tasks in systems with multiple processors:

  • Symmetrical multiprocessing (SMP): Any available processor or core can execute tasks. The most used and efficient one is n-way symmetrical multiprocessing, where n is the number of installed processors. With this procedure, each processor can execute a task isolated from the rest and also when a particular software is not optimized for multiprocessing systems. You have eight robots in the kitchen. When a robot is free, he goes back to the list to see what else he should be doing and starts working on the next task (8-way symmetrical multiprocessing).

  • Asymmetrical multiprocessing (AMP or ASMP): Usually, one processor acts as the main processor. It works as a manager and is in charge of distributing the tasks to the other available processors, using different kinds of algorithms for this purpose. You have nine robots in the kitchen. One of them is in charge of task distribution (the manager robot). He is always reading the list and watching the other robots work (the worker robots are the processors dedicated to run tasks). When a robot is free, the manager robot tells him what to do next.

The robots are expensive! You do not want to waste a robot to distribute the tasks. You would rather have robots that are independent. You want robots arranged similar to a symmetrical multiprocessing scheme.

The n-way symmetric multiprocessing procedure achieves the best performance and the best resources usage, where n can be two or more processors. With it, every available processor can execute tasks in an absolutely dynamic way. This is the reason why most multiprocessing systems use this approach.

A symmetric multiprocessing system with many users connected or numerous tasks running provides a good solution to von Neumann's bottleneck. The multiple input streams are distributed to the different available processors for their execution, and they generate multiple concurrent output streams, as shown in the following diagram:

But what if there is so much washing up that it takes a single robot several days to complete? A bottleneck will be generated again. Are his hands as fast as necessary? Are his legs too slow? Is he a lazy robot?

We have to take into account that if the response time of a processor to a user's request is not quick enough, a bottleneck will be generated again. However, it can also be generated by other problems along the performance waterfall. We must delve deeper into the process in order to understand these potential performance issues.

Therefore, while the number of users or the number of tasks being executed in a multiprocessing system increases, it is more likely to run out of processing capacity, among other things. If this happens, each user's tasks being executed will take longer to run, and for that reason, the response time will worsen.

Under these circumstances, there are two possible approaches to keep the response time untouched:

  • Replacing the existing processors with new ones (buying super robots): In order to apply this solution, there should be processors with better performance ratios than the ones that are currently used, or with more execution cores (to achieve a greater degree of parallelism). They also have to be compatible with the motherboard and with the sockets used by them. The great disadvantage of this approach is that the old processors are thrown out. It is also expensive.

  • Adding new processors to work with the existing ones (buying new robots to help the existing ones): In order to apply this solution, there should be free sockets on the motherboard.

 

Multiple core processors and hyperthreading


We have discussed multiple- and single-CPU computer systems. Now, let's take a look at multiple core CPU's and hyperthreading. A multiple core CPU has more than one physical processing unit. In essence, it acts like more than one CPU. The only difference is that all cores of a single CPU share the same memory cache instead of having their own memory cache. From the multithreaded parallel developer standpoint, there is very little difference between multiple CPUs and multiple cores in a CPU. The total number of cores across all of the CPUs of a system is the number of physical processing units that can be scheduled and run in parallel, that is, the number of different software threads that can truly execute in parallel.

There is a slight performance bottleneck with having multiple cores in a CPU versus having multiple CPUs with single cores due to the sharing of the memory bus. For most applications, this is negligible.

For the parallel developer trying to estimate performance gains by using a parallel design approach, the number of physical cores is the key factor to use for estimations.

This diagram shows three physical CPUs each having two logical cores:

The following diagram shows a CPU with four logical cores, each having its own memory and then shared memory between them:

Next, let's discuss hyperthreading. This is a proprietary simultaneous multithreading (SMT) technology, which Intel has developed, that allows a single physical core in a CPU to have multiple logical cores. Each of these logical cores is called a hardware thread and can be scheduled separately by the operating system (OS) scheduler.

The OS has to implement SMT to be able to take advantage of hyperthreading technology, but today, most operating systems do. Even though each hardware thread (logical core) appears as a separate core for the OS to schedule, only one logical core per physical core can execute a software instruction at a time. Hyperthreading is explained in the following diagram:

This is important to realize when you examine your computer hardware and estimate performance gains of a parallel application. For our examples of performance estimations using Amdahl's and Gustafson's laws, we will only be counting physical cores because technically logical cores, in a single physical core, cannot execute instructions during the same clock cycle.

Taking advantage of multiple execution cores

One of the techniques to improve the processing capacity consists of increasing the microprocessors' working frequency (overclocking), which raises the number of instructions capable of processing in the same period. This technique has been used for many years and has evolved from the legendary 8086/8088 with its poor 4.77 MHz (megahertz) to the many GHz (gigahertz) of modern microprocessors.

Nevertheless, microprocessor manufacturers are increasingly facing difficulties in raising the frequencies because the manufacturing process becomes more complex and the generated heat is difficult to dissipate in an efficient and inexpensive way.

Note

Consider our robot instance again. You want to buy a single robot, but want him to clean the oven in 5 seconds. That is possible, but he needs plutonium as an energy source because he must move his arms and legs at a very high speed. Besides, he needs an ambient temperature of 50F (5 degrees Fahrenheit) or -15C (-15 degrees Celsius). Why? Well, because metals moving at very high speeds generate heat. You do not want a burnt robot. Moreover, plutonium is very expensive. Something similar happens with modern microprocessors.

Therefore, the other alternative is to develop new microarchitectures; incorporating first duplicated, and then quadruplicated processing structures, and so on. In this way, there are many subprocessors in one single microprocessor's package. These subprocessors are known as execution cores or processing cores.

Microprocessors with multiple execution cores, also known as multicore, offer many complete execution cores that are interconnected in a single package. Their physical look is very similar to a conventional single core microprocessor. Nevertheless, they are equivalent to something like two or more microprocessors inside a single piece of silicon, as well as many pieces of silicon interconnected under the same physical package. Of course, we are avoiding a full discussion of the deep technical issues.

Note

At present, most available modern computers have microprocessors with at least two execution cores (dual core). Therefore, they are computers with multiprocessing capabilities.

Ever since the rise of multiple execution cores, the possibilities of combining the communication architectures, the different cores owned, and shared resources have continued to multiply. As with everything in life, in each possibility, there is a trade-off between manufacturing costs and performance. For this reason, a new land appeared in the microprocessor world.

In some cases, each execution core includes L1 and L2 cache memories. Caches can be divided up into levels with smaller, faster caches accessed first, and slower, larger ones accessed next. An architecture can employ as many caches as it wants, but typically you will see two levels. This allows the architecture to house often-used data in a cache with lower latency for faster retrieval.

In other cases, L2 cache memories are shared between two or more cores. Therefore, each core will have access to the whole L2 cache.

The greater the number of resources included in each core and the fewer the resources shared with the others, the greater the processing speed achieved by each core. On the other hand, sharing resources between cores benefits applications not optimized for multiprocessing because they use a single execution core.

The robots' communications interface must be as efficient as possible, since you want the robots to do many different tasks.

Achieving efficient external memory accesses is one of the most important matters with these microarchitectures. The communication with external memory has a great overhead with respect to time, compared to the internal core speed. When we design the most critical algorithms for our applications, we must minimize the external memory access in order to achieve the best performance. It is one of the main subjects to consider in designing applications that will be developed with parallelism in mind.

There are two options available to speed up the tasks done by the robots, taking into account the washing-up example:

  • Divide the washing-up into parts corresponding to the number of robots available, and have the robots do their own portion of the global task

  • Have a big pile of washing-up, and have each robot pick items up from that pile when they have room in their sinks to do that washing-up

The internal bus is very important, because it transports data between the different execution cores. Many microprocessors use one or more dedicated buses for that task with very high working speeds, while others establish those communications through the Front Side Bus (FSB), which is a less efficient way. When the microprocessor has more than two cores, the architecture could be any possible merger between the known architectures for single-and dual-core microprocessors. There are some microprocessors built by two pairs of cores, each one using a dedicated bus for the data interchange between both the cores, but with both pairs talking through the FSB.

When we are optimizing applications to take full advantage of these microarchitectures, one of the things that we should minimize is the information going through the FSB. Besides, we must consider this in evaluating the optimized application's efficiency. If we don't, we will probably draft wrong conclusions about them and we will try to optimize already maximized performances (according to the underlying hardware architecture).

A system with asymmetric multiprocessing based on many independent physical processors has many FSBs to access external memory, one for each physical processor. However, a system with a microprocessor having multiple cores has to share the FSB that acts as a great single door to the outside world and to the external memory. Therefore, the tasks to coordinate the activities in the execution cores require additional time to avoid conflicts in the shared FSB. This is an important difference between multiprocessing using independent physical microprocessors and multiple cores in one physical microprocessor.

The probability that an FSB will become a bottleneck is very high when applications are not optimized to take full advantage of cache memories included in each core. Therefore, when the software is running over these microarchitectures, it should avoid frequent accesses to the main memory.

Besides, many asymmetric multiprocessing systems use duplicated communication channels with the main memory. This feature is not available in many multicore microprocessors. It makes it nearly impossible to predict the performance of applications in completely different system architectures. However, designing them with parallelism in mind will take full advantage of any feature present in the system.

Nowadays, microprocessors with multiple execution cores are widespread. However, we can find many of them arranged in an n-way asymmetric multiprocessing system such as an 8-core system with two physical quad-core microprocessors. It is a very attractive setup in high-end workstations and servers.

In the coming years, microprocessors are going to include more and more processing cores. Modern operating systems are already optimized to take advantage of their parallel processing capabilities. We must optimize our applications to take full advantage of them.

Analyzing the microarchitectures used in modern microprocessors is a topic for an entire book. However, we needed some knowledge about them in order to understand the parallel processing capabilities that are useful for our goals.

Do not expect plutonium robots! They are still too expensive to maintain.

 

Examining our hardware


As we will see with some of the tools used to analyze a system's hardware, most count logical cores and not just physical cores. This is important to remember because of the limitation mentioned in the previous section where a CPU only executes an instruction each clock cycle for a physical core.

Let's take a second to look at some examples. First, if you are on a Windows machine (which we will assume for the examples in this book), you can right-click on the Taskbar and run the Task Manager. The following is a sample from my computer:

When looking at the Performance tab, you can see it shows four CPUs running. This would lead us to believe that my system has four CPUs in it. But in fact it actually has one CPU that has two cores (physical) with each having two hardware threads (logical cores). So, the Task Manager in Windows shows us how many logical cores are there in our system. This is the number of schedulable hardware threads the system scheduler can utilize.

Tip

CPU-Z is a handy utility to analyze the hardware of a computer in order to find information about a computer. You can download this free software from http://www.cpuid.com/.

Here is the output from CPU-Z for my computer:

As you can see from the bottom of the CPU-Z output, I have 1 CPU with 2 cores and 4 hardware threads. It is important to understand how many CPUs, physical cores, and hardware threads a system has so you can properly estimate performance gains from parallel development. In the next sections, we will examine several methods to predict this performance gain based on the number of physical cores. This will also help you understand the gains possible by "throwing more hardware" at a software application designed with parallelism.

 

OS scheduler operations


So far we have been discussing hardware and the number of CPUs, hardware cores, and logical cores; now, let's transition to software threads and the OS. A software application or service can run in one or many processes and threads. Typically, a software application has a user interface and is run by a user of the computer, while a software process is run by the OS and runs in the background. Both of these are types of software that are being executed by the computer they are running on.

Each application or service in turn has one or several processes that they actually execute inside. Processes are the running objects of an application or service. Also, each process has one or many execution threads (or software threads). The threads are the items that the scheduler schedules on the cores. I know this might seem confusing but it is important to understand the hardware from the ground up and how many physical cores can execute a software instruction, and the software from the top-down to each software thread that executes on a core.

The scheduler for an operating system is the subsystem of the OS that manages all of the software threads currently running and allocates execution time on the cores (both physical and logical) of the computer. Execution time is divided up in machine cycles, and a machine cycle is a tick of the computer's clock.

The scheduler determines which software threads run on which core (physical) each clock cycle. So, during each clock cycle in a computer, each core can execute an instruction of a software thread. Remember that using hyperthreading technology, the scheduler treats each logical core as a physical core. But in actuality, in each clock cycle, each physical core executes a single software thread's instruction.

Also important in our parallel development and estimation of performance gains is that we are assuming in our estimates that all hardware cores are available to our software application each clock cycle. In reality, most computers have many processes running at a given time and utilize some of the execution time of a core.

In Windows, the Task Manager provides some useful information to see what is running and consuming hardware resources. We have already looked at the Performance tab. Now, let's look at the Processes tab:

This tab in the Task Manager shows various information about the processes running on a Windows computer. As you can see from the Threads column in the preceding screenshot, some processes have many threads of execution at any given time.

You can go to the View menu and select Select Columns to change which columns of information are displayed by the Task Manager for each process.

 

Designing for concurrency


Now that we have an understanding of today's hardware capabilities (multiple processors, multiple physical cores, and multiple logical cores (hardware threads)) and OS schedulers, let's discuss how to take advantage of this in our software development.

We know that our hardware has the ability to execute multiple instructions at the same time. As we will see in later chapters, .NET provides several classes and libraries that allow us to develop software that runs in multiple threads instead of a single software thread. The question is then, when does it make sense to develop our software to run in multiple threads concurrently and what kind of performance gains can we expect?

When designing for concurrency, we should look at a high-level abstraction of the application's requirements. Look at what functions the application performs and which functions can operate in parallel without affecting other functions. This will help us decide how to determine the amount of parallel operations we can design into the application. We will discuss in detail the .NET classes to implement both heavyweight concurrency (the Thread class) and lightweight concurrency (Task Parallel Library) in later chapters; but for now, we need to determine what and how much of our application can happen in parallel.

For example, if the application takes a list of items and encodes each item into an encrypted string, can the encoding of each item be run in parallel independent of the encoding of another item? If so, this "function" of the application is a good candidate for concurrency. Once you have defined all of the high-level functions an application must perform, this analysis will determine the amount of parallelism that the application can benefit from.

The following is a simple example of a parallel design where some of the functions operate sequentially and others in parallel. As we will see in the next section, once we can define how much of the application functions concurrently versus sequentially, then we can understand the performance gains we can expect.

A lot of parallel designs employ some sort of pipelining design where some sequential work is performed, then some parallel work, then some sequential work, and so on. The preceding diagram shows a simple model for a pipeline design. Another popular concurrent design pattern is the producer-consumer model, which is really just a variation on the pipeline model. In this design, one function of the application produces an output that is consumed by another function of the application. The following is a diagram of this design pattern. In this example, each function can operate in parallel. The Load Image function produces image files to be consumed by the Scale Image function. The Scale Image function also produces thumbnail images to be consumed by the Filter Image function, and so on. Each of the function blocks can run in multiple concurrent threads because they are independent of each other:

The following diagram illustrates the sequential operation:

The following diagram illustrates the parallel pipeline design:

Estimating performance improvements

One of the most common mistakes in designing or upgrading systems with multiple processors is making linear projections in their processing speed. It is very common to consider that each additional processor in the system will increase the performance in a way that is directly proportional to its processing capacity.

For instance, when we have a system with just one processor, and if we add three more, we will not have four times the performance. This is because each time we add a processor, the time they dedicate to coordinate their work and the task assignment process increases. Therefore, because of the increased processing power spent on managing tasks, their performance will not increase linearly.

The additional robots added to the kitchen must talk among themselves to coordinate their work.

The coordination costs and the performance increment depend upon a number of factors including the following:

  • The operating system and its management procedures to coordinate and distribute processes and threads among multiple processors: This is the robots' accuracy in assigning the appropriate task to the most capable robot model for that particular task.

  • The level of optimization to run multiple processors offered by applications: This is one of the most relevant points, even when we are using an n-way symmetric multiprocessing scheme. In this book, we will learn to reach high levels of optimizations for concurrency in our software. This can be correlated to the robots' abilities to work with other robots on the same tasks.

  • The microprocessors' microarchitecture: This corresponds to how fast the robots move their hands and legs, and do similar tasks.

  • The speed of the memory subsystem shared by the microprocessors: This is the robots' communications interface.

  • The speed of the I/O buses shared by the microprocessors: This is the robots' efficiency and precision in managing their hands and legs to do each task (mopping the floor and cooking, for example).

All these items represent a problem when we design or upgrade a machine, because we need answers to the following questions:

  • How many microprocessors do we need when the number of users increases? How many robots do you need according to the number of friends/tasks?

  • How many microprocessors do we need to increase an application's performance? How many robots do you need to accelerate the wash-up time?

  • How many microprocessors do we need to run a critical process within a specific time period? How many robots do you need to clean the oven in 5 minutes?

We need a reference, similar to the one offered in the following table, in which we can see the coordination cost and the relative performance for an increasing number of processors:

Number of processors

Coordination cost

Relative performance

 

In relative processors

In percentage

In relative processors

In percentage

1

0.00

0%

1.00

100%

2

0.09

5%

1.91

95%

3

0.29

10%

2.71

90%

4

0.54

14%

3.46

86%

5

0.84

17%

4.16

83%

6

1.17

19%

4.83

81%

7

1.52

22%

5.48

78%

8

1.90

24%

6.10

76%

9

2.29

25%

6.71

75%

10

2.70

27%

7.30

73%

11

3.12

28%

7.88

72%

12

3.56

30%

8.44

70%

13

4.01

31%

8.99

69%

14

4.47

32%

9.53

68%

15

4.94

33%

10.06

67%

16

5.42

34%

10.58

66%

17

5.91

35%

11.09

65%

18

6.40

36%

11.60

64%

19

6.91

36%

12.09

64%

20

7.42

37%

12.58

63%

21

7.93

38%

13.07

62%

22

8.46

38%

13.54

62%

23

8.99

39%

14.01

61%

24

9.52

40%

14.48

60%

25

10.07

40%

14.93

60%

26

10.61

41%

15.39

59%

27

11.16

41%

15.84

59%

28

11.72

42%

16.28

58%

29

12.28

42%

16.72

58%

30

12.85

43%

17.15

57%

31

13.42

43%

17.58

57%

32

14.00

44%

18.00

56%

This table was prepared taking into account an overall average performance test with many typical applications well optimized for multiprocessing, and the most modern processors with multiple execution cores used in workstations and servers. These processors were all compatible with AMD64 or EMT64 instruction sets, also known as x86-64. We can take these values as a reference in order to have an idea of the performance improvement that we will see in optimized applications.

As shown in the previous table, the coordination cost grows exponentially as the number of processors or cores increases. The following graph shows the relative performance versus the number of processors:

As we can see in the preceding screenshot, the relative performance grows logarithmically as the number of processors or cores increase.

Note

The following are the formulas used to calculate the values presented in the table and the graphs:

Coordination cost = 0.3 x logarithm (number of processors) x (number of processors - 1)

Relative performance = number of processors - coordination cost

The percentages are the result of the division between the coordination cost or the relative performance and the total number of microprocessors installed.

Nowadays, the problem is that without many concurrent users, multiple processor systems have not proved to be as useful as expected. The use of machines equipped with more than one processor in workstations used by just one user is meaningful only when the applications executed are designed to work with multiple processors.

Most applications designed for a single user are not optimized to take full advantage of multiple processors. Therefore, if the code is not prepared to use these additional processors, their performance will not improve, as was explained earlier.

But, why does this happen? The answer is simple. The process to develop applications that take full advantage of multiple processors is much more complex than traditional software development (this book will show how to make this task much easier). With the exception of specialized applications requiring a lot of processing capacity and those dedicated to resolving complex calculations, most applications have been developed using a traditional, linear programming scheme.

Nevertheless, the release of physical microprocessors with multiple logical execution cores lead to the widespread availability of multiprocessing systems and an urgent need to take full advantage of these microarchitectures.

A system with multiple processors can be analyzed and measured by the following items:

  • Total number of processors and their features: This is the total number of robots and their features.

  • Processing capacity (discounting the coordination overload): This is the robots' speed at working on each task (without communicating).

  • Microarchitecture and architecture: This is the number of execution cores in each physical microprocessor and the number of physical microprocessors with each microarchitecture. These are the subrobots in each robot, the number of hands, legs, and their speed.

  • Shared memory's bus bandwidth: This is the maximum number of concurrent communications that the robots can establish.

  • I/O bus bandwidth: This is the robots' efficiency, precision, and speed in managing their hands and legs concurrently to do each task.

Note

Bandwidth between processors

This bus allows the processors to establish a fluid communication between them. It is also known as the inter-processor bus. In some microarchitectures, this bus is the same as the FSB. It competes with the outputs to the microprocessors' outside world and therefore steals available bandwidth. The great diversity in microarchitectures makes it difficult to foretell the performance of the applications optimized for multiprocessing in every running context that is possible in the modern computing world.

We are considering neither the storage space nor the amount of memory. We are focused on the parameters that define the operation and the performance of multiple processors.

Amdahl's law

Amdahl's law is one of the two laws of parallel optimization that is used to help determine the expected performance gains with parallel computer designs, for both hardware and software.

Amdahl's law is a formula to help estimate the performance gains to be expected in an application given the amount of the application that is executed concurrently and the number of physical cores in the machine.

Gene Amdahl is a computer architect who in 1967 presented his algorithm to compute the maximum expected performance improvement that can be expected when part of a system is written for parallelism. The algorithm calculates the expected speedup as a percentage.

This law takes into account the number of physical cores of execution, N, the percentage, P, of the application that is concurrent, and the percentage, B, that is serial. The time it takes to process when N cores are being used is as follows:

T (N) = T(1) x (P + 1/Nx(1-N))

So, the maximum speedup (N) would be calculated as:

Speedup (N) = T(1)/T(N) = T(1)/T(1)(B+1/N(1-P)) = 1/(P+1/N(1-P))

So, Amdahl's law states that if, for example, 50 percent of an application is run sequentially, 50 percent is concurrent, and the computer has two cores, then the maximum speedup is:

Speedup = 1/((1-.50)+.5/2) = 1.333

So, if the task took 100 execution cycles sequentially, then it will take 75 cycles with 50 percent concurrency because 75 (work units) x 1.33333 (percentage speedup) = 100 (work units).

The following graph shows the predicted speed increase of an application based on the number of additional processors and the percentage of the code that can run in parallel:

This law allows you to be able to estimate the performance gain of concurrency and to determine if the benefits are worth the extra complexity it adds to development, debug, and support. The extra costs of development, debug, and support have to be considered when developing a parallel software application.

Gustafson's law

Gustafson's law tries to address a shortfall of Amdahl's law by factoring in the scale of a problem. Gustafson was under the assumption that the problem size is not fixed but grows (scales). Amdahl's law calculates the speedup of a given problem size per the number of execution cores. Gustafson's law calculates a scaled speedup.

Gustafson's law is: S (P) = P – a * (P – 1) where S is the speedup percentage, P is the number of processing cores, and a is the percentage of concurrency of the application:

You will notice that the curves for a given percentage of concurrency do not level off in Gustafson's calculation versus Amdahl's.

 

Summary


We learned a lot in this chapter about multiprocessor and multicore hardware architectures and how the operating system scheduler manages them. We also learned how to design a software application for parallel operation and the performance gains we can expect from this. This chapter prepared us for the rest of the book and showed us the possibilities today's hardware can give the software developer if they are aware of the potential that multiple CPUs and multiple cores bring.

Now that we've learned about the principles of multiprocessing and multiple cores, we're ready to learn the main components of a parallel program, the processes and the threads, which is the topic of the next chapter.

About the Author

  • Rodney Ringler

    Rodney Ringler has 25 years' experience developing multitasking and parallel applications, with the last 10 focused on C# and .NET. He graduated cum laude from Clemson University with a BS degree in Computer Engineering. He then worked for 12 years in the fiber optic manufacturing industry on C-based real-time multitasking process control systems, where he went from being a developer to a project manager to an IT architect. After this, he spent 8 years running his own application development and hosting company focused on both .NET and open source technologies. He then spent several years as a consultant, working with companies in the retail, software, and manufacturing industries.

    Currently, Rodney works as a senior .NET developer at a manufacturing company based in Charlotte, NC, and takes .NET and object-oriented programming classes at Central Piedmont Community College.

    In his spare time, Rodney enjoys life in Lake Wylie, SC, with his wife and four children.

    Browse publications by this author

Latest Reviews

(4 reviews total)
Packt offers great books and great deals. If you want to learn you must have a look to Pack.
Dedica mucho al background working cuando lo mollar esta en programar puramente en paralelo. Los componentes esconden lo que realmente se debería saber.
Excelente libro que permite instrumentar aplicaciones en sistemas de cómputo personales.
Book Title
Unlock this book and the full library for FREE
Start free trial