C# 2008 and 2005 Threaded Programming: Beginner's Guide

By Gaston C. Hillar
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Taking Advantage of Multiprocessing and Multiple Cores

About this book

Most modern machines have dual core processors. This means that multitasking is built right into your computer's hardware. Using both cores means your applications can process data faster and be more responsive to users. But to fully exploit this in your applications, you need to write multithreading code, which means learning some challenging new concepts.

This book will guide you through everything you need to start writing multithreaded C# applications. You will see how to use processes and threads in C#, .NET Framework features for concurrent programming, sharing memory space between threads, and much more. The book is full of practical, interesting examples and working code.

This book begins with the fundamental concepts such as processes, threads, mono-processor systems, multi-processor systems. As the book progresses, the readers get a clear understanding of starting, joining, pausing and restarting threads. The readers get a better understanding of the simple techniques associated with parallelism. There are short exercises at the end of every chapter for the readers to perform.

The book also includes several practical parallelism algorithms and data structures used for illustration, and best practices and practical topics like debugging and performance.

Publication date:
January 2009
Publisher
Packt
Pages
416
ISBN
9781847197108

 

Chapter 1. Taking Advantage of Multiprocessing and Multiple Cores

We already know how to develop applications using the C# programming language. However, modern computers are prepared for running many operations in parallel, concurrently. C# is an advanced programming language. Thus, users and our bosses expect a C# application to offer great performance and a responsive user interface.

So, let's take our C# development skills to the next level. We want to take full advantage of modern hardware. For that reason, the first thing we have to do is try and understand how modern computers differ from older computers. Let's understand the parallelization revolution. The only requirement to be able to develop parallelized C# applications is to understand the basics of the C# programming language and the Visual Studio IDE. We will cover the rest of the requirements in our journey through the parallel programming world!

We must understand some fundamentals related to the multiprocessing capabilities offered by modern computers. We will have to consider them in order to develop applications that take full advantage of parallel processing features. In this chapter, we will cover many topics to help us understand the new challenges involved in parallel programming with modern hardware. Upon reading it and following the exercises we shall:

  • Begin a paradigm shift in software design

  • Understand the techniques for developing a new generation of applications

  • Have an idea of the performance increases we can achieve using parallel programming with C#

  • Perform accurate response time estimation for critical processes

 

Mono-processor systems: The old gladiators


The mono-processor systems use an old-fashioned, classic computer architecture. The microprocessor receives an input stream, executes the necessary processes and sends the results in an output stream that is distributed to the indicated destinations. The following diagram represents a mono-processor system (one processor with just one core) with one user, and one task running:

This working scheme is known as IPO (Input; Processing; Output) or SISD (Single Instruction, Single Data). This basic design represents the von Neumann machines, developed by this outstanding mathematician in 1952.

Single core: Only one warrior to fight against everybody

These days, systems with a single processing core, with just one logical processor, are known as single core.

When there is only one user running an application in a mono-processor machine and the processor is fast enough to deliver an adequate response time in critical operations, the model will work without any major problems.

For example, consider a robotic servant in the kitchen having just two hands to work with. If you ask him to do one task that requires both his hands, such as washing up, he will be efficient. He has a single processing core.

However, suppose that you ask him to do various tasks—wash up, clean the oven, make your lunch, mop the floor, cook dinner for your friends, and so on. You give him the list of tasks, and he works down the tasks. But since there is so much washing up, its 2 pm before he even starts making your lunch—by which time you get very hungry and make it yourself. You need more robots when you have multiple tasks. You need multiple execution cores, many logical processors.

Each task performed by the robot is a critical operation, because you and your friends are very hungry!

Let's consider another case. We have a mono-processor computer and it has many users connected, requesting services which the computer must process. In this case, we have many input streams and many output streams, one for each connected user. As there is just one microprocessor, there is only one input channel and only one output channel. Therefore the input streams are en-queued (multiplexing) for their processing, and then the same happens with the output streams but is inverted, as shown in the following diagram:

Doing a tiny bit of each task

Why does the robot take so long to cook dinner for you and your friends? The robot does a tiny bit of each task, and then goes back to the list to see what else he should be doing. He has to keep moving to the list, read it, and then start a new task. The time it takes to complete the list is much longer because he is not fast enough to finish so many tasks in the required time. That's multiplexing, and the delay is called von Neumann's bottleneck. Multiplexing takes additional time because you have just one robot to do everything you need in the kitchen.

The systems with concurrent access by multiple users are known as multi-user systems.

If the processor is not fast enough to deliver an adequate response time in every critical operation requested by each connected user, a bottleneck will be generated in the processor's input queue. This is well known in computer architecture as von Neumann's bottleneck.

There are three possible solutions to this problem, each consisting of upgrading or increasing one of the following:

  • The processor's speed, by using a faster robot. He will need less time to finish each task.

  • The processor's capacity to process instructions concurrently (in parallel), that is, adding more hands to the robot and the capability to use his hands to do different jobs.

  • The number of installed processors or the number of processing cores, that is, adding more robots. They can all focus on one task, but everything gets done in parallel. All tasks are completed faster and you get your lunch on time. That is multitasking.

No matter which option we pick, we must consider other factors that depend particularly on the kind of operations performed by the computer and which could generate additional bottlenecks. In some cases, the main memory speed access could be too slow (the robot takes too much time to read each task). In some other cases, the disks subsystem could have bad response times (the robot takes too much time to memorize the tasks to be done), and so on. It is important to make a detailed analysis on these topics before taking a decision to troubleshoot bottlenecks.

Moreover, sometimes the amount of data that needs to be processed is too large and the problem is the transfer time between the memory and the processor. The robot is too slow to move each hand. Poor robot! Why don't you buy a new model?

In the last few years, every new micro-architecture developed by microprocessor manufacturers has focused on improving the processors' capacity to run instructions in parallel (a robot with more hands). Some examples of these are the continuous duplication of processing structures like the ALU (Arithmetic and Logic Unit) and the FPU (Floating Point Unit), and the growing number of processing cores that are included in one single physical processor. Hence, you can build a super robot with many independent robots and many hands. Each sub-robot can be made to specialize in a specific task, thus parallelizing the work.

Computers used as servers, with many connected users and running applications, take greater advantage of modern processors' capacity to run instructions in parallel as compared to those computers used by only one user. We will learn how to take full advantage of those features in the applications developed using the C# programming language. You want the robot to get your lunch on time!

 

The performance waterfall


Considering all the analysis we have done so far to develop new algorithms for the applications of critical processes, we can conceive the performance waterfall shown in the following image:

Note

FSB (Front Side Bus)

The FSB is a bus that transports the data between the CPU and the outside world. When the CPU needs data from memory or from the I/O subsystem, the FSB is the highway used for that information interchange.

This performance waterfall will help us understand how we can take full advantage of modern multiprocessing. The topmost part of the waterfall represents the best performance. Hence, we lose speed as we go down each step. It is not a linear relationship, and the hardware infrastructure in which the application runs will determine the exact performance loss with each step represented in the above figure. However, the cascade is the same for every case, neither dependent on the kind of application being developed nor the hardware being used.

We must design the algorithms bearing in mind to keep the steps, down to the bottom of the performance waterfall, minimal. We should go downstairs as an exception and not as a rule. For example, a good decision consists of recovering all the necessary information from the disk subsystem or the network in one pass. Then, we can take everything to memory and begin processing without having to search for the data in every iteration.

A small performance problem in a mono-processing application multiplies its defects in its translation to a concurrent model. Therefore, we must consider these details.

As a rule or as a design pattern, the best approach when optimizing a critical process consists of running its tasks in the higher steps of the performance waterfall most of the time. It should visit the main memory in some iteration, but as little as possible. Each step down from the top means losing a small portion of the performance.

Let's draw an example of this. We have a state-of-the-art laser printer capable of printing 32 pages per minute. It is in an office on the sixth floor, but the paper ream stays on the first one. When the printer finishes with a page, a person must step down the six floors to take another sheet of paper and put it in the printer's paper feed tray. It takes about five to ten minutes for this person to bring each sheet of paper to the printer, as he goes downstairs, then he goes upstairs, on the way, he spends some time talking to a neighbor, and then he arrives back to the office with the sheet of paper. In addition, he could feel thirsty and go for a drink. As we can see, he wasted the state-of-the-art printer's performance (the execution core) because the paper tray was not fed quickly enough. The problem is that he brings a small quantity each time he arrives at the office (the hard disk and the I/O subsystem).

The printer's work would be more efficient if the person could feed it with a paper ream containing 500 sheets. The person could bring another paper ream with 500 sheets from the first floor when the printer's paper feed tray has only 50 sheets left (bringing it to the cache memories L1, L2, or L3).

What happens if we have eight printers working in parallel instead of only one? In order to take full advantage of their performance and their efficient printing process, all of them must have a good number of sheets in their respective paper feed trays. This is the goal we must accomplish when we plan an algorithm for parallelism.

Note

In the rest of the book, we will consider the performance waterfall for many examples and will try to achieve optimal results. We will not leave behind the necessary pragmatism in order to improve performance within a reasonable developing time.

Have a go hero - Researching micro-architectures and applications

A group of researchers need some consulting services of an IT professional specialized in parallel computing. They are not very clear in explaining the kind of research they are doing. However, you decide to help them.

They want to find the best computer micro-architecture needed to parallelize an application.

Research the new micro-architectures that are being prepared by leading PC microprocessor manufacturers, and the schedules for their release, particularly on these topics:

  • Are they increasing the processors' speed?

  • Do they mention upgrades about the processor's capacity to process instructions concurrently (in parallel)?

  • Are they talking about increasing the number of processing cores?

 

Multi-processor systems: Many warriors to win a battle


On one hand, systems with multiple processors are a solution to von Neumann's bottleneck; on the other hand, it is necessary to know their detailed features in order to break some myths about them. They do not offer an immediate performance improvement for all applications! The dilemma is that systems with multiple processors are not always the most appropriate solution to a performance problem.

There are two basic procedures for distributing tasks in systems with multiple processors:

  • Symmetrical multiprocessing. This is also known as SMP. Any available processor or core can execute tasks. The most used and efficient one is the 'n' way symmetrical multiprocessing, where 'n' is replaced by the number of installed processors. With this procedure, each processor can execute a task isolated from the rest, and also when particular software is not optimized for multiprocessing systems. In the following image we can see how this scheme of assigning a task to a processor works. You have eight robots in the kitchen. When a robot is free, he goes back to the list to see what else he should be doing and starts working on the new task ('8' way symmetrical multiprocessing):

  • Asymmetrical multiprocessing. This is also known as AMP or ASMP. Usually, one processor acts as the main processor. It works as a manager and is in charge of distributing the tasks to the other available processors, using different kinds of algorithms for this purpose. In the following image, we can see how this scheme assigns a task to a processor. You have nine robots in the kitchen. One of them is in charge of tasks' distribution (the manager robot). He is always reading the list and watching the other robots' work (the worker robots, processors dedicated to run tasks). When a robot is free, the manager robot tells him what to do next:

The robots are expensive! You do not want to waste a robot to distribute the tasks. You would rather have robots that are independent. You want robots arranged similar to a symmetrical multiprocessing scheme.

The 'n' way symmetric multiprocessing procedure achieves the best performance and the best resources usage, where 'n' can be two or more. With it, every available processor can execute tasks in an absolutely dynamic way. This is the reason why most multiprocessing systems use this approach.

Note

We will learn programming techniques that combine the advantages of the SMP and AMP approaches.

A symmetric multiprocessing system with many users connected or numerous tasks running provides a good solution to von Neumann's bottleneck. The multiple input streams are distributed to the different available processors for their execution, and they generate multiple concurrent output streams, as shown in the following image:

But what if there is so much washing up that it takes a single robot several days to complete? A bottleneck will be generated again. Are his hands as fast as necessary? Are his legs too slow? Is he a lazy robot?

We have to take into account that if the response time of a processor to a user's request is not quick enough, a bottleneck will be generated again. However, it can also be generated by other problems along the performance waterfall. We must delve deeper into the process in order to understand these potential performance issues.

Therefore, while the number of users or the number of tasks being executed in a multiprocessing system increases, it is more likely to run out of processing capacity, among other things. If this happens, each user's tasks being executed will take longer to run, and for that reason, the response time will worsen.

Under these circumstances, there are two possible approaches to keep the response time untouched:

  • Replacing the existing processors with new ones (buying super-robots): In order to apply this solution, there should be processors with better performance ratios than the ones that are currently used, or with more execution cores (to achieve more parallelism). Besides, they have to be compatible with the motherboard and with the sockets used by them. The great disadvantage of this approach is that the old processors are thrown out of the way. Besides, it is also expensive.

  • Adding new processors to work with the existing ones (buying new robots to help the existing ones): In order to apply this solution, there should be free sockets in the motherboard.

Note

These solutions apply only when the information systems are optimized to take full advantage of a dynamically growing parallelism capability.

Have a go hero - Multi-processing systems

It is a good idea to look inside the micro-architectures available in modern personal computers with multiprocessing. What changes would you consider making in the applications in order to take full advantage of them?

Estimating performance improvements

One of the most common mistakes in designing or upgrading systems with multiple processors is making linear projections in their processing speed. It is very common to consider that each additional processor in the system will increase the performance in a way that is directly proportional to its processing capacity.

For instance, when we have a system with just one processor, and if we add three more, we will not have four times the performance. This is because each time we add a processor, the time they dedicate to coordinate their work and the task assignment process increases. Therefore, because of the increased processing power spent on managing tasks, their performance will not increase linearly.

The additional robots added to the kitchen must talk among themselves to coordinate their work.

The coordination costs and the performance increment depend upon a number of factors, including:

  • The operating system and its management procedures for coordinating and distributing processes and threads among multiple processors: It is the robots' accuracy to assign the appropriate task to the most capable robot model for a particular task.

  • The level of optimization for running multiple processors offered by applications: This is one of the most relevant points, even when we are using an 'n' way symmetric multiprocessing scheme. In this book, we will learn to reach high levels of optimizations for concurrency in our software. This can be correlated to the robots' abilities to work with other robots in the same tasks.

  • The microprocessors' micro-architecture: This corresponds to the robots' speed to move hands and legs, and so on.

  • The speed of the memory subsystem shared by the microprocessors: It is the robots' communications interface.

  • The speed of the I/O buses shared by the microprocessors: It is the robots' efficiency and precision to manage their hands and legs to do each task (moping the floor and cooking, among others).

All these items represent a problem when we design or upgrade a machine, because we need answers to these questions:

  • How many microprocessors do we need when the number of users increased? How many robots do you need according to the number of friends/tasks?

  • How many microprocessors do we need to increase an application's performance? How many robots do you need to accelerate the wash up time?

  • How many microprocessors do we need to run a critical process within a specific time period? How many robots do you need to clean the oven in five minutes?

We need a reference, similar to the one offered in the following table in which we can see the coordination cost and the relative performance for an increasing number of processors:

Number of processors

Coordination cost

 

Relative performance

 
 

In relative processors

In percentage

In relative processors

In percentage

1

0.00

0%

1.00

100%

2

0.09

5%

1.91

95%

3

0.29

10%

2.71

90%

4

0.54

14%

3.46

86%

5

0.84

17%

4.16

83%

6

1.17

19%

4.83

81%

7

1.52

22%

5.48

78%

8

1.90

24%

6.10

76%

9

2.29

25%

6.71

75%

10

2.70

27%

7.30

73%

11

3.12

28%

7.88

72%

12

3.56

30%

8.44

70%

13

4.01

31%

8.99

69%

14

4.47

32%

9.53

68%

15

4.94

33%

10.06

67%

16

5.42

34%

10.58

66%

17

5.91

35%

11.09

65%

18

6.40

36%

11.60

64%

19

6.91

36%

12.09

64%

20

7.42

37%

12.58

63%

21

7.93

38%

13.07

62%

22

8.46

38%

13.54

62%

23

8.99

39%

14.01

61%

24

9.52

40%

14.48

60%

25

10.07

40%

14.93

60%

26

10.61

41%

15.39

59%

27

11.16

41%

15.84

59%

28

11.72

42%

16.28

58%

29

12.28

42%

16.72

58%

30

12.85

43%

17.15

57%

31

13.42

43%

17.58

57%

32

14.00

44%

18.00

56%

This table was prepared taking into account an overall average performance test with many typical applications well optimized for multiprocessing, and the most modern processors with multiple execution cores used in workstations and servers. These processors were all compatible with AMD64 or EMT64 instruction sets, also known as x86-64. We can take these values as a reference in order to have an idea of the performance improvement that we will see in optimized applications.

As we can see in the following image, the coordination cost grows exponentially as the number of processors or cores increases:

As we can see in the following image, the relative performance grows logarithmically as the number of processors or cores increase:

Tip

The formulas used to calculate the values presented in the table and the graphs:

Coordination cost = 0.3 x logarithm (number of processors) x (number of processors - 1)

Relative performance = number of processors - coordination cost

The percentages are the result of the division between the coordination cost or the relative performance and the total number of microprocessors installed.

Nowadays the problem is that, without many concurrent users, multiple processor systems have not proved to be as useful as expected. The use of machines equipped with more than one processor, in workstations used by just one user, is meaningful when the applications executed are designed to work with multiple processors.

Most applications designed for a single user are not optimized to take full advantage of multiple processors. Therefore, if the code is not prepared to use these additional processors, their performance will not improve, as was explained earlier.

But, why does this happen? The answer is simple. The process for developing applications that take full advantage of multiple processors is much more complex than traditional software development (this book will show how to make this task much easier). With the exception of specialized applications requiring a lot of processing capacity and those dedicated to resolving complex calculations, most applications have been developed using a traditional, linear programming scheme.

Nevertheless, the release of physical microprocessors with multiple logical execution cores leads to the widespread availability of multiprocessing systems and an urgent need to take full advantage of these micro-architectures.

A system with multiple processors can be analyzed and measured by the following items:

  • Total number of processors and their features—the total number of robots and their features

  • Processing capacity (discounting the coordination overload)—the robots' speed to work on each task (without communicating)

  • Micro-architecture and architecture (the number of execution cores in each physical microprocessor and the number of physical microprocessors with each micro-architecture)—the sub-robots in each robot, the number of hands, legs, and their speed

  • Shared memory's bus bandwidth the maximum number of concurrent communications that the robots can establish

  • I/O bus bandwidth—the robots' efficiency, precision, and speed to manage their hands and legs concurrently to do each task

Note

Bandwidth between processors

This bus allows the processors to establish a fluid communication between them. It is also known as the inter-processor bus. In some micro-architectures, this bus is the same as the FSB. It competes with the outputs to the microprocessor's outside world and therefore steals available bandwidth. The great diversity in micro-architectures makes it difficult to foretell the performance of the applications optimized for multiprocessing in every running context that is possible in modern computing world.

We are considering neither the storage space nor the amount of memory. We are focused on the parameters that define the operation and the performance of multiple processors.

Have a go hero - Calculating an estimated performance improvement

The group of researchers will ask you many questions regarding the performance improvements that you can achieve in your new applications. You must be prepared to answer those questions.

Take some applications developed in any programming language and measure their performance. Using the procedure and the formulas explained, calculate an estimated performance improvement that can be achieved by optimizing them.

The numbers are a temptation to begin recoding old-fashioned linear programming applications, aren't they? Let's go on to study the technical concepts we will use in the rest of the book. We need them in order to begin coding lots of practical samples.

Avoiding bottlenecks

Many bottlenecks can arise in systems with multiple processors, besides the von Neumann's one (that is reduced, but still alive). The problems appear when the system runs out of processing capacity.

The following list enumerates the components in which the bottlenecks may appear with some possible solutions:

  • Shared memory's bus

    Some possible solutions for a bottleneck here can be:

    • Increase each processor's cache memory

    • Increase each processor's cache memory levels (levels 1, 2, 3, and 4 in some cases)

    • Replace the processors with others having more cache memory and/or more levels

    • Replace the processors with others having a memory controller integrated in the physical microprocessor

    • Use a motherboard providing dedicated buses to shared memory

  • Inter-processor bus

    To avoid a bottleneck in the inter-processor bus, we can replace the motherboard and the processors with another set offering a bus with a wider bandwidth between them. It is expensive to troubleshoot this because you are replacing two of the most expensive components of the system.

  • I/O shared bus (that is, hard disks)

    Some possible solutions for a bottleneck here can be:

    • Increase the amount of the system's physical memory dedicated to work as a cache memory for the I/O subsystem, for example, the disks' cache

    • Use a faster I/O bus; it could involve changing cards or adding new ones for the chosen bus; in some cases, the only way is to replace the motherboard

    • Use a motherboard providing dedicated buses to shared I/O

  • Network (wireless or cabled)

    Some possible solutions for a bottleneck here can be:

    • Increase the buffers dedicated to the network or enhance the amount of physical system's memory committed to work as a cache memory for the network subsystem

    • Reduce network accesses

    • Increase the number of network cards

    • Increase the network speed

  • Storage subsystem performance

    Some possible solutions to improve the performance here can be:

    • Increase the amount of the system's physical memory dedicated to work as a cache memory for the storage subsystem

    • Use a faster I/O bus; it could implicate changing cards or adding new ones for the chosen bus; in some cases, the only way is to replace the motherboard

    • Improve the storage configurations (for example, using disk arrays)

Tip

It is very important to take all these bottlenecks into account in order to develop applications that take full advantage of the available parallelism in modern computers. We will design the applications´ architecture using algorithms that avoid these bottlenecks.

Have a go hero - Detecting bottlenecks

It is a good idea to detect bottlenecks in many applications using the tools provided by the operating system to monitor different subsystems. What changes would you consider making in your system to avoid them? Do you think you have enough memory to avoid continuous slow accesses to the hard disks?

There is a lot of software in the market that does not use the power offered by the available computers. We can improve the performance of this software.

Taking advantage of multiple execution cores

One of the techniques for improving the processing capacity consists of increasing the microprocessors' working frequency (over-clocking), which raises the number of instructions capable of processing in the same period. This technique has been used for many years and evolved from the legendary 8086/8088 with its poor 4.77 MHz (Megahertz) to the many GHz (Gigahertz) of modern microprocessors.

Nevertheless, microprocessor manufacturers are increasingly facing difficulties in raising the frequencies because the manufacturing process becomes more complex and the generated heat is difficult to dissipate in an efficient and inexpensive way.

Consider our robot instance again. You want to buy a single robot, but want him to clean the oven in 5 seconds. That is possible, but he needs plutonium as an energy source because he must move his arms and legs at a very high speed. Besides, he needs an ambient temperature of 5oF (5 degrees Fahrenheit) or -15oC (-15 degrees Celsius). Why? Because metals moving at very high speeds generate heat. You do not want a burnt robot. Moreover, plutonium is very expensive. Something similar happens with modern microprocessors.

Therefore, the other alternative is to develop new micro-architectures; incorporating first duplicated, and then quadruplicated processing structures, and so on. In this way, there are many sub-processors in one single microprocessor's package. These sub-processors are known as execution cores or processing cores.

Microprocessors with multiple execution cores, also known as multi-core, offer many complete execution cores that are interconnected in a single package. Their physical look is very similar to a conventional single core microprocessor. Nevertheless, they are equivalent to something like two or more microprocessors inside a single piece of silicon, as well as many pieces of silicon interconnected between themselves under the same physical package. Of course, we are avoiding deep technical issues.

Note

At present, most available modern computers have at least microprocessors with two execution cores (dual core). Therefore, they are computers with multiprocessing capabilities.

Ever since the rise of multiple execution cores, the possibilities of combining the communication architectures, the different cores owned, and shared resources have been multiplying. As with everything in life, in each possibility, there is a trade-off between manufacturing costs and performance. For this reason, a new land appeared in the microprocessors' world.

In some cases, each execution core includes L1 and L2 cache memories, as shown in the following image:

In other cases, L2 cache memories are shared between two or more cores. Therefore, each core will have access to the whole L2 cache, as shown in this image:

The greater the number of resources included in each core and the fewer the resources shared with the others, the greater the processing speed achieved by each core. On the other hand, sharing resources between cores benefits applications not optimized for multiprocessing because they use a single execution core.

The robots' communications interface must be as efficient as possible, since you want the robots to do many different tasks.

Achieving efficient external memory accesses is one of the most important matters with these micro-architectures. The communication with external memory has a great overhead with respect to time, as compared to the internal cores' speed. When we design the most critical algorithms for our applications, we must minimize the external memory accessed in order to achieve the best performance. It is one of the main subjects to consider in designing applications that will be developed with parallelism in mind.

There are two options available to speed up the tasks done by the robots, taking into account the washing up example:

  • Divide the washing up into as many parts as the number of robots available and have the robots do their own portion of the global task

  • Have a big pile of washing up, and have each robot pick items up from that pile when they have room in their sinks to do that washing up

The internal bus is very important, because it transports data between the different execution cores. Many microprocessors use one or more dedicated buses for that task with very high working speeds, while others establish those communications through the FSB (a less efficient way). When the microprocessor has more than two cores, the architecture could be any possible merger between the known architectures for single and dual core microprocessors. There are some microprocessors built by two pair of cores, each one using a dedicated bus for the data interchange between both the cores, but with both pairs talking through the FSB. The following diagram illustrates this micro-architecture:

When we are optimizing applications to take full advantage of these micro-architectures, one of the things that we should minimize is the information going through the FSB. Besides, we must consider this in evaluating the optimized applications efficiency. If we don't, we will probably draft wrong conclusions about them and we will try to optimize already maximized performances (according to the underlying hardware architecture).

A system with asymmetric multiprocessing based in many independent physical processors has many FSBs to access external memory, one for each physical processor. However, a system with a microprocessor having multiple cores has to share the FSB which acts as a great single door to the outside world and to the external memory. Therefore, the tasks for coordinating the activities in the execution cores require additional time to avoid conflicts in the shared FSB. This is an important difference between multiprocessing using independent physical microprocessors and multiple cores in one physical microprocessor.

The probability that a FSB will become a bottleneck is very high when applications are not optimized to take full advantage of cache memories included in each core. Therefore, when the software is running over these micro-architectures, it should avoid frequent accesses to the main memory.

Besides, many asymmetric multiprocessing systems use duplicated communication channels with the main memory. This feature is not available in many multi-core microprocessors. It makes it nearly impossible to predict the performance of applications in completely different systems architectures. However, designing them with parallelism in mind will take full advantage of any feature present in the system.

Nowadays, microprocessors with multiple execution cores are widespread. However, we can find many of them arranged in an 'n' way asymmetric multiprocessing system such as an eight-core system with two physical quad-core microprocessors. It is something very attractive in high range workstations and servers.

In the coming years, microprocessors are going to include more and more processing cores. Modern operating systems are already optimized to take advantage of their parallel processing capabilities. We must optimize our applications to take full advantage of them.

Analyzing the micro-architectures used in modern microprocessors is a topic for an entire book. However, we needed some knowledge about them in order to understand the parallel processing capabilities that are useful for our goals.

Do not expect plutonium robots! They will still be very expensive to maintain.

Have a go hero - Counting cores

Look at the computers that you will be able to use with the researchers. How many microprocessors do they have? How many cores does each microprocessor have? You can use CPU-Z in order to find this information. You can download this free software from http://www.cpuid.com/.

In the following image, you can see the information provided by CPU-Z for a computer with a microprocessor providing four cores:

At the bottom of the window, the Cores item displays number 4.

Scalability

Talking about information systems, scalability is an application's ability to upgrade its performance, its response time, or the number of users to which it can provide services. We can measure scalability from many points of views:

  • Processing capacity—the robots can do the tasks in lesser time

  • Features and services (new functions, new cards, and new hardware)—the robots can do new tasks (prepare squeezed orange juice)

  • Storage capacity - the robots can save tasks for each day of the month

  • Memory - the robots remember more tasks without having to go back to the list every minute

We will analyze scalability taking into account the processing capacity, leaving behind other hardware resources, which could break down scalability, as the ones mentioned earlier.

We have two possible ways to scale applications designed for mono-processing. One way is to use a microprocessor with a more efficient micro-architecture, capable of executing many more instructions per clock cycle. The other way is to use a microprocessor with the same micro-architecture, but with a higher frequency and hence more clock cycles. So, if we cannot find a microprocessor with one of the explained features offering a better performance than the one installed in the computer dedicated to run the application, we will not be able to improve its performance, enhance its response time, or increase the number of concurrent users.

Note

Applications designed for mono-processing are completely limited to the microprocessor manufacturers' improvements. Moreover, nowadays, microprocessors' working frequencies have stabilized. Microprocessors manufacturers tend to offer more processing cores, hence more parallelism capabilities.

The following table shows the complete processing power exploitation of an application designed for mono-processing with linear execution code, without considering parallelism. The table presents this situation with the increasing number of available processing cores:

Number of cores

Maximum processing power usage

1

100.00%

2

50.00%

3

33.33%

4

25.00%

6

16.33%

8

12.50%

12

8.33%

16

6.25%

24

4.16%

32

3.13%

48

2.08%

64

1.56%

The numbers are scary and demonstrate the need to tame multiprocessing quickly.

On the other hand, with applications designed to take full advantage of parallelism, multiprocessing, and multiple execution cores, scalability depends on the total number of processors or cores we can incorporate. Because applications are optimized for parallelism, each added processor will offer an immediate enhancement in performance despite the added management costs explained earlier. Hence, we will be able to achieve better performance using applications optimized for parallelism.

Note

If an application optimized for parallelism requires more processing capacity than the one offered by the most powerful microprocessor available in the market, and it would be scalable, then we can add other microprocessors or cores. We can also achieve more scalability distributing work to many computers. Our focus in this book will be on the first case.

Mounting a computer with eight processing cores (two microprocessors with four cores in each) is more likely than asking a microprocessor manufacturer to develop an exclusive processor working at 20 GHz.

We will be able to generate applications with a dynamic scalability concept, according to the number of available or configured working processing cores.

Have a go hero - Detecting scalability problems

If you have a computer with a multi-core microprocessor, you will be able to get some hands-on experience on the problems related to scalability. It is a good idea to monitor many applications you usually run and check each execution core load using Windows Task Manager. Are the applications taking full advantage of the processing power installed in your computing system? Are all the tasks using the full processing power?

Load balancing: Keeping everybody happy

The mechanism used for distributing the workload between many processors or cores that are part of a multiprocessing system is known as load balancing. Its main goal is to achieve the best possible performance by maintaining an equal workload between all the available processors or cores. It is also responsible for detecting with high accuracy the need to scale up (to increase the number of processors or cores).

If an inefficient workload distribution was applied, one of the cores would be overloaded with requests while the others would remain underutilized. For that reason, the load balancing algorithm has a critical role in the response time offered by the running application.

You do not want a robot to rebel! You want them to be happy! Therefore, you apply a carefully chosen load balancing procedure to distribute the tasks to the robots.

There are four traditional procedures for load balancing:

  • Random selection: Each request is assigned to a core randomly. One robot may kill you.

  • Task distribution: The tasks are distributed into each available core keeping a balance. This mechanism is known as PLB (Processing Load Balancing). The robot that mops the floor will kill you.

  • Round robin: Each request is assigned once to each core in the order of arrival, irrespective of the time it takes to accomplish the task. The round begins when a request arrives and is assigned to the first core, the next to the second, the following one to the third, and so on. When there are no more cores available, and a new request arrives, it is assigned to the first core and the round begins again. For example, if we apply this algorithm with a dual core CPU, request #1 goes to core #1, request #2 goes to core #2, request #3 goes to core #1, request #4 goes to core #2, and so on. One day, robot "A" will kill you. The other day, robot "B" will kill you. But, perhaps, some day, all of them will be happy. And, some other day, they will all be queued to kill you.

  • User defined (programmed by the user): Using any of the three algorithms mentioned above, the system tends to a uniform workload only if the distributed tasks consume the same processing power and take the same time for completion. Therefore, they can lack the necessary balance. So, another option is to introduce user-defined algorithms. We can develop small components or applications in order to measure each core workload and assign tasks considering the results. Although this algorithm could bring the best response time in some applications, it transforms an 'n' way symmetrical multiprocessing system in an asymmetric multiprocessing one, using one core to manage the task distribution process instead of leaving it to the algorithms applied by the operating system. You will be there an entire day watching the robots. In a week's time, there will be no more robots in your kitchen!

Tip

Every algorithm has its advantages and trade-offs.

Have a go hero - Thinking about load balancing

Take some processes and routines with poor performance results and redesign them to achieve better results with 2, 4, 8, and 16 execution cores. Which load balancing algorithm do you consider the best option for each application?

Operating systems and virtual machines

Why do we need to consider so many aspects related to hardware architecture in order to develop applications that take full advantage of parallelism? The answer is simple! Which is the most efficient way to reach Rome from our position? We need a map showing all the different possible ways to develop an accurate answer. The same happens when developing applications for multiprocessing. We need to know certain details about the way multiprocessing works in order to develop efficient methods to take full advantage of it. If we do not do this, any efforts made to take advantage of parallelism will not increase the application's performance.

There are great differences between parallelism and linear mono-processing programming. Hence, before starting with the necessary paradigm shift necessary for achieving the best results, it is very important to understand the underlying hardware architectures in which our optimized applications will be running.

We are talking about hardware, which is supposed to be a responsibility of modern operating systems. In some cases, as with the Java programming language, it is the responsibility of a virtual machine, which provides an even more isolated environment.

C# is a high-level, object-oriented language that provides a very good level of isolation from hardware. Nevertheless, when we are optimizing the applications for parallelism, we must obtain certain information about the hardware in order to adapt some aspects of the application to take full advantage of available resources.

This is the point where the first complications appear. The operating system isolates the applications from the underlying hardware, generating an intermediate manager. Its kernel is the lowest level layer above the hardware and manages memory, processors, and many other hardware, in a low-level way. Therefore, it is in charge of administering the different available processors or cores and distributing the workload in the most efficient way. It reduces our possibility to control certain variables and make them dependent on the operating system with which we are working.

The results achieved by the same applications in different Windows versions such as Vista, 2008, 2003, and XP can be quite different. It happens because each new release tends to be more optimized for the most recent hardware and hence for providing more efficient multiprocessing with multi-core microprocessors. For this reason, the operating system is also responsible for the performance results.

Parallelism is here to stay

Multiprocessing is not new. It has been available for many years, although always limited to powerful servers and sophisticated specific workstations because of its expensive and difficult-to-find components.

Nowadays, microprocessor manufacturers tend to add more execution cores to every new microprocessor instead of increasing their working frequency. They also improve the instructions per clock cycle ratio. But the focus is on the parallelism capabilities. They are continuously improving the execution of instructions in parallel. Therefore, the multi-core revolution is here and hence, we will see more and more cores in one physical microprocessor. For this reason, the time has arrived to take full advantage of parallel architectures available in modern computers. Linear code has reached its end.

We will go through lots of samples covering the most useful situations in which we will need to take full advantage of the available processing cores with C# 2.0 and 3.0.

Tip

We are at an inflection point wherein, as developers, we must use the parallel processing power available in current and future computers. We must not lose the opportunity to take full advantage of parallelism.

Have a go hero - Preparing minds for parallelism

Exercise your mind to prepare for parallelism. This group of researchers seems to have many complex algorithms in mind:

  • Take some applications you have developed using C# and check which processes should be recoded in order to improve these applications' performance. Think and write using some kind of pseudo-code, with part of those processes executed in parallel. Moreover, redesign the applied algorithms in order to reduce the results dependencies between instructions.

  • Think about new solutions in which you believe it is convenient to use parallelism to improve performance. Divide the processes into many independent ones from a conceptual point of view, without focusing on design and technical details.

  • Take some processes and routines with poor performance results and redesign them thinking of some way to achieve better results with 2, 4, 8, and 16 execution cores. Write them using some kind of pseudo-code.

  • Take some actions you do during the day and try to describe the parallel tasks you execute under certain circumstances. Doing it will help you understand how human beings apply parallelism every second. Then you will be able to transfer this thinking model to help you achieve better results in designing applications using this new paradigm.

 

Summary


We learned a lot in this chapter about multiprocessing and multiple cores. Specifically, we understood the performance waterfall in a computing system. We acknowledged the advantages of parallel programming with C# for the coming years. We learned the challenges associated with parallel processing and programming. Now, we are able to detect bottlenecks and prepare for scalability and efficient load balancing.

Now that we've learned about the principles of multiprocessing and multiple cores, we're ready to learn the main components of a parallel program, the processes and the threads, which is the topic of the next chapter.

About the Author

  • Gaston C. Hillar

    Gaston C. Hillar is Italian and has been working with computers since he was 8 years old. Gaston has a Bachelor's degree in computer science (graduated with honors) and an MBA.

    Currently, Gaston is an independent IT consultant and a freelance author who is always looking for new adventures anywhere in the world.

    He was a senior contributing editor at Dr. Dobb's, and has written more than a hundred articles on software development topics. He has received the prestigious Intel® Black Belt Software Developer award eight times. He has written many articles about Java for Oracle Java Magazine.

    Gaston was also a former Microsoft MVP in technical computing. He lives with his wife, Vanesa, and his two sons, Kevin and Brandon.

    Browse publications by this author
Book Title
Access this book and the full library for FREE
Access now