Homework 4 - CIS534 Spring 2010

Instructor: Prof. Milo Martin

Initial Data Due: Wed, March 3rd (emailed to me by 10am)

Final Data and Writeup Due: Friday, March 5th (under my door by 5pm).

Overview

In this assignment you will compare the performance of a various lock implementations: Test-and-Test-and-Set (TTS), ticket lock, array lock, and a list-based queue lock. As with the previous assignment, you'll use the Penn Parallelism Primitives library.

The basic experimental setup is a computation that updates a counter selected from an array of counters at random. The array of N counters is large and the counters are protected by an array of L locks (one lock per every "N divided by L" counters). By varying the number of locks, the lock contention is increased or decreased. The driver provided does some work within the critical section and when not in the critical section, making the test a bit more realistic. Thus, even for a single lock, there will be some speedup on multiple cores.

Note: As we only have a few machines with more than four cores, most of the experiments performed below will be on the four-core machines in the core2-quad.q queue.

Part 1

The first step is to run the driver code without any locking code (just comment out the call to acquire and release in the main work loop in driver-hw4.C). This will get the "wrong" answer when run with more than one thread, but it will capture the best-case runtime if the lock overhead was zero while still capturing most of the locality and sharing misses caused by the counters themselves. We'll be using an array of 16,384 32-bit counters (which is the default). Record the runtime on one processor and on four processors (use the core2-quad.q queue). On subsequent graphs, normalize all data to the single-processor case, but also draw a horizontal line that represents the four-core speedup (which will be the best-case achievable).

To make all the runtimes tractable, the driver runs each thread for the specified number of iterations and displays the runtime in seconds and the runtime per iteration. This runtime per iteration number can be used as the base measurement (instead of seconds), allow you to run a smaller number of iterations to avoid high runtimes that results from significant lock contention.

Answer the following questions:

  1. What is the maximum speedup possible even without locking?
  2. What effects causes this speedup to be less than a 4x speedup?

Part 2

The next step is to use the simple test-and-test-and-set (TTS) lock provided for you in TTSLock.C and TTSLock.h. Vary the number of locks starting at 1 and increasing it by one until the performance levels off (likely at something around a dozen or more locks for four cores).

Graph the data. The y-axis is throughput normalized to the single-thread case (higher is better). The no-lock case data is displayed as a horizontal line across the top of the graph. The x-axis is the number of locks.

Answer the following questions:

  1. What is the maximum speedup obtainable with the this lock?
  2. What is the minimum number locks needed to achieve near this maximum throughput?
  3. With a single lock, is the throughput greater than or less than the single-processor case. Why?

For each of the lock variants below, answer the same three questions above and plot the data on the same graph.

Part 3

Repeat Part 2 above, but using the ticket lock described in class.

Part 4

Repeat Part 2 above, but with the array lock described in class (Anderson's lock). You may also find the following pseudo code helpful:

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#anderson

Part 5

Repeat Part 2 above, but with the list-based queue lock described in class (CLH lock). You may also find the following pseudo code helpful:

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#clh

You'll likely want to use a global __thread variable to hold the local pointer. Initialize it to NULL. To allocate a node, you have a few options: (1) you can initialize it in ppp.C right after each thread is created, or (2) you can check for NULL each time you use the lock and allocate a node if the pointer is NULL. Option #1 is more invasive but has lower runtime overhead than option #2. Either way, you'll want to reuse the nodes, thus avoiding frequent calls to malloc()/free().

Part 6

Note: choose only one of Part 6, Part 7, or Part 8 to complete

The atomic template class automatically adds cache-block padding to each atomic variable. Repeat the above experiments but with padding disabled (just comment out the line in atomic.h). This will smash together all the objects putting multiple TTS locks in the same cache block, both parts of the ticket lock in the same block, and multiple parts of the array lock into the same block, etc.

Plot the new data, analyze the impact versus the results with padding, and discuss the results. As this is the simplest of the last three parts in terms of code and data collection, I expect a better analysis and prose explanation of the results than for the other options.

Part 7

Note: choose only one of Part 6, Part 7, or Part 8 to complete

Explore the effectiveness of the "pause" instruction. Disable the pause instruction (comment out the line in atomic.h). Re-run the experiments above (all the lock types) with and without the "pause" instruction on the eight-core Core 2 systems (the core2-oct.q queue) using 8 software threads. Is the a difference between the pause and pause-free versions? Explain why or why not?

Also repeat the experiments on the Core i7 box using 8 software threads (on the corei7.q queue) both with and without the "pause" instructions. Is the relative impact larger on the Core i7? Explain why or why not? (Recall the Core i7 is a multithreaded chip.)

Part 8

Note: choose only one of Part 6, Part 7, or Part 8 to complete

Rerun the above experiments (the first five parts) on the 128-thread Niagara T2 machine. As the TTS lock with so many threads is likely to perform poorly, you will also need to implement a TTS variant with exponential backoff. The pseudo code can be found at:

http://www.cs.rochester.edu/research/synchronization/pseudocode/ss.html#tas

As there is only one of these machines and it will take more runs to explore the space, I'm not asking everyone to do this. As others will be using the machine, please keep your runs short (as short as possible, say ~30 seconds or shorter, to get reasonable data) and you don't need to run every sequential data point between 1 and lots of locks. Sample some points in-between to avoid performing so many runs. I also suggest that you run with 127 software threads, leaving one thread for the OS or background tasks. This will likely reduce the noise and thus allow for shorter runtimes.

Logistics

What to Turn In

For the initial due date:

  1. By 10am on the day of class, send an email to cis534@cis.upenn.edu with the quantitative results for Part 1 and Part 2 above. Just the numbers is what I'm looking for.
  2. Be ready to describe your results during the in-class discussion.

Print out the following to bring to class on the final due date:

  1. Insightful writeup. The goal of this assignment is to get you to think more deeply about the various lock primitives choices and implementations. The role of the writeup is to convince with that you learned something by saying insightful things in the writeup. I'm not looking for lots of prose, but lots of insight described tersely.
  2. Graphs. Include the graphs described above. Label the x-axis and y-axis. Give each graph a descriptive title. The graphs will generally have a different line for each lock. Make sure to use different symbols or dot/dash patterns to clearly identify which lock is which.
  3. Your code. Print out your code for each lock.

Disclaimer

It is quite likely that the primitives library I give you will have bugs in it. It is a small amount of code, but tricky code and parallel code is difficult to test. I've done what I can to test it, but if something odd is happening it could be a bug in the code I wrote.

I haven't collected data for these locks on these systems. Thus, I don't know if the data is actually going to be interesting or not. It could be that four cores isn't enough to show much of anything. In that case, I might end up asking you all to collect the data for the first parts on the few larger machines we have.

Addendum

Retrospective

After assigning and grading this assignment, a few things to consider correcting in future years: