Penn Parallel Primitives

Author: Prof. Milo Martin
Version: PPPv3, March 24, 2010
Class:CIS534
Downlaod:PPPv3.tar.gz

The Penn Parallelism Primitives is a collection of low-level shared-memory primitives for writing multicore software written specifically for this course. It includes functions for spawning threads, atomic operations, simple barriers, etc.

Altogether the library is only 500 lines of code, so you should be able to look at it so that you understand what is happening at the lowest level. The atomic.h has some C++ magic in it, but that is just for convenience; one could use the raw compare-and-swap functions directly.

Update for v3. This version of the code includes support for task-based parallelism.

Compiling

To compile with these primitives:

g++ -ggdb3 -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C ...

To cross-compile for SPARC:

~cis534/public/cross/bin/sparc-sun-solaris2.10-g++ -R /home1/c/cis534/public/cross/sparc-sun-solaris2.10/lib/sparcv9/ -ggdb3 -m64 -std=c++0x -pthread -lrt -Wall -O3 ppp.C barrier.C random.C ...

ppp.h

This header file is the main file that abstraction the low-level interaction with pthread creation and bookkeeping.

machine.h

This header contains the low-level architecture-specific code (for example, inline assembly). Generally the atomic<T> class below should be used instead of these directly.

atomic.h

This header files defines the atomic<T> template class. It is a wrapper for a single variable of type T, but allows atomic operations to be performed on the variable. This should work for any T that is 4-byte or 8-bytes in size.

barrier.h

Contains the code of a single global barrier. It uses a single counter with sense reversal. To simplify initialization, this code sets up a single barrier than threads can use directly.

TTSLock.h

Contains the code for a simple test-and-test-and-set lock using compare and swap.

random.h

As there doesn't seem to be a nice portable way to generate random numbers in thread-safe way, this code does that. It isn't a great random number generator (good enough random generator for security or scientific simulation), but it simple, fast, and good enough for our purposes.

Task.h

All tasks are decedents of the Task base class. All tasks have an execute() method that is overloaded by each sub-class of Task. The Task objects also record other information about task, such as which counter to atomically decrement after it has completed executing.

TaskGroup.h

To spawn/wait on tasks, the programmer creates a TaskGroup object. This object has a spawn(Task& t) method for invoking tasks and a wait() method. A spawn operation immediately enqueues the task in the task queue. The wait() method completes only once all tasks spawned as part of that task group have completed. While waiting, a thread will look for other work to perform:

void spawn(Task *t);  // Will call delete on the pointer after execution
void spawn(Task &t);
void wait();

parallel_sort.h

The parallel_sort() function has the prototype:

template <typename T>
void parallel_sort(T* array, int64_t left, int64_t right, int64_t grainsize=0)

A call to parallel_sort will sort an array of any type, given its starting and ending index. It uses the built in (or programmer-specified) "less than" and "greater than" operators on type T. The grainsize is optional.

See driver-sort.C for code that uses parallel_sort().

parallel_for.h

The parallel_for() function has the prototype:

template <typename T>
void parallel_for(int64_t start, int64_t end, T* functor, int64_t grainsize)

This applies a recursive divide-and-conquer parallelism from start to end of the specified functor. The functor is any object of a class with a calculate(int64_t start, int64_t end) method. The grainsize is optional.

See driver-compute.C for code that uses parallel_for().

Example. For example, consider the following loop:

void foo()
{
  float* arr = new float(size);
  ...
  for (int64_t i=0; i<size; i++) {
    arr[i] = arr[i] * 2.0;
  }
  ...
}

The code to make the above loop would be:

#include "parallel_for.h"
class MyFunc {
public:
  MyFunc(float* array, float scalar) {
    m_array = array;
    m_scalar = scalar;
  }
private:
  float* m_array;
  float m_scalar;
public:
  void calculate(int64_t start, int64_t end) {
    for (int64_t i=start; i<end; i++) {
      m_array[i] = m_array[i] * m_scalar;
    }
  }
};

void foo()
{
  float* arr = new float(size);
  ...
  MyFunc f(arr, 2.0);
  parallel_for(0, size, &f);
  ...
}

Although this looks nasty (or at least verbose), the lambda functions in C++0x will clean this up substantially. GCC doesn't support it yet (and thus we won't be using it in this assignment), but it might look something like this:

void foo()
{
  float* arr = new float(size);
  ...
  parallel_for(0, size, [=](int64_t start, int64_t end) {
    for (int64_t i=start; i<end; i++) {
      arr[i] = arr[i] * 2.0;
    }
  });
  ...
}

Or, an even simpler construct could be supported (but isn't yet):

void foo()
{
  float* arr = new float(size);
  ...
  parallel_for(0, size, [=](int64_t i) {
    arr[i] = arr[i] * 2.0;
  });
  ...
}

parallel_reduce.h

Analogous to parallel_for() and parallel_sort().

See driver-reduce.C for code that uses parallel_reduce().