**Instructor**: Prof. Milo Martin

**Initial Data Due**: Monday, February 8th (at the **start** of class)

**Final Data and Writeup Due**: Wed, February 10th (at the **start** of
class). I may extend this to the following Monday if the class seems to
need it.

**Summary**: Create efficient implementations of computations below
(similar to the ones from the first homework). **As with the earlier
homework, we're focusing on sequential implementations**. The primary
goal is to minimize runtime; the secondary goal is to minimize memory
footprint. Use the ideas you generated for homework 1 and/or any of the
ideas discussed in class.

This assignment looks at a few different simple computational loops. The assignment is divided into several parts:

- Part 1: Non-algorithmic optimization of computation 1.
- Part 2: Unrestricted optimization of computation 1.
- Part 3: Unrestricted optimization of computation 2.

Plus **one** of:

- Step 4: Unrestricted optimization of computation 3.
- Step 5: Compare the performance on Niagara T2, Core 2, Core i7 machines
- Step 6: Implement Part A, B, C or D in Java or another alternative language

In each part you're asked to come up with a few ideas to make the code faster, describe them, try them out, and experimentally measure the impact.

To start, benchmark the performance of computation 1 (below) for number of particles ranging from 5000 to 250,000 (with many data points in between). Create a graph with time on the y-axis and number of particles on the x-axis. Plot the runtime of the unmodified code C++ code given below.

Next, re-organize the code and/or data layout to make the computation faster, but (for now) the inner-most "check distance and conditionally accumulate" must remain and must be called for each pair of particles. Issues to consider include micro-architectural impacts, instruction count effects, how the compiler is compiling the code, and memory system effects.

In your writeup, document the series of transformations you make (at least two) and record the improvement or non-improvement of each on runtime. That is, briefly describe the things that didn't work as well as those that did. If it did improve the runtime, did the change impact the runtime as you anticipated? If not, why? The goal is for you to explore different transformations or approaches and gain intuition as to their impact.

For your fastest version of the code, plot its performance on the same graph as the baseline.

Reorganize or rewrite the code for computation 1 (below) with no restrictions. That is, you can and must change the algorithm as well as consider low-level optimizations to create the faster version of the code you can.

As above, document the serious of transformations or approaches you employed.

For your faster version of the code, plot its performance on the same graph as the data from Part 1. You should extend the x-axis until the runs start taking a minute or reach 10 million particles.

Reorganize or rewrite the code for computation 2 (below) with no restrictions. That is, you can and must change the algorithm as well as consider low-level optimizations to create the faster version of the code you can.

As above, document the serious of transformations or approaches you employed.

For your faster version of the code, plot its performance on a new graph (but with the same axis as the above graph). Extend the x-axis until the runs start taking a minute or reach 10 million particles.

**Note: choose only one of Part 4, Part 5, or Part 6 to complete**

As with Part 2 and Part 3 above, perform a similar exploration of optimizations for computation 3 (below). As above, try a few different version of the code and describe the impact of each modification.

**Note: choose only one of Part 4, Part 5, or Part 6 to complete**

For one of the earlier parts of the assignment, compare and contrast the performance of the computation on all three types of machine we have available: Core 2, Core i7, and Niagara T2 processors. Take into account the relative performance differences, but also how the performance changes as the number of particles changes and the differing impact of some of your code changes above.

**The goal here is to get you thinking deeply about the different
machines**. Is there a difference in how performance changes as the
number of particles increases? If so, why? Or, for example, if some
change provided a speed boost on one machine but little or no gain on
another machine, report it and explain why that might be (that is, what
differences in the processor cores of memory hierarchy might explain the
relative differences).

**Note: choose only one of Part 4, Part 5, or Part 6 to complete**

Re-implement one of the above computations in Java or another language of your choice and re-optimize the code for that language. Plot the differences in performance (using the same runtime versus number of particles plots as above).

The goal of this is to get you thinking about how one might optimize for different languages and about the relative performance difference. What optimizations were especially good/bad for the language? How does the relative performance change as the problem size (number of particles) increases? In your writing, describe your findings, anything interesting you discovered, and your overall impressions.

The C++ code for the baseline computations, which can also be found in the file compute.C. The prose below include a brief description of the input values, and the exact distribution of values can be determined by looking at driver.C.

A set of N "points" is defined by three input arrays and one output array:

location: | An integer value for the one-dimensional location of the point. Each location is unique. |
---|---|

weight: | An integer weight of the point |

radius: | An integer distance threshold for the point |

answer: | An integer output of the computation, starts initialized to all zeros |

The above information for N points is specified as four size N arrays.

The distance between two points is calculated as:

int64 distance(int64 a, int64 b) { if (a > b) { return a-b; } else { return b-a; } }

In the computations below, the locations of the particles are unique and distributed (somewhat non-uniformly) over a range approximately 16 times the number of particles (so the space of locations is neither extremely sparse or extremely dense).

The radius values have a maximum that is twice the minimum radius, and
the radius values are **independent of the number of particles**. That
is, as the number of particles is increased, the minimum radius and
maximum radius do **not** increase. The radius is set so that it
captures on average a few hundred particles or more.

The code for computation 1:

void compute_one(int64 num_particles, int64 *location, int64 *weight, int64 *radius, int64 *answer) { for (int64 i = 0; i < num_particles; i++) { for (int64 j = 0; j < num_particles; j++) { if (distance(location[i], location[j]) < radius[i]) { answer[i] += weight[j]; } } } }

In computation, the bit-wise "or" operator is used in the attempt to break any obvious short-cut way of performing the calculation:

void compute_two(int64 num_particles, int64 *location, int64 *weight, int64 *radius, int64 *answer) { for (int64 i = 0; i < num_particles; i++) { for (int64 j = 0; j < num_particles; j++) { if (distance(location[i], location[j]) < radius[i]) { answer[i] += (weight[i] + weight[j]) | distance(location[i], location[j]); } } } }

Computation 3 is a two-dimensional variant with the same fixed positive radius for all particles. Distance is calculated using "Manhattan" distance (sum of the distances in each dimension):

void compute_three(int64 num_particles, int64 *xlocation, int64 *ylocation, int64 *weight, int64 radius, int64 *answer) { for (int64 i = 0; i < num_particles; i++) { for (int64 j = 0; j < num_particles; j++) { int64 dist = distance(xlocation[i], xlocation[j]) + distance(ylocation[i], ylocation[j]); if (dist < radius) { answer[i] += (weight[i] + weight[j]) * dist; } } } }

- When measuring runtime when tuning your application, pick a large enough number of particles such that the runtime is at least 10 or 20 seconds (the longer the better). Anything shorter likely won't give stable performance results or stress the memory hierarchy.
- You may use whatever standard libraries you want.

**Eniac**: For now, log into eniac.seas.upenn.edu and run your code there. You should run your timing results on the dual-socket Core 2 duo machines. Instructions for running your code on the dedicated machines can be found at: https://www.cis.upenn.edu/~cis534/sge.html**Source files**: You'll want to start with compute.C, compute.h and driver.C. Download the files. To build the program, use the following command:g++ -Wall -O3 driver.C compute.C -lrt -o compute

**Running the driver**: To run the driver code above, you need to specify the computation number and the number of particles and optionally the number of times to run the test:./compute -computation 1 -particles 10000 -trials 5

**Individual work**: Each student must submit their own writeup, but I encourage you to discuss this general problem with each other outside of class. Sharing of code is not allowed.

For the **initial due date**:

- By 10am on the day of class, send an email to cis534@cis.upenn.edu with your results thus far. For each of the three problems, indicate the largest problem size (number of particles) that your most-efficient code can compute in one minute (or 10 million particles if that takes less than 1 minute).
- Be ready to describe your approaches for the in-class discussion.

Print out the following to bring to class on the **final due date**:

**Insightful writeup**. There are few "right" answers for this assignment. And I don't actually know the "right" answers, anyway. The point of this assignment is the process of tuning code and showing whatever insight you gained from the process. I'm not looking for lots of prose, but lots of insight described tersely.We'll discuss these insights in-class after the assignments are due.

**Graphs**. Two or three graphs: Part 1 and 2, part 3, and part 4 (if you selected this option).**Your code**. Print out your final (fastest) versions of each of the computations.**Your faster results**. Send me your fastest results via email. Just copy and paste the "average" line from the output, which likes like:average: user: milom, host: acggrid31, comp: 1, particles: 100000, radius: 500, sparsity: 16, seconds: 20.6832

For parts one through three, you only need send me the timings for the following runs:

- Part1: 200K particles (20000)
- Part2: 20M particles (20000000)
- Part3: 10M particles (10000000)

Thus, you should email something like the following three lines:

average: user: milom, host: acggrid31, comp: 1, particles: 200000, radius: 500, sparsity: 16, seconds: 00.00 average: user: milom, host: acggrid31, comp: 1, particles: 20000000, radius: 500, sparsity: 16, seconds: 00.00 average: user: milom, host: acggrid31, comp: 2, particles: 10000000, radius: 500, sparsity: 16, seconds: 00.00

There probably isn't any "right" solution. If there is, I don't know what it is, as I haven't tried this out myself. It could be the obvious way is the fastest way of doing it; or it could be more subtle.

- [Feb 2] Changed compute.C to use "bit-wise or" instead of multiply (so that the C code matches this document).

After assigning and grading this assignment, a few things to consider correcting in future years:

- I now have some pretty efficient sequential code for some of these simple computations. As such, I could give them a baseline to shoot for, which would really help some (as some of the students stopped well short of obtaining efficient solutions in some cases).
- Certainly have more "real world" n-body calculations (with visualization) would be better than these simple made-up computations. The trick is finding one that is simple enough to let students understand.