Course Project - CIS501 Fall 2010

Instructor:Prof. Milo Martin
Proposal Due:Thursday, Nov 18 at noon
Report Due:Monday, Dec 13 at noon

Overview

You are a team of computer architects working at a company and your supervisor asks you to look at some academic research on a topic. You supervisor wants you to identify a promising paper (or set of papers), describe the work in your own works, and re-evaluate the ideas (using your own code and simulations), and present your findings and conclusions. You can assume the set of traces I've provided are representative of the workloads your company cares about.

The final report should describe the idea in your own words. You should assume that your supervisor is knowledgable in computer architecture as a whole, but assume they are not familiar with the specific idea or paper(s). You should be sure to conclude with your findings regarding the value of the idea explored (do your results confirm the results in the paper or does it put them in doubt?). Would your recommend that your company consider implementing the ideas in a future system?

You are welcome (and encouraged!) to explore your own idea, but it should be described and evaluated in the context of other work related to the idea (for example, by citing and describing relevant related work and comparing to reasonable baselines).

Teams

This project is to be done in teams of four students.

Deliverables

1. Proposal

The project proposal should be around a page (or maybe 500 words). It should included your project title, names of the groups, a brief description of your proposed project, your experimental plan (what you plan to measure and how), and references to any key papers. In general, the more information, the more feedback I can provide!

2. Final Report

The final project report is a document describing your project. I would estimate something on the order of half the length of the conference papers we've read (say, approximately 4000 words), but it can be longer or shorter as appropriate. You should assume the read is an expert in computer architecture but is not familiar with the specific paper(s) upon which you based your project. It should describe the basic idea, your implementation of the idea, and your experiment apparatus and assumptions. It should present results in graphical form, include prose that analyzes the results, draw conclusions, and also include a brief overview of related/prior work.

3. Source Code

I'd also like you to turn in all your source code via blackboard.

4. Group Contribution Report

After turning in the final report, each member of the group must submit a individual evaluation of each group member's contribution. This has three parts:

  1. In a sentence or two, describe your role in the group. What tasks did you perform? What was your main contribution?
  2. Assign each of the members in group (including yourself) a percentage based on your view of their contribution and effort spent on the project. The numbers should add up to 100%. In the unlikely event that every group member made exactly the same contribution, you would give each group member (including yourself) a value of 25%. If someone contributed, but other students contributed more, you might give 20%, 25%, 35%, 30%, etc. Again, it is highly unlikely that everyone contributes evenly, and that is okay. I'm just looking to understand the group dynamics and identify problem cases in which a team member greatly underperformed.
  3. Describe any problems the group encountered, how the problem was overcome (or not). Also explain the reason for your assessment of the contribution of the group members (especially if the assessment was unbalanced).

Overall, I'm not asking you to spend a long time on this, but I want some window inside the group's operation.

Overall Comments Based on Project Proposals

Write your own code. For any of the teams working on branch prediction and prefetching, you should write your own code. You should not be downloading reference implementations of the code and using that in your simulations. Instead, you should re-implement the algorithms within our simulation environment. The papers generally have sufficient information to reproduce the predictors. For other projects, if you use existing tools or infrastructure, you must cite any tools you use and describe how you used them and/or modified them.

Use our traces, if applicable. Again, for the branch prediction and prefetching projects, I strongly suggest (require?) that you use our traces (and not the ones from the various competitions). The set of traces we've made avaialble should generally have enough interesting branch and memory behavior to be sufficient. I'm familiar with these traces, you already have tools that can simulate them, and it should allow to focus on the ideas behind the project. This is also not unrelated to my comments above about writing your own code for the various predictors and/or prefetchers.

How are you going to get your results. Of the projects (not branch prediction or prefetching), a few of the project proposals were woefully inadequate in terms of how the results were going to be generated. Is it simulation? With our traces? Or with other traces? If so, how do you plan to generate these traces? What benchmarks are you going to use? What tools are you going to use? Etc. Be sure to really think through what results you are going to need and how you are going to get them.

Project writeup. Be sure to give your project a title and list the team members (as if it was a paper). Avoid copying figures or prose exactly from any other papers. Instead, re-describe the ideas in your own words and draw your own figures. Cite papers in the document body using numbers, such as [1]. List the references at the end, and include: title, authors, year, and conference/journal the paper appeared.

Topic-Specific Comments Based on Project Proposals

Branch Prediction (8 groups)

Not surprisingly, branch prediction was the most popular topic. Branch prediction also has the least risk involved as you already have the basic infrastructure for analyzing prediction accuracy (HW3) and impact on performance (HW6). As a result, I have high expectations for these projects in terms of polish, analysis, and insight. Let me say it another way, as there are several groups looking at this general topic, you should find some way to make your project report notable or memorable in some ways.

After completing HW6, you'll have reasonable timing model, so I expect at least some timing results (IPC) for the various predictors. Much of your work can be done by measuring mis-prediction rate, but I'd like to see some IPC numbers, too. Also, when compare predictors, be sure to compare equal-area predictors (as we did in our assignment) to make fair comparisons.

At least a few groups proposed looking at some of the complicated predictors from the predictor championship competition, which is great! However, if you find these predictors to be too complicated (understandable), you might consider some of the eariler predictors that form the basis of these predictors. That is, these award-winning predictors are mostly refined (and more complicated) versions of slightly older predictors (and see below for more ideas).

Conversely, if you're looking at one of the relatively simpler predictors such as the basic perceptron predictor, you might consider looking at some of the more recent hybrids or at the piecewise linear predictors that extends the perceptron predictor. Or perhaps compare it to a state-of-the-art non-neural predictor such as the gskew predictor discussed below.

There has been work on various sort of tagged predictors and skewed index that are interesting predictors without being as unwieldy as some of the ones used in the championship completion, so you might consider those as well.

As of right now, the topic "conventional" predictor is consider to be the optimized version of a gskew predictor. This was going to be used in the Alpha 21464 and it has been further refined. A good starting reference is:

http://www.irisa.fr/caps/people/seznec/Optim2bcgskew.pdf

Another predictor that isn't described in the literature but is used by Intel's processors is a loop-count predictor. There is a brief description of one implementation of a loop counter predictor as well as a good overview of the state of the art of branch prediction in the following lecture notes:

http://homepages.cae.wisc.edu/~mikko/752_Spring08/lect7-adv-branch-prediction.pdf

To give you some more ideas to explore that aren't quite as hard to tune as the most recent tournament predictors, below are links to some relevant paper on branch prediction, in rough chronological order:

Prefetching (4 groups)

I have a few comments on the prefetching projects:

First, I would like to see some timing results, as fetching a block one cycle "early" and calling that a successful prefetch is misleading (at best). To do timing simulation, you should connect a cache model to the HW6 simulator. To simplify the modeling of cache misses, I would actually check the cache tags during the issue ("wakeup") stage and only allow a load to proceed to execute if the block is in the cache. If the block is not in the cache, I would (1) allocate a entry in the cache for that block (evicting whatever else was in the cache) and then (2) record the cycle in the future when the miss will complete. Similarly, if something is prefetching into the data cache, I would put it in the cache immediately, but set the "available" cycle based on the latecy of the miss. Then, never let a load "hit" to a block that isn't yet "available" in the cache. This approach should capture the most important timing effects without being too complicated to implement.

Second, you should implement a cache with higher associativity than just two-way set associative. If you don't, you'll just be prefetching the conflict misses. Instead, remove the conflict misses by making the cache 8-way set-associative or more, and then focus on the harder compulsory and capacity misses that prefetching targets.

Third, use some sort of simple stride-based prefetcher as a baseline. As stride prefetching is simple, effective, and widely implemented, it really should be the baseline to be "fair".

Finally, the traces we have just aren't long enough for evaluating the perforance of large caches. So, I suggest that you use modest-size caches of today's L2 caches (128KB to 512KB) and focus on prefetching from the L3 (model as being infinite with some constant latency). You wouldn't need to model the first two level explicitly, just model a single L1/L2 cache of sufficient size and associativity (unless you really want to model a two-level cache hierarchy).

Memory Dependence Prediction (2 groups)

First, I'd also like to see "timing" data for the store sets modeling. To do this, you'll need to extend the HW6 simulator to actually model flushing the pipeline after a memory dependence mis-prediction. Why? A memory dependence mis-prediction only hurts performance if load actually issues too early (for example, because the store supplying it with data happened to already execute). Fortunately, flushing the pipeline in our simple model can be approximate without even re-fetching the instructions (just reset all older instructions to need to re-execute and set their output registers as non-ready). In fact, if using this approach, it would be possible to model selective squashing (squashing only those instructions that were dependent on the load) and explore the performance implications of that. If you wanted to model re-fetching the instructions (which would be more accurate), you could model that by "fetching" the instructions back into the ROB by just re-enabling them as the instructions would have been fetched.

Second, you do not need to reproduce all results and preform all the experiments in the original store sets paper. Pick and choose the most important experiments and be sure to include at least some end results (realistic predictor sizes and implementable mechanism).

Third, I'm not sure if any of our traces are going to act like "applu" did. If they do (or even if they don't) you're welcome and encouraged to play with some history or other means of trying to improve the prediction accuracy.

Fourth, the more accurate the branch predictor, the more difficult (and thus the more opportunity for improvement) memory dependence prediction becomes. The Store Sets paper does not explore what happens with the window grows or what happens as branch prediction is improved. Looking at how store sets behaves with large windows and perfect branch prediction might be interesting.

Other Cache Papers (3 groups)

For these projects (cache banking, dynamic insertion policies, and pseudo set-associativity), you should be able to collect stats on our traces just using cache simulation. However, in addition, as with the other projects discussed above, I'd really like to see timing results based on the simulator in HW6. You'll need to hook the cache into it (see comments above in prefetching).

Others (3 groups)

In addition to the general comments above, please see specific comments sent to you via email.

Retrospective Comments from 2010

Comments for next year based on what I learned from the writeups this year.