Homework 2 - CIS501 Fall 2008

Instructor:	Prof. Milo Martin
Due:	Tuesday, Oct 21st (at the start of class)

Overview

In this assignment you will explore the effectiveness of caches and the impact of different cache configurations. We've used Pin, a binary instrumentation tool, to generate a trace of all data accesses (loads and stores) of a program. Your task will be use this representative trace to evaluate the impact of cache size, associativity, block size, and next-block prefetching.

To do this, you'll write a program that reads in the trace and simulates different cache configurations and prefetching schemes. You're welcome to write the program in whatever language you like; although we are going to ask for a printout of your source code, to end result of this assignment are the experimental results (and not the program source code).

Trace and Trace Format

The trace we're giving you is a trace of load and store operations from an execution of the program GCC (Gnu C Compiler) from the SPECint2000 benchmark suite. Instruction cache accesses are not recorded (we're focusing on the data cache in this assignment). Each line of the trace file has two fields. Below are the first few lines of a trace file:

bfede22c    L
bfede228    S

The first field is the memory address (in hex) being read or written. The second field is the character "L" or "S" to distinguish between load and store operations. You may assume that no memory operations cross cache blocks.

The trace is available as an 81MB gzip compressed file (memory-trace-gcc.trace.gz). If you wish to verify that the files downloaded correctly, the MD5 sum is 5036ef77c947267e5b1512080eeced68.

You can either (1) download the compressed file, uncompress it, and process the uncompressed file, or (2) process the compressed file directly. Under Linux this can be done by using zcat to pipe the uncompressed trace into a program reading from standard input. For example: zcat | your-program. Another option is to use zlib (for C/C++) or java.util.zip (for Java) to directly open and process the compressed file. Finally, you can always just use the uncompressed version of the trace.

To help you debug your cache simulation code, we've provided some annotated results for the first thousand or so memory references for various cache configurations (see below). Comparing your results to these annotated outputs should help ensure your code is working properly.

Question 1 - Direct-mapped Tag/Index/Offset Calculations

As talked about in class, a cache has three primary configuration parameters:

S: Cache size (how much data the cache holds)
B: Block size (the granularity of the data)
A: Associativity (number of blocks with the same index that can be in the cache at the same time).

These three parameters the number of sets and ways in the cache. These parameters also specify how a memory reference address is split into its three components: block offset bits, index bits, and tag bits.

Assuming 32-bit address, calculate the following values for a direct-mapped 32KB cache with 64B blocks (that is: S=32KB, B=64B, and A=1):

The number of bits in the block offset

The number of "sets" in the cache

The number of "ways" in the cache

The number of index bits

The number of tag bits

Question 2 - Set-Associative Tag/Index/Offset Calculations

For a two-way set associative version of the cache in question 1 (that is: S=32KB, B=64B, and A=2), calculate the following values:

The number of bits in the block offset

The number of "sets" in the cache

The number of "ways" in the cache

The number of index bits

The number of tag bits

Question 3 - Generalizing the Calculations

Assuming 32-bit address, write formulas in terms of S, B, and A for calculating the following quantities. You may find using log(S), log(B), and log(A) in the formulas easier than S, B, and A directly. For this assignment, you may assume S, B, and A are all powers of 2. Write the formulas for:

The number of bits in the block offset

The number of "sets" in the cache

The number of "ways" in the cache

The number of index bits

The number of tag bits

Hint: you may find it helpful to first work out the numbers for a direct-mapped cache (that is, consider just S and B), and then extend it to set-associative caches.

Question 4 - Miss Rate vs Cache Size

The first experiment is to gauge the impact of cache size on cache miss rate. Write a program to simulate a direct-mapped cache parametrized by S and B. You may actually find it more convenient to configure your cache in terms of log(S) and log(B) instead of S and B.

For this first direct-mapped cache simulations, the contents of the cache can be represented by a single array of values: the tag array. To avoid needing to track a separate valid bit, just initialize all the tag values to zero and assume that all blocks are valid and remain valid throughout the simulation. As we're not actually manipulating the actual data values (as a real processor would) but just calculating hit/miss for memory reference, there is no need to track the contents of the data array (in fact, the trace doesn't even have the information you would need to do this).

For now, handle loads and stores in the same fashion.

Use your cache simulator to produce cache miss rates for varying cache sizes. Generate the data for caches from 256 bytes (2⁸) to 4MB (2²²). Configure the block size to 64 bytes. Generate a line plot of this data. On the y-axis, plot the "cache miss rate (percent of all memory references)". The smaller the miss rate, the better. On the x-axis, plot the log of the cache size (basically, the log of the cache sizes), in essence giving us a log scale on the x-axis.

How large must the cache be for the miss rate to be less than 10%? How large to be less than 5%?

Today's processors generally have 32KB or 64KB first-level data caches. The best improvement you would expect in general from doubling the size of the cache would be a 2x reduction in misses (for example, reducing the miss rate from 20% to 10%). By what ratio does increasing the cache size from 32KB to 64KB reduce the miss rate? (2.0 would be halving the miss rate; 1.0 would be no change in miss rate; less than 1.0 would be an increase in misses).

Question 5 - Performance Impact of Cache Size

In that last question, you plotted the miss rate, but the miss rate is only indirectly related to performance. In this question, plot the CPI (lower is better) for a processor with varying cache size based on the following assumptions: a base CPI of 1, 33.33% of instructions are memory operations, and the cache miss latency is either 10, 25, or 100 cycles. The x-axis is cache size (same as the previous question). The y-axis is CPI. Plot three lines on a graph (once for each miss latency). Creating this graph shouldn't require additional simulations, but can instead be calculated from the data collected in the previous question.

For each of the cache miss latency, how large must the cache be to achieve a CPI of 2? What does this say about the impact on cache miss latency?

In the previous question you calculated the ratio of miss rates for increasing the cache size from 32KB to 64KB. What is the speedup that results from increasing the cache size from 32KB to 64KB (this is basically a "speedup" calculation in which 1.0 is no speedup, less than 1.0 is a slowdown, and greater than 1.0 is a speedup). Calculate the speedup for each of the cache miss latencies.

How do these speedup numbers compare to the cache miss ratio you calculated in part b of the previous question?

Question 6 - Traffic of Write-Back vs Write-Through Caches

Modify your simulator to calculate the total number of bytes transfered. The first source of traffic is the data transfers to fill cache blocks. This quantity is easy to calculate: it is just the number of misses multiplied by the block size. The other source of traffic is write traffic (either writeback or writethrough). In this question, you'll calculate the traffic for both write-back and write-through caches.

Write-through: For the write-through cache, the traffic is the average number of bytes written by each store multiplied by the number of stores. For this trace, the average number of bytes written by each store is just a little under 4, so just assume each write through is four bytes 4. You'll need to modify your simulator to count the number of stores.
Write-back: For the write-back cache system, the traffic is the number of dirty evictions multiplied by the block size. Because at most one eviction happens per miss, this writeback traffic can at most double the overall traffic (in the case in which all evicted blocks were dirty). To calculate the number of dirty evictions, modify your simulator by adding a "dirty" bit for each block in the cache. Set the dirty bit whenever a store operations writes to the block. This dirty bit is not per word or per byte; it is one dirty bit for the entire cache block.

In either case, use a write-allocate policy. As such, the number of misses between write-through and write-back is the same under our assumptions (no write buffers and such), only the traffic is different.

For a direct-mapped cache with 64-byte blocks, generate a graph plotting the same cache sizes as the previous question (on the x-axis) versus total data traffic per memory operation (total number of bytes transferred divide by the total number of memory references). Plot two lines: (1) the write-through cache (miss fill traffic + write-through traffic) and (2) the write-back cache (miss fill traffic + write-back of dirty blocks).

At what cache size do the two write policies generate approximately the same amount of traffic?

Why does the difference between the two schemes diverge at large cache sizes?

Why does the difference between the two schemes diverge at small cache sizes?

As the write-back cache uses less traffic for reasonable cache sizes, all subsequent experiments use write-back caches.

Question 7 - Set-Associative Caches

Modify your simulator to support two-way set associative caches. Each set in the cache will now need two "tag" entries (one for each way), two dirty bits (one for each way), and a "LRU" (least recently used) bit used for cache replacement. Whenever you make a memory reference to a block, update the LRU bit to point to the other entry in the way. When replacing a block, replace the LRU entry (and don't forget to update the LRU bit to point to the other entry). When performing a store, be sure to set the correct dirty bit.

Generate miss rate data for the same block size and caches sizes as in the previous question, but simulate two-way set associative caches. Plot this result with the original data collected for a direct-mapped cache (two lines on the graph: one for direct-mapped caches and one for the two-way set associative cache).

How large must the set-associative cache be for the miss rate to be less than 10%? How large to be less than 5%?

How large must the direct-mapped cache be before it equals or exceeds the performance of the 16KB two-way set-associative cache (ignore any noise in which the direct-mapped cache might slightly outperform the set-associative cache for some configurations)?

This graph shows that the performance gap between set-associative and direct-mapped caches narrows as the cache size grows very large. Explain what likely causes this narrowing.

As the set-associative cache performs better than the direct-mapped cache, all subsequent experiments use the two-way set associative cache.

Question 8 - Cache Block Size

Now let's explore the impact of different cache block sizes. Modify your simulator to support various block sizes. Use the simulator to calculate the miss rate for several different block sizes: 4B, 8B, 16B, 32B, 64B, 128B, 256B, 512B, and 1024B blocks. Do this for two cache sizes: 8KB and 32KB. Create two graphs from this data:

Miss Rate Graph. Plot the data on graph with miss rate on the y-axis, log of the cache block size on the x-axis, and the two caches sizes plotted as two different lines.
Traffic Graph. Plot the data on graph with bytes per memory reference on the y-axis, log of the cache block size on the x-axis, and the two caches sizes plotted as two different lines.

Answer the following questions:

What is the block size with the lowest miss rate for the 8KB cache? And the 32KB cache?

What is the block size with the lowest traffic (bytes per memory reference) for the 8KB cache? And the 32KB cache?

What are the two (or three) sources of additional traffic as the block size grows? Explain why each component grows.

Given that current processors typically use, say, 64B blocks, which metric (miss rate or traffic) are today's caches designed to minimize?

Question 9 - Prefetching

One way to reduce cache miss rate is to prefetch data. The simplest prefetching technique is sequential next-N-block prefetching that is triggered by a cache miss. No prefetches are triggered on cache hits. Modify your cache simulator to simulate next-N-block prefetching. Everytime a cache miss occurs on address A, simulate a prefetch on address A+(block size x 1), A+(block size x 2), ... ,*A+(block size x N)*.

If any of the prefetch address are already present in the cache, they shouldn't generate any traffic (they are a "prefetch hit"), but such "prefetch hits" should not impact the LRU/MRU status of the block. If the prefetch address isn't in the cache (a "prefetch miss"), the cache will bring the block into the cache, evicting the LRU entry to make space in the cache (and creating write-back traffic if the evicted blocks was dirty). When bringing the prefetch into the cache, set the LRU/MRU bit so that the prefetching data is the LRU way (which is different from a normal "demand" non-prefetch miss, which sets the new block as the MRU entry). In addition, prefetches never set the dirty bit (the dirty bit will be set when a store actually occurs to the block).

Simulate a 32KB cache with the same range of block sizes as the previous question. For each block size, simulate next-block prefetching for N=0 (no prefetching), N=1, N=2, and N=3. Plot the data in a similar graphs as the last question (block size on the x-axis and miss rate and bytes per memory operation on the y-axis of each graph). Each graph will have four lines (one for N=0, N=1, N=2, N=3).

For the 32KB cache with a 64B cache block size, which of these four values of N has the lowest miss rate?

The main cost of prefetching is the additional traffic it generates. For the N above with the lowest miss rate, how much extra traffic does the prefetching introduce over the N=0 baseline? Give your answer as a percent increase (100% increase would be double the traffic).

How does prefetching impact the choice of block size? For each N, what is the block size with the lowest miss rate?

What combination of block size and prefetch N gives the lowest overall miss rate? Is this block size large, smaller, or the same as the best block size without prefetching (N=0).

In the above description, we specified that prefetched blocks are put into the cache marked as least recently used (LRU) and not most recently used (MRU). Why did we make this choice? Would the prefetching be more or less effective if blocks were prefetched into MRU? (Feel free to change this policy and see how the results change, but it isn't required to answer the question). Explain.

Test Cases

We have also provided you with a sample trace output run on the first thousand memory references from the trace. The output files have the following format:

[Set 0: {Way 0:000000, C} {Way 1:000000, C} LRU: 0] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S miss clean
[Set 0: {Way 0:ffb788, D} {Way 1:000000, C} LRU: 1] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S  hit    --
[Set 0: {Way 0:ffb788, D} {Way 1:000000, D} LRU: 0] [Set 1: {Way 0:000000, C} {Way 1:000000, C} LRU: 0]  | bfede200 S  hit    --

The contents of each set are specified, these are the contents of each way (way number: tag, state, where C=clean, D=dirty) and the LRU way for that set. Each line says what the contents of the cache were before a particular cache access and then lists the address being accessed, the access type (S for store, L for load and P for prefetch), the outcome (hit or miss) and if the state of the LRU way being evicted (clean or dirty on a miss, none on a hit).

The three output traces are:

cache-128-assoc-1-block-32-prefetch-0.trace.output: A direct-mapped cache with 32B blocks and a total cache size of 128 bytes. No prefetching.
cache-256-assoc-2-block-64-prefetch-0.trace.output: A two-way set-associative cache 64B blocks and a total cache size of 256 bytes. No prefetching.
cache-128-assoc-1-block-32-prefetch-2.trace.output: A direct-mapped cache with 32B blocks and a total cache size of 128 bytes. This trace uses a two-block-ahead prefetch on every cache miss. These extra prefetch instructions are listed as separate lines in the trace (with a "P" as the operation type).

What to Turn In

Short answers. Turn in your answers to the above questions. I strongly urge you to type up your answers.
Graphs. Turn in printouts of the following graphs. One per sheet of paper (to make them big enough to read) and label axises and gives the lines descriptive names in a legend.
- Graph A [Questions #4 and #7]: Miss rate for various cache sizes. Two lines: direct-mapped and two-way set associative.
- Graph B [Question #5]: The impact of cache size on CPI. One line: the CPI of the processor over various cache sizes
- Graph C [Question #6]: Traffic vs cache size. Two lines: write-through and write-back caches.
- Graph D [Question #8]: Impact of cache block size. Two lines:one for 8KB and one for 32KB cache.
- Graph E [Question #9]: Impact of prefetching (varying block size). four lines: prefetch distances of 0, 1, 2, 3.
Source Code. Print out your final source code for your cache simulator.

Hints & Comments

Runtime. Running through the trace should take just a few seconds in most cases. Even if you run through the entire trace to generate each data point, it shouldn't take more than a few minutes to generate all the data you need for this assignment. If your code is taking much longer, you're probably not doing something right (as me or the TAs about it).
Computer resources. If you wish to write the code by ssh'ing into a Linux box, you can log into halfdome.cis.upenn.edu. It is faster than eniac.seas.upenn.edu, and it won't kill off your jobs if they run too long (as will happen on eniac). Of course, you're also welcome to write and run the code on your personal computers or the lab machines.
Code reuse. In the end, the total amount of code needed to perform these simulations isn't actually that much. Don't go crazy in making the code super modular, but conversely there is significant potential for code reuse that shouldn't be ignored.
Automation of data collection. Put a little bit of thought into how you're going to collect the various data for each question. If you're running your program multiple times by hand and then copying and pasting the individual results into Excel, you're doing something wrong (or at least super inefficiently). Your code should be able to perform a range of simulations in an automated fashion (either by building it into the main program of using scripts to call it with different parameters). You program should be able to split out data files that can be imported into Excel (or whatever) to make graphs with minimal tedium.

Addendum

None, yet.