Computer Science 294-7 Lecture #23

Notes by Christoforos Kozyrakis

Slide 3: In this example, we use much more retiming resources than what is found in common designs. Still, this is what it takes to get efficient design.

Slide 4: This is the common retiming structure for image processing computations, where a local operation is performed in a small region of the image. By serializing operations on the image, we can reuse the same compute element for all operations. There are two types of retiming elements. Those that keep values that we may need to look at on every cycle (red registers) and are usually the results of previous operations, and those that hold results that we do not need to examine that often (grey-yellow FIFOs) and are used for retiming of the scan-lines.

Slide 5: For systolic computations, we need to stagger inputs in order to achieve correctness. Therefore, we need retiming resources. One thing to notice is that if we perform multiple computations in the row (in this case multiple matrix multiplications), retiming registers just set up the initial input skew which does not have to change at any point in the future.

Slide 6: For echo cancellation we need to hold the value computed for time equal to a round-trip delay before performing the subtraction.

Slide 7: In the initial layout, the logic depth is one (e.g. all pixels calculated at once). Each following layout comes from the one before by halving the amount of computing elements we devote to the computation. On every such step, we must add a retiming register (the red one) that hold inputs to the first "half-computation" in order to use it with the second one. By serializing operations we go to deeper retiming structures. An interesting thing is that with every cut we reduce the requirements in computing elements to half, but the amount of retiming resources needed stays the same.

Slide 8: Retiming provides temporal interconnect, since it make a result available to the same point sometime later. A wire provides spatial interconnect.

Slide 9: By bandwidth we mean the length of time.
When we know exactly when the data must be available to some point (how many clock cycles), latency is not an issue. In this case, bandwidth is the main concern.

Slide 10: A single LUT implementation of a flip-flop is also possible if sufficient delay between inputs-outputs is available.
It is clear that such an implementation of retiming resources is not area efficient. Still, you may be able to fold some more functions to these LUTs through the unused inputs.

Slide 11: In this case, the area penalty for adding the optional flip-flop at the outputs is almost 2%, which is very small (can be ignored).

Slide 12: The problem in this case is that we also have to provide the interconnect resources for the flip-flop. These significantly increase the area penalty (~1/4 of LUT area).

Slide 13: Xilinx allows you to use the CLB 4-LUT memories like a register file for retiming purposes, performing 1 read and 1 write per clock cycle. So the bandwidth is 1 bit-in and 1 bit-out/cycle. The area penalty is kind of misleading because it does not account for the counters that need to be implemented to use the array this way. One or two 4-bit counters are necessary.
In older generations of Xilinx CLBs, this was not possible since write was asynchronous, which made things much more difficult.

Slide 14: In the first scheme, we can select the retiming delay but only one of the delayed versions of the output is becomes available.
With the second one, all versions become available. This may increase routing requirements by increasing the number of distinct output signals to be interconnected.
In the third case, retiming has been transferred to the inputs. All delayed versions of inputs are available without any change to the routing requirements. Still, the total number of retiming registers may be much larger, since we may have to delay the same value in every single LUT it is delivered as an input.

Slide 15: As mentioned before, this scheme minimizes the LUT I/O (same as in the case of no retiming) at the cost of extra registers. We can even chain together the inputs for deeper retiming, perhaps at the expense of reducing the number of useful LUT inputs.

Slide 16: We can organize retiming resources as a register file. The area penalty depends on the number of bits per word (w), the number of words and the number of i/o ports.
With the register file we pay for being able to read words from any place at anytime. If some values can always be read from a single place, we may be able to achieve some area savings.

Slide 17:

Slide 18: Disk drive could be used for long-term retiming, in between runs of a program.

Slide 19: Here are the area/bandwidth characteristics of the various retiming resources that can be used. Moving from left to right, the resources listed become more efficient for smaller retiming distances.

Slide 20: Still, there is a unique minimum retiming requirement.

Slide 21: This is an example of a case that by transforming the algorithm we reduce the retiming requirements. By combining loops and rescheduling a few instructions we get from 3N registers to only 2. In the initial case we have to retime whole vectors.