Slide 3:
In this example, we use much more retiming resources than what is found
in common designs. Still, this is what it takes to get efficient design.
Slide 4:
This is the common retiming structure for image processing computations,
where a local operation is performed in a small region of the image.
By serializing operations on the image, we can reuse the same compute element
for all operations. There are two types of retiming elements. Those that
keep values that we may need to look at on every cycle (red registers) and
are usually the results of previous operations, and those that hold results
that we do not need to examine that often (grey-yellow FIFOs) and are used
for retiming of the scan-lines.
Slide 5:
For systolic computations, we need to stagger inputs in order to achieve
correctness. Therefore, we need retiming resources. One thing to notice is
that if we perform multiple computations in the row (in this case
multiple matrix multiplications), retiming registers just set up the initial
input skew which does not have to change at any point in the future.
Slide 6:
For echo cancellation we need to hold the value computed for time
equal to a round-trip delay before performing the subtraction.
Slide 7:
In the initial layout, the logic depth is one (e.g. all pixels calculated at
once). Each following layout comes
from the one before by halving the amount of computing elements we
devote to the computation. On every such step, we must add a retiming
register (the red one) that hold inputs to the first "half-computation"
in order to use it with the second one. By serializing operations we go to
deeper retiming structures. An interesting thing is that with every cut
we reduce the requirements in computing elements to half, but the amount
of retiming resources needed stays the same.
Slide 8:
Retiming provides temporal interconnect, since it make a result available to
the same point sometime later. A wire provides spatial interconnect.
Slide 9:
By bandwidth we mean the length of time.
When we know exactly when the data must be available to some point (how many
clock cycles), latency is not an issue. In this case, bandwidth is
the main concern.
Slide 10:
A single LUT implementation of a flip-flop is also possible if sufficient
delay between inputs-outputs is available.
It is clear that such an implementation of retiming resources is not
area efficient. Still, you may be able to fold some more functions to
these LUTs through the unused inputs.
Slide 11:
In this case, the area penalty for adding the optional flip-flop at the
outputs is almost 2%, which is very small (can be ignored).
Slide 12:
The problem in this case is that we also have to provide the interconnect
resources for the flip-flop. These significantly increase the area
penalty (~1/4 of LUT area).
Slide 13:
Xilinx allows you to use the CLB 4-LUT memories like a register file for
retiming purposes, performing 1 read and 1 write per clock cycle. So the
bandwidth is 1 bit-in and 1 bit-out/cycle. The area penalty is kind of
misleading because it does not account for the counters that need to be
implemented to use the array this way. One or two 4-bit counters are
necessary.
In older generations of Xilinx CLBs, this was not possible
since write was asynchronous, which made things much more difficult.
Slide 14:
In the first scheme, we can select the retiming delay but only one of the
delayed versions of the output is becomes available.
With the second one, all versions become available. This may
increase routing requirements by increasing the number of distinct output
signals to be interconnected.
In the third case, retiming has been transferred to the inputs. All delayed
versions of inputs are available without any change to the routing
requirements. Still, the total number of retiming registers may be much
larger, since we may have to delay the same value in every single LUT it
is delivered as an input.
Slide 15:
As mentioned before, this scheme minimizes the LUT I/O (same as in the
case of no retiming) at the cost of extra registers. We can even chain
together the inputs for deeper retiming, perhaps at the expense of reducing
the number of useful LUT inputs.
Slide 16:
We can organize retiming resources as a register file. The area penalty
depends on the number of bits per word (w), the number of words and the
number of i/o ports.
With the register file we pay for being able to read words from any
place at anytime. If some values can always be read from a single place,
we may be able to achieve some area savings.
Slide 17:
Slide 18:
Disk drive could be used for long-term retiming, in between runs of a
program.
Slide 19:
Here are the area/bandwidth characteristics of the various retiming resources
that can be used. Moving from left to right, the resources listed become
more efficient for smaller retiming distances.
Slide 20:
Still, there is a unique minimum retiming requirement.
Slide 21:
This is an example of a case that by transforming the algorithm we reduce
the retiming requirements. By combining loops and rescheduling
a few instructions we get from 3N registers to only 2. In the initial case
we have to retime whole vectors.