Computer Science 294-7 lecture #2
Computing Requirements for Reconfigurable Computing
&
VLSI Scaling

Notes by Nick Weaver

Administrivia: The Course Calendar is online, and contains items including handouts, slides, lecture notes (like these), reading lists and supplemental reading, and other important information. It will be continually updated during the course of the class.

Today's topics: Computing requirements for reconfigurable devices and VLSI scaling. These notes generally follow the slides, and the numbers refer to the order in the handed out slides.


Slide 2: The first slide was a discussion point for the different elements and what needs to be configured in order to program the device.


Slide 3: These are two sample programmable devices, a large block of memory or combinational logic. The problems are potential inefficiencies, the exponential size in the memory, feedback and routing issues, and, most importantly, NO STATE! This slide was designed as a discussion point. Without state, there are limits on reuse of components, and, if an infinite input stream is required, the lack of state implies an infinite number of these devices are required.


Slide 4: We need registers for pipelining (for performance), to communicate data between cycles (eg in an FSM or some feedback loop). The general buzzword used for both these uses is Retiming. The two pictures are two alternate data flows for an ASCII character (0-9,a-f) to hex converter. The circles are the inputs, while the boxes are 4 input lookup tables (4-LUTs). The first can be pipelined by adding registers on the input, but can't be pipelined more aggressively, because some of the inputs go to the second row of the computation. The second diagram shows the additional blocks that are used to retime data, allowing this aggressive pipelining. Such pipelining doesn't help latency, but will help bandwidth.

Shown below is a simple model for our configurable computing device:

Mapping the ASCII hex->binary circuit onto such a device, we program up the instruction to evaluate each of the constituent compute blocks and configure the interconnect accordingly. Shown below is the device programmed up without the additional retiming blocks.


Slide 5: So why don't we always just fully pipeline and run at the maximum speed? Sometimes there is feedback, where the final output is needed as an input to the next calculation (forcing serialization), or perhaps the application doesn't require as much throughput.

If either of these is the case for our ASCII hex->binary example, evaluation would proceed as follows (active instructions highlighted in yellow):

Most of the circuitry is left idle at any point in time.

In such cases, we may use less hardware, possibly without affecting the latency (but at a definite cost of throughput). For example, the 3 separate rows of operations in the ASCII hex->binary example could be implemented with a single block of 12 computational units by first doing the actions of the first row, then the second, then the third. In order to do this, we need to be able to quickly change the array's behavior, essentially having multiple instruction.

We also need to be careful that all data is in the right place. A few extra blocks are allocated here --- just as in the fully pipelined version --- to retime data so that the correct version of each value is available to each compute block when it is evaluated.

An even smaller implementation is possible with more storage and instructions but a single lookup table. Since data and instruction storage is usually much smaller, this takes less area but impacts performance.

Another way of putting it is that the data flow graph may allow full pipelining, but that the application either (a) might not require the full throughput or (b) might contain a feedback loop which prevents pipelining. Nonetheless, we can still use the techniques involved in pipelining to instead make the problem more area efficient. Instead of having hardware to compute the entire calculation, we can have sufficient hardware to perform only 1 of the pipeline stages, store the result, reconfigure the hardware to perform the next stage, store the result etc. This requires that we store multiple configurations (essentially a few VVLIW (Very Very Long Instruction Word. :) instructions which we can quickly switch between. Due to the extreme size of these instructions, we can't expect to load these from external memory)


Slide 6: Sometime there are mutually exclusive cases (EG, if else constructs). We can share hardware by doing one or the other, which we can do by either changing/selecting the instruction based on the data (like a conditional branch in a conventional computer), or by building a device that does one or the other (eg ALU) with the mode selected by the data. One example might be an FSM, where the states CHANGE the operation of the logic, allowing a much more complex FSM to be built in the same amount of hardware.


Slide 7: This is the extreme example of just a single computing element, which is very area efficient but with really low throughput. This is essentially a conventional computer with a very small reconfigurable functional unit and a register file. As such, the calculation needs to be performed sequentially, so the throughput decreases linearly with the number of gates required.


Slide 8: Wrapup slides for this discussion: We need both computing elements and interconnect. The application will give us the throughput, which we can use to make things more efficient by reusing element on different data (pipelining for throughput) or by reusing the same elements for different parts of the computation (sacrificing throughput for area).


Slide 9: Reuse means we must have some way of data retiming, and often an ability to change instructions. Memory itself can be used for data retiming or for holding the inactive (but soon to be used) instructions. This memory may often be scattered throughout a reconfigurable device or may be as one or two monolithic pieces.


Slide 10: (Note, the lecture occurred in a different order, with this shifted to the end. These notes follow the slides however). FPGA marketeers' speak in terms of gate counts, but they aren't the only thing (have interconnect and state elements) and the notion of counting gates is tricky.



Slide 11-12: How many 2-input NAND gates are in a 2 input lookup table (2-LUT)? Depends on what kind of gate! So really a range.(1-4). Xilinx generally takes something of a midpoint when they give the numbers (eg, a XC4000 4-LUT would be (1-9) or so, so Xilinx says 4 or 5) (How Xilinx counts gates).


Slide 13: And then what about a flip-flop? It's probably 6 or so IF you use it, but if you don't, well?


Slide 14: Interconnect makes it more difficult. EG, the left one is more area efficient (the right one has a full crossbar, which grows with the square of the number of gates), but the yield may depend on application, eg, the routing on the left one is VERY bad, and may really hurt yield, while the one on the right will be able to use all the gates. But for some applications, the limited routing available on the left is sufficient.


Slide 15: Needed retiming also affects things. How many ffs to retime the 1st example? Well, 22 gates? But on something like the XC6200 (a FF/LUT structure), you need 4 blocks, so 40? gate equivalents? What happens if the LUT can be used for memory?


Slide 16: Conclusion: Why should we count one element when all three (compute, retiming, interconnect) are all important. Any architecture has a fixed ratio (but some items may be used for 2 or 3 functions), so measuring a single item is misleading unless it is the limiting factor for a specific application.


Slide 18: Scaling VLSI. We need to predict what the technology of the future will be in order to model/reason about current trends and devices to be built later.


Slide 19: Scale everything uniformly, either by a factor or by measuring something that is scaled (lambda). Dennard postulated that this would be good down to 1 micron (it was), and Bohr has shown that we've managed to continue this. The scaling concept appears to be valid at least for another 10 years.


Slide 20: Shows how we scale different features, which has been true for all but voltage. Voltage over the long term follows the historical trend, but it tends to stay at one level for a considerable time before moving to a lower value, mostly for compatibility reasons. (EG, 5 volts remained the standard voltage for a considerable time)


Slide 21: Shows area comparison. 1st is die size for various devices, the second is their area in lambda^2, and the third is relative area in terms of lambda^2.









Slides 22-29: Math for scaling all the other quantities. Note that wire delays DO NOT SCALE, and actually get worse as wires get longer (in terms of lambda) to get across chip. There are also two ways to calculate energy dissipation. The resistive model is appropriate for NMOS (or non-complementary structures in CMOS such as grounded-P gates), and suggests that power per gate goes down by the square of the scaling, which implies that if the scaled process has more transistors/unit area by the scaling that the power per unit area will remain constant. The capacitive dissipation model is more appropriate for CMOS, where most of the dissipation is the charging and uncharging of the capacitors.


Slide 30: Summarizes all the scaling.


Slide 31: Notes: Circuit delay does not scale as fast, because of wire delays. Adding wire layers, and thicker wires help some, but not enough.