CS294-7 Scribe Notes

Day 21: Tuesday, April 8

Compute Blocks

William Tsu


ALU Bits

Traditional microprocessors, DSPs use ALU, the following slides would explain why.

ALU Bit

As carry connection is fast in ALU. ie. Carry signals just ripple through and complete in less time. The carry path just doesn't have extra switches, stubs etc which would add to the RC delay. Note that we are talking about "latency" advantage here of the ALU, not "computation density".

LUT Add Delay

Here are some sample data points. The 32 bit ALU, by avoiding the programmable interconnect, can have delay as low as 16ns and 6ns. And for the LUT based design, the delay is so high that motivate extra dedicated arithmetic hardware support. So, people have tried various ways to improve a pure LUT based computing structure, as will be explained in the next few slides.

Carry support (Altera)

The first example here is carry and cascade support from Altera. Notice the extra carry-in and cascade-in inputs. The carry chain provides a very fast (less than 0.5ns) carry-forward function between LEs (logic elements). (Altera FPGA is organized in LABs (logic array block), each LAB has eight vertically stacked LEs) So, the carry-in signal from a lower-order bit moves forward into the higher-order bit via the carry chain, and feeds into both the LUT and the next portion of the carry chain. This feature allows Altera to implement very high-speed counters, adders and comparators of arbitrary width.

Carry support (Altera)

Here are some more data points on Altera. Note that for the 43ns version, 16 LEs are used for sum computation and 16 LEs are for carry. For the 8ns version, by utilizing the built-in carry chain hardware, half of the LEs are actually being saved.

Carry support (Xilinx)

Another example of carry support in FPGA - Xilinx. Note the separate F and G carry generation blocks.

Carry support (Xilinx)

Some data points for Xilinx. Note that the actual performance and size of a CLB without the carry hardware is not known, so for the 16 CLBs version, even it doesn't use the carry logic, the carry hardware is there though which most likely make the CLB a little bigger and slower.

Carry support (Xilinx)

Xilinx 5K series. Unlike the 4K, It uses 3 2:1 mux rather than a 3 LUT in the middle of the block. Also, the LUTs in 5K cannot be configured as memory.

Carry support (Xilinx)

Some more data points again. The 3.8ns (LUT combinational) is used to start things up. The last 3.8ns is the final stage. Note that they are the fixed overhead of an adder.

LUT Carry Summary

So, as a summary, LUT carry gives area and speed benefit.

Cascades

Some other opportunities here: cascades. Some example logic here is 9-input and, or etc. Note that carry is actually a particular instance of cascade.

Xilinx 5K Cascade

The +0.5ns cascade stage is the mux delay. Referring to the diagram, two adjacent blocks can be composed into a 5 input function.

Altera 8K Cascade

Another cascade example for Altera. In general, cascade is most likely good when a few of them are being used. But when go after 4, it might not has any advantage, as we can always implement a very wide input function by having a few LUTs operate in parallel and merge the intermediate results at the end.

Hardwired Logic Blocks

So, besides carry and cascade, another way to reduce latency is using hardwired logic block.

Hardwired Logic Blocks

The 2 4-LUTs structure is actually a subgraph of the more general one on the right. Note that there is a lot of regularity within the graph. So, the point is, if we have enough regularity of a certain structure within the general graph, we can just build that and we don't necessary need the flexibility offer by individual LUT.

Hardwired Logic Blocks

The 2 4-LUTs structure shown should be smaller than 2 individual LUTs because of reduce I/O. The critical path of the graph is 3 levels.

Hardwired Logic Blocks

By covering the graph differently, the critical path actually goes down to 2 levels. So, having the right hardwired logic block can reduce latency.

Hardwired Logic Blocks

The lecture reviewed some of the results from the literature. We saw that some of the cascades, while perhaps requiring more 4-LUTs (packing inefficiency), actually allow smaller implementations due to their reduced need for general purpose interconnect. We also saw up to 50% speed improvements which these hardwired connections make possible. As in the examples, the speed and area results are from different mapping criteria. (See Day 21 reading for full citations.)

Summary