Computer Science 294-7 Lecture #7

Notes by Bruce McGaughy

Slide 5: Adding to the difficulty is the problem of comparing designs manufactured in different process technologies. Care must be taken to normalize collected data for different technology generations.

Slide 6: The object of the lecture is to compare computing devices across a spectrum ranging from processors to full custom designs, with ASIC, Sea of Gates, Gate Arrays and FPGA's in between. Across this spectrum, Performance is traded off for Flexibility.

<---------- Performance <----------

| Custom---ASIC---Sea of Gates---PGA---FPGA---Processor|

----------> Flexibility ---------->

Slide 7: With sea of gates, there is no fixed overhead for routing channels. Wires are treated more as first class citizens, and hence, the interconnect density is not limited by the width of the routing channels.

Slide 8: Gate arrays are estimated to have 2-3x worse density than custom designs. Gate arrays will be least efficient in implementing memory storage elements, which can be highly optimized in memory arrays.

Slide 9: In a sea of gates array with two layers of metal, approximately 35% of the gates are usable. This figure jumps to about 70% for 3-4 layers of metal. FPGA's are about 5x less dense than MPGA's, and are more like 10-15x less dense than custom designs.

Slide 11: The processors perform so badly in this metric because their density is diluted by control and cache memory overhead. Only the ALU really performs useful work. It can be argued that the instruction counter performs useful work on branch statements. However, the branch statements may have never been necessary if the computation was spread out in area rather than time. Co-processors are also not included in these comparisons.

Slide 12: This slide shows the order of magnitude advantage of FPGA implementations over processors in terms of peak computational density. The advantage holds over a wide range of process technologies, from 3.0um down to 0.35um.

Slide 13: Stalls can include cache misses, unfilled branch slots, and data dependancy stalls. With wider super-scalar and VLIW processors, the degradation from the peak computational density could be quite heavily impacted, as they often experience under-utilized resources.

Slide 14: Longer path lengths make longer cycles necessary, degrading yielded density from peak. If the throughput is limited, bit elements can sit idle. Finally, mismatches in resource balance (e.g. insufficient interconnect or retiming registers/memory) can cause some units to be unused.

Slide 15: After fabrication, it is impossible to tailor the hardware exactly to a task (e.g. select optimal word width, specialize around constants or modes of operation). FPGA implementations (and in some cases processor and DSP implementations) allow these things to be defined on a per application basis. Thus, MPGA's may have to accomodate more general problems (e.g. larger word sizes) than required at a given time, resulting in lower yielded density.

Slide 16: Yielded throughput density can give us a good metric for comparing various implementations.

Slide 17: Here, "High Throughput" means that the most desirable throughput for the task is unbounded, or bounded at a fixed rate higher than a single compute element can achieve. For example, in data base applications such as searching and sorting,

Slide 18: The full-custom implementations achieve the highest performance density. The FPGA beats the raw throughput of the DSP implementation, even though the DSP has a custom multiplier. This is due to the fact that the the computational density of the DSP multiplier is diluted by the remaining 90% of the chip. The RISC processor must synthesize the multiply operation temporally out of smaller add operations, and thus achieves the lowest density.

Slide 19: The FPGAs have comparable throughput density to the custom implementations for programmable coefficients, but 10-30x less dense for fixed coefficients. FPGA's are 30x more dense than DSP's!

Slide 20: The multipliers cannot be optimized around the coefficients, so the FPGA is 50x less dense than a custom solution. The FPGA still holds a 4x advantage over the DSP, and the advantage increases for lower precision solutions.

Slide 21: The custom IC should have a 2-4x speed normalization to account for its older process technology. The FPGA offers a speedup at the cost of 80x the Silicon area. This particular application is ideally suited for a throughput comparison, as the parallel nature of the algorithm can be exploited at will.

Slide 22: The areas for SPLASH implementations were approximated by the FPGA and memory areas. For SPLASH2, the area of the 9 crossbars is omitted. The area of the processors may also be optimistic, since the first SPARC had no data cache, and the SuperSPARC would still not have a large enough data cache to hold the problem. Most of the SPLASH advantage comes from an increase in Silicon. However, SPLASH was limited to 1/10 its potential due to throughput problems in the I/O. This was corrected in SPLASH2, which shows a 6x improvement.

In this application, the SPLASH implementations used neither their crossbars nor their memories. At the end of the previous lecture, we saw a throughput density comparison that only included FPGA area.

Slide 23: Not surprisingly, the custom macro designs perform significantly better than the FPGA and processors without floating point support.

Slide 24: The custom macro-cells also perform better than FPGA's for floating point multiply.

Slide 25: Area-time curves are non-ideal below certain time points. For example, a doubling of compute time will not necessarily allow a 50% reduction in area. Residual control and irregular options cannot be scaled down with the decreasing throughput requirements, causing the non-idealness.

Slide 26: The DSP will continue to scale linearly below 800ns, assuming other tasks can be scheduled between filter evaluations. The custom IC can run at 33ns, but uses the same amount of silicon, regardless of how slow it is run. Assuming that the FIR filter throughput can be chosen before fabrication, than a custom IC could be designed at any of the throughput rates with a continuous speed/area tradeoff. The DSP has the advantage for speeds of less than 0.5 MHz, with the FPGA having the advantage in the 0.5-8.0 MHz range.

Slide 27: Because the DES encryption is less regular, there are fewer design options open to the FPGA. Scaling down the area is far less attractive for this application.

Slide 28: There is often a fixed area which does not scale down with throughput. To illustrate this we look at an image scoring example. Here, a fixed area is required for image memory and the retiming window. As we scale to higher throughput designs, the per operation cost of this fixed area is highly ammortized across a large number of computational units.

Slide 29: Here, we see the area-time curve resulting from this example, along with an alternative based on custom image processing ASICs. In this case, the ASICs suffer in density because they solve a more general problem than the one at hand --- e.g. the custom BFIR is designed to handle a 32x32 image window while this task only requires a 16x16 window. The distance between the ASIC and the FPGA implementation closes as we go to larger designs the aforementioned fixed area is highly ammortized in the more parallel implementations.

Slide 30: As we have seen, fixed overhead Af, such as timing and control can impact the idealness of the space-time tradeoff. The path length can also limit the achievable speed of a design. Finally, the interconnect requirements may grow more quickly with parallelism.

Slide 31: Summary of the lecture's conclusions.