Slide 5:
Adding to the difficulty is the problem of comparing designs manufactured
in different process technologies. Care must be taken to normalize collected
data for different technology generations.
Slide 6:
The object of the lecture is to compare computing devices across a spectrum
ranging from processors to full custom designs, with ASIC, Sea of Gates,
Gate Arrays and FPGA's in between. Across this spectrum, Performance is
traded off for Flexibility.
| Custom---ASIC---Sea of Gates---PGA---FPGA---Processor|
----------> Flexibility ---------->
Slide 7:
With sea of gates, there is no fixed overhead for routing channels. Wires
are treated more as first class citizens, and hence, the interconnect density
is not limited by the width of the routing channels.
Slide 8:
Gate arrays are estimated to have 2-3x worse density than custom designs.
Gate arrays will be least efficient in implementing memory storage
elements, which can be highly optimized in memory arrays.
Slide 9:
In a sea of gates array with two layers of metal,
approximately 35% of the gates are usable. This figure jumps to about 70%
for 3-4 layers of metal. FPGA's are about 5x less dense than MPGA's,
and are more like 10-15x less dense than custom designs.
Slide 11:
The processors perform so badly in this metric because their density
is diluted by control and cache memory overhead. Only the ALU really
performs useful work. It can be argued that the instruction counter performs
useful work on branch statements. However, the branch statements may have never
been necessary if the computation was spread out in area rather than time.
Co-processors are also not included in these comparisons.
Slide 12:
This slide shows the order of magnitude advantage of FPGA implementations
over processors in terms of peak computational density. The advantage holds over
a wide range of process technologies, from 3.0um down to 0.35um.
Slide 13:
Stalls can include cache misses, unfilled branch slots, and data dependancy
stalls. With wider super-scalar and VLIW processors, the degradation from the
peak computational density could be quite heavily impacted, as they often
experience under-utilized resources.
Slide 14:
Longer path lengths make longer cycles necessary, degrading yielded density
from peak. If the throughput is limited, bit elements can sit idle.
Finally, mismatches in resource
balance (e.g. insufficient interconnect or retiming
registers/memory) can cause
some units to be unused.
Slide 15:
After fabrication, it is impossible to tailor the hardware exactly to a
task (e.g. select optimal word width, specialize around constants or
modes of operation). FPGA implementations (and in some cases processor and
DSP implementations) allow these things to be defined on a per application
basis. Thus, MPGA's may have to accomodate more general problems
(e.g. larger word sizes) than required at a given time, resulting in
lower yielded density.
Slide 16:
Yielded throughput density can give us a good metric for comparing
various implementations.
Slide 17:
Here, "High Throughput" means that the most desirable throughput for the
task is unbounded, or bounded at a fixed rate higher than a single compute
element can achieve. For example, in data base applications such as searching and
sorting,
Slide 18:
The full-custom implementations achieve the highest performance density.
The FPGA beats the raw throughput of the DSP implementation,
even though the DSP has a custom multiplier. This is due to the fact that the
the computational density of the DSP multiplier is diluted by the remaining 90%
of the chip. The RISC processor must synthesize the multiply operation
temporally out of smaller add operations, and thus achieves the lowest density.
Slide 19:
The FPGAs have comparable throughput density to the custom implementations for
programmable coefficients, but 10-30x less dense for fixed coefficients.
FPGA's are 30x more dense than DSP's!
Slide 20:
The multipliers cannot be optimized around the coefficients,
so the FPGA is 50x less dense than a custom solution. The FPGA
still holds a 4x advantage over the DSP, and the advantage increases for
lower precision solutions.
Slide 21:
The custom IC should have a 2-4x speed normalization to account for its
older process technology. The FPGA offers a speedup at the cost of 80x
the Silicon area. This particular application is ideally suited for
a throughput comparison, as the parallel nature of the algorithm can
be exploited at will.
Slide 22:
The areas for SPLASH implementations were approximated by the FPGA and
memory areas. For SPLASH2, the area of the 9 crossbars is omitted. The
area of the processors may also be optimistic, since the first SPARC had no
data cache, and the SuperSPARC would still not have a large enough data
cache to hold the problem. Most of the SPLASH advantage comes from an
increase in Silicon. However, SPLASH was limited to 1/10 its potential due
to throughput problems in the I/O. This was corrected in SPLASH2, which
shows a 6x improvement.
In this application, the SPLASH implementations used neither their crossbars nor their memories. At the end of the previous lecture, we saw a throughput density comparison that only included FPGA area.
Slide 23:
Not surprisingly, the custom macro designs perform significantly better than
the FPGA and processors without floating point support.
Slide 24:
The custom macro-cells also perform better than FPGA's for floating
point multiply.
Slide 25:
Area-time curves are non-ideal below certain time points. For example,
a doubling of compute time will not necessarily allow a 50% reduction
in area. Residual control and irregular options cannot be scaled down
with the decreasing throughput requirements, causing the non-idealness.
Slide 26:
The DSP will continue to scale linearly below 800ns, assuming other
tasks can be scheduled between filter evaluations. The custom IC can
run at 33ns, but uses the same amount of silicon, regardless of how
slow it is run. Assuming that the FIR filter throughput can be chosen
before fabrication, than a custom IC could be designed at any of the
throughput rates with a continuous speed/area tradeoff. The DSP has
the advantage for speeds of less than 0.5 MHz, with the FPGA having the
advantage in the 0.5-8.0 MHz range.
Slide 27:
Because the DES encryption is less regular, there are fewer design
options open to the FPGA. Scaling down the area is far less attractive
for this application.
Slide 28:
There is often a fixed area which does not scale down with throughput. To
illustrate this we look at an image scoring example. Here, a fixed area is
required for image memory and the retiming window. As we scale to higher
throughput designs, the per operation cost of this fixed area is
highly ammortized across a large number of computational units.
Slide 29:
Here, we see the area-time curve resulting from this example, along with an
alternative based on custom image processing ASICs. In this case, the
ASICs suffer in density because they solve a more general problem than the
one at hand --- e.g. the custom BFIR is designed to handle a 32x32
image window while this task only requires a 16x16 window. The distance
between the ASIC and the FPGA implementation closes as we go to larger
designs the aforementioned fixed area is highly ammortized in the
more parallel implementations.
Slide 30:
As we have seen, fixed overhead Af, such as timing and control can impact
the idealness of the space-time tradeoff. The path length can also limit
the achievable speed of a design. Finally, the interconnect requirements
may grow more quickly with parallelism.
Slide 31:
Summary of the lecture's conclusions.