Administrivia:
Class time will remain at TuTh 9:30-11am.
There is a new room for the class: 505 Soda Hall.
Because of FPGA '97, the Feb. 11 lecture has been rescheduled to 5pm,
and it will be held in 405 Soda Hall.
Slide 3:
Two fundamental components are required to do a computation:
(1) the ability to compute results from input values and
(2) the ability to communicate data in space and in time.
Slide 4:
This slide defines a key term, Primitive Instruction (pinst), which is used
throughout the rest of the discussion. Here are some examples of primitive
instructions for various architecture: (1) instruction of a RISC processor,
(2) each functional unit field in the long instruction of a VLIW processor,
(3) bits controlling one k-LUT and the associated interconnect resources
in an FPGA.
Slide 5:
The rest of this discussion is based on the processing model shown in this
slide.
The diagram here shows how the configuration context is made up of pinsts which each pinst controlling its slice of the device (i.e. it's compute element, local state manipulation, and it's slice of the spatial interconnect).
Slide 6:
In the most general or ideal form of our processing model, each and every
computing element is given a new instruction on every cycle.
This gives us the highest degree of flexibility and expressiveness in
deploying the available computing resources.
Slide 7:
The problem with the ideal model is that the instruction bandwidth and
storage required to provide a new instruction to each and every computing
element on every cycle can quickly exacerbate available hardware resources.
To illustrate this point, let us consider an array of N processing elements
(each side of the array is SQRT(N) elements wide). Let us assume each array
element (AE) is a 1K-Lambda by 1K-Lambda block and requires a 64-bit
instruction. Wire pitch is 8 Lambda.
Slide 8:
Let us assume that we can bring instructions into the array from all four
sides using two dedicated metal layers (this is extremely generous). Each
side of the array is SQRT(N) x 1K-Lambda wide and can be used to bring in
2 x SQRT(N) 64-bit instructions into the array. There are 4 sides, so a total
of 8 x SQRT(N) instructions can be brought into the array every cycle.
For N = 64, this aggregate capacity is completely exhausted by the array
elements (i.e., 8 x SQRT(N) = N).
Slide 9:
To accommodate more than 64 elements, we'll have to increase the size of
each AE to provide additional space to bring in the instructions needed
by the extra AEs. Assuming that the aggregate channel capacity available
for instructions is fully utilized, this means that the area of each AE
is 16K-Lambda^2 x N. This means that the total area of the array
(AE(area) x N) is proportional to N^2. The implication of this quadratic
dependence is that the required instruction distribution bandwidth has a
strong influence on how dense arrays of processing elements can be made.
Density, defined as number of processing elements per unit area, is
proportional to 1/N.
Slide 10:
What if instead of fetching instructions from memories external to the
array, we were to store all instructions inside the array. This would seem
to solve the bandwidth problem. However, the problem now is that the area
required for instruction storage can quickly dominate.
To illustrate this point, let's assume that each bit of storage takes up
1.2K-Lambda^2 of area (i.e., ~80K-Lambda^2 of area per 64-bit instruction).
Each AE will now require an additional N x 80K-Lambda^2 of area to store
N instructions. For N = 13, half of the total area of each AE is dedicated
to instruction storage!
Slide 11:
The bottom-line is that instruction distribution requirements can easily
dominate available hardware resources and we cannot afford to provide a
new instruction to every bit-processing element on every cycle.
Is there anything we can do about this problem? Sure! Try reducing the
instruction distribution requirements. See if there are any properties or
patterns that can be exploited to achieve that end.
Slide 12:
Two examples that represent extreme means of reducing instruction
distribution requirements are:
(1) SIMD Array: all processing elements execute the same instruction
every cycle. A single channel carries the same instruction to all elements.
This technique amortizes the cost of a pinst across many PEs.
The wide datapaths of conventional processors exploit this technique
by having each processing element perform the same operation on
different bits of an n-bit data word.
(2) FPGA: each processing element has its own locally stored instruction
and executes it every cycle. Temporal locality is exploited to the limit,
where the same operation is performed from cycle to cycle. The array can
be reprogrammed, albeit extremely slowly.
Slide 13:
And then there are hybrid solutions:
(1) VLIW/Superscalar Processors: only a few pinsts are issued every
cycle, and each pinst is shared across w bit-processing elements that
process w-bit words of data.
(2) DPGA: each processing element has a few locally stored contexts that
define its behavior at different times. This allows a DPGA to tolerate
temporal variations in an application somewhat better than an FPGA, but
a DPGA is still limited in how much temporal variation it can tolerate.
Slide 14:
This slide presents a taxonomy of computing architectures based on
the instruction distribution mechanisms that are used. Key parameters
are
Slide 15:
The question that arises now is what if the parameters of a given
architecture do not match the corresponding parameters of a given
application. What if the datapath is too narrow or too wide? What if
the instruction memory is too deep or too shallow? What inefficiencies
do these mismatches cause, and what price do we pay for them?
Slide 16:
To answer that question, we need to build a model, and this model
has to capture two things: the area of a given device used to implement
an application, and a measure of how well is this device utilized.
Slide 17:
To develop this model, we will assume that our computing devices
are based on the composition shown in this slide. There are three
components to account for: bit-processing elements, spatial and
temporal interconnect, and instruction memory.
Slide 18:
Here's the model for the area of each bit-processing element.
The areas associated with each of the components shown in the
previous slide are added up. The parameters of this model along
with some of the assumed values are fully defined in Table 9.1
of Andre' DeHon's thesis, but the key ones that we will use for the rest
of this discussion are:
Slide 19:
Just to illustrate that this model is reasonably accurate, here is a
comparison of estimated and actual values for a few computing devices.
Slide 20:
This slide shows a plot of peak computational density against w and c.
Increasing w results in smaller bit-processing elements and higher density
because there is more sharing of instruction memories and less switches
in the interconnect. Increasing c results in lower computational
density because there are more instruction per processing element, resulting
in larger sizes. This effect is less severe for larger w because the cost
of additional instruction memory locations is amortized over a larger
number of processing elements.
Slide 21:
This is another spot check to check the validity of our model against
actual devices.
Slide 22:
Next important question is the efficiency of a given device for implementing
a given application. Efficiency is a measure of how fully available hardware
resources are utilized. Important observation here is that datapath width (w)
and context depth (c) have their duals for any given application in terms of
the native bit-width of data values for that particular application and the
path length of the application, respectively.
For example, if we wanted to execute 16-bit operations on a 32-bit device, then half of the datapath resources would be wasted. If we were to execute 16-bit operations using 8-bit devices (throughput is fixed), then we would have to use two devices with two sets of instruction memories. Now half of the instruction memory is wasted. Executing 16-bit operations on a 16-bit device would be ideal, as no hardware resources would be left unused. For an architecture to be efficient, the corresponding parameters should match well. We define efficiency as the area of a perfectly matched architecture with just the right values for w and c to that of that architecture being evaluated.
Slide 23:
In this slide efficiency is plotted against path length and context depth.
Note that path length is an application characteristic and is a measure
of the number of operations which must occur in series before
we can repeat the programmed operation (or, alternately, the number of
operation time slots which we can afford to take in order to perform each
round of a computation), whereas context depth is a hardware parameter. The
two are duals of each other.
Highest efficiency is achieved when the path length of a given application
is equal to the context depth of the architecture used to implement that
application. In other words we have a perfect match and no resources are
wasted. When context depth is much larger than path length, a great deal
of area is wasted in the form of unused instruction memory locations, and
efficiency drops sharply. When context depth is too shallow, large numbers
of processing elements are unused for long periods while they are waiting
for their turn to "kick in" and contribute to the computation. In this case,
it is the processing elements that are being wasted.
Slide 24:
In the previous slide we can see that efficiency drops sharply to near-zero
values for a single-context FPGA (c = 1) as path length is increased. This
is an indicator of the sensitivity of this kind of a device to parameter
mismatches. Next logical question is whether a different value of context
depth can provide less sensitivity to architectural mismatches, and the
answer, as shown in this slide, is affirmative. For c = 16 (e.g., a DPGA
with 16 contexts), efficiency does not drop much below 50% for large
variations in path length. The reason is that in this scenario, the area
of the instruction memory is roughly equal to that of the active elements;
therefore, in either extreme of parameter mismatches, at most half of
the device area is unused.
Slide 25:
In the previous slides the assumption was that w = 1. For other values
of w, the robust point where instruction memory and active elements each
take up roughly half of the available area varies. For larger w, more of
the instruction memory is amortized over a large group of bit-processing
elements; therefore, maximum robustness will occur for larger context
depths.
Slide 26:
In this slide, the efficiency of two extreme architectures are
plotted against data width and path length. The efficiency of
single-context, single-bit datapath FPGAs drops rapidly to less
than 1% for wide data and long path lengths.
In the case of a "Processor" with a wide datapath
and a deep instruction memory, efficiency is high for long programs and
wide data values, but poor for small programs and narrow data values
(efficiency drops to less than 0.5% for path length and data size of one).
These two architectures accommodate complementary points in the
application space.
Slide 27:
In the previous slide we saw that both architectures were very
inefficient at their cross points, and neither was very efficient in the
middle of the space marked out. Inspired by the robust point we found for
c in the fixed w case, we might try to find a more robust design point for
this space. Here, we have picked a fairly good intermediate point.
Note, that it has the robustness property when one of the parameters is
matched (i.e. at the w=8, c=64 cross-section). However, it drops off
toward the corners -- and is less than 10% efficient at the c = w = 1 and
c = 1024,w = 128 corner points. This further illustrates that the space of
architectures is large. With fixed instruction assignment, it's not
possible to pick a single set of parameters which has bounded efficiency
across the whole space; here we see it's even hard to have a reasonable
bound in a small subspace.
Slide 28:
Keep in the mind that in this analysis some second-order effects
were necessarily abstracted away. Real applications have a heterogeneous
mix of characteristics, and the application space is much larger than what
can be captured by the parameters that were considered in this analysis.
Having said that, the models that we have developed provide valuable insight
and capture the essence of what is going on.
Slide 29:
And here's a summary of what was discussed in this lecture.