Computer Science 294-7 Lecture #3
Instructions

Notes by Arthur Abnous

Administrivia: Class time will remain at TuTh 9:30-11am. There is a new room for the class: 505 Soda Hall. Because of FPGA '97, the Feb. 11 lecture has been rescheduled to 5pm, and it will be held in 405 Soda Hall.

Slide 3: Two fundamental components are required to do a computation: (1) the ability to compute results from input values and (2) the ability to communicate data in space and in time.

Slide 4: This slide defines a key term, Primitive Instruction (pinst), which is used throughout the rest of the discussion. Here are some examples of primitive instructions for various architecture: (1) instruction of a RISC processor, (2) each functional unit field in the long instruction of a VLIW processor, (3) bits controlling one k-LUT and the associated interconnect resources in an FPGA.

Slide 5: The rest of this discussion is based on the processing model shown in this slide.

The diagram here shows how the configuration context is made up of pinsts which each pinst controlling its slice of the device (i.e. it's compute element, local state manipulation, and it's slice of the spatial interconnect).

Slide 6: In the most general or ideal form of our processing model, each and every computing element is given a new instruction on every cycle. This gives us the highest degree of flexibility and expressiveness in deploying the available computing resources.

Slide 7: The problem with the ideal model is that the instruction bandwidth and storage required to provide a new instruction to each and every computing element on every cycle can quickly exacerbate available hardware resources. To illustrate this point, let us consider an array of N processing elements (each side of the array is SQRT(N) elements wide). Let us assume each array element (AE) is a 1K-Lambda by 1K-Lambda block and requires a 64-bit instruction. Wire pitch is 8 Lambda.

Slide 8: Let us assume that we can bring instructions into the array from all four sides using two dedicated metal layers (this is extremely generous). Each side of the array is SQRT(N) x 1K-Lambda wide and can be used to bring in 2 x SQRT(N) 64-bit instructions into the array. There are 4 sides, so a total of 8 x SQRT(N) instructions can be brought into the array every cycle. For N = 64, this aggregate capacity is completely exhausted by the array elements (i.e., 8 x SQRT(N) = N).

Slide 9: To accommodate more than 64 elements, we'll have to increase the size of each AE to provide additional space to bring in the instructions needed by the extra AEs. Assuming that the aggregate channel capacity available for instructions is fully utilized, this means that the area of each AE is 16K-Lambda^2 x N. This means that the total area of the array (AE(area) x N) is proportional to N^2. The implication of this quadratic dependence is that the required instruction distribution bandwidth has a strong influence on how dense arrays of processing elements can be made. Density, defined as number of processing elements per unit area, is proportional to 1/N.

Slide 10: What if instead of fetching instructions from memories external to the array, we were to store all instructions inside the array. This would seem to solve the bandwidth problem. However, the problem now is that the area required for instruction storage can quickly dominate. To illustrate this point, let's assume that each bit of storage takes up 1.2K-Lambda^2 of area (i.e., ~80K-Lambda^2 of area per 64-bit instruction). Each AE will now require an additional N x 80K-Lambda^2 of area to store N instructions. For N = 13, half of the total area of each AE is dedicated to instruction storage!

Slide 11: The bottom-line is that instruction distribution requirements can easily dominate available hardware resources and we cannot afford to provide a new instruction to every bit-processing element on every cycle. Is there anything we can do about this problem? Sure! Try reducing the instruction distribution requirements. See if there are any properties or patterns that can be exploited to achieve that end.

Slide 12: Two examples that represent extreme means of reducing instruction distribution requirements are: (1) SIMD Array: all processing elements execute the same instruction every cycle. A single channel carries the same instruction to all elements. This technique amortizes the cost of a pinst across many PEs. The wide datapaths of conventional processors exploit this technique by having each processing element perform the same operation on different bits of an n-bit data word. (2) FPGA: each processing element has its own locally stored instruction and executes it every cycle. Temporal locality is exploited to the limit, where the same operation is performed from cycle to cycle. The array can be reprogrammed, albeit extremely slowly.

Slide 13: And then there are hybrid solutions: (1) VLIW/Superscalar Processors: only a few pinsts are issued every cycle, and each pinst is shared across w bit-processing elements that process w-bit words of data. (2) DPGA: each processing element has a few locally stored contexts that define its behavior at different times. This allows a DPGA to tolerate temporal variations in an application somewhat better than an FPGA, but a DPGA is still limited in how much temporal variation it can tolerate.

Slide 14: This slide presents a taxonomy of computing architectures based on the instruction distribution mechanisms that are used. Key parameters are

The number of independent control threads: how many independent program counters are responsible for dispatching pinsts to processing elements. (e.g. number of processors in a MIMD architecture, number of separate stream controllers in a MSIMD architecture)
The number of pinsts per control thread: how many separate pinsts are dispatched by a given program counter. (e.g. number of functional units in a VLIW architecture)
The depth of the instruction store: how deep is the instruction memory addressed by the program counter. (e.g. number of contexts on a DPGA, number of instructions supported in the on-chip I-cache in a processor)
The number of bit-processing elements controlled by each pinst. (e.g. word-width in a conventional processor, also incorporates vector width for SIMD/vector architectures)

Slide 15: The question that arises now is what if the parameters of a given architecture do not match the corresponding parameters of a given application. What if the datapath is too narrow or too wide? What if the instruction memory is too deep or too shallow? What inefficiencies do these mismatches cause, and what price do we pay for them?

Slide 16: To answer that question, we need to build a model, and this model has to capture two things: the area of a given device used to implement an application, and a measure of how well is this device utilized.

Slide 17: To develop this model, we will assume that our computing devices are based on the composition shown in this slide. There are three components to account for: bit-processing elements, spatial and temporal interconnect, and instruction memory.

Slide 18: Here's the model for the area of each bit-processing element. The areas associated with each of the components shown in the previous slide are added up. The parameters of this model along with some of the assumed values are fully defined in Table 9.1 of Andre' DeHon's thesis, but the key ones that we will use for the rest of this discussion are:

w: datapath width
c: depth of instruction memory
d: bits of data storage for each bit-processing element
k: number of inputs into each bit-processing element (= 4 for the rest of this discussion)

Slide 19: Just to illustrate that this model is reasonably accurate, here is a comparison of estimated and actual values for a few computing devices.

Slide 20: This slide shows a plot of peak computational density against w and c. Increasing w results in smaller bit-processing elements and higher density because there is more sharing of instruction memories and less switches in the interconnect. Increasing c results in lower computational density because there are more instruction per processing element, resulting in larger sizes. This effect is less severe for larger w because the cost of additional instruction memory locations is amortized over a larger number of processing elements.

Slide 21: This is another spot check to check the validity of our model against actual devices.

Slide 22: Next important question is the efficiency of a given device for implementing a given application. Efficiency is a measure of how fully available hardware resources are utilized. Important observation here is that datapath width (w) and context depth (c) have their duals for any given application in terms of the native bit-width of data values for that particular application and the path length of the application, respectively.

For example, if we wanted to execute 16-bit operations on a 32-bit device, then half of the datapath resources would be wasted. If we were to execute 16-bit operations using 8-bit devices (throughput is fixed), then we would have to use two devices with two sets of instruction memories. Now half of the instruction memory is wasted. Executing 16-bit operations on a 16-bit device would be ideal, as no hardware resources would be left unused. For an architecture to be efficient, the corresponding parameters should match well. We define efficiency as the area of a perfectly matched architecture with just the right values for w and c to that of that architecture being evaluated.

Slide 23: In this slide efficiency is plotted against path length and context depth. Note that path length is an application characteristic and is a measure of the number of operations which must occur in series before we can repeat the programmed operation (or, alternately, the number of operation time slots which we can afford to take in order to perform each round of a computation), whereas context depth is a hardware parameter. The two are duals of each other. Highest efficiency is achieved when the path length of a given application is equal to the context depth of the architecture used to implement that application. In other words we have a perfect match and no resources are wasted. When context depth is much larger than path length, a great deal of area is wasted in the form of unused instruction memory locations, and efficiency drops sharply. When context depth is too shallow, large numbers of processing elements are unused for long periods while they are waiting for their turn to "kick in" and contribute to the computation. In this case, it is the processing elements that are being wasted.

Slide 24: In the previous slide we can see that efficiency drops sharply to near-zero values for a single-context FPGA (c = 1) as path length is increased. This is an indicator of the sensitivity of this kind of a device to parameter mismatches. Next logical question is whether a different value of context depth can provide less sensitivity to architectural mismatches, and the answer, as shown in this slide, is affirmative. For c = 16 (e.g., a DPGA with 16 contexts), efficiency does not drop much below 50% for large variations in path length. The reason is that in this scenario, the area of the instruction memory is roughly equal to that of the active elements; therefore, in either extreme of parameter mismatches, at most half of the device area is unused.

Slide 25: In the previous slides the assumption was that w = 1. For other values of w, the robust point where instruction memory and active elements each take up roughly half of the available area varies. For larger w, more of the instruction memory is amortized over a large group of bit-processing elements; therefore, maximum robustness will occur for larger context depths.

Slide 26: In this slide, the efficiency of two extreme architectures are plotted against data width and path length. The efficiency of single-context, single-bit datapath FPGAs drops rapidly to less than 1% for wide data and long path lengths. In the case of a "Processor" with a wide datapath and a deep instruction memory, efficiency is high for long programs and wide data values, but poor for small programs and narrow data values (efficiency drops to less than 0.5% for path length and data size of one). These two architectures accommodate complementary points in the application space.

Slide 27: In the previous slide we saw that both architectures were very inefficient at their cross points, and neither was very efficient in the middle of the space marked out. Inspired by the robust point we found for c in the fixed w case, we might try to find a more robust design point for this space. Here, we have picked a fairly good intermediate point. Note, that it has the robustness property when one of the parameters is matched (i.e. at the w=8, c=64 cross-section). However, it drops off toward the corners -- and is less than 10% efficient at the c = w = 1 and c = 1024,w = 128 corner points. This further illustrates that the space of architectures is large. With fixed instruction assignment, it's not possible to pick a single set of parameters which has bounded efficiency across the whole space; here we see it's even hard to have a reasonable bound in a small subspace.

Slide 28: Keep in the mind that in this analysis some second-order effects were necessarily abstracted away. Real applications have a heterogeneous mix of characteristics, and the application space is much larger than what can be captured by the parameters that were considered in this analysis. Having said that, the models that we have developed provide valuable insight and capture the essence of what is going on.

Slide 29: And here's a summary of what was discussed in this lecture.

Computer Science 294-7 Lecture #3 Instructions

Computer Science 294-7 Lecture #3
Instructions