Computer Science 294-7 Lecture #8

Notes by Adrian J. Isles

Slide 3: Systolic architectures "pump" data through various points and processing elements in the design. Data flows regularly, like blood and the human heart, and is synchronized with the clock.

Slide 4: One of the motivations, here, is to minimize the number of memory accesses. Each data item can come from either memory or the outside world. Intermediate results are used without going to memory or the outside world.

Slide 5: The philosophy of Systolic architectures is to emphasize local interconnect. Putting things next to each other reduces the amount of space taken up by wires and reduces wire delay.

Slide 6: Researchers in the manufacturing industry faced similar problems in trying to optimize their process. They realized that they were performing many overhead operations on the assembly line that were not "value added". Generally, they were checking parts out from a central inventory, doing real work and checking them in again. Checking things in and out took up considerable time and energy, often dominating assemble processing line time.

Slide 7: This slide presents the "rules of engagement" for Systolic, above which algorithms can be implemented to solve real problems. Note that pipeline communication between cells is the central concept that gives Systolic its behavior. SPLASH, discussed earlier, is an example of a Systolic Architecture. Historical note: these architectures came around when people started building VLSI (100 transistors, lambda = 2 microns, 1979).

Slide 8: Many DSP architectures, such as FIR filters, use a Systolic methodology. As we implement RC's, systolic solutions should be in our bag-of-tricks (sort of like numerical recipes). Historical note: another reason why people started using Systolic was that people realized that VLSI wires were extremely important: their was, at that time, only one layer of metal, and long wires killed.

Slide 9: FIR Filters are sort of the "canonical" example of Systolic computation. Note, that the equation in slide is incorrect. Instead, it should be y_i = w_1 * x_i + w_2 * x_(i-1) + .... + w_k * x_(i-k-1) Also, observe that there is no broadcast, since Systolic only allows local interconnect. Therefore, data values are shifted in and sample values are shifted out.

Slide 10: Note that for the FIR architecture above, there is k parallelism. Note also that the intermediate storage is almost fixed by the problem and does not increase as the number of processors increases. Finally, the structure means that even though the throughput increases, the bandwidth requirement does not.

Slide 11: The a's and b's are shifted in, and computations are performed as they intersect. Systolic matrix multiply is a well studied problem. Algorithms are available, for example, to handle matrix operations that are larger than the size of the matrix hard-coded in the design.

Slide 12: This slide is just a worksheet showing the position of data at each time step. Note that the final result is produced in 3n cycles. One can overlap another set of multiplies as data is flowing out.

Slide 13: Note that n * n work performed on each cycle. The perimeter bandwidth is equal to the number of rows plus the number of columns.

Slide 14: The slides above gives a list of various algorithms that can be implemented using Systolic architectures. The ones in bold indicate those that will be (or have been) discussed in lecture.

Slide 15: Once again, Systolic has the limitation of only allowing local interconnect, and pipelined compute elements. Semisystolic allows the relaxation of these restrictions.

Slide 16: This slide presents a review of Systolic architectures given in this lecture.