Slide 3:
Systolic architectures "pump" data through various points and processing
elements in the design. Data flows regularly, like blood and the human
heart, and is synchronized with the clock.
Slide 4:
One of the motivations, here, is to minimize the number of memory
accesses. Each data item can come from either memory or the outside world.
Intermediate results are used without going to memory or the outside world.
Slide 5:
The philosophy of Systolic architectures is to emphasize local interconnect.
Putting things next to each other reduces the amount of space taken up
by wires and reduces wire delay.
Slide 6:
Researchers in the manufacturing industry faced similar problems
in trying to optimize their process.
They realized that they were performing many overhead operations on the
assembly line that were not "value added". Generally, they were checking
parts out from a central inventory, doing real work and checking them in
again. Checking things in and out took up considerable time and energy,
often dominating assemble processing line time.
Slide 7:
This slide presents the "rules of engagement" for Systolic, above which
algorithms can be implemented to solve real problems. Note that pipeline
communication
between cells is the central concept that gives Systolic its behavior.
SPLASH, discussed earlier, is an example of a Systolic Architecture.
Historical note: these architectures came around when people started
building VLSI (100 transistors, lambda = 2 microns, 1979).
Slide 8:
Many DSP architectures, such as FIR filters, use a Systolic methodology.
As we implement RC's, systolic solutions should be in our bag-of-tricks
(sort of like numerical recipes).
Historical note: another reason why people started using Systolic was that
people realized that VLSI wires were extremely important:
their was, at that time, only one layer of metal, and long wires
killed.
Slide 9:
FIR Filters are sort of the "canonical" example of Systolic computation.
Note, that the equation in slide is incorrect. Instead, it should be
y_i = w_1 * x_i + w_2 * x_(i-1) + .... + w_k * x_(i-k-1)
Also, observe that there is no broadcast, since Systolic only allows local
interconnect. Therefore, data values are shifted in and sample values are
shifted out.
Slide 10:
Note that for the FIR architecture above, there is k parallelism. Note
also that the intermediate storage is almost fixed by the problem and does
not increase as the number of processors increases. Finally, the structure
means that even though the throughput increases, the bandwidth requirement
does not.
Slide 11:
The a's and b's are shifted in, and computations are performed as they
intersect. Systolic matrix multiply is a well studied problem. Algorithms are
available, for example, to handle matrix operations that are larger than the
size of the matrix hard-coded in the design.
Slide 12:
This slide is just a worksheet showing the position of data at each time step.
Note that the final result is produced in 3n cycles. One can overlap
another set of multiplies as data is flowing out.
Slide 13:
Note that n * n work performed on each cycle. The perimeter bandwidth is
equal to the number of rows plus the number of columns.
Slide 14:
The slides above gives a list of various algorithms that can be implemented
using Systolic architectures. The ones in bold indicate those that will be
(or have been) discussed in lecture.
Slide 15:
Once again, Systolic has the limitation of only allowing local interconnect,
and pipelined compute elements. Semisystolic allows the relaxation of these
restrictions.
Slide 16:
This slide presents a review of Systolic architectures given in this lecture.