Computer Science 294-7, Lecture #9

Systolic Structures

Slide 1:

Slide 2: Admin

Xilinx CAD tools still not all available. Schematic capture works, but back-end to map to FPGA does not.

Slide 3: Systolic

A systolic architecture uses a regular array composed of a few simple compute-cells which exchange pipelined data with their adjacent neighbors at clocked intervals. Limited I/O bandwidth reaches the border cells of the array, and input data is essentially reused as it is pipelined through the array.

Slide 4: Today's Reading

Slide 5: Transitive Closure

A systolic array can operate on graphs by manipulating edge matrices. The transitive closure of a graph essentially places additional edges between each pair of nodes which are originally connected by a directed path. In the input graph, matrix entries a(i,j) denote edges between nodes i and j. In the transitive closure (output), matrix entries a(i,j) denote the existence of a directed path in the input graph between i & j. A systolic implementation for computing the transitive closure of a graph operates as a matrix multiplication (see day 8 notes) on the binary edge matrices.

Slide 6: Priority Queue

A priority queue maintains a list of records sorted by key, so that it can extract (pop) the record with smallest (or largest) key. Each insertion must be sorted by key into proper position in the queue.

Slide 7: Priority Queue

In the systolic implementation of a priority queue (discussed in today's reading), each compute-cell holds a single record and a key comparator. At each compute cycle, a compute-cell compares its key to that of its neighbor, interchanging records if necessary. Thus sorting is accomplished by a kind of parallelized bubble sort.

Slide 8: Priority Queue - Example

This example depicts the logical structure of a priority queue as elements are inserted and extracted.

Slide 9: Priority Queue - Example

This slide depicts the actual contents of the systolic array priority queue (for the example in the previous slide) as they evolve over time. Although insertions and extractions are dispatched at the head of the queue in constant time, the contents of the queue continue to "bubble" for several cycles while insertions/extractions are sorted. This slide depicts insertions with an "I" (looks suspiciously like a "1") and extractions with an "E". Alternatively, one could treat both operations as insertions, using a key of infinity for extractions operations. Note that infinite keys correspond to null records and bubble down to the tail of the queue.

Slide 10: Priority Queue

Important - the interface to the queue (insertions/extractions at the head) operates in constant time - the head of the queue is always immediately available in correct sorted order.

If the queue is mostly empty, the processing resources at the tail of the queue sit idle.

Slide 11: Systolic Tree Multiqueue

A priority queue can also be implemented using a sorted tree. In a typical systolic implementation, compute-cells represent the leafs of a binary tree, and they must exchange data at each insertion/extraction to keep the tree balanced.

The benefit of the tree is that it will allow several queues to share processing resources. The drawback is that each lookup on the tree requires takes O(log(n)) time to produce a result.

Slide 12: Systolic Priority Multiqueue

To manage multiple queues, a systolic tree can be attached to the tails of a collection of systolic queues to collectively manage their overflows. The heads of the queues continue to provide constant-time insertions and extractions, while the tree allows capacity and processing resources to be shared between queues.

Slide 13: Relational Database

Slide 14: DB: Intersection

To compute the intersection of two sets of database records (tuples), we must locate those records which are in both sets. For a systolic implementation, records could pass by each other (spatially) for comparison in the same order as a convolution operation.

Slide 15: Tuple Comparison

The comparison of record fields can be pipelined so that different cells along an array row compare different fields of the same record. Each cell would pipe the true/false result of its field comparison to the next cell in the row, so that the logical and of all comparisons emerges at the end of the row.

Note that we always compare every field because of the dataflow in the systolic architecture. A serial processor can short-circuit some of this work -- once a result turns false, it does not need to evaluate the rest of the field comparisons.

Slide 16: DB: Intersect

Note that each column of the array is dedicated to comparing a different kind of field. Inputs streams A & B traverse up & down the array (respectively), with record fields staggered so that the fields belonging to a single record from the A stream and a single record from the B stream are compared by a single row of the array.

Slide 17: DB: Intersect

The intersection operation is completed by propagating the logical or of record comparisons to collect identical records.

Slide 18: DB: Intersect --> Subtract

These 3 slides discuss the use of the intersection operation to implement other set operations:

subtract one set from another
remove duplicates from a set
union of two sets

Slide 19: DB: Intersect --> Unique

Note that we can perform this initialization by placing a piece of logic at the front of each row which is set to initialize the propogating row value with true or false accoringly.

Slide 20: DB: Unique --> Union

Slide 21: DB: Unique --> Projection

Set projection is the process of projecting records into a lower "dimensionality" by removing (ignoring) selected fields from each record of a set. The resulting set may contain duplicates which were originally differentiated by the fields which were removed. Thus set projection can be implemented using an intersection-based systolic array for removing duplicates.

Slide 22: DB: Match Tuples --> Join

This may need an example to help motivate what a join is:


Relation A:
    Field name        Field Job

1  Ben Bitdiddle      Intern
2  Audrey T. Tour     Accountant
3  Alyssa P. Hacker   Programmer
4  Sam Suem           Lawyer

Relation B:
      Field Job         Field Department
1  Intern             Engineering 
2  Programmer         Engineering 
3  Lawyer             Business
4  Accountant         Business

If we join these two relations over their "Job" field, we get a T matrix:


                B
       1     2     3     4
  1  true  false false false
  2  false false false true 
A 3  false true  false false
  4  false false true  false

These get filtered down the the true cases in the join: (A1,B1), (A2,B4), (A2,B2), (A4,B3)

Then we know how to form the result of the join:
A Join over Column Job B:

Joined Relation
  Field name        Field Job     Field Department

 Ben Bitdiddle      Intern        Engineering 
 Audrey T. Tour     Accountant    Business
 Alyssa P. Hacker   Programmer    Engineering 
 Sam Suem           Lawyer        Business

Slide 23: DB Ops

Note that if we sort database records before performing the set operations, most of the operations become O(n) rather than O(n^2). Including the sort makes operations of complexity O(n log(n)). In practice, most database applications do actually sort the data first.

This suggests that the systolic implementations aren't work-optimal in some sense. Of course, the sorting operations require non-local connections and data access, so the fact that they require less operations should be weighted against their greater interconnect requirements.

FYI: The best sorting algorithms on systolic hardware, sort n numbers on O(n) hardware in O(sqrt(n)) time.

Slide 24: DB Ops

The figures "1500 bit tuples" and "10,000 records" are taken from the 1979 paper Systolic (VLSI) Arrays for Relational Database Operations and are typical of OLD database applications.
With regards to the figure "20 TeraLambda squared", recall that current VLSI technology places on the order of 10 MegaLambda squared on a single die.

These numbers underscore the fact that, in practice, we will partition the problem and operate on it in pieces rather than assembling enough hardware to solve the entire task in one path.

Slide 25: Operations Decomposable

For large data sets, the systolic array structure can typically be segmented into "bands" (with respect to one of the data sets), so that a single band's worth of hardware can sequentially process the entire array's data set. Thus one can avoid expressing in hardware the entirety of a systolic array that is prohibitively large.

Slide 26: Review

Systolic arrays employ parallelism to effectively trade processing area for time. For instance, an O(n^3) problem solved in O(n) time on O(n^2) processors. Note that there typically exist multiple systolic implementations to solve any given problem, some in 1D (O(n) processors), some in 2D (O(n^2) processors).