Computer Science 294-7, Lecture #9
Systolic Structures
Notes by Eylon Caspi
Slide 1:
Slide 2: Admin
Xilinx CAD tools still not all available.
Schematic capture works, but back-end to map to FPGA does not.
Slide 3: Systolic
A systolic architecture uses a regular array composed of a few simple
compute-cells which exchange pipelined data with their adjacent neighbors
at clocked
intervals. Limited I/O bandwidth reaches the border cells of the array,
and input data is essentially reused as it is pipelined through the array.
Slide 4: Today's Reading
Slide 5: Transitive Closure
A systolic array can operate on graphs by manipulating edge matrices.
The transitive closure of a graph essentially places additional edges
between each pair of nodes which are originally connected by a directed
path. In the input graph, matrix entries a(i,j) denote edges between
nodes i and j. In the transitive closure (output), matrix entries a(i,j)
denote the existence of a directed path in the input graph between i & j.
A systolic implementation for computing the transitive closure of a
graph operates as a matrix multiplication (see day 8 notes) on the binary
edge matrices.
Slide 6: Priority Queue
A priority queue maintains a list of records sorted by key,
so that it can extract (pop) the record with smallest (or largest) key.
Each insertion must be sorted by key into proper position in the queue.
Slide 7: Priority Queue
In the systolic implementation of a priority queue (discussed in today's
reading), each compute-cell holds a single record and a key comparator.
At each compute cycle, a compute-cell compares its key to that of its
neighbor, interchanging records if necessary. Thus sorting is accomplished
by a kind of parallelized bubble sort.
Slide 8: Priority Queue - Example
This example depicts the logical structure of a priority queue
as elements are inserted and extracted.
Slide 9: Priority Queue - Example
This slide depicts the actual contents of the systolic array
priority queue (for the example in the previous slide) as they
evolve over time. Although insertions and extractions are dispatched
at the head of the queue in constant time, the contents of the queue
continue to "bubble" for several cycles while insertions/extractions
are sorted. This slide depicts insertions with an "I" (looks suspiciously
like a "1") and extractions with an "E". Alternatively, one could treat
both operations as insertions, using a key of infinity for extractions
operations. Note that infinite keys correspond to null records and bubble
down to the tail of the queue.
Slide 10: Priority Queue
Important - the interface to the queue (insertions/extractions at the
head) operates in constant time - the head of the queue is always
immediately available in correct sorted order.
If the queue is mostly empty, the processing resources at the
tail of the queue sit idle.
Slide 11: Systolic Tree Multiqueue
A priority queue can also be implemented using a sorted tree. In a typical
systolic implementation, compute-cells represent the leafs of a binary
tree, and they must exchange data at each insertion/extraction to keep
the tree balanced.
The benefit of the tree is that it will allow several queues to share
processing resources. The drawback is that each lookup on the tree
requires takes O(log(n)) time to produce a result.
Slide 12: Systolic Priority Multiqueue
To manage multiple queues, a systolic tree can be attached to the tails
of a collection of systolic queues to collectively manage their overflows.
The heads of the queues continue to provide constant-time insertions and
extractions, while the tree allows capacity and processing resources to be
shared between queues.
Slide 13: Relational Database
Slide 14: DB: Intersection
To compute the intersection of two sets of database records (tuples),
we must locate those records which are in both sets. For a systolic
implementation, records could pass by each other (spatially) for comparison
in the same order as a convolution operation.
Slide 15: Tuple Comparison
The comparison of record fields can be pipelined so that different
cells along an array row compare different fields of the same record.
Each cell would pipe the true/false result of its field comparison to the
next cell in the row, so that the logical and of all comparisons
emerges at the end of the row.
Note that we always compare every field because of the dataflow in the
systolic architecture. A serial processor can short-circuit some of this
work -- once a result turns false, it does not need to evaluate the rest of
the field comparisons.
Slide 16: DB: Intersect
Note that each column of the array is dedicated to comparing a different
kind of field. Inputs streams A & B traverse up & down the array
(respectively), with record fields staggered so that
the fields belonging to a single record from the A stream
and a single record from the B stream are compared by a single row of the
array.
Slide 17: DB: Intersect
The intersection operation is completed by propagating the logical
or of record comparisons to collect identical records.
Slide 18: DB: Intersect --> Subtract
These 3 slides discuss the use of the intersection operation
to implement other set operations:
- subtract one set from another
- remove duplicates from a set
- union of two sets
Slide 19: DB: Intersect --> Unique
Note that we can perform this initialization by placing a piece of
logic at the front of each row which is set to initialize the propogating
row value with true or false accoringly.
Slide 20: DB: Unique --> Union
Slide 21: DB: Unique --> Projection
Set projection is the process of projecting records into a lower
"dimensionality" by removing (ignoring) selected fields from each
record of a set. The resulting set may contain duplicates which
were originally differentiated by the fields which were removed.
Thus set projection can be implemented using an intersection-based
systolic array for removing duplicates.
Slide 22: DB: Match Tuples --> Join
This may need an example to help motivate what a join is:
Relation A:
Field name Field Job
1 Ben Bitdiddle Intern
2 Audrey T. Tour Accountant
3 Alyssa P. Hacker Programmer
4 Sam Suem Lawyer
Relation B:
Field Job Field Department
1 Intern Engineering
2 Programmer Engineering
3 Lawyer Business
4 Accountant Business
If we join these two relations over their "Job" field, we get a T matrix:
B
1 2 3 4
1 true false false false
2 false false false true
A 3 false true false false
4 false false true false
These get filtered down the the true cases in the join:
(A1,B1), (A2,B4), (A2,B2), (A4,B3)
Then we know how to form the result of the join:
A Join over Column Job B:
Joined Relation
Field name Field Job Field Department
Ben Bitdiddle Intern Engineering
Audrey T. Tour Accountant Business
Alyssa P. Hacker Programmer Engineering
Sam Suem Lawyer Business
Slide 23: DB Ops
Note that if we sort database records before performing the set
operations, most of the operations become O(n) rather than O(n^2).
Including the sort makes operations of complexity O(n log(n)). In
practice, most database applications do actually sort the data first.
This suggests that the systolic implementations aren't work-optimal in some
sense. Of course, the sorting operations require non-local connections and
data access, so the fact that they require less operations should be
weighted against their greater interconnect requirements.
FYI: The best sorting algorithms on systolic hardware, sort n
numbers on O(n) hardware in O(sqrt(n)) time.
Slide 24: DB Ops
- The figures "1500 bit tuples" and "10,000 records" are taken from
the 1979 paper Systolic (VLSI) Arrays for Relational Database
Operations and are typical of OLD database applications.
- With regards to the figure "20 TeraLambda squared",
recall that current VLSI technology places on the order of
10 MegaLambda squared on a single die.
These numbers underscore the fact that, in practice, we will
partition the problem and operate on it in pieces rather than assembling
enough hardware to solve the entire task in one path.
Slide 25: Operations Decomposable
For large data sets, the systolic array structure can typically be
segmented into "bands" (with respect to one of the data sets), so that
a single band's worth of hardware can sequentially process the entire
array's data set. Thus one can avoid expressing in hardware the entirety
of a systolic array that is prohibitively large.
Slide 26: Review
Systolic arrays employ parallelism to effectively trade processing area
for time. For instance, an O(n^3) problem solved in O(n) time on O(n^2)
processors. Note that there typically exist multiple systolic
implementations to solve any given problem, some in 1D (O(n) processors),
some in 2D (O(n^2) processors).
Slide 27: Review