Slide 1:
The homepage for both SPLASH-1 and SPLASH -2 can be found
here
Slide 2:
Notice the change of time for the next class.
Slide 3:
The SPLASH-1 architecture is built upon
the Xilinx XC3090. Its interconnection is that of a linear systolic array
(no mesh connection allowed). The board was built around an
application (DNA sequence matching) and was developed in 1988. It likely cost
more for the SPLASH-1 board than it would have for a custom IC. The 1MHz
FIFO is a big limitation that sets the maximum system throughput.
Slide 4:
The SPLASH-2 architecture added a crossbar and was built with Xilinx
XC4010 FPGA'S. It was developed 1992. The 43 giga Lambda squared
does not include the crossbar. The
crossbar does not likely contribute significantly to the area.
Slide 5:
This slide motivates the problem that the SPLASH-1 in SPLASH-2 architectures
solve.
Slide 6:
An DNA sequence is composed of a string of words made from a group of four
characters. The path for a translation from one sequence to another
is not unique.
All we really need to find is the cost of the minimum length paths. The
transformation can be described in terms of three operations; insert,
delete, and substitute, each with an associated cost. It should be noted that
"match" is not an operation. It is associated with the substitution
operation, which has a cost of either 2 or 0 dependent on whether a
substitution is required for not (respectively).
Slide 7:
In this example we examine the steps one, or nature,
might take to transform the sequence "SHMOO" to "SHORE."
Slide 8:
Notice that the delete operation depends only on the source character and the
insert operation depends only on the target character. However, the
substitute operation depends on the source and target character. Given
this formulation, elements can be computed along parallel lines that
are perpendicular to the diagonal formed by the line d(0,0), d(m, n) and are
called the antidiagonols.
Slide 9:
In this example the target is "SHORE" and the source is "SHMOO." As
defined d(0, 0) = 0. From ther we can begin computing along the parallel
line, as described previously, formed by, in this case, the two elements
d(0, 1) and d(1, 0). For element d(0, 1), we only need computed the cost
of insertion, which in this case is 1, since the cost of d(0, 0) = 0.
Similarly we compute the cost of deletion for element d(1, 0) to be 1. In
the next parallel {composed of the three elements d(0, 2), d(1, 1), d(2,
0)}, let's examine the element d(1, 1). To compute it we must
determine the minimum cost between the deletion of the source, the
insertion of the target, or the substitution of the target (for the
source). In this case the associated costs are 1, 1, or 0 (respectively).
Therefore the operation cost is 0. In this manner the remaining parallels
are computed.
Slide 10:
Here is the completed cost table for our example. Notice the element d(4,
3) = 1. In this case the cost to move from d(3, 2) and substitute has the
minimum cost since the source and target characters match. This agrees
with intuition given that to transform "SHMO" to "SHO" the operation of
minimum cost would be that of inserting an"M" into the target.
Slide 11:
The algorithm displayed runs concurrently on each PE. The PE's are
interconnected in a systolic array. The source characters and target
characters move through the array in opposite directions.
Slide 12:
Each sequence is separated by a null character for proper timing. As
previously described, the source flows from the left while he target
flows from the right. This slide shows three timesteps. The first
timestep shows the initial conditions; namely the cost of deletions for
the source and the cost of insertions for the target. In the second
timestep the first characters of the source and target enter a common
element. The algorithm dictates that the cost would be 0 given the
match (and prior history 0 cost). In the third timestep the cost
of the two intersecting characters are both 1 and are derived from the
cost (0) of the input target (source) added to the cost (1) of deletion
(insertion).
Slide 13:
In this slide the example continues as described in the previous slide.
For example, in the second timestep, for the cost computation of the characters
(S, R), (target, source), the minimum cost result is 3. This is the cost
of the target distance (2) plus the cost of deletion (1).
Slide 14:
This slide continues the example in similar fashion.
Slide 15:
This slide concludes the example with the total cost of transformation,
IE: the edit distance, calculated to be 4.
The final distance ends up at the tail of the match and target sequence
and follows them out of the array.
Slide 16:
This slide summaries the statistics for the bidirectional technique. As
will be seen in a few slides, techniques exist which increase the steady
state computational density and increases the utilization of the hardware.
The SPLASH-1 board could operate on strings on the order of 340-350
characters long. Using the unidirection scheme the size of the
operable strings almost double (PE's do a bit more computation and therefore
require more logic).
Slide 17:
In practice, these systems are used to compute comparison of a given
source string against large numbers of targets. Due to this mode of
use, this algorithm suffers greatly since half of its processing elements
are, on average, idle. Only when computing the maximum antidiagonal will
the array be fully utilized.
Slide 18:
The unidirectional algorithm was developed for SPLASH-2. In this case, the
computation precedes across rows rather than antidiagonals. Notice that
a setup cycle is required between sequences. In this formulation, all
PE' s are fully utilized during the comparison.
Slide 19:
This slide summarizes the results for the unidirection scheme.
Slide 20:
Given the character set is composed of 4 symbols, the datapath need only be
2 bits. Also notice the regularity in the cost metric, namely that there is,
at most, a difference of 2 between adjacent entries. Given this, the registers
need only have two bits for encoding "change" in distance.
Slide 21:
Given it is a systolic realization, only small amounts of interconnect are
required. As will be
seen in later slides, the SPLASH system performs between three to four
orders of magnitude better than a workstation. The inherent parallelism
in the algorithm is greatly exploited.
Slide 22:
The path length seemed excessively long.
Better floorplanning might have helped. Additional pipelining might be
possible. If the critical path length still forced a long cycle, it might
then be possible to interleave several comparisons at once (since the
normal style is to compare the source against many targets).
Slide 23:
This slide describes the metric used in the following slide to compare
throughput of several machines. The cells in the grid, in are prior examples,
each require a separate "CUPS."
Slide 24:
Here we see several orders of magnitude between the optimized SPLASH-2
and the SPARC-1. One might be able to improve the results for the last
two items by optimizing the code for these workstation architectures.
Slide 25:
This slide compares the computational density of the architectures by
dividing the cell updates per second by the area required for the various
architectures. For the metric of computational density, the performance
spread narrows to three orders of magnitude.
While SPLASH had higher raw performance
than the custom implementation, when we look at the performance density, we see
that the custom implementation actually did offer more performance per unit
area. The algorithm did not use the memories on the SPLASH board.
One might also note that
the additional resources available in the SPARC-10, as compared to the
-1, does not proportionally increased its throughput metric. Therefore
its computational density, for this algorithm, is poorer.
Slide 26:
This slide concludes the lecture with a few pertinent results.