Computer Science 294-7 Lecture #6
SPLASH

Notes by Roy A Sutton


Slide 1: The homepage for both SPLASH-1 and SPLASH -2 can be found here


Slide 2: Notice the change of time for the next class.


Slide 3: The SPLASH-1 architecture is built upon the Xilinx XC3090. Its interconnection is that of a linear systolic array (no mesh connection allowed). The board was built around an application (DNA sequence matching) and was developed in 1988. It likely cost more for the SPLASH-1 board than it would have for a custom IC. The 1MHz FIFO is a big limitation that sets the maximum system throughput.


Slide 4: The SPLASH-2 architecture added a crossbar and was built with Xilinx XC4010 FPGA'S. It was developed 1992. The 43 giga Lambda squared does not include the crossbar. The crossbar does not likely contribute significantly to the area.


Slide 5: This slide motivates the problem that the SPLASH-1 in SPLASH-2 architectures solve.


Slide 6: An DNA sequence is composed of a string of words made from a group of four characters. The path for a translation from one sequence to another is not unique. All we really need to find is the cost of the minimum length paths. The transformation can be described in terms of three operations; insert, delete, and substitute, each with an associated cost. It should be noted that "match" is not an operation. It is associated with the substitution operation, which has a cost of either 2 or 0 dependent on whether a substitution is required for not (respectively).


Slide 7: In this example we examine the steps one, or nature, might take to transform the sequence "SHMOO" to "SHORE."


Slide 8: Notice that the delete operation depends only on the source character and the insert operation depends only on the target character. However, the substitute operation depends on the source and target character. Given this formulation, elements can be computed along parallel lines that are perpendicular to the diagonal formed by the line d(0,0), d(m, n) and are called the antidiagonols.


Slide 9: In this example the target is "SHORE" and the source is "SHMOO." As defined d(0, 0) = 0. From ther we can begin computing along the parallel line, as described previously, formed by, in this case, the two elements d(0, 1) and d(1, 0). For element d(0, 1), we only need computed the cost of insertion, which in this case is 1, since the cost of d(0, 0) = 0. Similarly we compute the cost of deletion for element d(1, 0) to be 1. In the next parallel {composed of the three elements d(0, 2), d(1, 1), d(2, 0)}, let's examine the element d(1, 1). To compute it we must determine the minimum cost between the deletion of the source, the insertion of the target, or the substitution of the target (for the source). In this case the associated costs are 1, 1, or 0 (respectively). Therefore the operation cost is 0. In this manner the remaining parallels are computed.


Slide 10: Here is the completed cost table for our example. Notice the element d(4, 3) = 1. In this case the cost to move from d(3, 2) and substitute has the minimum cost since the source and target characters match. This agrees with intuition given that to transform "SHMO" to "SHO" the operation of minimum cost would be that of inserting an"M" into the target.


Slide 11: The algorithm displayed runs concurrently on each PE. The PE's are interconnected in a systolic array. The source characters and target characters move through the array in opposite directions.


Slide 12: Each sequence is separated by a null character for proper timing. As previously described, the source flows from the left while he target flows from the right. This slide shows three timesteps. The first timestep shows the initial conditions; namely the cost of deletions for the source and the cost of insertions for the target. In the second timestep the first characters of the source and target enter a common element. The algorithm dictates that the cost would be 0 given the match (and prior history 0 cost). In the third timestep the cost of the two intersecting characters are both 1 and are derived from the cost (0) of the input target (source) added to the cost (1) of deletion (insertion).


Slide 13: In this slide the example continues as described in the previous slide. For example, in the second timestep, for the cost computation of the characters (S, R), (target, source), the minimum cost result is 3. This is the cost of the target distance (2) plus the cost of deletion (1).


Slide 14: This slide continues the example in similar fashion.


Slide 15: This slide concludes the example with the total cost of transformation, IE: the edit distance, calculated to be 4. The final distance ends up at the tail of the match and target sequence and follows them out of the array.


Slide 16: This slide summaries the statistics for the bidirectional technique. As will be seen in a few slides, techniques exist which increase the steady state computational density and increases the utilization of the hardware. The SPLASH-1 board could operate on strings on the order of 340-350 characters long. Using the unidirection scheme the size of the operable strings almost double (PE's do a bit more computation and therefore require more logic).


Slide 17: In practice, these systems are used to compute comparison of a given source string against large numbers of targets. Due to this mode of use, this algorithm suffers greatly since half of its processing elements are, on average, idle. Only when computing the maximum antidiagonal will the array be fully utilized.


Slide 18: The unidirectional algorithm was developed for SPLASH-2. In this case, the computation precedes across rows rather than antidiagonals. Notice that a setup cycle is required between sequences. In this formulation, all PE' s are fully utilized during the comparison.


Slide 19: This slide summarizes the results for the unidirection scheme.


Slide 20: Given the character set is composed of 4 symbols, the datapath need only be 2 bits. Also notice the regularity in the cost metric, namely that there is, at most, a difference of 2 between adjacent entries. Given this, the registers need only have two bits for encoding "change" in distance.


Slide 21: Given it is a systolic realization, only small amounts of interconnect are required. As will be seen in later slides, the SPLASH system performs between three to four orders of magnitude better than a workstation. The inherent parallelism in the algorithm is greatly exploited.


Slide 22: The path length seemed excessively long. Better floorplanning might have helped. Additional pipelining might be possible. If the critical path length still forced a long cycle, it might then be possible to interleave several comparisons at once (since the normal style is to compare the source against many targets).


Slide 23: This slide describes the metric used in the following slide to compare throughput of several machines. The cells in the grid, in are prior examples, each require a separate "CUPS."


Slide 24: Here we see several orders of magnitude between the optimized SPLASH-2 and the SPARC-1. One might be able to improve the results for the last two items by optimizing the code for these workstation architectures.


Slide 25: This slide compares the computational density of the architectures by dividing the cell updates per second by the area required for the various architectures. For the metric of computational density, the performance spread narrows to three orders of magnitude. While SPLASH had higher raw performance than the custom implementation, when we look at the performance density, we see that the custom implementation actually did offer more performance per unit area. The algorithm did not use the memories on the SPLASH board. One might also note that the additional resources available in the SPARC-10, as compared to the -1, does not proportionally increased its throughput metric. Therefore its computational density, for this algorithm, is poorer.


Slide 26: This slide concludes the lecture with a few pertinent results.