Computer Science 294-7 Lecture #5
Instructions

Notes by Amit Mehrotra


Slide 3: An overview of the Programmable Active Memory (PAM) as detailed in Programmable Active Memories: Reconfigurable Systems Come to Age. The core consists of a 4x4 array of FPGA's (XC3090) with 4 1Mb RAM for locale memory. The maximum memory bandwidth is 400 MB/s. It also has four 32b external connectors three of which are of 400 MB/s capacity. The fourth connector links to the host interface through input and output FIFO's. The clock on the board is tunable to allow different length critical paths. Also the clock can be dynamically tuned for a slow and infrequent operation.


Slide 4: The authors claim the synthesis tools were too wasteful for area and performance, and in an FPGA environment which already suffers from performance penalty aver raw silicon implementation. Schematic capture is very laborious to be viable. So a netlist generator has been developed which gives an algorithmic description of the structure. The language used for "hardware description" is C++.


Slide 5: This slide illustrates an example of a netlist generator which generates an adder. Also, logic operators can be easily annotated with placement directives. This information is optional. However, the tool performs best when regular blocks like datapaths are placed relative to each other and automatic placement is performed for the control logic.


Slide 6: This slide describes the interfacing scheme of PAM with the host processor. As mentioned before the host interfaces through input and output FIFO's allowing the host and PAM clocks to be asynchronous. This turbo bus achieves at 100 MB/s speed. However, the PAM device is at the same level in the processor memory hierarchy as the main memory. This results in latency of the Order of micro seconds when interfacing with the host.


Slide 7: This slide describes various applications on which PAM has been configured and it's relative performance compared to state of the art machines (if possible). This emphasizes the breadth of applications which have been successfully tackled by the PAM architecture.


Slide 8: This describes the results of the PAM configured as a long integer multiplier. The configuration computes AB+S. This is achieved by 2-bit multiplication of the long word and the two bit result is shifted out. This operates at 33 MHz which is about an order of magnitude faster than Cray implementation. They cite another FPGA implementation which uses 4-bit multiplication and is twice faster than their implementation but is 3 times more area. Note that one could use either implementation on their machine depending on the relative importance of performance and area in a particular application.


Slide 9: This describes the result of the PAM configured as an RSA decryption system. It operates at 185 kb/s for 1k long keys and 300kb/s for 512 long keys.


Slide 10: This architecture has been specialized around the computation the modulus. It remains unclear if this approach was not adapted, how much larger (slower) the design would have been. The implementation exploits the structure of the problem and heavy pipelining to achieve the fastest speed. The basic computation in the decryption scheme (for instance Chinese remainders, precompute powers, hensel's division carry completion, quotient pipelining) have been hacked and speeded up.


Slide 11: For sensor processing applications, the input is a continuous stream of data at very high speeds (GB/s) and the main application is the pre-processing and reduction of the data.


Slide 12: In this application input arrives at the rate of 100kHz. This translates to 160MB/s raw input rate. The PAM is configured to compute some simple statistics on the image data (pixel-wise sum, moment generation, discrimination etc.) and in real time declare whether the image is interesting or not. This example shows that PAM can achieve a level of performance not possible with contemporary microprocessors. The 129x speed difference suggest it would take at least 129 processors to match its speed --- assuming the problem could be reasonably distributed to that many processors.


Slide 13: This describes another custom computing machine which is targeted specifically towards video processing. This allows more coarse-grained interconnected elements. It consists of several dedicated compute elements interconnected with a monolithic crossbar.


Slide 14: Since the target domain is video processing applications, the system needs to be reconfigured only to the extent that various video standards and compression algorithms differ from each other. However, all these involve some basic digital signal processing. Therefore these "macro" blocks are Integrated on the system as stream processors. For instance filter processor, Discrete cosine transform processor, motion estimator, color space converter, remap/composite processor, superposition processor, sequenced lookup table along with two FPGAs and a RISC processor are present in the system.


Slide 15: This indicates that this system performs stream processing on the input video data and uses pipelined dataflow between the large grains blocks which implement various operations as mentioned before.


Slide 16: This slide shows an example which has been implemented in the paper.


Slide 17: This slide reviews the common themes of the two custom computing machines.