Slide 3:
An overview of the Programmable Active Memory (PAM) as detailed in
Programmable Active Memories: Reconfigurable Systems Come to Age. The
core consists of a 4x4 array of FPGA's (XC3090) with 4 1Mb RAM for locale
memory. The maximum memory bandwidth is 400 MB/s. It also has four 32b
external connectors three of which are of 400 MB/s capacity. The fourth
connector links to the host interface through input and output FIFO's.
The clock on the board is tunable to allow different length critical paths.
Also the clock can be dynamically tuned for a slow and infrequent operation.
Slide 4:
The authors claim the synthesis tools were too wasteful for area and
performance, and in an FPGA environment which already suffers from performance
penalty aver raw silicon implementation. Schematic capture is very laborious
to be viable. So a netlist generator has been developed which gives an
algorithmic description of the structure. The language used for "hardware
description" is C++.
Slide 5:
This slide illustrates an example of a netlist generator which generates
an adder. Also, logic operators can be easily annotated with placement
directives. This information is optional. However, the tool performs best
when regular blocks like datapaths are placed relative to each other and
automatic placement is performed for the control logic.
Slide 6:
This slide describes the interfacing scheme of PAM with the host processor.
As mentioned before the host interfaces through input and output FIFO's
allowing the host and PAM clocks to be asynchronous. This turbo bus achieves
at 100 MB/s speed. However, the PAM device is at the same level in the
processor memory hierarchy as the main memory. This results in latency of the
Order of micro seconds when interfacing with the host.
Slide 7:
This slide describes various applications on which PAM has been configured
and it's relative performance compared to state of the art machines (if
possible). This emphasizes the breadth of applications which have been
successfully tackled by the PAM architecture.
Slide 8:
This describes the results of the PAM configured as a long integer multiplier.
The configuration computes AB+S. This is achieved by 2-bit multiplication
of the long word and the two bit result is shifted out. This operates at 33
MHz which is about an order of magnitude faster than Cray implementation.
They cite another FPGA implementation which uses 4-bit multiplication and
is twice faster than their implementation but is 3 times more area. Note
that one could use either implementation on their machine depending on the
relative importance of performance and area in a particular application.
Slide 9:
This describes the result of the PAM configured as an RSA decryption system.
It operates at 185 kb/s for 1k long keys and 300kb/s for 512 long keys.
Slide 10:
This architecture has been specialized around the computation the modulus.
It remains unclear if this approach was not adapted, how much larger (slower)
the design would have been. The implementation exploits the structure of
the problem and heavy pipelining to achieve the fastest speed. The basic
computation in the decryption
scheme (for instance Chinese remainders, precompute powers, hensel's division
carry completion, quotient pipelining) have been hacked and speeded up.
Slide 11:
For sensor processing applications, the input is a continuous stream of
data at very high speeds (GB/s) and the main application is the
pre-processing and reduction of the data.
Slide 12:
In this application input arrives at the rate of 100kHz. This translates to
160MB/s raw input rate. The PAM is configured to compute some simple statistics
on the image data (pixel-wise sum, moment generation, discrimination etc.)
and in real time declare whether the image is interesting or not. This
example shows that PAM can achieve a level of performance not possible with
contemporary microprocessors. The 129x speed difference suggest it would
take at least 129 processors to match its speed --- assuming the problem
could be reasonably distributed to that many processors.
Slide 13:
This describes another custom computing machine which is targeted
specifically towards video processing. This allows more coarse-grained
interconnected elements. It consists of several dedicated compute elements
interconnected with a monolithic crossbar.
Slide 14:
Since the target domain is video processing applications, the system needs
to be reconfigured only to the extent that various video standards and
compression algorithms differ from each other. However, all these involve
some basic digital signal processing. Therefore these "macro" blocks are
Integrated on the system as stream processors. For instance filter processor,
Discrete cosine transform processor, motion estimator, color space converter,
remap/composite processor, superposition processor, sequenced lookup table
along with two FPGAs and a RISC processor are present in the system.
Slide 15:
This indicates that this system performs stream processing on the input
video data and uses pipelined dataflow between the large grains blocks which
implement various operations as mentioned before.
Slide 16:
This slide shows an example which has been implemented in the paper.
Slide 17:
This slide reviews the common themes of the two custom computing machines.