Administrivia:
Anyone who has not signed up for scribing notes please see Andre to do it soon. The room for lectures has been set to Soda 505. Due to the conflict with FPGA'97, we reschedule the Tuesday lecture (Feb. 11) to 5-6:30PM in 405 Soda.
Slide 3:
In last lecture we discussed many issues of instruction organization, and described some concepts of RP architecture space for reconfigurable computing. The goal of this lecture is to give you background knowledges and to build a bridge from high-level design space to low-level implementation. The main topics are what an FPGA looks like and how they are layout and programmed. This tutorial also helps people preparing for the first assignment (implementation of certain computation with Xilinx), and next two lectures on FCCMs.
Slide 4:
There are three main components of FPGA: i)Processing Units; ii)Interconnect; and iii) Instruction Memory. For FPGA, the interconnection is described by p=0.5(a reasonable estimation of rent's parameter to meter interconnection requirements) and d=1 (stands for the optional flip-flop) and c=1 (one pinst stored locally in each block) in the RP space.
Slide 5:
This slide shows the yielded efficiency versus application's critical path length and granularity for a conventional FPGA architecture of a single context and a single retiming memory as well as a single bit width granularity. We can easily see that this FPGA architecture prefers applications of short critical path and low granularity.
Slide 6:
4-input look up table (LUT) is the basic processing unit for FPGA. Since there is an optional flip-flop for retiming, each LUT can be used for combinational or sequential logic.
Slide 7:
Flexibility, area and delay are 3 main issues of interconnect design of
FPGA. There are many tradeoffs between them. For instance, FPGAs do not use
crossbar or mesh connection for interconnects due to area and delay
constraints. The main feature of FPGA interconnection is the hierarchy of
short, long and dedicated connections. There are also some special features
such as corner turns and joining segments. Vendors usually distribute
different connections statistically.
Slide 8:
XC3000 devices are the second generation FPGA family of Xilinx. They were the first devices really used heavily for reconfigurable computing systems such as PAM and SPLASH-1, and they are still widely used.
Each XC3000 CLB contains a combinational logic section (2 4-LUT or an equivalent 5-LUT), 2 flip-flops, and a program memory controlled multiplexer selection of function.
Slide 9:
The left corner figure gives a large view of possible connections of XC3000 devices. Note that most of the wires are local segments to a CLB. There are also a limited number of long vertical and horizontal lines. There are switch boxes to join those local segments and thus allow programmed interconnections between CLBs. The right-upper corner figure shows the legitimate switching matrix combinations for each pin. Some bidirectional interconnect buffers (adjacent to switch matrixes) are available to propagate signals in either direction on a given general interconnect segment. Long lines are available to bypass switch matrixes for long distance communication to reduce delays due to switching.
Slide 10:
Through the rest of this lecture, we will focus on more popular and more recent XC4000 family devices. We will use this component for Project 1.
Each CLB contains 2 4-LUTs (F and G function generators). A third function generator H(3-LUT) is provided. The inputs of H function generator can be the outputs of F and G, or from outside the CLB. This configuration gives us versatility of available functions. For instance, the CLB can implement certain functions of up to nine variables. There are 2 FFs for storage or retiming. The FFs can be used as registers or shift registers without blocking function generators from performing a different, perhaps unrelated task. An optional mode of CLB is that LUTs can be configured as an array of RAM cells. There is also dedicated arithmetic logic for fast propagation and generation of carry signal.
Let's estimate the programming speed: Assume we use serial programming with bandwidth 1b/50ns, and 400 unencoded instruction bits per CLB, for a FPGA of 2000 CLBs we need to take 40ms to program it.
Slide 11:
The dedicated fast carry logic is an important features of XC4000 devices, which can speed arithmetic and counting up to 70 MHz range. There is no general interconnection involved in the carry chain, and the extra carry output is passed on to the function generator in the adjacent CLB directly. The fast carry logic greatly increases the efficiency and performance of adders, subtractors, accumulators, comparators and counters. However, synthesis cannot, generally, map to the fast carry logics, but we can always use macros (e.g. adders and counters) which explicitly make use of this resource.
Slide 12:
One important feature of XC4000 devices is that the LUTs can be used as
RAM. Although it's less dense than regular RAM, it is much denser than
flip-flop and can be configured to different operation modes for each CLB:
Supported RAM modes
16 x 1 | 16 x 2 | 32 x 1 | Edge-triggered | level-triggered | |
Single-port | yes | yes | yes | yes | yes |
Dual-port | yes | - | - | yes | - |
Slide 13:
As we already emphasized, there are tradeoffs between flexibility and area,
delay for interconnect layout. From this slide we can see that there is a
clear hierarchy of wires. For instance, 'Double' represents a wire spanning
2 CLBs, while 'Quad' for a wire spanning 4. There are 2 tri-state buffers
per each CLB hanging on the long lines. Note that for each column, there
are several Global clock lines and a Carry chain. Therefore down-to-up is
the preferred orientation for data path and the horizontal direction for
data flow (tri-state buffered long lines are horizontal). The darker lines
are additions to the XC4000EX series parts. It can be seen that larger
devices require more interconnections and control signals (4 more global
clock wires).
Slide 14:
This slide gives you more detailed interconnect layout of XC4000. Each CLB has two tri-state buffered long lines, possibly suitable for global buses. One common scheme for routing lines spanning multiple CLBs is to transpose them with each other, which makes sure that all CLB is identical. This is helpful for mapping, layout, programming, and CAD. This feature might be also helpful to reduce the cross-couple noise.
Slide 15:
This table summarizes the area and delay parameters for some conventional
FPGAs. We can see that the area per 4-LUT is around 600K lambda^2, and less
than 1M lambda^2. The ORCA 2C takes slightly larger than 1M lambda^2 per
4-LUT, due to larger supported maximum size array and more supported data
path operations by each PLC. The last four data points are from our area
model (we'll see the details of the scaling in lecture 12). The model, in
particular, is showing that we expect the larger device families to require
more area per 4-LUT due to more required interconnection. The parameter
"cycle" is the delay of communications between two adjacent CLBs.
Slide 16:
A basic relationship between the areas of logic, memory, and interconnect: Area(interconnect) = 10 x Area(memory) = 100 x Area(logic).
Slide 17:
Longer communication pays more delay penalty on interconnection. We can expect that the interconnection counts most of the delay (around 70%).
Slide 18:
This is an abstract-level description of the flow of FPGA programming.
Slide 19:
We can start with some standard tools such as SIS, Synopsys etc. to
logic-optimize and technology-independent map. One important issue is to
adapt cost models to map properly to FPGAs, since the relative costs of
logic functions are different than custom chips. For instance, since
flip-flops are already built into each CLB, we should use pipelining as much
as possible. Another example is that one-hot encoded FSMs is preferable
to densely encoded FSM state bits due to the relative inexpense of FFs.
Slide 20:
This shows an example of LUT mapping. We usually first pack logic into
LUTs, and then try to build logic elements out of these LUTs and the
surrounding resources.
Slide 21:
Placing typically starts with min-cut partitioning to minimize wiring
bisection bandwidth. Basically people needs to solve a constraint
satisfaction problem to get a good placement: maximizing locality of
communication, matching wiring resources, and minimizing delay. Since this
is an NP-hard problem, it is not practical to solve the problem
optimally. The state-of-art solutions usually employ simulated annealing.
Slide 22:
The general approach of routing is the standard two-stage method of global
routing followed by detailed routing. This allows the separation of two
distinct problems: balancing the densities of the routing channels, and
assigning specific wire segments to each connection. Since the routing
tools don't generally do a good job on their own, people often need to do
manual placement to get the highest density and performance. Compared to
the current state of the art in autoplacement, manual placement improves
both the quality of the result (minimizing critical paths) and the running
time by avoiding random searches.
Slide 23:
This is the summary of the topics covered by this lecture. As we already
saw, interconnect is the dominant contributor to silicon area
and delay. Further there is a big tradeoff between reconfigurability and
area. Limited interconnect resource makes mapping difficult and time
consuming.