With continued scaling, all transistors are no longer created equal. The delay of a length 4 horizontal routing segment at coordinates (23,17) will differ from one at (12,14) in the same FPGA and from the same segment in another FPGA. The vendor tools give conservative values for these delays, but knowing exactly what these delays are can be invaluable. In this paper, we show how to obtain this information, inexpensively, using only components that already exist on the FPGA (configurable PLLs, registers, logic, and interconnect). The techniques we present are general and can be used to measure the delays of any resource on any FPGA with these components. We provide general algorithms for identifying the set of useful delay components, the set of measurements necessary to compute these delay components, and the calculations necessary to perform the computation. We demonstrate our techniques on the interconnect for an Altera Cyclone-III (65nm). As a result, we are able to quantify over a 100ps spread in delays for nominally identical routing segments on a single FPGA.
Timing Extraction identifies the delay of fine-grained components within an FPGA. From these computed delays, the delay of any path can be calculated. Moreover, a comparison of the fine-grained delays allows a detailed understanding of the amount and type of process variation that exists in the FPGA. To obtain these delays, Timing Extraction measures, using only resources already available in the FPGA, the delay of a small subset of the total paths in the FPGA. We apply Timing Extraction to the Logic Array Block (LAB) on an Altera Cyclone III FPGA to obtain a view of the delay down to near individual LUT granularity, characterizing components with delays on the order of a few hundred picoseconds with a resolution of ±3.2 ps. This information reveals that the 65nm process used has, on average, random variation of σ/μ=4.0% with components having an average maximum spread of 83ps. Timing Extraction also shows that as VDD decreases from 1.2V to 0.9V in a Cyclone IV 60nm FPGA, paths slow down and variation increases from σ/μ=4.3% to σ/μ=5.8%, a clear indication that lowering VDD magnifies the impact of random variation.
Nanowire building blocks provide a promising path to small feature size and thus the ability to more densely pack logic. However, the small feature size and novel, bottomup manufacturing process will exhibit extreme variation and produce many devices that operate outside acceptable operating ranges. One-mapping-ﬁts-all, prefabrication assignment of logical functions to physical transistors that exhibit high threshold variation will not work—combining the wide range of physical variation in transistor threshold voltage with the wide range of fanouts in the design produces an unworkably large composite range of possible delays. Nonetheless, by carefully matching the fanout of each net to the physical threshold voltages of devices after fabrication, it is possible to reduce the net range of path delays sufficiently to achieve high system yield. By adding a modest amount of extra resources, we achieve 100% yield for systems built out of devices with 38% variation, the ITRS prediction for threshold variation in 5 nm transistors. Moreover, for these systems, we maintain delay, energy and area close to the variation-free nominal case.
In nanowire-based logic, the semiconducting material (e.g., Si, GaN, SiGe) is grown into individual nanowires rather than being part of the substrate. This offers us the opportunity to stack multiple layers of nanowires to create a three-dimensional logic structure which has high quality semiconductors in all vertical layers. The authors detail a feasible three-dimensional programmable logic architecture which can plausibly be realized from layers of semiconducting nanowires, making only modest assumptions about the control and placement of individual nanowires in the assembly. This shows a natural path for continuing to scale areal logic density once nanowire pitches approach fundamental limits. The authors show that the three dimensional systems are volumetrically efficient, with the surface area reducing roughly in proportion to the number of vertical layers. The authors further show that, on average, delay is reduced 18% from compact layout in three dimensions. For only a 20% area impact, the authors show how to avoid adding any manufacturing steps to physically isolate portions of nanowire layers
A key challenge facing nanotechnologies will be controlling nanoarrays, two orthogonal sets of nanowires that form a crossbar, using a moderate number of mesoscale wires. Three methods have been proposed to use mesoscale wires to control individual nanowires. The first is based on nanowire differentiation during manufacture, the second makes random doped connections between nanowires and mesoscale wires, and the third, a mask-based approach, interposes high-K dielectric regions between nanowires and mesoscale wires. All three addressing schemes involve a stochastic step in their implementation. In this paper, we analyze the mask-based approach and show that a large number of mesoscale control wires is necessary for its realization.
A key challenge that face nanotechnologies is controlling the uncertainty introduced by stochastic self-assembly. In this paper we explore architectural and manufacturing strategies to cope with this uncertainty when assembling nanoarrays, crossbars composed of two orthogonal sets of coded parallel nanowires. Because the encodings of nanowires that are assembled into a nanoarray cannot be predicted in advance, a discovery process is needed and specialized decoding circuitry must be employed. We have developed a probabilistic method of analysis so that various design strategies can be evaluated.
Timing Extraction identifies the delay of fine-grained components within an FPGA. From these computed delays, the delay of any path can be calculated. Moreover, a comparison of the fine-grained delays allows a detailed understanding of the amount and type of process variation that exists in the FPGA. To obtain these delays, Timing Extraction measures, using only resources already available in the FPGA, the delay of a small subset of the total paths in the FPGA. We apply Timing Extraction to the Logic Array Block (LAB) on an Altera Cyclone III FPGA to obtain a view of the delay down to near individual LUT SRAM cell granularity, characterizing components with delays on the order of tens to a few hundred picoseconds with a resolution of ±3.2 ps, matching the expected error bounds. This information reveals that the 65 nm process used has, on average, random variation of σ/μ = 4.0% with components having an average maximum spread of 83 ps. Timing Extraction also shows that as VDD decreases from 1.2 V to 0.9 V in a Cyclone IV 60 nm FPGA, paths slow down and variation increases from σ/μ = 4.3% to σ/μ = 5.8%, a clear indication that lowering VDD magnifies the impact of random variation.
Suitable architectures and paradigm shifts in assembly and usage models will make it possible to exploit the compactness and energy benefits of single-nanometer dimension devices and allow extending these structures into the third dimension without depending on top-down lithography to define the smallest feature sizes in a system.
A programmable logic array (PLA) needs its inputs available in both the positive and negative polarities. In lithographic-scale VLSI PLAs, programmable array logics (PALs) and programmable logic devices (PLDs) a buffer and inverter at the PLA input typically produces both polarities from a single polarity input. However, the extreme regularity required for sublithographic designs has driven nanoscale architectures to consider alternate solutions. Consequently, the authors compare three schemes: one based on producing both polarities in a restoration stage (selective inversion), one based on a local inversion stage and one based on a full dual-rail logic implementation. The authors develop a mapping flow for the dual-rail logic and quantify its cost in both logical product terms and physical implementation area and also develop area and timing models for all three schemes. Mapping benchmarks from the Toronto 20 set, the authors are able to show that the local inversion scheme is faster (less than one-fifth the latency), lower energy (one-half the energy) and comparable size to the selective inversion scheme and faster (less than half the latency), smaller (one-third of the area) and lower energy (one-ninth the energy) than the dual-rail scheme.
A key challenge facing nanotechnologies is learning to control uncertainty introduced by stochastic self-assembly. In this article, we explore architectural and manufacturing strategies to cope with this uncertainty when assembling nanoarrays, crossbars composed of two orthogonal sets of parallel nanowires (NWs) that are differentiated at their time of manufacture. NW deposition is a stochastic process and the NW encodings present in an array cannot be known in advance. We explore the reliable construction of memories from stochastically assembled arrays. This is accomplished by describing several families of NW encodings and developing strategies to map external binary addresses onto internal NW encodings using programmable circuitry. We explore a variety of different mapping strategies and develop probabilistic methods of analysis. This is the first article that makes clear the wide range of choices that are available.
This paper addresses the issue of reducing transient faults that affect instructions while they are in the instruction queue waiting to be executed. Previous work has shown that for an in-order processor, squashing instructions triggered by a cache miss can reduce the number of transient faults. This paper shows that for an outof-order processor, reducing the size of the instruction queue can have a bigger impact than more adaptive techniques such as fetch halting. Ongoing work will explore more effective techniques for selective fetch halting to provide a reduction in faults committed while having a minimal impact on performance.
Traditional solutions to variation and aging cost energy. Adding static margins to tolerate high device variance and potential device degradation prevent aggressive voltage scaling to reduce energy. Post-fabrication configuration, as we have in FPGAs, provides an opportunity to avoid the high costs of static margins. Rather than assuming worst-case device characteristics, we can deploy devices based on their fabricated or aged characteristics. This allows us to place the high-speed/leaky devices as needed on critical paths and slower/less-leaky devices on non-critical paths. As a result, it becomes possible to meet system timing requirements at lower voltages than conservative margins. To exploit this post-fabrication configurability, we must customize the assignment of logical functions to resources based on the resource characteristics of a particular component after it has been fabricated and the resource characteristics have been determined—that is, component-specific mapping. When we perform this component-specific mapping, we can accommodate extremely high defect rates (e.g., 10%), high variation (e.g., σ_Vt=38 %), as well as lifetime aging effects with low overhead. As the magnitude of aging effects increase, the mapping of functions to resources becomes an adaptive process that is continually refined in-system, throughout the lifetime of the component.