# ECE 252 / CPS 220 Advanced Computer Architecture I

# Lecture 4 Reduced Instruction Set Computers

Benjamin Lee Electrical and Computer Engineering Duke University

www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class\_ece252fall11.html



# Technology

- ROMs were faster than DRAMs

Instructions

- For complex instruction sets (CISC), datapath and controller were cheaper and simpler

- New instructions (e.g., floating point) supported without datapath modifications

Compatibility

- ISA compatibility across machine models were cheaper and simpler

- Fixing bugs in the controller was easier

In the 1970s, except for cheapest and fastest machines, all computers were microprogrammed

# Microprogramming in 1980's

# Increasing Complexity

- CISC ISAs led to subroutine and call stacks in microcode
- Fixing bugs in control conflicts with read-only nature of ROMs
- Technology
  - Advent of VLSI technology
  - Assumptions about ROM vs RAM speed became invalid

## Instructions

- Better compilers made complex instructions less important
- Compilers had difficulty using complex instructions

# Microarchitecture

- Microarchitectural innovations: pipelining, caches and buffers, etc.
- Make multiple-cycle execution of reg-reg instructions unattractive

# Modern Microprogramming

# Microprogramming is far from extinct

- Played crucial role in microprocessors of 1980's (e.g., Intel 386, 486)
- Plays assisting role in modern microprocessors

# Important role in early microprocessors (1980s)

- Example: Intel 386, 486

## Assisting role in modern microprocessors

- Example: AMD Athlon, Intel Core 2 Duo, IBM Power PC
- Most instructions executed directly (hardwired control)
- Infrequently-used, complicated instructions invoke microcode engine

# Assisting role in modern microprocessors

- Patchable microcode common for post-fabrication bug fixes
- Example: Intel Pentiums load microcode patches at bootup



# Instruction Management

- Shift away from fixed hardware microcode, microroutines
- Exploit fast RAM to build instruction cache of user-visible instructions
- Adapt contents of fast instruction memory to fit what application needs at the moment.

### Simple Instruction Set

- Shift away from complex CISC instructions, which are rarely used
- Enable hardwired, pipelined implementation

### Greater Integration

- In early 1980s, able to fit 32-bit datapath and small cache on die
- Allow faster operation by avoiding chip crossings in common case



# Seymore Cray, 1964

- Fast, pipelined machine with 60-bit words
- Ten functional units (floating-point, integer, etc.)

# Control

- Hardwired control, no microcoding
- Dynamic instruction scheduling with a scoreboard

# System Organization

- Ten peripheral processors for input/output
- Fast time-shared 12-bit integer ALU
- Very fast clock, 10MHz
- Novel Freon-based technology for cooling





# CDC 6600: Load/Store Architecture

# Separate instructions manipulate three register types

- 8, 60-bit data registers
- 8, 18-bit address registers
- 8, 18-bit index registers

# Arithmetic and logic instructions are reg-to-reg

- Hardwired control, no microcoding
- Dynamic instruction scheduling with a scoreboard

Ri ←(Rj) op (Rk)

# Only load and store instructions refer to memory



# Simplified Pipeline Implementation

- Three-address, register-register ALU instructions
- No implicit dependencies between inputs and outputs

# Multiple Outstanding Memory Accesses

- Decouple (a) setting of address register from (b) retreiving value from data register
- SW can schedule load for address register before use of loaded value
- SW can interleave independent instructions

# Multiple parallel functional units

- Example: two separate multipliers
- Functional units initially unpipelined
- CDC7600 pipelines functional units (foreshadowing later RISC machines)



#### Latency = (Instructions / Program) x (Cycles / Instruction) / (Seconds / Cycle)

- Instructions per program depends on source code, compiler technology, ISA
- Cycles per instruction (CPI) depends on the ISA and the microarchitecture
- Time per cycle depends on the microarchitecture, underlying technology

| <u>Microarchitecture</u> | CPI | Cycle Time |
|--------------------------|-----|------------|
| Microcoded               | >1  | short      |
| Single-cycle unpipelined | 1   | long       |
| Pipelined                | 1   | short      |

- This lecture presents single-cycle unpipelined microarchitecture



# **Combinational Circuits**







# Synchronous State Elements

- Flipflop, Register, Register file, SRAM, DRAM
- Edge-triggered elements where data is sampled on rising edge





#### Register









Highly ported register files difficult to design

- Almost all MIPS instructions have exactly 2 register source operands
- Intel's Itanium, GPR File has 128 registers with 8 read ports, 4 write ports!!!





Reads, Writes complete in one cycle

- Read can be done any time (i.e. combinational)
- Write is performed at the rising clock edge if it is enabled
- Write address and data must be stable at the clock edge

# **MIPS Instruction Set Architecture**

## **Processor State**

- 32 32-bit GPRs, R0 always contains a 0
- 32 single precision FPRs, may also be viewed as16 double precision FPRs
- FP status register, used for FP compares & exceptions
- PC, the program counter
- some other special registers

### Data types

- 8-bit byte, 16-bit half word
- 32-bit word for integers
- 32-bit word for single precision floating point
- 64-bit word for double precision floating point

### Load/Store style instruction set

- data addressing modes- immediate & indexed
- branch addressing modes- PC relative & register indirect
- byte addressable memory- big endian mode

# All instructions are 32 bits



- 1. Instruction Fetch
- 2. Decode and register access
- 3. ALU operation
- 4. Memory operation (optional)
- 5. Write back

And the computation of the address of the next instruction









# Conflicts in Merging Datapath



• ECE 252 / CPS 220





• ECE 252 / CPS 220

# Load/Store Instructions (Harvard)





# Conditional PC-relative branch

| 6      | 5  | 5 | 16     |            |
|--------|----|---|--------|------------|
| opcode | rs |   | offset | BEQZ, BNEZ |

# Unconditional register-indirect jumps

opcode rs

JR, JALR

## Unconditional absolute jumps

6 26 J, JAL

- PC-relative branches add offset×4 to PC+4 to calculate the target address (offset is in words): ±128 KB range
- Absolute jumps append target×4 to PC<31:28> to calculate the target address: 256 MB range
- jump-&-link stores PC+4 into the link register (R31)
- All Control Transfers are delayed by 1 instruction
  we will worry about the branch delay slot later

# Conditional Branches (BEQZ, BNEZ)





















Hardware control is pure combinational logic

# ALU Control, Immediate Extension



## D U Hardwired Control Table

| Opcode              | ExtSel             | BSrc | OpSel | MemW | RegW | WBSrc | RegDst | PCSrc |
|---------------------|--------------------|------|-------|------|------|-------|--------|-------|
| ALU                 | *                  | Reg  | Func  | no   | yes  | ALU   | rd     | pc+4  |
| ALUi                | sExt <sub>16</sub> | Imm  | Op    | no   | yes  | ALU   | rt     | pc+4  |
| ALUiu               | uExt <sub>16</sub> | Imm  | Op    | no   | yes  | ALU   | rt     | pc+4  |
| LW                  | sExt <sub>16</sub> | Imm  | +     | no   | yes  | Mem   | rt     | pc+4  |
| SW                  | sExt <sub>16</sub> | Imm  | +     | yes  | no   | *     | *      | pc+4  |
| BEQZ <sub>z=0</sub> | sExt <sub>16</sub> | *    | 0?    | no   | no   | *     | *      | br    |
| BEQZ <sub>z=1</sub> | sExt <sub>16</sub> | *    | 0?    | no   | no   | *     | *      | pc+4  |
| J                   | *                  | *    | *     | no   | no   | *     | *      | jabs  |
| JAL                 | *                  | *    | *     | no   | yes  | PC    | R31    | jabs  |
| JR                  | *                  | *    | *     | no   | no   | *     | *      | rind  |
| JALR                | *                  | *    | *     | no   | yes  | PC    | R31    | rind  |

RegDst = rt / rd / R31

BSrc = Reg / Imm WBSrc = ALU / Mem / PC PCSrc = pc+4 / br / rind / jabs



### Assumptions

- Clock period is sufficiently long to complete:

- 1. instruction fetch
- 2. decode and register access
- 3. ALU operation
- 4. data fetch if required
- 5. register write-back setup time
- $t_{C} > t_{IFetch} + t_{RFetch} + t_{ALU} + t_{DMem} + t_{RWB}$

### Update at end of Cycle

- Updates occur on rising edge of following clock
- Update architectural state
- Program counter (PC), register file, memory



All objects go through the same stages

No sharing of resources between any two stages

Propagation delay through all pipeline stages is equal

Scheduling of an object entering the pipeline is not affected by objects in other stages

These conditions hold for industrial assembly lines but do they hold for an instruction pipeline?



# Strategy

- First, build MIPS without pipelining, CPI = 1
- Then, add pipeline registers to reduce cycle time, maintaining CPI=1
- Clock period reduced by dividing the execution of an instruction into
- multiple cycles,  $t_c > max \{t_{IM}, t_{RF}, t_{ALU}, t_{DM}, t_{RW}\} (= t_{DM} probably)$  However, CPI will increase unless instructions are pipelined



# Dividing Datapath into Stages

Suppose memory is slower than other stages.

Since slowest stage determines the clock, it may be possible to combine stages without loss of performance

$$\begin{array}{ll}t_{IM} &= 10 \text{ units} \\ t_{DM} &= 10 \text{ units} \\ t_{ALU} &= 5 \text{ units} \\ t_{RF} &= 1 \text{ unit} \\ t_{RW} &= 1 \text{ unit} \end{array}$$





- Write-back requires much less time, combine with memory access -  $t_c > max \{t_{IM}, t_{RF} + t_{ALU}, t_{DM} + t_{RW}\} = t_{DM} + t_{RW}$ 



| Assumptions                                                                            | Unpipelined t <sub>C</sub> | Unpipelined t <sub>c</sub> | Speedup |
|----------------------------------------------------------------------------------------|----------------------------|----------------------------|---------|
| $t_{IM} = t_{DM} = 10;$<br>$t_{ALU} = 5,$<br>$t_{RF} = t_{RW} = 1$<br>4-stage pipeline | 27                         | 10                         | 2.7     |
| $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RV}$<br>4-stage pipeline                      | <sub>N</sub> = 5 25        | 10                         | 2.5     |
| $t_{IM} = t_{DM} = t_{ALU} = t_{RF} = t_{RV}$<br>5-stage pipeline                      | <sub>N</sub> = 5 25        | 5                          | 5.0     |

Higher speedup possible w/ more pipeline stages



Microcoding became less attractive as gap between RAM and ROM speeds

Complex instruction sets difficult to pipeline, so it was difficult to increase performance as gate count grew

Processor performance depends on (a) instructions per program, (b) cycles per instruction and (c) time per cycle

Load/Store RISC instruction sets designed for efficient, pipelined implementations



These slides contain material developed and copyright by

- Arvind (MIT)
- Krste Asanovic (MIT/UCB)
- Joel Emer (Intel/MIT)
- James Hoe (CMU)
- John Kubiatowicz (UCB)
- Alvin Lebeck (Duke)
- David Patterson (UCB)
- Daniel Sorin (Duke)