# Rethinking DRAM Power Modes for Energy Proportionality

Krishna Malladi<sup>1</sup>, Ian Shaeffer<sup>2</sup>, Liji Gopalakrishnan<sup>2</sup>, David Lo<sup>1</sup>, Benjamin Lee<sup>3</sup>, Mark Horowitz<sup>1</sup>

Stanford University<sup>1</sup>, Rambus Inc<sup>2</sup>, Duke University<sup>3</sup>

ktej@stanford.edu

### **Main Memory in Datacenters**



- Server power main energy bottleneck in datacenters
   PUE of ~1.1 → the rest of the system is energy efficient
- Significant main memory (DRAM) power
  - 25-40% of server power across all utilization points
  - Low dynamic range  $\rightarrow$  No energy proportionality

### Main Memory in Datacenters



Server power main energy bottleneck in datacenters

• PUE of  $\sim 1.1 \rightarrow$  the rest of the system is energy efficient

- Significant main memory (DRAM) power
  - 25-40% of server power across all utilization points
  - Low dynamic range  $\rightarrow$  No energy proportionality

## Outline

Inefficiencies of DRAM interfaces

#### Energy-proportionality via fast DRAM interfaces

- MemBlaze
- MemCorrect
- MemDrowsy

## Outline

Inefficiencies of DRAM interfaces

#### Energy-proportionality via fast DRAM interfaces

- MemBlaze
- MemCorrect
- MemDrowsy



- DDR3 optimized for high bandwidth
  - High speed interface with DLLs, CLKs, ODTs
  - Very high static power in active-idle
- Hard to powerdown to deep states
  - Long impractical wakeup time to power up interface
  - Insufficient idleness in workloads  $\rightarrow$  Significant active-idle time



- DDR3 optimized for high bandwidth
  - High speed interface with DLLs, CLKs, ODTs
  - Very high static power in active-idle
- Hard to powerdown to deep states
  - Long impractical wakeup time to power up interface
  - Insufficient idleness in workloads  $\rightarrow$  Significant active-idle time



- DDR3 optimized for high bandwidth
  - High speed interface with DLLs, CLKs, ODTs
  - Very high static power in active-idle
- Hard to powerdown to deep states
  - Long impractical wakeup time to power up interface
  - Insufficient idleness in workloads  $\rightarrow$  Significant active-idle time



- DDR3 optimized for high bandwidth
  - High speed interface with DLLs, CLKs, ODTs
  - Very high static power in active-idle
- Hard to powerdown to deep states
  - Long impractical wakeup time to power up interface
  - Insufficient idleness in workloads  $\rightarrow$  Significant active-idle time



#### **DRAM** Power





#### **DRAM** Power

Reduce active-idle power

Reduce time in active-idle
Increase time in power-down

Active-Idle
Powerdown
Active

#### **DRAM** Power

Reduce active-idle power

- Reduce time in active-idle
- Increase time in power-down

Reduce power-down power



## **DRAM Interfaces**



Bits are short

Sampling window is only 625ps

- Data (DQ) and Clock (CLK) signals forwarded to DRAM
- Write data aligned to Clock edges

## **DRAM Interfaces**



#### Dynamic chip variations affect Reads

- PVT variations  $\rightarrow$  Misaligned DQS and CLK signals
- Non-deterministic Read timing  $\rightarrow$  Incorrect sampling

## **DRAM Interfaces**



- Adjust delay to match chip temperature, voltage variations
- Align DQS, DQ to CLK

#### Live with Slow-Powerup

#### S/W mechanisms

- Batch requests (or) subset ranks (or) Predict idleness
  - Degrades application performance
  - Degraded device density

#### H/W mechanisms

- Statically Disable DLLs in BIOS  $\rightarrow$  Statically lowers bandwidth
  - Worse performance
- Use current deep powermodes
  - Long memory wake-up latency

#### With Wakeup = 1 u sec



Can't win with long wakeups

### Faster Wakeups



Powerups should be much smaller

100ns

### Faster Wakeups



Powerups should be much smaller

100ns

# Outline

#### Inefficiencies of DRAM interfaces

#### Energy-proportionality via fast DRAM interfaces

- MemBlaze
- MemCorrect
- MemDrowsy

Enabling deep powerdown needs lowlatency wakeups

Enabling deep powerdown needs lowlatency wakeups

Rearchitect interface to reduce – wakeup latency

 Enabling deep

 powerdown needs low-latency wakeups

 Rearchitect

 interface to reduce

 wakeup latency



## Fast Wakeup with MemBlaze



#### No DLL

- Periodic Timing reference signal stores DRAM offset in controller
- Current-mode logic (CML) clocking has fewer variations

#### Fast turn-on of datapath

Capacitive boosting quickly restores bias values

## Fast Wakeup with MemBlaze



#### No DLL

- Periodic Timing reference signal stores DRAM offset in controller
- Current-mode logic (CML) clocking has fewer variations
- Fast turn-on of datapath
  - Capacitive boosting quickly restores bias values

Exit latency ~ 10ns

#### MemBlaze DRAM + Controller



Integrated into DRAMs. Fabricated and tested
More details in the paper

## **Silicon Results**





## Methodology

#### Workloads

#### Memcached

- Key/value pairs with 100B and 10KB values
- Zipf popularity distribution with exponential inter-arrival times
- Yahoo! Cloud Benchmark (YCSB), SPECjbb
- Multiprogrammed (MP) and Multithreaded (MT)
  - SPECCPU 2006, SPECOMP 2001, PARSEC
  - High BW (HB), Medium BW (MB), Low BW (LB)

#### Architecture

- 8 OoO Nehalem cores at 3GHz, 8MB shared L3 cache
- 32 GB DRAM, 2Gb DDR3-1333 chips
- Fast powerdown baseline, 15 cycle powerdown timer

### **MemBlaze Evaluation**



66% lower memory energy with MemBlaze fastlock
 No performance penalty





## Speculative Wakeup with MemCorrect



- Use deep power-down, which powers-off DLL, CLK
- Transfer speculatively before the long DLL recalibration

#### Error Detection/Correction

- Detector fires if power-down period accumulated large skew
- Corrector waits for recalibration before transfer

### **MemCorrect Evaluation**



- Vary probability of correct timing (p)
- 40% energy savings (esp. for datacenters)

Degrades performance for high-BW apps

Increases energy/bit





# Lazy Wakeup with MemDrowsy



#### Fast wakeup

- Wakeup from deep-powerdown
- Transfer at lower rate before DLL recalibration completes
- Reduced Sampling Rate
  - Lower data rate for READs during calibration time (~ 700ns)
  - Transfer each bit multiple times  $\rightarrow$  Wider sampling window
  - Eliminates timing uncertainty

## **MemDrowsy Evaluation**



- Vary sampling reduction rate (Z)
- 40% energy savings for datacenter apps
- High Z harms both performance and energy/bit
  - Energy per bit increases from wake-ups, higher bus activity
  - Z=2 more realistic

### MemCorrect + MemDrowsy



- Combine MemCorrect and MemDrowsy
- If error detected, halve sampling rate instead of backoff
- ≤10% performance penalty
- 50% energy/bit savings

### Conclusion

- DDR3 is energy-disproportional
  - DRAMs dissipate high static power
- DDR3 interfaces are efficiency bottlenecks
  - High active-idle power
  - Long wake-ups from power modes
- Re-architect interfaces with MemBlaze
- Or use MemCorrect + MemDrowsy
  - Provide fast wake-up from power modes
  - Energy efficiency improves by 40-70%
  - Performance impact is  $\leq 10\%$

#### Thank you for your attention!

Questions?

ktej@stanford.edu