# CPR: Composable Performance Regression for Scalable Multiprocessor Models

#### Benjamin C. Lee

Computer Architecture Group Microsoft Research

#### Jamison Collins, Hong Wang

Microarchitecture Research Lab Intel Corporation

# (intel)

#### **David Brooks**

Engineering and Applied Sciences Harvard University







イロト イポト イヨト イヨト

International Symposium on Microarchitecture 11 November 2008

Benjamin C. Lee, et al 1 :: MICRO :: 11 Nov 08

# **Technology Trends**

- Moore's Law and increasing transistor densities
- Performance and power efficiency
- Transition to multi-core and parallelism



Technology Trends Simulation Challenges Simulation Paradigm

# **Simulation Challenges**

### Cycle-Accurate Simulation

- Accurately identifies trends in design space
- Tracks instructions' progress through microprocessor

### Simulation Costs

- Costs per simulation :: minutes, hours per design
- Number of simulations :: scales exponentially (m<sup>p</sup>)
  - *p*, *m* :: parameter count, resolution
- Costs per simulation :: scales superlinearity (n<sup>γ</sup>)
  - *n* cores,  $\gamma > 1$

・ 同 ト ・ ヨ ト ・ ヨ ト ・

| Motivation     | Technology Trends   |
|----------------|---------------------|
| Uniprocessor   |                     |
| Multiprocessor | Simulation Paradigm |

### Statistical Inference

- Construct inferential models from samples [ASPLOS'06]
- Use models as efficient surrogates for simulator



◆□▶ ◆□▶ ◆ □▶ ◆ □▶ ● □ ● の Q @

Motivation Uniprocessor Multiprocessor Technology Trends Simulation Challenges Simulation Paradigm

### Multiprocessor Inference

### • Expensive CMP Simulations

- Physical resource contention increases host cycles
- Logical resource contention increases simulated cycles
- Synchronization increases cost per simulated cycle

#### • Composable Performance Regression

- Leverage core models to minimize CMP simulations
- Core :: Uniprocessor performance
- Contention :: Shared resource contention
- Penalty :: Performance penalty from contention

く 同 と く ヨ と く ヨ と

Motivation Regression Theor Uniprocessor Model Evaluation Multiprocessor Evolutionary Desi

### Outline

#### **Motivation**

Technology Trends Simulation Challenges Simulation Paradigm

#### Uniprocessor

Regression Theory Model Evaluation Evolutionary Design

#### **Multiprocessor**

CPR Model Evaluation Scalability

・ 同 ト ・ ヨ ト ・ ヨ ト

Motivation Regression Theor Uniprocessor Model Evaluation Multiprocessor Evolutionary Desig

### Outline

#### Motivation

Technology Trends Simulation Challenges Simulation Paradigm

#### Uniprocessor

Regression Theory Model Evaluation Evolutionary Design

#### Multiprocessor

CPR Model Evaluation Scalability

イロト イポト イヨト イヨト

э

Motivation Regression Theory Uniprocessor Model Evaluation Multiprocessor Evolutionary Design

## **Regression Theory**

#### Statistical Inference

- Models relationships between data
- Requires initial data to train, formulate model
- Leverages correlation from initial data for prediction

#### Regression Models

- Low training costs (sample 300 from 4.3B designs)
- Accurate inference (1.5% median error)
- Efficient computation (100's of predictions per second)

・ 同 ト ・ ヨ ト ・ ヨ ト



### Formulation

- n simulated design samples, p design parameters
- Response :: Y design metrics (*e.g.*, performance)
- Predictor :: X design parameters (e.g., ROB, cache)

$$\mathbf{Y} = \begin{bmatrix} y_1 \\ \vdots \\ y_n \end{bmatrix} \qquad \qquad \mathbf{X} = \begin{bmatrix} x_{11} & \dots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \dots & x_{np} \end{bmatrix}$$

- Coefficients ::  $\beta = [\beta_0, \dots, \beta_p]^T$
- Errors ::  $\varepsilon = [\varepsilon_1, \dots, \varepsilon_n]^T$  where  $\varepsilon_i \sim N(0, \sigma^2)$

$$\begin{aligned} \mathbf{Y} &= \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \\ F(\mathbf{Y}) &= G(\mathbf{X})\boldsymbol{\beta}_G + \boldsymbol{\varepsilon} \end{aligned}$$

◆□ ▶ ◆□ ▶ ◆ □ ▶ ◆ □ ▶ ◆ □ ● ● ○ ○ ○

Motivation Regression Theory Uniprocessor Model Evaluation Multiprocessor Evolutionary Design

## Prediction

### Requirements

- β known from least squares model training
- X known for a given set of queries

### Expected Response

- Response as weighted sum of predictor values
- Computed efficiently as matrix-vector product

$$E[\mathbf{Y}] = E[\mathbf{X}\beta + \varepsilon]$$
  
=  $E[\mathbf{X}\beta] + E[\varepsilon]$   
=  $\mathbf{X}\beta$ 

・ 同 ト ・ ヨ ト ・ ヨ ト

MotivationRegression TheoUniprocessorModel EvaluationMultiprocessorEvolutionary Des

## Experimental Methodology

### Intel Product Simulators

- Models consecutive generations of x86 µ-arch
- Supports dual-, quad-core architectures.

### Sampling Uniformly at Random (UAR)

- Parameter space includes predictors, ROB, caches
- 15 parameters, 4.3B designs
- Sample 300 designs for simulation

#### Statistical Framework

- R :: software environment for statistical computing
- Hmisc and Design packages [Harrell]

・ 同 ト ・ ヨ ト ・ ヨ ト

| Motivation     | Regression Theory |
|----------------|-------------------|
| Uniprocessor   | Model Evaluation  |
| Multiprocessor |                   |

### **Benchmarks**

| Digi   | tal Home     |                                       |  |  |
|--------|--------------|---------------------------------------|--|--|
| 1      | audio        | audio conversion                      |  |  |
| 2      | video        | video compression                     |  |  |
| 3      | photo        | photoshop album                       |  |  |
| Gan    | nes          |                                       |  |  |
| 4      | unreal       | Unreal Tournament                     |  |  |
| 5      | halflife     | Half-Life, modified Quake engine      |  |  |
| 6      | homeworld    | Homeworld, three-dimensional movement |  |  |
| Mul    | timedia      |                                       |  |  |
| 7      | mentalray    | rendering, ray tracing                |  |  |
| 8      | painter      | raster graphics package               |  |  |
| 9      | tachyon      | ray tracing                           |  |  |
| Offi   | се           |                                       |  |  |
| 10     | outlook      | personal information manager          |  |  |
| 11     | access       | relational database management system |  |  |
| 12     | excel        | spreadsheet application               |  |  |
| Pro    | Productivity |                                       |  |  |
| 13     | md2          | OpenSSL cryptographic hash function   |  |  |
| 14     | encrypt      | file encryption                       |  |  |
| 15     | flash        | multimedia player                     |  |  |
| Server |              |                                       |  |  |
| 16     | specweb      | web server                            |  |  |
| 17     | tpcc         | on-line transaction processing        |  |  |
| 18     | specjapp     | J2EE 1.3 application servers          |  |  |

◆□ > ◆□ > ◆豆 > ◆豆 > ̄豆 \_ のへで

Motivation Regression Theor Uniprocessor Model Evaluation Multiprocessor Evolutionary Desi

### Uniprocessor Model Accuracy

- Obtain 50 additional random samples for validation
- Core :: 1.5% median error





< 🗇

ヨト・モート

Motivation Regression Theory Uniprocessor Model Evaluation Multiprocessor Evolutionary Design

# **Evolutionary Design**

#### Evolutionary Approach

- Optimize ProcX
- Design ProcY, enhancing ProcX with μ-arch features
- Re-construct models, accounting for  $\mu$ -arch features
- Optimize ProcY

### Case Study

- Consecutive generations of x86 μ-arch
- Improve FE (e.g., branch prediction)
- Improve MEM (e.g., prefetching)
- Improve OOO (e.g., memory disambiguation)

・ 同 ト ・ ヨ ト ・ ヨ ト

| Motivation     | Regression Theory   |
|----------------|---------------------|
| Uniprocessor   |                     |
| Multiprocessor | Evolutionary Design |

### **Evolving Caches**

#### Improve MEM: similar performance with smaller caches



Parameter Values for Varying Delay Targets

・ 同 ト ・ ヨ ト ・ ヨ ト

ъ



# **Evolving ROB**

- Improve FE: more instructions inflight, suggests larger ROB
- Improve MEM: fewer cache misses, suggests smaller ROB



Motivation CPR Uniprocessor Model E Multiprocessor Scalabil

#### CPR Model Evaluation Scalability

### Outline

### Motivation

Technology Trends Simulation Challenges Simulation Paradigm

#### Uniprocessor

Regression Theory Model Evaluation Evolutionary Design

#### **Multiprocessor**

CPR Model Evaluation Scalability

・ロット (雪) (山) (山)

ъ

### Composable Performance Regression

- CPR :: build separate core, contention, penalty models
- Requires simulations  $N_{uni} > N_{con} \ge N_{pen}$
- Suppose core sims require  $T_1$ , multi-core sims require  $T_1n^{\gamma}$



ヘロン 人間 とくほ とくほ とう

3



### **CPR:** Core

- Train with uniprocessor sims from full parameter space
- Estimate per core delay from all design parameters



◆□ ▶ ◆□ ▶ ◆ □ ▶ ◆ □ ▶ ● □ ● ● ● ●

## **CPR:** Contention

- Train with CMP, cache-only sims from reduced subspace
- Estimate cache hits/misses from shared cache parameters



◆□ > ◆□ > ◆臣 > ◆臣 > ─ 臣 ─ のへで



### **CPR:** Penalty

- Train with composed predictions, few CMP sims from full space
- Estimate CMP core delays from core, contention predictions



ヘロン 人間 とくほ とくほ とう

3

Motivation Uniprocessor Multiprocessor

CPR Model Evaluation Scalability

### **Benchmarks**

| Dual-Core Benchmarks |           |           |  |  |
|----------------------|-----------|-----------|--|--|
| Set                  | .1        | .2        |  |  |
| 1                    | painter   | homeworld |  |  |
| 2                    | access    | mentalray |  |  |
| 3                    | specjapp  | specweb   |  |  |
| 4                    | homeworld | tachyon   |  |  |
| 5                    | dense     | flash     |  |  |

| Quad-Core Benchmarks |         |           |          |           |  |
|----------------------|---------|-----------|----------|-----------|--|
| Set                  | .1      | .2        | .3       | .4        |  |
| 1                    | dense   | excel     | flash    | md2       |  |
| 2                    | video   | specjapp  | specweb  | tachyon   |  |
| 3                    | excel   | homeworld | audio    | unreal    |  |
| 4                    | outlook | encrypt   | halflife | homeworld |  |
| 5                    | painter | mentalray | outlook  | encrypt   |  |

Benjamin C. Lee, et al 22 :: MICRO :: 11 Nov 08

Motivation CPR Uniprocessor Model Evaluation Multiprocessor Scalability

### Multiprocessor Model Accuracy

- Dual-core :: 6.6% median error
- Quad-core :: 4.8% median error





< 🗇 >

▶ < ∃ >



# **Scaling Trends**

- Lower bound CPR costs 0.33x of naïve costs
- Approach lower bound as uniprocessor models built



э

### Conclusion

#### Inference in Industry

- Effective inference for x86  $\mu$ -arch
- 1.5% median errors relative to simulation
- Evolutionary design for new features across generations

#### • Composable Performance Regression

- Leverage core models to minimize CMP simulations
- Construct separate core, contention, penalty models
- 4.8 to 6.6% median errors for dual-, quad-core
- 0.33x training costs of prior approaches

伺 とくき とくき とう

### **Future Directions**

### Efficient Multiprogramming Analysis

Evaluate combinations without modeling every combination

#### Multi-Threaded Workloads

- Extend for homogeneous, heterogeneous threads.
- Account for synchronization events

#### Many-Core Architectures

- Construct models without many-core simulators
- Consider other shared resources (e.g., network)

(雪) (ヨ) (ヨ)

# CPR: Composable Performance Regression for Scalable Multiprocessor Models

#### Benjamin C. Lee

Computer Architecture Group Microsoft Research

#### Jamison Collins, Hong Wang

Microarchitecture Research Lab Intel Corporation

# (intel)

#### David Brooks

Engineering and Applied Sciences Harvard University





イロト イポト イヨト イヨト

International Symposium on Microarchitecture 11 November 2008

Benjamin C. Lee, et al 27 :: MICRO :: 11 Nov 08