#### ESE5320: System-on-a-Chip Architecture

Day 27: December 6, 2023 Real Time (Optional)

ESE5320 Fall 2023 -- DeHor



#### Today

#### Real Time

- Part 1: Demands
- Part 2: Challenges
  - Algorithms
  - Architecture
- · Part 3: Disciplines to achieve

ESE5320 Fall 2023 -- DeHo

#### Message

- · Real-Time applications demand different discipline from best-effort tasks
- · Look more like synchronous circuits
- · Can sequentialize, like processor
  - But must avoid/rethink typical generalpurpose processor common-case optimizations

3

#### Real Time

- "Real" refers to physical time
  - Connection to Real or Physical World
- · Contrast with "virtual" or "variable" time
- · Handles events with absolute guarantees on timing

#### **Real-Time Tasks**

- · What timing guarantees might you like for the following tasks?
  - Turn steering wheel on a drive-by-wire car
    - Delay to recognized and car turns
  - Self-driving car detects an object in its path
    - Delay from object appearing to detection
  - Pacemaker stimulates your heart
  - Video playback (frame to frame delay)

#### **Real-Time Guarantees**

- · Attention/processing within fixed interval
  - Sample new value every XX ms
  - Produce new frame every 30 ms
  - Both: schedule to act and complete action
- · Bounded response time
  - Respond to keypress within 20 ms
  - Detect object within 100 ms
  - Return search results within 200 ms

n ESE5320 Fall 2023 -- DeHon



#### Real-Time Response

- What if your car gave you a spinning wait wheel for 5 seconds when you
  - Turned the wheel?
  - Stepped on the brakes?



Penn ESE5320 Fall 2023 -- DeHon

8

#### Synchronous Circuit Model

- A simple synchronous circuit is a good "model" for real-time task
  - Run at fixed clock rate
  - Take input every "cycle" (application cycle)
  - Produce output every "cycle" (application cycle)
  - Complete computation between input and output
  - Designed to run at fixed-frequency
    - · Critical path meets frequency requirement

nn ESE5320 Fall 2023 -- DeHo

9

## Preclass 2 • (Circuit) Cycle time could operate? • Assume clocked at 100Hz (application cycle) • Worst-case delay from (L)eft press to change in heading (posx,posy)? Controller (Controller (Cont

10

# Preclass 2 • Assume clocked at 100Hz (application cycle) Controller (Position (cycle)) (Ryorke (Sprake (color))) Lipress (Clock into posx.posy) Penn ESE5320 Fall 2023 -- DeHon 10 ms 10 ms 10 ms 11

Historically

- · Real-Time concerns grew up in EE
  - Because an analog circuit was the only way could meet frequency demands
  - ...later a dedicated digital circuit...
- Applications
  - Signal processing, video, control, ...

Penn ESE5320 Fall 2023 -- DeHon

12

11

#### **Technological** Change



- · Area units for spatial design shown (preclass 2c)
- · Fraction of processor capacity required (Prelcass 2d)
- · Why might prefer using a processor to using the spatial circuit?
  - Hint: What does preclass 2c,d suggest?

#### Performance Scaling

- · As circuit speeds increased
  - Can meet real-time performance demands with heavy sequentialization
- Circuit and processor clocks
  - from MHz to GHz
- Many real-time task rates unchanged
  - 44KHz audio, 33 frames/second video
- Even 100MHz processor
- Can implement audio in a small fraction of its computational throughput capacity

13

#### HW/SW Co-Design

- · Computer Engineers know can implement anything as hardware or software
- · Want freedom to move between hardware and software to meet requirements
  - Performance, costs, energy

15

#### Real-Time Challenge

- Meet real-time demands / guarantees
  - Economically using programmable architectures
- · Sequentialize and share resources with deterministic, guaranteed timing
- Spatial (all hardware, HLS synthesized) implementations are good at meeting real-time guarantees, but may be bigger than necessary

Day 3

16

14

#### **CHALLENGES**

ESE5320 Fall 2023 -- DeHon

17

**Processor Data Caches** 

- · Traditional Processor Data Caches are a heuristic instance of this
  - Add a small memory local to the processor
    - · It is fast, low latency
  - Store anything fetched from large/remote memory in local memory
    - · Hoping for reuse in near future
  - On every fetch, check local memory before go to large memory

n ESE5320 Fall 2023 -- DeHo

Large **Memory** 

Day 3

#### **Processor Data Caches**

- · Demands more than a small memory
  - Need to sparsely store address/data mappings from large memory
  - Makes more area/delay/energy expensive than just a simple memory of capacity
- Don't need explicit data movement
- · Cannot control when data moved/saved
  - Bad for determinism
- Limited ability to control what stays in small memory simultaneously

19

21

19

20

### Preclass 3: Processor Cache Timing

- Assume
  - cache miss (go to large memory) takes 10 cycles
  - Cache hit (small memory) takes 1
  - Start with empty cache
- Due to memory delay, how long to execute:

 $\begin{array}{lll} b = a[0] + a[1]; & b = a[i] + a[i]; \\ c = a[1] + a[2]; & c = a[k] + a[i]; \\ d = a[2] + a[0]; & d = a[m] + a[n]; \end{array}$ 

Penn ESE5320 Fall 2023 -- DeHon

21

#### Scratchpad

**Processor Data Caches** 

· Traditional Processor Data Caches are

- Store anything fetched from large/remote

- On every fetch, check local memory before

Large

Memory

- Stall processor while waiting for data

a heuristic instance of this

memory in local memory

go to large memory

· Hoping for reuse in near future

- Recall, scratchpad memory
  - Small

ESE5320 Fall 2023 -- DeHo

- Explicitly managed (not dynamic like cache)
- If move (DMA) data to scratchpad memory, would be deterministic

 $\begin{array}{lll} b = a[0] + a[1]; & b = a[i] + a[i]; \\ c = a[1] + a[2]; & c = a[k] + a[i]; \\ d = a[2] + a[0]; & d = a[m] + a[n]; \end{array}$ 

Penn ESE5320 Fall 2023 -- DeHon

22

#### Observe

 Instructions on "General Purpose" processors take variable number of cycles

Penn ESE5320 Fall 2023 -- DeHon

23

24

#### Preclass 4

- How many cycles?
  - sin, cos 100 cycles each
  - Assignments take 1 cycle

else

{sh=sin(heading);
ch=cos(heading);}

n ESE5320 Fall 2023 -- Defion

#### Preclass 5

· How many cycles?

```
sum=0;
for (i=0;i<32;i++) {
    sum+=(0-(b\%2)) & a;
    b=b>>1;
    a=a<<1;
```

ESE5320 Fall 2023 -- DeHon

25

#### Observe

- · Data-dependent branching, looping
  - Means variable time for operations

27

#### Algorithm

· What programming constructs are datadependent (variable delay)?

ESE5320 Fall 2023 -- DeHon

#### Preclass 5

· How many cycles?

```
sum=0;
for (;b!=0;b=b>>1) {
     if (b\%2==1)
        sum+=a;
     a=a<<1;
}
```

26

ESE5320 Fall 2023 -- DeHon

#### Two Challenges

- 1. Architecture Hardware have variable (data-dependent) delay
  - Esp. for General-Purpose processors
    - · Instructions take different number of cycles
- 2. Algorithm computational specification have variable (data-dependent) operations
  - Different number of instructions

 $Time = \sum Cycles(i)$ 

28

#### **Programming Constructs**

- · Conditionals: if/then/else
- · Loops without compile-time determined bounds
  - While with termination expressions
  - For with data-dependent bounds
- · Data-dependent recursion
- Interrupts
  - I/O events, time-slice
- Note: 1st three were issue for HLS

- For same reason - how did we address?

30

29

#### Hardware Architecture

- Some typical (4710,5710) processor "optimizations" can cause variable delay
  - Caches
  - Branch prediction
  - Common-case optimizations
  - Pipeline stalls
  - Speculative issue

32

What can we do to make architecture more deterministic?

- Explicitly managed memory
- Eliminate Branching (too severe?)
- · Unpipelined processors
- · Fixed-delay pipelines
  - Offline-scheduled resource sharing
  - Multi-threaded
- Deadlines

SE5320 Fall 2023 -- DeHon

34

36

35

## **Explicitly Managed Memory** 1 cycle small memory 20 cycles Large Memory 36 ESE5320 Fall 2023 -- DeHon

#### **DISCIPLINES TO ACHIEVE REAL-TIME**

33

**Explicitly Managed Memory** 

- · Make memory hierarchy visible
  - Use Scratchpad memories instead of caches
- Explicitly move data between memories
  - E.g. movement into local memory
- Already do for Register File in Processor
  - Load/store between memory and RF slot
  - ...but don't do for memory hierarchy

#### Offline Schedule Resource Sharing

- · Don't arbitrate
- · Decide up-front when each shared resource can be used by each thread or processor
  - Simple fixed schedule
  - Detailed Schedule
- What

- Memory bank, bus, I/O, network link, ...

nn ESE5320 Fall 2023 -- DeHon

37



#### Time-Multiplexed Bus

- · Regular schedule
- Fixed bus slot schedule of length N > masters
  - (probably a multiple)
- · Assign owner for each slot
  - Can assign more slots to one
- E.g. N=8, for 4 masters
  - Schedule (1 2 1 3 1 2 1 4)

39

39

38



Simple Deterministic
Processor with Multiplier

- · No branching
- · Unpipelined

ESE5320 Fall 2023 -- DeHon

- Every operation completes in fixed time
- Cycle time as shown?
- · Retimed cycle time?
- What's inefficient about this design?

Penn ESE5320 Fall 2023 -- DeHon

0.5ns Inst.

1 ns Mem

0.5ns Figs
Figs
1 ns Mem

1 ns Me

41

## Simple Deterministic Processor with some Pipelining

- No branching
- Every operation completes in fixed time
- · Retimed cycle time?
  - · Hint what are cycles?
- How pipelines added change behavior?
  - Hint: what is sequence of addresses into Instr. Mem?

Mem?
ESE5320 Fall 2023 -- DeHon



**Deadline Instruction** 

- · Deal with algorithmic (branching) variability
- · Set a hardware counter for thread
- · Decrement counter on each cycle
- Demand counter reach 0 before thread allowed to continue at deadline instruction
  - Stall if get there early
  - Similar to flip-flop on a logic path
    - · Wait for clock edge to change or sample value
- · Model: fixed execution time

Penn ESE5320 Fall 2023 -- DeHon

63

42

#### **WCET**

- WCET Worst-Case Execution Time
- Analysis when working with algorithms and architectures with data-dependent delay
  - Need to meet real time
  - Calculate the worst-case runtime of a task
    - Like calculating the critical path (but harder)
    - · Worst-case delay of instructions
    - · Worst-case path through code
    - Worst-case # loop iterations
  - Rationale for setting Deadlines
  - (like a cycle time)

64

#### Deterministic Pipelines

- Not how ARM, Intel (4710, 5710) processor are pipelined
- Those include operations that make timing variable
  - dynamic data hazards, branch speculation
- Here, data becomes available after a predictable time
- · Branches take effect at a fixed time
  - Likely delayed
- Schedule to delays to get correct data

65

65

#### **Different Goals**

#### Real-Time

- Willing to recompile to new hardware
- Want time on hardware predictable
- Willing to schedule for delays in particular hardware

General Purpose/Best Effort

- · ISA fixed
- Want to run same assembly on different implementations
- Tolerate different delays for different hardware
- Run faster on newer, larger implementations

SE5320 Fall 2023 -- DeHon

66

68

64

### SoC Opportunity

- Can choose which resources are shared
- · Can dedicate resources to tasks
- Isolate real-time tasks/portions of tasks from best-effort
  - Separate hardware/processors
  - Separate memories, network

67

#### UltraScale+ Zynq

- · Has 2 "Real-Time Processor"
  - ARM Cortex-R5
    - 32b (vs. 64b for A53 APU processor)
    - ARMv7-R (vs. ARMv8)
    - Single ALU, dual issue
    - · Branch prediction
- Explicitly managed scratchpads
  - Tightly-Coupled Memories
  - On-Chip Memory (OCM)

n ESE5320 Fall 2023 -- DeHon

Programmable Soc

| Martin of the Superior | M

69

68



70

#### Admin

- Feedback
- Project due Monday 12/11
- · Signup for Demo Day
  - Will invite signup on Ed Discuss
- Final: Tuesay, Dec. 19, 3pm, Moore 216
- · Review: watch Ed Discuss
- · No DQ for today

Penn ESE5320 Fall 2023 -- DeHon

72

72

#### Big Ideas:

- Real-Time applications demand different discipline from best-effort tasks
- Look more like synchronous circuits and hardware discipline
- Avoid or use care with variable delay programming constructs
- Can sequentialize, like processor
  - But must avoid/rethink typical processor common-case optimizations
  - Offline calculate static schedule for computation and sharing

71

<del>l. ... and it at it at at all a</del>