05 Hardware

Modern CPUs exploit parallelism internally (caching, vectorization, ILP, pipelining), even on a single core
Code structure determines how effective these structures are

History and Preamble

Moore’s Law: Transistor counts on a chip double approximately every two years
Until the early 2000s: Shrinking transistors enabled both higher transistor densities and higher clock speeds, directly driving computational power growth
computer executes one instruction per clock. An instructino describes an operation for a processor to perform (modify the conputer’s state)

Single Core

Instruction level parallelism

These optimizations make control unit more complex: Out of order scheduling, branch prediction, memory prefetching, cache, etc.

But limits:

Power consumption and heat
ILP wall
Memory wall

So we got

Multi Core

Thread level parallelism

We need to parallelize tasks, so Programmers must design software to split work into independent threads.

Caching

Von-Neumann-Architektur

Memory access results in latency → Invention of Cache

Example, summing an Array

Best performance in sequential access, efficient due to cache locality, cache prefetching works

Jumping around the Array is Cache-unfiendly, causing cache misses because prefetching is inefficient

→ a computer has multiple caches. Here each L stands for a cache, f.ex. L1 is 5x fasster than L2, etc.

Compute optimization

3 Approaches to apply parallelism:

Vectorization
ILP
Pipelining

Vectorization

Verschiedene, unabhängige Operationen vom gleichen Typen können parallelisiert werden, durch Optimierungen vom Compiler

SIMD

SIMD: Single Instruction, Multiple Data. One instruction broadcasts to multiple ALUs (calculation units)

Compiler sieht Optimizationsmöglichkeiten

ILP

Processor finds independent instructions and executes them in parallel (“Superscalar”). Independent: result of one instruction isn’t input of the other. Hardware enables ILP

ILP: Happens on a single thread, so no concurrency.
Multi-threading: multiple cores

Combination: Thread scheduled on a core, ILP to separate on the same thread

Prefetching: Preloading instructions into the CPU cache before needed.

x = a + b; // this and the one below are independent
y = c + d; // can execute these 2 instructions in parallel
z = x * y; // this one depends on results above, so has to wait

IPL ==can also reorder statements==, if independend from one another.

Independent Statements

Two statemenets are independent, if

different register names

different memory addresses

Slides

Pipelining

→ split a “process” into different stages, each with clear function, f.ex. clothing washing process

Balenced

Balanced pipeline: all steps require same time

1. Instruction Fetch

CPU reads the instruction from memory
may prefetch multiple instructions

2. Instruction Decode

decode the instructions, prepare instructions for execution
instructions are then dispatched to the correct execution units

3. Execution

CPU executes decoded instruction
Superscalar execution: instructions dispatched to multiple execution units.
SIMD: apply one single instruction to multiple pieces of data all at once

4. Memory Access

If needed, exchange data between CPU and memory
pre-fetching, caching, etc.

5. Writeback

Save the final result of the execution
outcome of the execution stage is written back into the CPU’s registers
Superscalar Execution: Multiple instructions can write their results back to different registers or memory locations at the same time

Throughput

Throughput = Menge an Arbeit die wir in bestimmter Zeit machen können larger is better

→ EINHEITEN angeben exam

For a specific number of instances n: $Throughput = (n)/total time$
For infinite pipeline (per execution unit) $Throughput bound = \frac{1}{Dauer vom l a ¨ ngsten Schritt}$

Beispiele:

$t_{put} = \frac{1}{3 s}$ means 1 instruction per 3s
1 Produkt pro 15 min

Latency

Latency = time needed to perform a given computation. lower is better

Latency ist nicht konstant, kann sich verändern
Latency bound = (no of stages) $\cdot$ Dauer vom längsten Schritt
Konstant: nehme einfach erste Instanz latency = Zeit für erste Instanz
Nicht konstant: latency = Zeit für erste Instanz + (Längste Stage - Time of first stage) * (n-1)

Example: unbalanced Pipeline creates waiting delays

Example: washing loads

→ make pipeline balanced by increasing time for each stage to match longest stage

→ add 2 dryers working in a row

CS Notes

Explorer

History and Preamble

Single Core

Multi Core

Caching

Compute optimization

Vectorization

ILP

Pipelining

Throughput

Latency

Example: unbalanced Pipeline creates waiting delays

Graph View

Table of Contents