- Modern CPUs exploit parallelism internally (caching, vectorization, ILP, pipelining), even on a single core
- Code structure determines how effective these structures are
History and Preamble
-
Moore’s Law: Transistor counts on a chip double approximately every two years
-
Until the early 2000s: Shrinking transistors enabled both higher transistor densities and higher clock speeds, directly driving computational power growth
-
computer executes one instruction per clock. An instructino describes an operation for a processor to perform (modify the conputer’s state)
Single Core
Instruction level parallelism
These optimizations make control unit more complex: Out of order scheduling, branch prediction, memory prefetching, cache, etc.
But limits:
- Power consumption and heat
- ILP wall
- Memory wall
So we got
Multi Core
Thread level parallelism
We need to parallelize tasks, so Programmers must design software to split work into independent threads.
Caching
Von-Neumann-Architektur
Memory access results in latency → Invention of Cache
Example, summing an Array
Best performance in sequential access, efficient due to cache locality, cache prefetching works
Jumping around the Array is Cache-unfiendly, causing cache misses because prefetching is inefficient
→ a computer has multiple caches. Here each L stands for a cache, f.ex. L1 is 5x fasster than L2, etc.
Compute optimization
3 Approaches to apply parallelism:
Vectorization
Verschiedene, unabhängige Operationen vom gleichen Typen können parallelisiert werden, durch Optimierungen vom Compiler
SIMD
SIMD: Single Instruction, Multiple Data. One instruction broadcasts to multiple ALUs (calculation units)
Compiler sieht Optimizationsmöglichkeiten
ILP
Processor finds independent instructions and executes them in parallel (“Superscalar”). Independent: result of one instruction isn’t input of the other. Hardware enables ILP
ILP: Happens on a single thread, so no concurrency.
Multi-threading: multiple cores
Combination: Thread scheduled on a core, ILP to separate on the same thread
Prefetching: Preloading instructions into the CPU cache before needed.
x = a + b; // this and the one below are independent
y = c + d; // can execute these 2 instructions in parallel
z = x * y; // this one depends on results above, so has to waitIPL ==can also reorder statements==, if independend from one another.
Independent Statements
Two statemenets are independent, if
- different register names
- different memory addresses
Slides
Pipelining
→ split a “process” into different stages, each with clear function, f.ex. clothing washing process
Balenced
Balanced pipeline: all steps require same time
1. Instruction Fetch
- CPU reads the instruction from memory
- may prefetch multiple instructions
2. Instruction Decode
- decode the instructions, prepare instructions for execution
- instructions are then dispatched to the correct execution units
3. Execution
- CPU executes decoded instruction
- Superscalar execution: instructions dispatched to multiple execution units.
- SIMD: apply one single instruction to multiple pieces of data all at once
4. Memory Access
- If needed, exchange data between CPU and memory
- pre-fetching, caching, etc.
5. Writeback
- Save the final result of the execution
- outcome of the execution stage is written back into the CPU’s registers
- Superscalar Execution: Multiple instructions can write their results back to different registers or memory locations at the same time
Throughput
Throughput = Menge an Arbeit die wir in bestimmter Zeit machen können larger is better
→ EINHEITEN angeben exam
- For a specific number of instances n:
- For infinite pipeline (per execution unit)
Beispiele:
- means 1 instruction per 3s
- 1 Produkt pro 15 min
Latency
Latency = time needed to perform a given computation. lower is better
-
Latency ist nicht konstant, kann sich verändern
-
Latency bound = (no of stages) Dauer vom längsten Schritt
-
Konstant: nehme einfach erste Instanz
latency = Zeit für erste Instanz -
Nicht konstant:
latency = Zeit für erste Instanz + (Längste Stage - Time of first stage) * (n-1)
Example: unbalanced Pipeline creates waiting delays
Example: washing loads
→ make pipeline balanced by increasing time for each stage to match longest stage
→ add 2 dryers working in a row