Lecture-15: Advanced Processor Architecture (MPU)

Advanced Processor Architecture

Modern processors use a combination of architectural techniques to achieve high performance, low power consumption, parallel execution, and efficient resource utilization. Here are the major concepts:

1. Superscalar Architecture

A superscalar processor can execute multiple instructions per clock cycle.
It includes:

Multiple execution units
Parallel pipelines
Instruction dispatch logic

Goal: Increase throughput by running several instructions at the same time.

2. Pipelining

Instructions are broken into stages (Fetch → Decode → Execute → Memory → Write-back).
Multiple instructions stay in different stages, making the CPU work continuously.

Advanced versions include:

Deep pipelines (Pentium 4 had ~20+ stages)
Dynamic pipeline resizing (modern ARM big.LITTLE)
Out-of-order pipelining

3. Out-of-Order Execution (OoO)

Instructions are not executed in the original program order. Instead, they are executed when:

The required data is available
The execution unit is free

Hardware components involved:

Reservation stations
Reorder buffer (ROB)
Register renaming

This hides latency and boosts speed.

4. Register Renaming

Prevents data hazards (WAR, WAW) by giving each instruction its own physical register instead of reusing architectural registers.

Result: More parallel execution without conflicts.

5. Branch Prediction

To keep pipeline full, processors predict the next instruction after a branch.

Modern CPUs use:

Two-level adaptive predictors
Branch history tables
Global/local prediction
Neural predictors (in some ARM & Apple chips)

Bad prediction → pipeline flush → penalty.

6. Speculative Execution

The CPU executes instructions before knowing whether they are actually needed.
If prediction is correct → performance ↑
If wrong → discard results.

Used heavily in Intel, AMD, Apple M-series chips.

7. Multi-Core Architecture

Instead of increasing clock speed, modern CPUs add multiple cores inside one chip.

Types:

Single-core → Dual-core → Quad-core
Many-core (20+ cores in server CPUs)
Heterogeneous cores (big.LITTLE architecture)

8. Heterogeneous Computing (big.LITTLE)

Used in ARM-based mobile & Apple Silicon:

Performance cores (P-cores) → High speed
Efficiency cores (E-cores) → Low power

The scheduler chooses which core to use.

9. Simultaneous Multithreading (SMT / Hyper-Threading)

A single core appears as two logical threads.

It allows:

Higher resource utilization
Overlapping stalls
Better throughput

Intel → Hyper-Threading
AMD → SMT

10. Cache Hierarchy & Advanced Memory Architecture

Modern CPUs depend heavily on caches:

L1 (fastest, smallest)
L2 (larger, slower)
L3 (shared across cores)
L4/eDRAM (in some server chips)

Advanced techniques:

Cache coherence protocols (MESI, MOESI)
Victim caches
Smart prefetching algorithms

11. Instruction Set Innovations

RISC vs CISC

Intel/AMD → CISC (x86_64), but internally convert to RISC-like micro-ops
ARM → Pure RISC (simple & power efficient)

Vector Extensions

Intel → AVX, AVX2, AVX-512
ARM → NEON, SVE
Used in AI, multimedia, scientific computing

12. Accelerator Integration

Modern processors integrate accelerators for special workloads:

AI accelerators / NPUs
GPU cores (APU architecture)
Cryptographic engines
Image signal processors (ISP)

Example: Apple M1/M2/M3 integrates CPU + GPU + Neural Engine.

13. Chiplet Architecture

Instead of one large die, CPUs now use multiple smaller chiplets connected via high-speed interconnects.

AMD uses:

CCD (Core Complex Die)
IOD (I/O Die)

Benefits:

Better yields
Lower manufacturing cost
Higher scalability

14. Power Management Technologies

To save battery:

Dynamic Voltage and Frequency Scaling (DVFS)
Turbo Boost (Intel) / Precision Boost (AMD)
Thermal throttling
Adaptive power gating

Summary Table

Feature	Purpose
Superscalar	Execute multiple instructions per cycle
Pipelining	Overlap instruction stages
Out-of-order	Maximize performance by ignoring program order
Branch prediction	Avoid stalls during conditional jumps
Speculative execution	Boost performance using prediction
SMT	Use idle resources more efficiently
Multi-core	Parallel processing
Heterogeneous cores	Balance performance & power
Chiplets	High scalability & efficiency
Vector engines	High-speed math operations