VJZ Corporation

Join VJZ Corporation in reshaping the future of computing. With our trailblazing innovations and applicable scientific tools, we strive to advance digital technology.

Navigation/Menu

Home About Apply Resources Intranet (Requires VPN)

Modern CPU Microarchitecture 101

Take a look at this unassuming picture of AMD’s Ryzen 7 9700X:

image

Billions of CPUs like the Ryzen 7 9700X (albeit not as powerful) sit in motherboards and PCBs all around the world. The 9700X belongs to the AMD Zen 5 microarchitecture. Here is what happens if you take the cover off:

image

There are two dies in Zen 5 chips. The larger one in the top center supports PCIe and media encoding, and it contains the memory controller, the display engine, and many more modern components. The actual CPU itself is in the lower right, and here is the zoomed-in labeled die shot:

image

Why do caches take up so much space? What is inside of each core? How are modern CPUs faster than ever? These are some of the questions that we hope to answer in this walkthrough. Modern processors are not simple fetch, decode, and execute pipelines anymore. They often try to predict the future, execute out-of-order, and have support for executing several instructions in parallel. We will look at several mechanisms that enable this to work flawlessly for countless devices every second.

Disclaimer

Modern Intel and AMD processors are extremely complex to understand. There are many proprietary technologies at play and these companies really want to make it very hard to reverse engineer. x86 is also a variable-length ISA, making decoding more complex than RISC architectures. x86 decoding may warrant itself its own article, so will not be discussed here. Instead, we will use Microprocessor without Interlocked Pipelined Stages (MIPS) in running examples and diagrams.

Single-Cycle Processor

These are very basic processors that nobody uses nowadays outside of teaching examples. Each instruction gets executed in one cycle, which may seem great, but cycle times are horrendously long. Consider the following delays:

Single-cycle processors determine the clock cycle time by the sum of all delays. So even though a instruction like addu r1, r1, r2 never uses data memory, it still takes 975 picoseconds to complete.

image

The only upside of single-cycle processors are their simple designs as seen from the above diagram. That is why they are primarily used in classrooms. Their inefficiency would not make them perform well on any real workloads.

Pipelined Processor

The biggest shortcoming of the single-cycle processor was the lack of utilization. An instruction would only occupy one component at a time, leaving the other components sitting around and doing nothing. Pipelined processors instead group the components by stages. A classic version has 5 stages: (F)etch, (D)ecode, (E)xecute, (M)emory, and (W)riteback. As seen in the diagram below, pipelined processors are way more complicated in design:

image

If we apply the above delays to our pipelined processor, the clock cycle time reduces to 350 ps, an almost 3x speedup. To see why this is the case, consider the following simple workload:

add  r4, r2, r1
sw   r2, 2(r5)
slt  r3, r5, r4

In the single-cycle processor, the following diagram illustrates why it takes 15 cycles:

cycle # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
add r4, r2, r1 F D E M W                    
sw r2, 2(r5)           F D E M W          
slt r3, r5, r4                     F D E M W

Whereas a pipelined processor would only take 7 cycles:

cycle # 1 2 3 4 5 6 7
add r4, r2, r1 F D E M W    
sw r2, 2(r5)   F D E M W  
slt r3, r5, r4     F D E M W

Even though the forwarding unit and hazard detector handle most dependency situations, sometimes the processor still has to stall. A notable example is a load-use hazard:

cycle # 1 2 3 4 5 6 7 8 9 10
add r1, r4, r5 F D E M W          
sw r2, 2(r1)   F D E M W        
lw r3, 8(r6)     F D E M W      
add r5, r1, r3       F D D** E M W  
sub r1, r8, r2         F F D E M W

**The decode stage stalls because the previous instruction (lw) cannot forward to the next instruction’s (add) execute stage since add needs the result of lw which is only produced after the memory stage.

Superscalar Processor

A natural way to further speed up execution is to widen the pipeline. Rather than fetching one instruction at a time, we can fetch two or more instructions. This makes the processor superscalar. For example, consider a 2-wide superscalar processor running these instructions:

cycle # 1 2 3 4 5 6 7 8 9 10
add r3, r2, r6 F D E M W          
lw r1, 10(r5) F D E M W          
sub r4, r4, r1   F D D E M W      
and r6, r4, r2   F D D D E M W    
or r4, r6, r3     F F D D E M W  
addi r5, r5, #1     F F F D E M W  
slt r9, r7, r6         F F D E M W

Even though more instructions can be fetched at a time, dependencies are still the biggest bottleneck. In the above example, more than 70% of instructions stalled due to dependent instructions. Wouldn’t it make more sense to execute independent instructions that do not depend on each other first instead of stalling? That is exactly what processor designers sought to answer, which is the main topic of the next section.

Out-of-Order Processor

Up until now, our optimizations all boil down to one goal: keeping all components of the CPU busy. With pipelining, we managed to continuously feed instructions to the processor rather than waiting for the previous one to fully complete. Nevertheless, dependent instructions lead to hazards, which partially negated the benefit of pipelining. Consider the following program plagued with load-use hazards and its pipeline diagram:

cycle # 1 2 3 4 5 6 7 8 9 10 11 12
lw r1, 0(r10) F D E M W              
add r2, r1, r3   F D D E M W          
lw r4, 4(r10)     F F D E M W        
add r5, r4, r6         F D D E M W    
lw r7, 8(r10)           F F D E M W  
add r8, r7, r9               F D E M W

We can change the ordering of these instructions and remove all stalls:

cycle # 1 2 3 4 5 6 7 8 9 10
lw r1, 0(r10) F D E M W          
lw r4, 4(r10)   F D E M W        
lw r7, 8(r10)     F D E M W      
add r2, r1, r3       F D E M W    
add r5, r4, r6         F D E M W  
add r8, r7, r9           F D E M W

The programmer could reorder instructions themselves to realize this speedup, but that is extremely tedious and difficult. Instead, most modern processors automatically do this for you. This is the origin of the name “out-of-order,” describing how CPUs would internally schedule instructions in an order that was different than written. We will see several ways the hardware enables this to happen.

Register Renaming

The first major problem to tackle is the limited amount of registers available. Consider the following program:

add  r1, r2, r3
mul  r4, r1, r5
add  r1, r6, r7
sub  r8, r1, r9

What dependencies exist?

  1. mul needs r1 from the first add
  2. sub needs r1 from the second add

This creates two types of hazards:

But notice: r1 gets completely recomputed in the second add instruction. The first version of r1 is different from the second version of r1. This is where register renaming would be useful. Since different microarchitectures have different amounts of registers, program compatibility quickly breaks if we do not maintain two types of registers:

The processor keeps track of which architectural register maps to which physical register using the register alias table. The processor also has a free list of registers that it can use. Going back to our earlier example, let us assume this is the alias table at the start of the program:

arch reg phys reg
r1 t11
r2 t12
r3 t13
r4 t14
r5 t15
r6 t16
r7 t17
r8 t18
r9 t19

The free list at this point starts at t20 and goes onwards. We can rename the program:

add  t20, t12, t13
mul  t21, t20, t15
add  t22, t16, t17  # can run early since t22 and t20 are independent of each other
sub  t23, t11, t19

The renaming process is straightforward and successfully removes the hazards that we were talking about. Each time a new result is produced, we grab a free register and update the register alias table. The following changes are made:

arch reg phys reg
r1 t11, t20, t22
r2 t12
r3 t13
r4 t14, t21
r5 t15
r6 t16
r7 t17
r8 t18, t23
r9 t19

You might think that this is wasteful since we keep using up free registers. At a certain point, physical registers are freed to be used again. We will discuss about that below (see section on commit stage). In OoO processors, register renaming is itself a stage, commonly just called “rename.” Besides doing renaming, it also has a few other responsibilities…

Zeroing Idioms and Move Elimination

Two clever optimization ideas stem from the fact we can now rename registers. Imagine a common pattern for clearing a register generated by compilers:

add  r1, r2, r3   # do stuff with r1
xor  r1, r1, r1   # set r1 to 0

Modern CPUs automatically know that r1 ^= r1 will always result in r1 = 0. This is known as a zeroing idiom. The rename stage can recognize that given the pattern xor rX, rX, rX, it can update the corresponding physical register representing rX to 0. The instruction never makes it to the ALU as there is no need, meaning it was never executed at all! Some more zeroing idioms:

Depending on the specifics, all of the above instruction patterns never actually get executed, making them near zero-cost. Move elimination is a similar idea. Any instructions that effectively copy the contents of two registers can be eliminated by renaming. Consider the following:

move r2, r1

When the processor updates the register alias table, the following happens:

arch reg phys reg
r1 t45
r2 also t45

The move instruction disappears since the processor knows that r2 = r1 and r1 maps to t45. By the transitive property, the processor maps r2 to t45 as well.

Re-order Buffer

The last major responsiblity of the rename stage is to allocate an ROB entry. The re-order buffer is a very important structure responsible for ensuring instructions get retired correctly. Essentially, each entry contains:

To allocate a new ROB entry, the rename stage takes the instruction and extracts the destination architectual registers and the destination physical registers. Except for zeroing idioms and move elimination, the ROB entry is marked as not ready. Zen 5 cores have ROBs with 448 entries.

Instruction Dispatch

The renamed instructions get added to a queue and sent along to the dispatch stage. This stage is responsible for scheduling instructions to different functional units by keeping track of a scoreboard:

phys reg status
t1 ready
t2 pending
t3 pending
t4 ready
t5 ready

The scoreboard tells the processor which physical registers have already been written back to. The dispatch stage uses this information to check the instruction queue for instructions to “wake up.” For example, the dispatcher can tell add t2, t1, t5 to proceed but not or t6, t2, t5 because t2 is still marked as pending. The dispatch stage also keeps track of which functional units are free or in progress. If all functional units are occupied, the processor stalls.

Functional Units

In modern microarchitectures like the Zen 5, there are a few types of components that actually carry out the execution:

There may be some specialized functional units not listed, but these are the common ones. After execution, functional units can broadcast results thorugh the bypass network or common data bus. This allows forwarding to be done.

Writeback Stage

The writeback stage takes the result from functional units and writes it into the corresponding physical register. It also updates the ROB entry’s ready bit to be 1, indicating that the result can be committed. After the result is written, this stage additionally updates the scoreboard so the dispatcher knows which physical registers have results available.

Commit Stage

This is the last stage to process the instruction. Even though the instructions were actually executed out-of-order, the commit stage needs to retire them in order. Otherwise, the program’s flow from an external view would appear wrong, and debugging it would be a nightmare. The whole point of microarchitecture is to make optimizations without breaking the ISA’s guarantees. The commit stage selects the oldest ROB entry and retires it using the following steps:

  1. Write the result from the ROB’s destination physical register to the destination architectural register.
  2. If the instruction stores data to memory, actually write the result to the cache here. This is a common pattern where AGUs compute the effective store address, but not actually store it until the instruction is ready to commit. The reason lies in the fact that cache updates are sequential, meaning writing to them out-of-order would be a memory consistency violation.
  3. Recycle the physical register into the free list after writing to the architectural register.
  4. Retire the ROB entry and clear it.

Full OoO Example

Here is a full example of how an OoO core might execute this program for two iterations:

  add   r7, r0, r0
loop:
  add   r11, r7, r1
  add   r12, r7, r5
  beq   r7, r6, end
  lw    r10, 0(r11)
  lw    r11, 1(r11)
  sub   r11, r10, r11
  sw    r11, 0(R12)
  addi  r7, r7, #2
  beq   r0, r0, loop
end:

Program after register renaming:

  add   t2, t1, t1
loop:
  add   t4, t2, t3
  add   t6, t2, t5
  beq   t2, t7, end
  lw    t8, 0(t4)
  lw    t9, 1(t4)
  sub   t10, t8, t9
  sw    t10, 0(t6)
  addi  t11, t2, #2
  beq   t1, t1, loop
loop:
  add   t12, t11, t3
  add   t13, t11, t5
  beq   t11, t7, end
  lw    t14, 0(t12)
  lw    t15, 1(t12)
  sub   t16, t14, t15
  sw    t16, 0(t13)
  addi  t17, t11, #2
  beq   t1, t1, loop

Register alias table after renaming:

arch reg phys reg
r0 t1
r1 t3
r5 t5
r6 t7
r7 t2, t11, t17
r10 t8, t14
r11 t4, t9, t10, t12, t15, t16
r12 t6, t13

Pipeline diagram of back end execution engine (assuming 2-wide superscalar processor):

cycle # 1 2 3 4 5 6 7 8 9 10 11 12 13 14
add t2, t1, t1 D E W C                    
beq t1, t1, loop D E W             C        
add t4, t2, t3   D E W C                  
add t6, t2, t5   D E W C                  
beq t2, t7, end     D E W C                
lw t8, 0(t4)     D E W C                
lw t9, 1(t4)       D E W C              
addi t11, t2, #2       D E W     C          
sub t10, t8, t9         D E W C            
add t12, t11, t3         D E W     C        
sw t10, 0(t6)           D E W C          
add t13, t11, t5           D E W     C      
beq t11, t7, end             D E W   C      
lw t14, 0(t12)             D E W     C    
lw t15, 1(t12)               D E W   C    
addi t17, t11, #2               D E W       C
sub t16, t14, t15                 D E W   C  
beq t1, t1, loop                 D E W     C
sw t16, 0(t13)                   D E W C  

Branch Prediction

So far, we have been talking about improving performance through instruction execution. At a certain point, even OoO processors hit a limit in speedup. That is why modern CPUs do something dramatic: they try to predict the future using a technique called speculative execution. One common category that CPUs speculate on is control flow. These are ubiquitous in modern programming: every conditional, loop, and function uses control flow. The general term in microarchitecture is branching, as in code execution paths can branch at certain instructions. Branch instructions cannot be executed out-of-order in the traditional sense because the CPU cannot know the outcome of where to redirect the program counter until the instruction runs. Therefore, the CPU has to guess if the branch is either taken or not taken. If that guess is wrong, it will be very detrimental as the CPU has to rollback the wrong branch. However, branch predictors are specialized units that try to minimize guessing wrong as much as possible.

Static Predictor

This type of predictor operates on a static rule:

If branch jumps forward (target > PC), do not take; else backwards taken.

The reason why this is usually the rule is because of loops. If the branch instruction jumps to earlier code (backwards), chances are it is a loop. Accordingly, we should take the branch since most likely the loop (backwards) repeats more than the exit condition (forward). Nevertheless, this basic rule yields fairly mediocre results, which is why it is not seen in modern microarchitectures.

1-Bit History Predictor

The premise of this one is simple: repeat the last prediction. This gives our predictor some history. The rationale behind history-based predictors is that conditionals usually aren’t decided in isolation. Usually, they need prior state to decide whether or not a branch should be taken. A 1-bit predictor just remembers the last outcome:

image

These predictors aren’t very good either since two branches could have the same hashed PC. Nothing in the table tells us about this. Adding tags would make the table much larger and slower (almost becomes a cache).

Markov Predictor

These predictors utilize Markov chains to model branches as a stochastic process. Usually a table keeps track of the pattern seen, the next bit, and frequencies:

pattern next bit frequency count
00 0/1 0/0
01 0/1 2/1
10 0/1 0/3
11 0/1 1/0

Bimodal Saturating Counter

Now we move on to 2-bit predictors. This predictor uses a state machine to keep track of predictions:

image

The predictor initializes to weakly not taken. Each time it observes a branch taken, it transitions closer to strongly taken, reinforcing its decision. The opposite is also true for not taken. This helps the branch predictor “learn” patterns given what it has seen before.

Local Predictor

It seems like storing history is the right way to go about predicting branches. The local predictor is a two-level indexing scheme where the program counter is used to index into a pattern history table. The PHT contains some patterns of taken/not taken seen around the PC (hence local). The pattern history table maps to a branch history table where the bimodal counter states from above are used. Here is an example showing the second iteration of a pattern:

image

These local predictors could keep going, but structures are finite. There is not a set limit on the number of potential branches and patterns. This could cause some conflicts:

Correlating Predictor

We can level up the local predictor by combining it with global history. In fact, gshare is a branch predictor that combines a global history register and the PC. Correlating predictors roughly look like this:

image

Here is how they work:

  1. Access the global history register.
    • The PC contains the instruction address of the branch.
    • The GHR contains the pattern history of the last few branches that we have executed.
  2. Use the pattern in the GHR to index into the BHT.
    • Make prediction using the BHT entry.
    • BHT entry contains the branch bias (strongly/weakly taken/not taken).
  3. When the actual outcome becomes available…
    • GHR <<= 1: this will effectively remove the oldest history entry.
    • Replace the LSB of the GHR with the actual branch outcome.

Tagged Geometric History Length Predictor

The predictors that we have talked about so far are fundamental, but not what is used today. TAGE is a modern predictor that uses multiple PHTs indexed by GHRs of different lengths. It intelligently allocates PHT entries to different branches based on their history requirements.

image

Advantages of TAGE:

Disadvantages of TAGE:

Perceptron Predictors

This very modern and advanced prediction technique uses neural networks to predict whether or not a branch will be taken. Unlike traditional predictors that rely on history patterns, the perceptron predictor learns and adapts based on past outcomes and the GHR, similar to how perceptrons update in machine learning.

image

Each branch is associated with a perceptron carrying a set of weights. Each weight corresponds to a bit in the GHR. This approach represents how much the bit is correlated with the direction of the branch.

Branch Target Buffer

We talked a lot about how to predict the outcome of a branch, but the other part of this is predicting where that branch would take us. The branch target buffer is a cache that stores predicted addresses given our current PC:

image

The branch target buffer uses a similar tag-index-offset scheme seen in caches. Instead of passing the memory address, we break the PC down into its appropriate tag, index, and offset instead.

Return Address Stack

A similar variant of branch target buffers exists for function calls. Return address stacks are structures used to predict where return instructions would go. Here is a simple 4-entry RAS:

image

On a call instruction, we increment the index and save return address in that slot. On a ret instruction, we read prediction from the index and decrement it. Note: this is a different structure than the regular call stack, which also has return addresses. That stack lives in RAM while this is baked inside the CPU internals.

Modern Speculation

Real branch predictors like the ones found inside Zen 5 are often proprietary. After all, speculation is one of the main elements that drive competition in the processor market. Designing a fast CPU requires it to guess many times and close to correctly each time. Regardless, here is what a modern CPU fetch stage might look like:

image

Besides predicting branches, modern processors also have a ton of more advanced ways to speculate:

Caching

One major bottleneck of modern computer architecture is relatively slow speed of random-access memory. Now that we have our blazingly fast OoO core doing speculation, taking hundreds of cycles to access DRAM just is not going to cut it anymore. Therefore, we need to start caching data so the CPU does not stall waiting for more to come. If you think about it, RAM itself acts as a cache for data on the disk, which is even slower. The concept of caching is nothing new, and performance gains have been shown in many applications. That is why CPUs dedicate so much space to the cache. At a certain point, when the functional units cannot be meaningfully optimized further, the rest of the space goes to the cache. There are a few goals that make CPU caches different from other types of caches:

Placement Policies

If we want to maximize temporal and spatial locality, we can split the memory address into a particular format consisting of three parts: address = [tag | index | offset]

To best model how each type of cache behaves, concrete examples will be shown instead of theoretical properties.

Direct-Mapped Cache

Direct-mapped caches are the simplest type of CPU caches. If $i$ bits contribute to the index, and $f$ bits contribute to the offset, then the cache is:

set entry
0 [valid bit] / [tag bits] / [block of $2^f$ bytes]
1 [valid bit] / [tag bits] / [block of $2^f$ bytes]
2 [valid bit] / [tag bits] / [block of $2^f$ bytes]
$2^i-1$ [valid bit] / [tag bits] / [block of $2^f$ bytes]

effective data capacity $= 2^{i+f}$ bytes

Set-Associative Cache

Set-associative caches introduce $w$ ways, which serve as columns to our table:

set way 0 way 1 way $w-1$
0 [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes]
1 [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit]/[tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes]
$2^i-1$ [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes]

effective data capacity $= w \times 2^{i+f}$ bytes

Fully Associative Cache

This type of cache only has one set:

set way 0 way 1 way $w-1$
0 [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes]

effective data capacity $= w \times 2^f$ bytes

Eviction Policies

There are several methods to decide which way gets evicted when a set is full and all ways are occupied:

set way 0 way 1 LRU
0 [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes] 0 (way 1 was just accessed)
$2^i - 1$ [valid bit] / [tag bits] / [block of $2^f$ bytes] [valid bit] / [tag bits] / [block of $2^f$ bytes] 1 (way 0 was recently accessed)

Write Policies

How to keep RAM and cache mirrored?

Make room in cache when a write results in a miss?

You can mix these policies by choosing one from the first grouping and another from the second. Most modern caches are write-back and write-allocate.

Cache Hierarchy

There are different levels of caches depending on microarchitecture. Introducing several levels of caches mean there exists inclusivity policies. Inclusive caches mean that larger caches (closer to RAM) must include all blocks present in the smaller caches (closer to execution). Exclusive caches mean that blocks in one level must not be present in another. Non-inclusive caches mean that inclusivity is not guaranteed but not necessarily exclusive. Each level serves a different purpose:

image

  1. The level 1 (L1) cache is split into an instruction cache and a data cache. The fetch stage reads from L1i and the memory stage accesses L1d. Modern CPUs hold virtual addresses in their registers and rely on the translation lookaside buffer to cache physical addresses. Paged memory deserves its own article but for the purposes of this one, L1 caches are usually virtually indexed and physically tagged (VIPT), meaning the index bits come from the virtual address, but the tag is derived from the physical address (requires TLB).
  2. The level 2 (L2) cache is unified in most designs. It is a per-core cache that helps support the L1 cache and is physically indexed and physically tagged (PIPT), meaning it completely operates on physical addresses.
  3. The level 3 (L3) cache is usually the last-level cache. It is shared between cores and serves as a general cache for the entire chip.
  4. Victim caches are fully associative caches that sit in between two levels of cache (usually between L1 and L2). When a miss occurs, evicted blocks are placed in this cache to give it a “second chance.” This is best for useful blocks that are evicted not because they have gone cold, but because the main cache is out of space.

Cache Coherence

The L3 cache has a unique problem: if multiple cores can access the same memory location, how can we ensure correctness? There are two techniques that tackle this:

Here is the metadata that each block contains:

The cache controller can transition between metadata states using this state machine:

image

Zen 5 Specs

After going through how superscalar OoO cores speculate and several types of caches and their purpose, you hopefully have the knowledge to understand the majority of this diagram:

image

We covered many topics related to the microarchitectural design of Zen 5, but some notable advanced topics not covered in this tutorial:

Computer microarchitecture is a broad and evolving area, and this overview only covers a small subset of the concepts involved. In practice, real designs vary widely and are tightly linked to performance, power efficiency, and cost, with decisions at this level shaping everything from product competitiveness to broader shifts in the computing market.