Architecture and Operating System Support for Virtual Memory - Chapter 4 - Modern VM Hardware Stack

type

status

date

slug

summary

1. Inverted Page Tables

Multi-level page table is common, but there is another way to save page table space: Inverted Page Tables, which maintains one entry for every physical page in the system. The entry records process ID and virtual page number, so a single inverted page table is maintained for all processes (multi-level page table is maintained for each process).

Linear search is expensive, so Hashed Inverted Page Table is used in PowerPC architecture. The virtual page number is first hashed and search Inverted Page Table starting from the hash.

Different VA may have same hash, so collision chain can be setup: a list of alternative positions to search if the originally searched entry is a collision. First lookup the hashed position, if hit, the search is complete, otherwise lookup the entry in the collision chain. If the search reaches the end of the chain without hit, then page fault.

Hashed Anchor Table (HAT) is to balance length of collision chain and page table size. HAT is indexed by hash value and accessed before inverted page table. HAT points to the chain head in inverted page table. Increasing HAT size reduces collision chain length without changing the size of the inverted page table. (HAT entry is smaller then inverted page table entry, so the area is more efficient to enlarge HAT size).

2. TLB Arrangement

2.1 Multi-level TLBs

Q1: How to balance between balance latency and increasing application memory footprint.

A1: Multi-level TLBs. L1 TLB is small, last-level TLBs might have as many as thousands of entries. Today, many CPU vendors implements two levels of TLB per CPU.

Q2: Last-level TLB should be private to each core or shared between multiple cores ?

A2: Balance between last-level TLB hit rates and latency(usually private is lower latency than shared). Also need consider area efficiency when shared.

Q3: TLBs should be unified or split for instruction and data ?

A3: Keeping instruction and data TLBs separate ensures that access to each by the relevant parts of the pipeline is fast and uncontended. Instruction cache and data cache may both at critical path of instruction execution. Higher level TLBs tends to be unified because not on the critical path of a hit.

2.2 TLB Placement Relative to Caches

Cache Organization	TLB Lookup and Cache Lookup	Advantage	Disadvantage	Notes
PIPT Cache	TLB lookup is performed before caches are looked up.	Simple cache management (multiprogramming, snoop).	Performance (TLB must be lookup before cache lookup).	Typically, L2 and LLC are PIPT. L1 is not PIPT (instead of simple embedded CPU).
VIVT Cache	No need lookup TLB for cache lookup, no need lookup TLB if cache is hit (or lookup for permission check).	Latency is good. TLB can be scaled to larger size.	When cache is hit, still need lookup TLB for permission check (no need lookup TLB when accessing cache, but before the access retires) (or store permission information in cache) (comment: for security issue, need lookup TLB before hit data is used to form address)	VIVT is not common in general-purpose CPU.
ㅤ	ㅤ	ㅤ	Multiprogramming is challenging. Different processes use same VA but mapped to different PA. Cache may need add PID (Process Identifier) or ASID (Address Space Identifier) for each line, leading cache storage increasing. (Caches can also be flushed when different processes switched)	ㅤ
ㅤ	ㅤ	ㅤ	Single process, different synonym VA can map to same PA. Synonyms are difficult to manage in VIVT caches.	ㅤ
VIPT Cache	TLB lookup is performed parallel with cache lookup. (VA is for indexing, PA is for tag comparison)	Hide TLB latency with cache access. No problem on multiprogramming and synonym because of PA tag comparison.	Cache set index is extracted from page offset (no need translated for PA), so is limited (e.g. only 6 bits can be used for set index if page size is 4KB). Otherwise, same PA may be mapped to different sets by different synonyms.	Armv6 cores use page coloring to solve synonyms in VIPT caches. (https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/page-colouring-on-armv6-and-a-bit-on-armv7)
PIVT Cache	TLB lookup is performed before cache lookup.	ㅤ	ㅤ	PIVT is not practical, because if PA is known when indexing, then no need to use VA for tag comparison.

3. TLB Replacement Policies

TLB replacement policy is not as widely studied as cache replacement policy, as the latter will generally have a much larger effect on performance.

Can use LRU, pseudo-LRU or FIFO policy for L2 TLB. Can use randomized policy for L1 TLB if timing is tight.

4. Multiple Page Sizes

Many modern OSes and architecture maintain support for multiple page sizes concurrently.

On TLB lookup, need lower bits of VA to select TLB set, but the location of these bits is not known unless the page size is known, and the page size is not known until TLB lookup is performed.

The first approach is to use split TLBs, one for each page size. So different page size is lookup in parallel. This approach wastes energy since translation only exist in one TLB. Also underutilize TLB, because for example, when OS allocate all small pages, superpage TLBs is wasted.

5. Page Table Entry Metadata

5.1 Permission Information

This permission information is page table entry is varied in different architectures. E.g. specific read permission bits or assume any valid region can be read. W^X protection or let OS decides. User/Supervisor bits or not.

5.2 Accessed and Dirty Bits

Accessed and dirty bits are for OS page replacement decisions.

Accessed Bits:

mark pages that have been accessed in the recent past

on modern CPU, the page table walker is responsible for setting the accessed bit on every load and store operation

OS to identify code pages as prime candidates for eviction

OS can clear accessed bit periodically, and after a sufficiently long window, page table with no accessed bit set are considered as code page

Dirty Bits:

mark pages with data that needs to be written back to backing storage

set when a store operation writes to a page

dirty bits usually maintained in TLBs, not page table walker, because when store lookup TLB, TLB may be hit

Accessed bit is for performance, dirty bit is for proper function. If a dirty page is marked for eviction, the data must be written back to backing storage before the physical memory is evicted.

5.3 Address Space Identifiers and Global Bits

Historically, TLBs not track information about context with TLB entry, and OS need flush TLB for every context switch. TLB flush will downgrade performance(10% on ARM and x86). Now, many architectures add hardware support to reduce frequency of TLB flushes.

ASID and global bits is added.

Each process has its own ASID and TLB lookup check ASID for TLB hit.

Global bits to identify translations that are global to all processes(usually addresses used by the kernel). So when global bits is set in TLB entry, no ASID need to be compared. (also no need to be flushed if the system has no ASID)

ASID is not necessarily same as OS process ID. x86 provides 12 bits ASID, while Linux PID is 32 bits. So OS is responsible to track mapping between PID and ASID.

6. Page Table Walkers

6.1 Software-Managed TLBs

In the early days of VM, page table walk is supported purely on OS (popular for SPARC, MIPS and ARM in 1990s and early 2000s). TLB miss triggers an interrupt to OS, and OS runs an interrupt handler to perform page table walk. When page table walk is completed, TLB is filled via a dedicated instruction and control is back to user-level process.

Performance is poor

TLB miss require context switch to OS, pipeline must be flushed before and after interrupt handler

interrupt handler pollutes caches and predictors (branch predictor and memory disambiguation predictors)

6.2 Hardware-Managed TLBs

Benefits

avoid interrupt, pipeline flushes, predictor pollution

pipeline can execute other instructions in the program to overlap with page table walker

For base address of current context’s page table, x86 use the CR3 register, ARM use TTBR0 and TTRB1(pointer to page tables that are globally shared among processes).

For performance, multiple page table walkers can work in parallel for multiple in-flight TLB misses.

6.3 MMU Caches

MMU caches is to cache page table entries from earlier levels of a multi-level page table. TLB cache PTEs from the last level of page table walk, MMU caches stores other levels PTE.

At each level of walk, the walker first check MMU cache in that level. If there is hit, then walker can directly proceed to next level of page table walk.

It makes sense to dedicate caching for upper level PTE because they map larger portion of the address space. E.g. x86 four level scheme, each level covers 512GB, 1GB, 2MB and 4KB. Upper levels are highly likely to be reused.

Intel uses Paging Structure Caches (PSCs), and several PSCs are maintained for each page table walk level: L4 are tagged with L4 index, L3 are tagged with L4 and L3 index, L2 are tagged with L4, L3, L2 index. All PSCs are searched in paralleled when TLB misses, the longest match is used as the starting point for the rest of page table walk.

AMD uses Page Walk Cache (PWC), PWCs are simply dedicated PIPT caches for each page table level, so each level must be look up sequentially.

(MMU cache access time is in 8-15 cycles today).

6.4 Translation Storage Buffers

SPARC’s Translation Storage Buffer (TSB) is software caching support for address translation. TSB are cached in conventional on-chip hardware caches. A special CPU register maintains a pointer to the physical address to the root of TSB.

SPARC CPUs implements TLB miss handling via OS trap. When TLB is miss, interrupt handler searches TSB whose tag matches the virtual address of the translation miss. If the entry is found, then TSB is loaded into TLB, otherwise page table walk is proceeded.

7. Summary

This chapter covers hardware and ISA-level design space of VM subsystem, including alternative page table design and TLB and other caching structure to speed translation.