type
status
date
slug
summary
tags
category
icon
password
Below are notes from Architectural and Operating System Support for Virtual Memory.
Advanced hardware design to reduce address translation overheads. This chapter covers both native and virtualized systems.
Hardware-based improvements do not require changes to the software stack. But it consumes on-chip area and power. And hardware must be verified for correctness. So hardware must be sufficiently simple and readily-implementable.
1. Improving TLB Reach
TLB reach: effective capacity offering by TLB. The higher the reach, the lower frequency of TLB miss.
- 1024 entry TLB for 4KB page, then corresponds to a total reach of 4MB memory.
1.1 Shared Last-level TLBs
Modern processors generally employ two-level TLB privately by each core. Recent studies have considered the potential benefits of designing last-level TLBs shared among multiple cores.
- when multiple threads running on different cores access the same translation (multiple threads collaborate to compute on shared data structures)
- one core bring TLB to shared TLB, and another core can hit this entry
- for non-shared TLB entry
- more flexible than private L2 TLB on when entries can be placed

(private TLB can be more levels, such as L1/L2 TLB is private to each core and there is an L3 shared last-level TLB)
The book discussed below design spaces:
- TLB entry design
- shared LL TLB entries store identical information in L1 TLB
- Multi-level inclusion
- strictly inclusive, mostly inclusive, exclusive
- strictly inclusive require back-invalidation because L2 and L1 TLB have different evict hints
- Translation coherence
- when TLB shootdown, if strictly inclusive, then if L2 TLB has no entry, no need invalidate L1 TLB
- whether strictly inclusive should balance between back-invalidation and filtering coherence messages
- Centralized vs. distributed design
- Integrating prefetching strategies
- on a TLB miss, past work inserts the requested translation into the shared TLB and also prefetches entries for virtual pages consecutive to the current one
- prefetches do not require extra page table walk
- For 64-byte cache lines, entries for translations for 8 adjacent virtual pages reside on the same line. Therefore, past work prefetches these entries into the shared TLB, with no additional page walk requirements.
1.2 Part-of-memory TLBs
Extremely large shared TLB that are placed directly within main memory.
L1/L2 TLB provides fast access time, part-of-memory TLB provides large TLB reach, particularly helpful in situations where TLB miss penalties are severe.
Part-of-memory TLB can be filled into hardware data caches to further accelerating their access.
- when private TLB is miss, instead of page table walk, cache hierarchy and part-of-memory TLB is lookup(the lookup address is the base address of part-of-memory TLB plus virtual address to lookup)
- if part-of-memory TLB is miss, then part-of-memory TLB informs core to start page table walk

1.3 TLB Coalescing
Using simple compression schemes to record information about multiple translations in a single hardware entry. (available on commercial system like AMD’s Ryzen)
- contiguous virtual pages are assigned to contiguous physical pages, so can compress contiguous translations into one TLB entry

Source of intermediate contiguity
- memory allocator
- e.g. when allocating 4 pages, the va is contiguous
- memory defragmentation engine
- OS often run defragment memory thread to make page frames contiguous
- the application
- often access virtual page in order
Real-world workloads often see an average of 8–16 page table entries experiencing contiguity in the manner described. So if take full advantage of this, 8x-16x TLB reach can be achieved.


Coalesced TLB design options
- Lookup
- in the example, if no coalesced, then using va[0] as index, if has coalesced for four coalesced translations, then va[2] as index
- after find set index, tag match is compared, and also need find bitmap to check whether the requested translation is in the bundle
- 0b1101 means V0, V1, V3 is in the bundle, V2 is not in the bundle, if V0 map to P1, then V1 map to P2, V3 map to P4
- if V0 map to P1, V1 map to P2, V2 map to P5, V3 map to P6, then set 0 can have two entry
- one is for V0/V1, bitmap is 0b1100
- another is for V2/V3, bitmap is 0b0011
- if page table sees little contiguity, then set indexing scheme may increase TLB misses
- coalescing part of TLB or only at L2 TLB


- Fill
- Consider an x86-64 system, where 8 page table entries reside in a 64-byte cache line. With coalesced TLBs, combinational logic is added on the fill path to detect contiguous translations within the cache line. Once the contiguous translations are detected, they are coalesced and filled into the coalesced TLB

In summary, TLB coalescing extends TLB capacity by caching information about multiple translation in a single entry, and performs a form of prefetching (fill TLB with neighbor translation result).
2. Hardware Support for Multiple Page Sizes
To support multiple page sizes, most vendors use split or statically partitioned TLBs at the L1 level, one for each page size. This approach suffers poor utilization. When OS allocates mostly small pages , superpage TLBs are wasted. When OS allocates mostly superpages, small page TLBs lie unused.
There have been several proposals to support multiple page sizes concurrently. While some of them have partially been adopted by commercial products at the L2 TLB level, which can afford slightly higher lookup times, L1 TLBs remain split.
2.1 Multi-indexing Approaches
Use different index functions for different page sizes.
Fully associative TLBs
Accommodating multiple page sizes concurrently in a single TLB is to implement it with full associativity. In this approach, each TLB entry maintains a mask based on the page size of the translation. The main problem with fully associative TLBs is that they have high access latencies and consume more power than set-associative designs. Intel’s Skylake chips use 12-way set associative L2 TLBs.
Hash-rehash
Initially look up the TLB using a “hash” operation, where we assume a particular page size, and this page size is typically the baseline page. On a miss, the TLB is again probed using “rehash” operations, using another page size. This process continues until all page sizes are looked up, or a TLB hit occurs.
TLB hits have variable latency, TLB misses take longer. Can lookup for all page sizes in parallel, but add up energy.
Hash-rehash only used in a few products (e.g., Intel Broadwell and Haswell architectures support 4 KB and 2 MB pages with this approach but not 1 GB pages).
Skewing
A virtual address is hashed using several concurrent hash functions. The functions are chosen so that if a group of translations conflict in one way, they conflict with a different group on other ways. Translations of different page sizes reside in different sets. For example, if our TLB supports 3 page sizes, each cacheable in 2 separate ways, we need a 6-way skew-associative TLB.
❓need read the original paper for deep understanding, https://ieeexplore.ieee.org/document/1321052
2.2 Using Prediction to Enhance Multiple Indices
Reduce the need for parallel to lookup different page sizes by predicting, before looking up the TLB, what the page size of a virtual address is likely to be. So, hash-rehash or skew TLB is first looked up with the predicted page size.
PC based predictor learns the page size of a given page separately for each instruction, increasing the PHT size.
Register-based predictor take use of this phenomenon: most virtual addresses are calculated using a combination of source register(s) and immediate values. Prior research shows that the value of the base register dominates in the virtual address calculation. So, simply looking up the value stored in the register file in the base register location can suffice in identifying the desired PHT counter. And prediction can be parallel with address calculation.

2.3 Using Coalesced Approaches
Use a single indexing function for all page sizes, e.g. MIX TLB.


Lookup for B is miss, then for split TLB, table walk page table entry and filled in 2MB Page TLB. For MIX TLB, entry for B is mirrored in both sets, because MIX TLB derive set index for small pages (bit[12]). B/C are contiguous, so B/C is coalesced in TLB, which can solve capacity waste of the mirror by keeping more page table entries.

Why do MIX TLBs use the index bits corresponding to the small pages ?
Use index bits of superpages, e.g. bit[21] for 2MB page will make 256 contiguous 4KB page in same set and increasing conflict miss.
Why do MIX TLBs perform well ?
When superpages are scarce, all TLB resources can be used for small pages. When the OS can generate superpages, it usually sufficiently defragments physical memory to allocate superpages adjacently too. MIX TLBs utilize all hardware resources to coalesce these superpages.
How many mirrors can a superpage produce and how much contiguity is needed ?
Consider superpages with N 4 KB regions, and a MIX TLB with M sets. N is 512 and 262,144 for 2 MB and 1 GB superpages. Commercial L1 and L2 TLBs tend to have 16–128 sets. Therefore, N is greater than M in modern systems, meaning that a superpage has a mirror per set (or N mirrors).
If the number of contiguous superpages is equal to (or sufficiently near) the number of mirrors , we achieve good performance. On modern 16-128 set TLBs, we desire 16-128 contiguous superpages.
3. TLB Speculation
Speculation to remove expensive page table walks off the critical path of the processor execution.
If va-pa patterns are repeated amongst translations, and can be learned, it may be possible to speculate the physical page frame number from a virtual page lookup, even when the TLB misses. The processor continue execution using the speculative page number. In parallel, page table walk to verify whether the speculation is correct.
Prior work on TLB speculation essentially builds on patterns with virtual and physical address contiguity (similar to the work that motivates coalesced TLBs).
❓security impact of TLB speculation

4. Translation-Triggered Prefetching
Replay:
When TLB is miss, after page table walk to get va-pa mapping, the memory access instruction replay to finish the access.
Recent work shows that replays in real-world workloads are often just as harmful to performance as TLB misses. Innovations like MMU caches and multiple page table walkers mean that TLB miss penalties now mostly consist of the cost of a single memory reference for the leaf page table entry, since higher level entries from the page table are generally (90%+ of the time) found in the MMU caches. Therefore, the bulk of TLB miss overheads end up being composed of the one reference for the leaf page table entry, and one for the replay.
Recent work shows that if page table entries are poorly cached(miss in L1/L2/L3/LLC), the physical page that they point to is likely to be poorly cached too. This presents an opportunity: when main memory must be accessed to satisfy a page table lookup, can we deduce the physical memory address that the subsequent replay will need and prefetch it into the LLC to enable better performance? (Prior work not consider prefetching to low level caches, e.g. L1, but are likely ripe in the future).
Constructing the target physical address of the replay
Cacheline address in the page offset is propagated to page table walk and finally to MC (Memory Controller) to construct the target physical addresss.


Performing timely prefetch
Prefetch of the replay target data into the LLC prior to the replay’s LLC lookup. By prefetching at DRAM access for the page table entry, most modern systems provide a window of roughly 100-150 clock cycles for the replay target prefetch to be filled into the LLC. While prior work has more details on this, this timescale is usually enough for LLC prefetch.

5. Other Important Hardware Improvements for Virtual Memory
In particular, we point readers to research on an emerging area of increasing importance in VM design: the notion of translation coherence. In particular, with the emergence of heterogeneous memory architectures, it is becoming increasingly important to move pages of memory between devices with complementary performance characteristics.
For example, it may be beneficial to migrate pages of memory from DRAM devices with high capacity to higher-bandwidth but smaller die-stacked DRAM. In these situations, page tables must be changed and TLB coherence can become a performance bottleneck.
6. Summary
Emerging research questions on VM, particularly in the context of hardware innovations.