type
status
date
slug
summary
tags
category
icon
password
Below are notes from Architectural and Operating System Support for Virtual Memory.
A computer today is built with many different types of compute unit, many specialized for specific tasks. Backing these heterogenous components is set of caches/buffers/networks. And these components are moving away from relying on physical memory as communication mechanism but toward to participate in VM system to take advantage of VM (to avoid manual memory management).
CPU, GPU and DSP may communicate each other via shared VM, network cards, disk controllers may communicate only via physical memory and DMA.
1. Accelerators and Shared Virtual Memory
Shared VM can
- ❓enable more interesting use cases than the alternative of coarse-grained bulk task offloading
- avoid manual memory management to programmer
So, GPU, DSP, Intel Xeon Phi chip now share a virtual address space with their host CPU.
ARM CPU, Adreno GPU and Hexagon DSP in Qualcomm Snapdragon 800 SoCs are “architected to look like a multi-core with communication through shared memory”.
This chapter task CPU as an example as a shared VM device.
- originally, GPU is used for rendering graphics
- computation model follows a bulk offload paradigm: coarse-grained tasks are sent from CPU master to GPU peripheral
- overtime, GPU becomes more programmable and general-purpose (as if extremely parallel CPUs)
- computation model share virtual address space, can to some extent to manage their own memory and dynamically launch new tasks, can share data with CPU and other GPUs at fine granularity
- but lack of system-level support for VM management
- no OS on GPU, so no easy way to handle page fault
- system trap, or behavior undefined, or send back to CPU to handle(which is very expensive)
- OS on CPU to support VM across device is somewhat limited
- Linux Heterogenous Memory Management (HMM)
- https://www.kernel.org/doc/html/v4.19/vm/hmm.html
- provide a simple helper layer for device drivers to be able to mirror a CPU process’s address space on an external device, so memory allocated on CPU can be easily accessible on device
- GPU provide extremely weak memory models (microarchitecture aggressively buffers, coalesces and reorder memory request, so reliable shared memory communication is difficult)
- GPU not always keep hardware-coherent with CPU
- CPU and GPU cannot concurrently access any data
- or use mechanism like on-demand page-level migration (which is slow)
Heterogenous Related Organization/Language | ㅤ | ㅤ |
Heterogenous System Architecture Foundation | ㅤ | |
OpenCL | ㅤ | |
Nvidia PTX | PTX formal consistency model: https://research.nvidia.com/publication/2019-04_formal-analysis-nvidia-ptx-memory-consistency-model |
2. Memory Heterogeneity
So far, physical memory is assumed to be homogenous (every region of physical memory is almost same latency and bandwidth) to each core: UMA (Uniform Memory Access) system.
NUMA (Non-Uniform Memory Access) system: some regions of memory behave differently than others:
- memory placement
- memory is homogeneous, but some region of memory are farther away from some cores
- memory technology
- memory is heterogeneous (different memory technology)

2.1 Non-Uniform Memory Access (NUMA)
Today, NUMA arises when memory is distributed across multiple processors within single system. E.g. many server-class systems have multiple processor sockets on a single motherboard, cache coherent are kept with MOESI/MESIF model which use Owned(O)/Forwarding(F) states to mitigate some of the need for heavy cross-socked communication.
Memory placement issue on performance:
To optimize system performance, a system may choose to allocate memory regions as close as possible to the core performing the allocation, and employ some dynamic page migration policy to move pages around in response to change system state.
numactlon Linux runs processes with a specific NUMA scheduling or memory placement policy.

2.2 Emerging Memory Technologies
Performance
- High-end GPU use Graphic DDR (GDDR) for better throughput at the expense of more power, now moving from GDDR to HBM
Persistent or non-volatile memory
- historically, non-volatile memory (e.g. NAND flash) was too slow
- Intel and Micron’s 3D-XPoint aims to sit right between NAND and DRAM

- phase-change memory or memristors are now being considered
- may non-volatile memories have much better read performance than write performance
- allocation and migration policies must account for this asymmetry
3. Cross-Device Communication
3.1 Direct Memory Access (DMA)
When memory is distributed more than one node, DMA is used for data communication.
DMA not always keep coherent with CPU caches
- if coherent, DMA request automatically prove the cache and fetch or flush dirty data from the caches into memory before DMA operation
- some DMA can inject data received from external devices into caches to minimize latency for the core to read
- if non-coherent, these operations must be performed by software
- OS ensure dirty data flush and prevent other code to interfere with the memory region during DMA operation
3.2 Input/Output MMUs (IOMMUs)
VM on accelerators is supported by IOMMU: translate virtual address (or “device addresses” if the device is operating within its own private virtual address space) into physical memory addresses.
All accesses from the peripheral to CPU memory that miss in the device TLBs (if present) pass through IOMMU to get translated or take a fault. IOMMU can access to the page table managed by host CPU.
MMU is near CPU core, IOMMU may be distant from accelerators.

Case Study: GPUs and IOMMUs
Case is from AMD x86-64 system with GPU.
- Border control: Sandboxing accelerators (https://pages.cs.wisc.edu/~lena/bc.pdf)
- Observations and opportunities in architecting shared virtual memory for heterogeneous systems (https://www.csa.iisc.ac.in/~arkapravab/papers/ispass16.pdf)
GPU maintains private TLB (device TLB) for recent translations. AMD’s IOMMU is not coherent with CPU cache. So system software need keep coherent between cache and IOMMU.
When GPU device TLB is miss, translation request is in form of a PCI-Express Address Translation Service (ATS) to IOMMU. When IOMMU translated successfully, ATS response is sent to GPU. When IOMMU is fault, sends ATS response to GPU to inform fault. Then GPU sends another request called Peripheral Page Request (PPR) to IOMMU. IOMMU puts the request in a memory-mapped queue (can have multiple PPR) and raise an interrupt to the CPU. OS uses its IOMMU driver to process the interrupt and PPR requests. When requests are serviced, the driver notifies IOMMU, IOMMU notifies GPU. The GPU then sends another ATS request for translation for the original faulting address.
3.3 Memory-Mapped Input/Output (MMIO)
Some external communication mechanism (most commonly configuration register for some device) is mapped into the physical address space of some CPU process. So CPU can communicate with devices with normal read/write commands.
During boot, bootloader and OS will enumerate the registers and drivers will map the registers into kernel’s virtual address space.
MMIO abstraction is convenient for programming model. But special care should be take, that they are not like normal read/write:
- not cached in cache, but directly to device
- MMIO reads may not return the value of the latest write (not “memory” but some register)
- reads are not harmless, it can be destructive or trigger some side effects
3.4 Non-Cacheable/Coalescing Accesses
In order to allow MMIO to work properly, most processors provide some mechanism for indicating that certain memory regions and/or certain memory accesses should be treated as uncacheable.
Some application wants special uncacheable region to maximize throughput (normally, uncacheable cannot be cached, buffered or coalesced as normal memory)
- CPU filled frame with polygons and pixels, when complete, the frame is passed to GPU to be rendered
- so memory system need batch, reorder and stall writes to burst them out at the end to max throughput
- so uncacheable region can have write-combining, write-coalescing attributes
4. Virtualization
Virtualization is the critical tech to enable cloud infrastructures.
VM is one of the primary contributors to performance gap between native and virtualized application performance. The main problem is virtualization requires two levels of address translation.
- 1st level: gVA —> gPA
- a guest virtual address (gVA) is converted to a guest physical address (gPA) via per-process guest OS page table (gPT)
- 2nd level: gPA —> hPA
- gPA is converted to a host physical address (hPA) using a per-VM host page table (hPT)
Two ways to manage these page tables: nested page tables and shadow page tables.
4.1 Nested Page Tables

Guest CR3 register combined with gVP to deduce gPP of level 4 of the guest page table (gPP_gL4).
gPP must be converted into hPP to indicate the actual address of page table.
So, first use gPP to lookup nested page table (nL4/nL3/nL2/nL1), to get hPP_gL4 (gL4/gL3/gL2/gL1: guest page table).
Lookup gL4 get gPP_gL3.
The procedure is as follows: total 24 memory request (if no virtualization, only need 4 memory request)
- guest_CR3 + gVA —> gPA_gL4
- gPA_gL4 is nested translated to hPA_gL4
- from gL4 get gPA_gL3 (❓actually get base of gPA_gL3, combine with gVA to get full gPA_gL3)
- gPA_gL3
- gPA_gL3 nested translated to hPA_gL3
- from gL3 get gPA_gL2
- gPA_gL2
- gPA_gL2 nested translated to hPA_gL2
- from gL2 get gPA_gL1
- gPA_gL1
- gPA_gL1 nested translated to hPA_gL1
- from gL1 get gPA
- gPA
- gPA nestest translated to hPA
How to accelerate the translation procedure:
- private per-CPU TLB cache gVP to hPP translation
- private per-CPU MMU cache to store intermediate age table information
- private per-CPU nTLB cache gPP to hPP translation
Accelerate nested page tables. | Accelerating two-dimensional page walks for virtualized systems. | |
Translation Cache | Translation Caching: Skip, Don’t Walk (the Page Table) | |
ㅤ | Large-reach memory management unit caches |
4.2 Shadow Page Tables
An alternative to nested page tables. Hypervisor creates a shadow page table(merges gPT and hPT), holding gVA to hPA translation. So, when TLB is miss, only need 4 memory reference to translate gVA to hPA.
Drawback is page table updates are expensive: must be kept consistent with guest and host page tables. In particular, guest page table updates can often be frequent, and suffer costly VM exits to
update the hypervisor-managed shadow page table (VM exits can cost 100s to 1000s of cycles).
5. Summary
With VM enabled hardware accelerators, the address translation hardware is placed in device TLB and IOMMU.
With virtualization, hardware support is added to accelerate memory translation.
The author believe a potentially interesting question is how virtualization support should be extended for the hardware accelerators.