Architecture and Operating System Support for Virtual Memory - Chapter 7 - Heterogeneity and Virtualization

type

status

date

slug

summary

1. Accelerators and Shared Virtual Memory

Shared VM can

❓enable more interesting use cases than the alternative of coarse-grained bulk task offloading

avoid manual memory management to programmer

So, GPU, DSP, Intel Xeon Phi chip now share a virtual address space with their host CPU.

ARM CPU, Adreno GPU and Hexagon DSP in Qualcomm Snapdragon 800 SoCs are “architected to look like a multi-core with communication through shared memory”.

This chapter task CPU as an example as a shared VM device.

originally, GPU is used for rendering graphics

computation model follows a bulk offload paradigm: coarse-grained tasks are sent from CPU master to GPU peripheral

overtime, GPU becomes more programmable and general-purpose (as if extremely parallel CPUs)

computation model share virtual address space, can to some extent to manage their own memory and dynamically launch new tasks, can share data with CPU and other GPUs at fine granularity
but lack of system-level support for VM management

no OS on GPU, so no easy way to handle page fault

system trap, or behavior undefined, or send back to CPU to handle(which is very expensive)

OS on CPU to support VM across device is somewhat limited

Linux Heterogenous Memory Management (HMM)

https://www.kernel.org/doc/html/v4.19/vm/hmm.html
provide a simple helper layer for device drivers to be able to mirror a CPU process’s address space on an external device, so memory allocated on CPU can be easily accessible on device

GPU provide extremely weak memory models (microarchitecture aggressively buffers, coalesces and reorder memory request, so reliable shared memory communication is difficult)
GPU not always keep hardware-coherent with CPU

CPU and GPU cannot concurrently access any data
or use mechanism like on-demand page-level migration (which is slow)

Heterogenous Related Organization/Language	ㅤ	ㅤ
Heterogenous System Architecture Foundation	https://hsafoundation.com/	ㅤ
OpenCL	https://www.khronos.org/opencl/	ㅤ
Nvidia PTX	https://docs.nvidia.com/cuda/parallel-thread-execution/	PTX formal consistency model: https://research.nvidia.com/publication/2019-04_formal-analysis-nvidia-ptx-memory-consistency-model

2. Memory Heterogeneity

So far, physical memory is assumed to be homogenous (every region of physical memory is almost same latency and bandwidth) to each core: UMA (Uniform Memory Access) system.

NUMA (Non-Uniform Memory Access) system: some regions of memory behave differently than others:

memory placement

memory is homogeneous, but some region of memory are farther away from some cores

memory technology

memory is heterogeneous (different memory technology)

2.1 Non-Uniform Memory Access (NUMA)

Today, NUMA arises when memory is distributed across multiple processors within single system. E.g. many server-class systems have multiple processor sockets on a single motherboard, cache coherent are kept with MOESI/MESIF model which use Owned(O)/Forwarding(F) states to mitigate some of the need for heavy cross-socked communication.

Memory placement issue on performance:

To optimize system performance, a system may choose to allocate memory regions as close as possible to the core performing the allocation, and employ some dynamic page migration policy to move pages around in response to change system state.

numactl on Linux runs processes with a specific NUMA scheduling or memory placement policy.

https://linux.die.net/man/8/numactl

2.2 Emerging Memory Technologies

Performance

High-end GPU use Graphic DDR (GDDR) for better throughput at the expense of more power, now moving from GDDR to HBM

Persistent or non-volatile memory

historically, non-volatile memory (e.g. NAND flash) was too slow

Intel and Micron’s 3D-XPoint aims to sit right between NAND and DRAM

phase-change memory or memristors are now being considered

may non-volatile memories have much better read performance than write performance

allocation and migration policies must account for this asymmetry

3. Cross-Device Communication

3.1 Direct Memory Access (DMA)

When memory is distributed more than one node, DMA is used for data communication.

DMA not always keep coherent with CPU caches

if coherent, DMA request automatically prove the cache and fetch or flush dirty data from the caches into memory before DMA operation

some DMA can inject data received from external devices into caches to minimize latency for the core to read

if non-coherent, these operations must be performed by software

OS ensure dirty data flush and prevent other code to interfere with the memory region during DMA operation

3.2 Input/Output MMUs (IOMMUs)

VM on accelerators is supported by IOMMU: translate virtual address (or “device addresses” if the device is operating within its own private virtual address space) into physical memory addresses.

All accesses from the peripheral to CPU memory that miss in the device TLBs (if present) pass through IOMMU to get translated or take a fault. IOMMU can access to the page table managed by host CPU.

MMU is near CPU core, IOMMU may be distant from accelerators.

Case Study: GPUs and IOMMUs

Case is from AMD x86-64 system with GPU.

Border control: Sandboxing accelerators (https://pages.cs.wisc.edu/~lena/bc.pdf)

Observations and opportunities in architecting shared virtual memory for heterogeneous systems (https://www.csa.iisc.ac.in/~arkapravab/papers/ispass16.pdf)

GPU maintains private TLB (device TLB) for recent translations. AMD’s IOMMU is not coherent with CPU cache. So system software need keep coherent between cache and IOMMU.

When GPU device TLB is miss, translation request is in form of a PCI-Express Address Translation Service (ATS) to IOMMU. When IOMMU translated successfully, ATS response is sent to GPU. When IOMMU is fault, sends ATS response to GPU to inform fault. Then GPU sends another request called Peripheral Page Request (PPR) to IOMMU. IOMMU puts the request in a memory-mapped queue (can have multiple PPR) and raise an interrupt to the CPU. OS uses its IOMMU driver to process the interrupt and PPR requests. When requests are serviced, the driver notifies IOMMU, IOMMU notifies GPU. The GPU then sends another ATS request for translation for the original faulting address.

3.3 Memory-Mapped Input/Output (MMIO)

Some external communication mechanism (most commonly configuration register for some device) is mapped into the physical address space of some CPU process. So CPU can communicate with devices with normal read/write commands.

During boot, bootloader and OS will enumerate the registers and drivers will map the registers into kernel’s virtual address space.

MMIO abstraction is convenient for programming model. But special care should be take, that they are not like normal read/write:

not cached in cache, but directly to device

MMIO reads may not return the value of the latest write (not “memory” but some register)

reads are not harmless, it can be destructive or trigger some side effects

3.4 Non-Cacheable/Coalescing Accesses

In order to allow MMIO to work properly, most processors provide some mechanism for indicating that certain memory regions and/or certain memory accesses should be treated as uncacheable.

Some application wants special uncacheable region to maximize throughput (normally, uncacheable cannot be cached, buffered or coalesced as normal memory)

CPU filled frame with polygons and pixels, when complete, the frame is passed to GPU to be rendered

so memory system need batch, reorder and stall writes to burst them out at the end to max throughput

so uncacheable region can have write-combining, write-coalescing attributes

4. Virtualization

Virtualization is the critical tech to enable cloud infrastructures.

VM is one of the primary contributors to performance gap between native and virtualized application performance. The main problem is virtualization requires two levels of address translation.

1st level: gVA —> gPA

a guest virtual address (gVA) is converted to a guest physical address (gPA) via per-process guest OS page table (gPT)

2nd level: gPA —> hPA

gPA is converted to a host physical address (hPA) using a per-VM host page table (hPT)

Two ways to manage these page tables: nested page tables and shadow page tables.

4.1 Nested Page Tables

Guest CR3 register combined with gVP to deduce gPP of level 4 of the guest page table (gPP_gL4).

gPP must be converted into hPP to indicate the actual address of page table.

So, first use gPP to lookup nested page table (nL4/nL3/nL2/nL1), to get hPP_gL4 (gL4/gL3/gL2/gL1: guest page table).

Lookup gL4 get gPP_gL3.

The procedure is as follows: total 24 memory request (if no virtualization, only need 4 memory request)

guest_CR3 + gVA —> gPA_gL4

gPA_gL4 is nested translated to hPA_gL4
from gL4 get gPA_gL3 (❓actually get base of gPA_gL3, combine with gVA to get full gPA_gL3)

gPA_gL3

gPA_gL3 nested translated to hPA_gL3
from gL3 get gPA_gL2

gPA_gL2

gPA_gL2 nested translated to hPA_gL2
from gL2 get gPA_gL1

gPA_gL1

gPA_gL1 nested translated to hPA_gL1
from gL1 get gPA

gPA

gPA nestest translated to hPA

How to accelerate the translation procedure:

private per-CPU TLB cache gVP to hPP translation

private per-CPU MMU cache to store intermediate age table information

private per-CPU nTLB cache gPP to hPP translation

Accelerate nested page tables.	Accelerating two-dimensional page walks for virtualized systems.	https://pages.cs.wisc.edu/~remzi/Classes/838/Spring2013/Papers/p26-bhargava.pdf
Translation Cache	Translation Caching: Skip, Don’t Walk (the Page Table)	https://www.cs.rice.edu/CS/Architecture/docs/barr-isca10.pdf
ㅤ	Large-reach memory management unit caches	https://www.cs.yale.edu/homes/abhishek/abhishek-micro13.pdf

4.2 Shadow Page Tables

An alternative to nested page tables. Hypervisor creates a shadow page table(merges gPT and hPT), holding gVA to hPA translation. So, when TLB is miss, only need 4 memory reference to translate gVA to hPA.

Drawback is page table updates are expensive: must be kept consistent with guest and host page tables. In particular, guest page table updates can often be frequent, and suffer costly VM exits to update the hypervisor-managed shadow page table (VM exits can cost 100s to 1000s of cycles).

5. Summary

With VM enabled hardware accelerators, the address translation hardware is placed in device TLB and IOMMU.

With virtualization, hardware support is added to accelerate memory translation.

The author believe a potentially interesting question is how virtualization support should be extended for the hardware accelerators.