Architecture and Operating System Support for Virtual Memory - Chapter 5 - Modern VM Software Stack

type

status

date

slug

summary

1 Virtual Memory Management

Linux uses VMA(Virtual Memory Area) tree to track the VM regions associated with each process. Pointers to VMA tree is maintained by per-process mm_struct or main memory descriptor. ❓One main memory descriptor is to per address space. VMA tree records all the VMA regions used by the process. Each VMA region is a contiguous range of virtual address and not overlap. The size of each VMA is a multiple of the page size of the system. Each process may have several VMAs. VMA trees are implemented with red-black trees for O(log(n)) search time.

Linux tracks two types of VMA mappings in VMA tree. For each mapping, maintain pointers to start and end virtual address, page protection bits (read/write/execute) and VMA protection bits (a superset of the page protection bits).

file-backed mappings

allocated using mmap() system calls, which are used to represent code pages, libraries, data files, and devices

anonymous mappings

stack and the heap not backed by a file

When enlarge VMA, the kernel tries to merge new pages with existing adjacent VMAs.

1.1 Demand Paging and Lazy Allocation

Two approach to manage new VM allocation

all virtual pages added to VMA are immediately assigned with new physical page frames, and page table s ae immediately changed to reflect the assignments ⇒ waste memory, the process may not actually access all newly allocated virtual pages

lazy allocation

only allocate physical pages when program tries to access new virtual page for the first time

step 1: program uses a malloc(), malloc() makes brk()/sbrk() system call to grow its heap
step 2: brk() enlarges heap VMA
step 3: program attempts to access new page for the first time, and page fault occurs.

heap needs no file transfer, so is minor page fault
if it’s file-backed page requiring data transferred from disk, then is major page fault

step 4: OS assign a physical page to the virtual address and creates a page table entry.

1.2 Copy-on-Write (CoW)

Similar to page frame not allocated when new virtual page is allocated, when a page is duplicated, the page frame is not allocated, but wait until some data in one of the two pages (the original page and the duplicated page) is actually modified.

E.g. the page is duplicated because of fork()

forked process often quickly discard their inherited address space and begin executing the code of another program, so the duplicated page frame would be quickly thrown away
the duplicate pages are read only, so no need to duplicate page frames

When page is duplicated

original page frame remains, and two virtual addresses point to the page frame and marked read-only

page table entry is read-only, but OS internal data structure(?VMA) is writeable

when try to write to the page frame, OS is trapped and the page frame is duplicated and each virtual address point to each page frame

TLB triggers fault because of read-only, but OS will find in internal data structure, the page is writeable, so the page is copy-on-write

1.3 Address Space Layout Randomization

ASLR is for better memory security (e.g. buffer overflow).

randomly arranges the address space position of key data areas including the base of the executable and the positions of the stack, heap and libraries ⇒ more difficult to predict process target address

❓return-to-libc, inject shellcode on the stack to identify stack position

OS adaption

Linux

enable a weak form of ASLR in 2005.
Subsequent PaX and ExecShield enable better ASLR
PIE (position-independent executable): random base address for main executable binary
KASLR (kernel ASLR): ASLR support for the kernel pages by randomizing where the kernel code is placed at boot time, merged to mainline Linux since 2014

Windows

since Vista in 2007, ASLR enabled for executables and dynamically linked libraries, other processes not enable ASLR by default for backward compatibility

Apple

ASLR for applications in 2011, KASLR from 2012

2 Managing Locality

When physical memory is capacity-constrained, how to choose which page to evict (replacement algorithm) and how to optimize disk write back performance (page buffering).

2.1 Working Set

Working set defines the amount of memory that a process requires in a given time interval.

working set W(t,a) of a process at time t is the collection of information referenced by the process during the process time interval (t-a, t)

due to program locality behavior, the working set in (t, t+a) can be approximated by working set in (t-a, t)

working set must be predicted with care

if too many pages are kept in capacity-constrained main memory, then fewer other processes can be ready
if too few pages are in main memory, then page fault is increased and the process is waiting for disk transfer

Need solve three problems, when physical memory is capacity constrained

identify program’s working set

choose what pages in main memory to evict for incoming pages from disk

choose when to bring in pages from disk to main memory

2.2 Naive Page Replacement Policies

Page replacement policy need make most use of accessed bits.

Most optimal is Belady’s algorithm, which evicts the memory page that will not be used for the longest time.

In the example below, Belady’s is 6 page faults and FIFO replacement is 10 page faults.

2.3 LRU Page Replacement Policies

In software, the LRU is implemented as linked list. When a page is referenced, it is added to the head of the list. Pages at the tail of the list are the prime candidate for eviction.

Due to complex pointer management in LRU, most modern OS use approximate LRU.

Most widely adopted approximate LRU is CLOCK algorithm. When page is accessed for the first time, the accessed bit is set. After a time interval, the OS traverses all page table entries and clear the accessed bit. If the page is accessed again the accessed bit is set again. So the accessed bits can indicate the pages that is recently accessed and track the working set of a process.

One of CLOCK approach is called LRU with second chance. All pages are in a circular list, and a pointer (clock hand) is maintained to the next page replacement victim. When need evict a page, OS check the page pointed by clock hand. If the accessed bit is set, then clear the accessed bit and move to next page. The victim is chosen until a page with a cleared accessed bit.

(in below figure, the arrow is lock hand, 1/1 means page 1 with accessed bit set, 1/0 means page 1 with accessed bit cleared)

(from the figure below, you can find LRU with second chance is a mixed of FIFO and LRU)

Most CLOCK algorithm first tries to evict page whose dirty bit is cleared(write back dirty page to disk is more expensive than clean page), if no such pages, it chooses a page whose accessed bit is cleared but dirty bit is set.

2.4 Page Buffering

(comment: It seems that page buffers is similar to evict buffer)

OS implements intelligent buffering scheme to take the penalty of performing I/O off of the critical path.

intelligent: not write page to disk right away

Keep a pool of free physical page frame. When the number of pages in the pool is lower than the threshold, the OS preemptively writes back/evict pages according to replacement algorithm, and adding them to the pool of free frames. The OS can choose to evict when I/O is little traffic.

The physical page in the pool not lose their content.

If the page is reused in the near future, the page can be mapped from the free pool quickly, no need transfer from disk.

If the page need to be deleted for incoming page, then no need to write back to disk again.

3 Physical Memory Allocation

The primary challenge is to manage memory fragmentation effectively

external fragmentation

visible to the allocation system, refers to un-reusable memory that is smaller than the size of new memory to be allocated

internal fragmentation

visible to the process, refers to the amount of wasted space in a single allocation unit

3.1 Naive Memory Allocators

Allocator	Mechanism	Note
Best Fit	Search all block, and choose the smallest block that can satisfy the request.	Search time is large. Tends to leave large and small holes in memory.
First Fit	Search all blocks, and choose the first block that can satisfy the request.	ㅤ
Worst Fit	❓invert of Best Fit	ㅤ

(comment: on above figure, for request of 8, 12 and 13, it seems that First Fit should be successful for all requests)

❓on above figure, why Worst Fit allocate request of 8 to “15 pages”, not “20 pages”.

3.2 Buddy Allocation

The units of allocation is restricted to power of 2. The buddy allocator maintains a number of lists, numbered from 0 to K, maintaining information on , , …, free contiguous bytes.

To allocate bytes memory, the list of K is searched first, until a free chunk of is found. ⇒ pre-sorted by size, fast allocation reduce search overhead as in Best Fit or First Fit

E.g. to allocate bytes, first search List 0, then List 1, and last List 3 to find free memory.

If the only free blocks available are bigger than , then free block is recursively halved until a region of appropriate size is available. ⇒ continue to track available blocks with as much contiguity as possible

E.g. List 3 is halved to List 2 of P0-P7, and is not appropriate size,

then P0-P3 is halved to List 1, and is not appropriate size

then P0-P1 is halved to List 0, and is not appropriate size

so P0 is allocated

When memory is freed, it is merged with contiguous free blocks and update buddy position. ⇒ reduce fragmentation

3.3 Memory Pools and Slab Allocation

Buddy allocator allocated memory in size of power-of-2, so there may be fragments in the end of memory allocated if the program object is not exactly size of power-of-2.

User libraries and/or code expecting to allocate objects at sizes not close to a power of 2 will often therefore establish memory pools within larger blocks of VM, and then provide a separate (user-level) allocator for objects placed in that block.

Slab allocator is dedicated for kernel-level memory allocation. Kernel allocation requests are often for particular objects with well-known size requirements.

made up of several software “caches” of slabs, and slabs maintains memory slots for different types of kernel objects

a request for specific object is routed to the appropriate cache of slabs

3.4 Page Coloring

(chapter 4.2.2 using page coloring for VIPT cache, that different cache set cannot be mapped to same physical address)

Here page coloring is to distributed in different pages as much as possible to avoid hot spot of memory in same set of cache. (e.g. most pessimistic case is all page frames are in same set of caches, and then cache will be trashed). But the benefits of page coloring for performance have not been universally shown today.

3.5 Reverse Mappings

Multiple page table entries may mapped to the same physical page. When page replacement happens, it only knows the identity of the physical page, not the virtual page and page table pointing to it. Reverse mappings are used to identify corresponding virtual page and page tables. They are not as critical to performance, since they are less common.

4 Summary

OS is responsible for layout the of the VA and PA space, including allocator algorithm for fragment management, lazy allocation and CoW (Copy on Write) for resource management, replacement and buffering for page eviction.

From next chapter, shift away of exploring design space, it ill begin to discuss advanced use cases and subtle correctness issue in VM (parallelism, synchronization).