Linux内存管理

内存概述

Linux内核并不是将物理内存直接分配给进程，而是采用虚拟内存结构。每个进程拥有一个独立的虚拟内存空间和内存映射表(page table)，虚拟内存到物理内存的地址映射关系记录于page table中，当CPU执行指令时，若该指令是从内存中读取或写入数据时，就需要通过MMU(Memory Manage Unit)将内存virtual address替换为physical address

mm-physical-virtual

物理内存

物理内存根据用途划分为不同的区域，比如：DAM, ZONE_HIGHMEM , ZONE_NORMAL , 每个区域又划分为多个page, page是物理内存进行分配的最小单位，每个page的大小是由CPU架构决定的，有些架构支持多种page size，可以在内核初始化时通过内核配置文件选择其中一种尺寸。默认采用4KB。

若是多颗处理器的机器，内存类型可能是NUMA(Non-Uniform Memory Access)，CPU读取不同内存（是否与自己相邻，相邻的称为该CPU的local memory）的速度是不一样的。与之相对应的UMA读取所有内存速度一样。NUMA将内存划分为不同的bank，每个bank对应一个Node, 每个Node下面分不同区域。

虚拟内存

由于物理内存采用page作为基本单位进行分配，虚拟内存也采用page作为基本单位进行内存的分配。虚拟内存到物理内存的地址映射通过记录于page table中的映射条目完成，每个物理内存页可以映射多个虚拟内存页，这样可以实现不同进程的内存共享。

pagetable

采用虚拟内存架构可以带来以下好处：

隔离用户进程和内核进程内存空间，隔离用户进程之间的内存空间：
由于进程使用的时虚拟内存地址，虚拟地址所对应的物理地址是由MMU映射完成的，所以进程无法访问其他进程的内存空间，除非有意共享。
硬件抽象：
物理内存通常比较复杂，比如多块物理内存，甚至是NUMA，虚拟内存将物理内存的各种细节隐藏，对每个进程抽象出一个统一的虚拟地址空间。由于内核可以自由的修改虚拟内存到物理内存地址的映射，为了节省物理内存，内核并不及时将虚拟内存映射到物理内存，而是当进程真正使用到了才给其进行映射，使用多少映射多少，这就是延时分配和按需分配。其次如果物理内存紧张，内核可以将最近没使用，且使用频率不高的占用内存的数据换出到硬盘，使用到时再换出到内存。
实现内存中的数据移动到硬盘交换分区中
实现内存区域访问权限控制
实现物理内存共享，将物理内存page映射到多个进程虚拟地址空间中

MMU

MMU(Memory Manage Unit)是CPU和内存之间的一块硬件，用于virtual-memory-address 到 physical-memory-address的转换，MMU不但实现地址转换，还实现内存权限控制，管理TLB缓存等

为了加快虚拟地址到物理地址的映射速度，CPU使用TLB缓存，TLB访问速度介于CPU和CPU缓存之间，TLB缓存记录了最近的内存地址映射条目，CPU首先查找TLB中是否有映射条目，没有才取查询内存中的page table。

tlb

虚拟地址空间

Linux中，不仅是用户空间进程，内核也使用虚拟地址，虚拟地址空间被划分为内核和用户空间，虚拟地址包括以下三部分：

内核空间
- Kernel Logical Address
- Kernel Virtual Address
用户空间
- User Virtual Address

每个进程都有一个虚拟内存地址空间，内核也将自己的物理内存映射到了每个进程的虚拟内存地址空间中，但用户进程没有权限访问内核区域。

虚拟内存地址被按照功能划分为不同的区域，从低地址到高地址分别如下图所示。下图是假设虚拟内存地址空间为4GB。

virtual_mm_addr

swapping

当物理内存不够用的时候，MMU将物理内存中最近使用频率不高的PAGE中的数据写到硬盘中，然后将这些PAGE标记为空闲，清空TLB和page table中的条目。当进程再次使用到交换出去的数据时，CPU执行相应的指令时将引起page fault, 然后内核采取以下行动：

将进程置于睡眠状态
从硬盘中将数据复制到空闲的物理page中(不一定是原来的)，重新将physical-memory-address到virtual-memory-address的映射条目写入page table中
唤醒进程

上面的操作对进程是透明的，进程仍然使用原来的虚拟内存地址，浑然不知发生了什么。

page fault

page fault是一种CPU异常，当进程访问了无效的虚拟内存地址时由MMU产生，以下三种情况会引起page fault:

为给进程虚拟地址提供映射
进程访问了自己没有权限的内存
虚拟地址有效，但其物理内存页被swapping出去了

其他概念

huge pages

由于内存的访问速度低于CPU执行速度，为了加快查询内存地址映射，上文提到的一种方法是使用高速缓存TLB，但TLB的资源也是有限的。为了减少内存映射条目数量，可以采用huge pages, 就是将多个物理内存的page一起映射给虚拟地址，x86系统中支持2M，甚至1G的page合并一起映射，所以称为huge pages哈。

page cache

包括buffered I/O 和内存映射

OOM killer

当内存不够用时，OMM killer将选择一些占用内存多的，且不那么重要的进程杀死来回收内存。

内存管理工具

待续。。。

堆内存的分配

参考另外一篇博客：内存管理

Direct I/O 和 Buffered I/O

物理内存是一种易失性存储设备，不能长期保存数据，长期保存的数据是存储在硬盘等低速存储设备中。为了减少对硬盘的访问，通常将数据读入到内核空间物理内存page cache中，然后再复制到进程用户空间page cache中，当需要重复读取数据时进程直接从内核page cache中复制。进程写入数据也是先写到自己空间的page cache中，然后复制到内核page cache（脏页）中，然后再同步到磁盘。这种方式就是Buffered I/O。

但是有些进程就不爱这么操作，比如数据库，他们更愿意使用自己的缓存机制，然后将用户空间中的数据直接写入磁盘，读取也是一样。这就是Direct I/O。

bufferd I/O

bufferd-IO

示例：

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>

int main() {
    int fd;
    char word[4194305];

    /* Bufferd I/O */
    fd = open("/tmp/a.txt", O_RDONLY);

    /* 读取文件信息 */
    struct stat fileInfo;
    fstat(fd, &fileInfo);   

    read(fd, word, fileInfo.st_size);
    close(fd);
    return 0;
}

direct I/O

direct-IO

示例：

#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>

int main() {
    int fd;
    char word[4194305];

    /* Direct I/O */
    fd = open("/tmp/a.txt", O_RDONLY | O_DIRECT);

    /* 读取文件信息 */
    struct stat fileInfo;
    fstat(fd, &fileInfo);   

    read(fd, word, fileInfo.st_size);
    close(fd);
    return 0;
}

内存映射

上文叙述了Bufferd I/O 对重复读取的文件进行内核空间的缓存，可以减少对磁盘的访问。但是也存在缺点，他需要将内核空间的page cache和用户空间的page cache进行数据复制。下面我们要介绍的共享文件映射就不需要这一步。内存映射还不止此功能。

内存映射分类：

私有文件映射：
共享文件映射
私有匿名映射
共享匿名映射

私有文件映射

若只读取，多个进程虚拟地址映射到相同的物理地址，当有进程要写数据时，内核采用copy-on-write将physical page复制一份给该进程，但写入的数据不会同步到硬盘。

private-file-mapping

示例代码

程序中通过fork()创建子进程，子进程写入数据时触发内核copy-on-write

private_file_mapping.c

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    printf("parent process...\n");

    /* open file */
    int fd = open("/tmp/mmapped.txt", O_RDWR | O_CREAT);
    if(fd == -1) {
        perror("open mmapped");
        exit(1);
    }

    /* get fie infomation */
    struct stat fileInfo;
    fstat(fd, &fileInfo);

    /* private file mapping */
    char *map = mmap(NULL, fileInfo.st_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    if (map == MAP_FAILED) {
        close(fd);
        perror("mmap");
        exit(1);
    }

    printf("current mapped content: %s\n", map);

    /* create a new process */
    int pid;
    pid = fork();

    /* child process */
    if(pid == 0) {
        printf("child process...\n");
        printf("current mapped content: %s", map);

        /* 由于采用私有文件映射，此时向映射的physical page写入数据时，
           内核会采用copy-on-wirte 为该进程复制新的physical page 
           从而不影响原来的physical page
        */
        char *text = "hello, world";
        int i;
        for(i; i<strlen(text); i++) {
            map[i] = text[i];
        }

        if(msync(map, strlen(text), MS_SYNC) == -1) {
            perror("msync");
        }

        printf("current mapped content: %s", map);
        printf("exit child process\n\n");
        exit(0);
    }


    /* parent process */
    if(pid > 0) {
        waitpid(pid, NULL, 0);
        printf("current mapped content: %s\n", map);

        /* free the mapped memory */
        if(munmap(map, fileInfo.st_size) == -1) {
            close(fd);
            perror("mumap");
            exit(1);
        }
        close(fd);
        return 0;
    }
}

编译执行上面的程序

$ gcc private_file_mapping.c -o private_file_mapping
$ echo 'so beautiful' > /tmp/mmapped.txt
# 执行
$ ./private_file_mapping 
parent process...
current mapped content: so beautiful

child process...
current mapped content: so beautiful
current mapped content: hello, world
exit child process

current mapped content: so beautiful

共享文件映射

此种情况无论是读取，还是写入都共享相同的physical page，任何一个进行修改了数据，其他进程都可以看到，并且会同步到硬盘。

shared-file-mapping

示例代码

shared-file-mapping.c

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/wait.h>

int main() {
    printf("parent process...\n");

    /* open file */
    int fd = open("/tmp/mmapped.txt", O_RDWR | O_CREAT);
    if(fd == -1) {
        perror("open mmapped");
        exit(1);
    }

    /* get file infomation */
    struct stat fileInfo;
    fstat(fd, &fileInfo);

    /* private file mapping */
    char *map = mmap(NULL, fileInfo.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (map == MAP_FAILED) {
        close(fd);
        perror("mmap");
        exit(1);
    }

    printf("current mapped content: %s\n", map);

    /* create a new process */
    int pid;
    pid = fork();

    /* child process */
    if(pid == 0) {
        printf("child process...\n");
        printf("current mapped content: %s", map);

        /* 由于采用私有文件映射，此时向映射的physical page写入数据时，
           内核会采用copy-on-wirte 为该进程复制新的physical page 
           从而不影响原来的physical page
        */
        char *text = "hello, world";
        int i;
        for(i; i<strlen(text); i++) {
            map[i] = text[i];
        }

        if(msync(map, strlen(text), MS_SYNC) == -1) {
            perror("msync");
        }

        printf("current mapped content: %s", map);
        printf("exit child process\n\n");
        exit(0);
    }


    /* parent process */
    if(pid > 0) {
        waitpid(pid, NULL, 0);
        printf("current mapped content: %s\n", map);

        /* free the mapped memory */
        if(munmap(map, fileInfo.st_size) == -1) {
            close(fd);
            perror("mumap");
            exit(1);
        }
        close(fd);
        return 0;
    }
}

编译执行上面的程序

$ gcc shared_file_mapping.c -o shared_file_mapping
$ echo 'so beautiful' > /tmp/mmapped.txt
# 执行
$ ./shared_file_mapping 
parent process...
current mapped content: so beautiful

child process...
current mapped content: so beautiful
current mapped content: hello, world
exit child process

current mapped content: hello, world

私有匿名映射

匿名映射没有对应的文件，私有匿名映射，读取时多个进程共享相同的物理页，但又进程修改数据时就会采用copy-on-write复制一份physical page给该进程

private-anon-mapping

示例代码

private-anonymous-mapping.c

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

int main()
{
    int pid;
    char *map;

    printf("parent process...\n");
    map = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    printf("%s\n", map);

    map[0] = 'h';
    map[1] = 'l';

    if((pid = fork()) == -1) {
        perror("fork");
    }

    if(pid == 0) {
        printf("child process...\n");
        printf("%s\n", map);

        map[2] = 'l';
        getchar();
        exit(0);
    }

    if(pid > 0) {
        waitpid(pid, NULL, 0);
        printf("%s\n", map);
    }

    return 0;
}

执行上面的程序

$ gcc private-anonymous-mapping.c -o private-anonymous-mapping
$ ./private-anonymous-mapping 
parent process...

child process...
hl

hl

共享匿名映射

共享匿名映射所有进程共享相同的physical pages，实现进程间交换数据

shared-anon-mapping

示例代码

shared-anonymous-mapping.c

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

int main()
{
    int pid;
    char *map;

    printf("parent process...\n");
    map = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    printf("%s\n", map);

    map[0] = 'h';
    map[1] = 'l';

    if((pid = fork()) == -1) {
        perror("fork");
    }

    if(pid == 0) {
        printf("child process...\n");
        printf("%s\n", map);

        map[2] = 'l';
        getchar();
        exit(0);
    }

    if(pid > 0) {
        waitpid(pid, NULL, 0);
        printf("%s\n", map);
    }

    return 0;
}

执行上面的程序

$ gcc shared-anonymous-mapping.c -o shared-anonymous-mapping
$ ./shared-anonymous-mapping 
parent process...

child process...
hl

hll

zhubiao

Linux内存管理

内存概述

物理内存

虚拟内存

MMU

虚拟地址空间

swapping

page fault

其他概念

内存管理工具

堆内存的分配

Direct I/O 和 Buffered I/O

内存映射

私有文件映射

共享文件映射

私有匿名映射

共享匿名映射

Table of Contents