NUMA 架构 - souloss

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

1404 字

4 分钟

NUMA 架构

2026-02-05

CPU与计算机体系结构

CPU

/

体系结构

在单插槽系统上，所有 CPU 核心访问内存的延迟相同——这叫 UMA（Uniform Memory Access）。但在多插槽系统上，每个 CPU 插槽有自己的本地内存，访问本地内存快，访问远端内存慢——这就是 NUMA（Non-Uniform Memory Access）。

NUMA 不是”问题”，而是多插槽系统的必然结果。理解 NUMA，才能在多插槽服务器上获得预期的性能。

一、从 UMA 到 NUMA#

1.1 为什么需要 NUMA？#

当 CPU 核心数增加时，前端总线（FSB）成为瓶颈——所有核心竞争同一条总线访问内存。NUMA 的解决方案：给每个 CPU 插槽分配本地内存和本地内存控制器，插槽之间通过互联总线（Interconnect）连接。

graph TB subgraph UMA["UMA：前端总线架构"] C1["CPU 0"] --> FSB["前端总线 瓶颈！"] C2["CPU 1"] --> FSB C3["CPU 2"] --> FSB C4["CPU 3"] --> FSB FSB --> MEM["共享内存控制器"] MEM --> DRAM["DRAM"] end subgraph NUMA["NUMA：分布式内存"] N0["CPU 0 内存控制器"] --- N1["CPU 1 内存控制器"] N0 --- DRAM0["本地 DRAM 0"] N1 --- DRAM1["本地 DRAM 1"] N0 -.->|"跨插槽访问 延迟 1.5-2x"| DRAM1 N1 -.->|"跨插槽访问 延迟 1.5-2x"| DRAM0 end style UMA fill:#ffcdd2,stroke:#c62828 style NUMA fill:#e8f5e9,stroke:#2e7d32

1.2 NUMA 的延迟差异#

访问类型	典型延迟	相对延迟
本地 L1	~1 ns	1x
本地 L2	~4 ns	4x
本地 L3	~12 ns	12x
本地 DRAM	~80 ns	80x
远端 DRAM（1 跳）	~140 ns	140x
远端 DRAM（2 跳）	~200 ns	200x
远端 DRAM（3 跳）	~260 ns	260x

Note

“1 跳”表示数据经过 1 段互联总线。4 插槽系统中，最远的两个插槽之间可能需要 2-3 跳。Intel 的 UPI（Ultra Path Interconnect）和 AMD 的 Infinity Fabric 是不同的互联技术，但延迟特征类似。

二、NUMA 拓扑#

2.1 常见拓扑结构#

graph TB subgraph 双插槽["双插槽（1 跳）"] S0["Socket 0 Node 0"] --- S1["Socket 1 Node 1"] end subgraph 四插槽环["四插槽环形（1-2 跳）"] Q0["Node 0"] --- Q1["Node 1"] Q1 --- Q2["Node 2"] Q2 --- Q3["Node 3"] Q3 --- Q0 end subgraph 八插槽["八插槽（1-3 跳）"] E0["Node 0"] --- E1["Node 1"] E1 --- E2["Node 2"] E2 --- E3["Node 3"] E3 --- E4["Node 4"] E4 --- E5["Node 5"] E5 --- E6["Node 6"] E6 --- E7["Node 7"] E7 --- E0 end style 双插槽 fill:#e8f5e9,stroke:#2e7d32 style 四插槽环 fill:#fff9c4,stroke:#f9a825 style 八插槽 fill:#ffcdd2,stroke:#c62828

2.2 查看 NUMA 拓扑#

1
# 查看 NUMA 节点信息
2
numactl --hardware
3
# available: 2 nodes (0-1)
4
# node 0 size: 64217 MB
5
# node 1 size: 64509 MB
6
# node 0 free: 58123 MB
7
# node 1 free: 58456 MB
8
# node distances:
9
# node   0   1
10
#   0:  10  21
11
#   1:  21  10
12

13
# 距离矩阵：10 = 本地，21 = 远端（2.1x 延迟）
14

15
# 查看进程的 NUMA 内存分布
16
numastat -p $(pidof your_program)
17

18
# 查看每个 NUMA 节点的 CPU 列表
19
lscpu | grep -i numa

2.3 NUMA 距离矩阵#

	Node 0	Node 1	Node 2	Node 3
Node 0	10	21	31	21
Node 1	21	10	21	31
Node 2	31	21	10	21
Node 3	21	31	21	10

距离值 10 = 本地，21 = 1 跳，31 = 2 跳。延迟与距离成正比。

三、NUMA 对性能的影响#

3.1 内存带宽#

访问类型	带宽	相对带宽
本地 DRAM	~50 GB/s	1.0x
远端 DRAM（1 跳）	~35 GB/s	0.7x
远端 DRAM（2 跳）	~25 GB/s	0.5x

3.2 典型场景的性能影响#

场景	NUMA 感知	NUMA 无感知	性能差异
数据库（OLTP）	本地内存	随机分布	20-40%
内存数据库	本地内存	跨节点	30-50%
大数据处理	本地分配	默认分配	10-30%
多线程计算	线程绑定	不绑定	15-35%

3.3 跨 NUMA 的缓存一致性#

跨 NUMA 节点的缓存一致性消息需要经过互联总线，延迟更高：

sequenceDiagram participant Core0 as Node 0 核心 participant Ctrl0 as Node 0 控制器 participant Link as 互联总线 participant Ctrl1 as Node 1 控制器 participant Core1 as Node 1 核心 Core0->>Ctrl0: 写入共享变量（Invalidate） Ctrl0->>Link: Invalidate 消息 Link->>Ctrl1: 传递 Invalidate Ctrl1->>Core1: 执行 Invalidate Core1->>Ctrl1: Ack Ctrl1->>Link: Invalidate Ack Link->>Ctrl0: 传递 Ack Ctrl0->>Core0: 写入完成 Note over Link: 互联总线延迟 40-80 ns

四、NUMA 感知编程#

4.1 内存分配策略#

策略	行为	适用场景
default	在当前线程所在的 NUMA 节点分配	通用
bind	绑定到指定 NUMA 节点	需要确定性延迟
interleave	交错分配到所有节点	带宽密集型
preferred	优先在指定节点分配，不足时用其他	灵活

4.2 numactl 工具#

1
# 绑定到 Node 0 分配内存
2
numactl --membind=0 ./your_program
3

4
# 交错分配内存（适合带宽密集型）
5
numactl --interleave=all ./your_program
6

7
# 绑定 CPU 和内存到同一节点
8
numactl --cpunodebind=0 --membind=0 ./your_program
9

10
# 查看当前 NUMA 策略
11
numactl --show

4.3 代码中的 NUMA 感知#

1
#include <numa.h>
2
#include <numaif.h>
3
#include <stdio.h>
4

5
void numa_aware_allocation() {
6
    if (numa_available() < 0) {
7
        printf("NUMA 不可用\n");
8
        return;
9
    }
10

11
    int num_nodes = numa_num_configured_nodes();
12
    printf("NUMA 节点数: %d\n", num_nodes);
13

14
    // 在当前线程所在的 NUMA 节点分配内存
15
    int preferred_node = numa_node_of_cpu(sched_getcpu());
16
    void *local_mem = numa_alloc_onnode(1024 * 1024, preferred_node);
17

18
    // 交错分配
19
    void *interleaved_mem = numa_alloc_interleaved(1024 * 1024 * 4);
20

21
    // 绑定到指定节点
22
    void *bound_mem = numa_alloc_onnode(1024 * 1024, 0);
23

24
    numa_free(local_mem, 1024 * 1024);
25
    numa_free(interleaved_mem, 1024 * 1024 * 4);
26
    numa_free(bound_mem, 1024 * 1024);
27
}

4.4 线程绑定（CPU Affinity）#

1
#define _GNU_SOURCE
2
#include <sched.h>
3
#include <pthread.h>
4

5
void bind_to_node(int node) {
6
    cpu_set_t cpuset;
7
    CPU_ZERO(&cpuset);
8

9
    // 获取该节点的 CPU 列表
10
    struct bitmask *cpus = numa_allocate_cpumask();
11
    numa_node_to_cpus(node, cpus);
12
    for (int i = 0; i < numa_num_configured_cpus(); i++) {
13
        if (numa_bitmask_isbitset(cpus, i)) {
14
            CPU_SET(i, &cpuset);
15
        }
16
    }
17
    numa_free_cpumask(cpus);
18

19
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
20
}

五、NUMA 与数据库#

5.1 PostgreSQL 的 NUMA 优化#

1
# postgresql.conf
2
# 绑定到指定 NUMA 节点
3
# 使用 numactl 启动
4
numactl --cpunodebind=0 --membind=0 pg_ctl start
5

6
# 共享缓冲区大小应考虑 NUMA 节点内存
7
shared_buffers = 32GB  # 不超过单个 NUMA 节点的内存

5.2 MySQL 的 NUMA 问题#

MySQL 的 Buffer Pool 是一个大内存区域，默认分配在启动线程所在的 NUMA 节点上。后续其他节点的线程访问 Buffer Pool 时都是远端访问。

解决方案：

1
# 方案 1：交错分配
2
numactl --interleave=all mysqld
3

4
# 方案 2：每个 NUMA 节点一个 MySQL 实例
5
numactl --cpunodebind=0 --membind=0 mysqld --port=3306
6
numactl --cpunodebind=1 --membind=1 mysqld --port=3307

六、NUMA 与虚拟化#

6.1 虚拟机的 NUMA 拓扑#

KVM 支持将虚拟机的 vCPU 和内存绑定到物理 NUMA 节点：

1
<!-- libvirt XML 配置 -->
2
<numatune>
3
    <memory mode='strict' nodeset='0'/>
4
</numatune>
5
<vcpu placement='static' cpuset='0-15'>16</vcpu>
6
<cputune>
7
    <vcpupin vcpu='0' cpuset='0'/>
8
    <vcpupin vcpu='1' cpuset='1'/>
9
    <!-- ... -->
10
</cputune>

6.2 容器的 NUMA 感知#

1
# Docker 绑定到 NUMA 节点
2
docker run --cpuset-cpus=0-15 --cpuset-mems=0 your_image
3

4
# Kubernetes NUMA 感知调度
5
# 需要启用 CPU Manager 和 Device Manager
6
# kubelet 配置：
7
# --cpu-manager-policy=static
8
# --topology-manager-policy=best-effort

七、动手实验#

7.1 实验 1：测量 NUMA 延迟差异#

1
#include <stdio.h>
2
#include <stdlib.h>
3
#include <time.h>
4
#include <numa.h>
5
#include <numaif.h>
6

7
#define SIZE (64 * 1024 * 1024)  // 64M int = 256MB
8

9
int main() {
10
    if (numa_available() < 0) {
11
        printf("NUMA 不可用\n");
12
        return 1;
13
    }
14

15
    int num_nodes = numa_num_configured_nodes();
16
    printf("NUMA 节点数: %d\n", num_nodes);
17

18
    for (int node = 0; node < num_nodes; node++) {
19
        int *arr = numa_alloc_onnode(SIZE * sizeof(int), node);
20
        for (int i = 0; i < SIZE; i++) arr[i] = i;
21

22
        clock_t start = clock();
23
        long long sum = 0;
24
        for (int i = 0; i < SIZE; i++) sum += arr[i];
25
        clock_t end = clock();
26

27
        printf("Node %d 分配: %.3f 秒\n", node,
28
               (double)(end - start) / CLOCKS_PER_SEC);
29
        numa_free(arr, SIZE * sizeof(int));
30
    }
31
    return 0;
32
}

7.2 实验 2：numactl 的效果#

1
# 本地访问
2
numactl --cpunodebind=0 --membind=0 ./memory_benchmark
3

4
# 远端访问
5
numactl --cpunodebind=0 --membind=1 ./memory_benchmark
6

7
# 交错分配
8
numactl --interleave=all ./memory_benchmark
9

10
# 对比三种方式的延迟

7.3 实验 3：观察 NUMA 统计#

1
# 查看 NUMA 内存分配统计
2
numastat
3

4
# 查看进程的 NUMA 内存分布
5
numastat -p $(pidof your_program)
6

7
# 查看 NUMA 命中/未命中
8
perf stat -e numa_hit,numa_miss,numa_foreign ./your_program

八、NUMA 感知编程深入#

8.1 NUMA 感知内存分配策略详解#

不同策略对延迟和带宽的影响差异巨大，选错策略可能让延迟翻倍：

flowchart TD APP["应用类型"] --> LAT["延迟敏感型 数据库、KV 存储"] APP --> BW["带宽敏感型 大数据、图计算"] APP --> MIX["混合型 Web 服务器"] LAT --> BIND["策略: bind 线程+内存绑定同一节点 最小延迟"] BW --> INTER["策略: interleave 交错分配到所有节点 最大聚合带宽"] MIX --> PREF["策略: preferred 优先本地，不足时远端 平衡延迟和带宽"] style BIND fill:#e8f5e9,stroke:#2e7d32 style INTER fill:#fff9c4,stroke:#f9a825 style PREF fill:#e3f2fd,stroke:#1565c0

策略	命令	延迟	带宽	适用场景
bind	`--membind=0`	最低（本地）	单节点带宽	延迟敏感
interleave	`--interleave=all`	中等（平均）	最大聚合带宽	带宽敏感
preferred	`--preferred=0`	低（优先本地）	中等	通用
default	默认	取决于首次访问	取决于分配	不推荐

8.2 线程放置策略#

1
// NUMA 感知的线程池
2
#define _GNU_SOURCE
3
#include <sched.h>
4
#include <pthread.h>
5
#include <numa.h>
6

7
struct thread_config {
8
    int numa_node;
9
    int cpu_within_node;
10
    void* (*func)(void*);
11
    void* arg;
12
};
13

14
void* numa_thread_entry(void* arg) {
15
    struct thread_config *cfg = (struct thread_config*)arg;
16

17
    // 1. 绑定到指定 NUMA 节点的 CPU
18
    cpu_set_t cpuset;
19
    CPU_ZERO(&cpuset);
20
    struct bitmask *cpus = numa_allocate_cpumask();
21
    numa_node_to_cpus(cfg->numa_node, cpus);
22
    int cpu_list[64], n = 0;
23
    for (int i = 0; i < numa_num_configured_cpus(); i++) {
24
        if (numa_bitmask_isbitset(cpus, i)) cpu_list[n++] = i;
25
    }
26
    CPU_SET(cpu_list[cfg->cpu_within_node % n], &cpuset);
27
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
28
    numa_free_cpumask(cpus);
29

30
    // 2. 设置内存分配策略为本地节点
31
    struct bitmask *nodemask = numa_allocate_nodemask();
32
    numa_bitmask_setbit(nodemask, cfg->numa_node);
33
    set_mempolicy(MPOL_BIND, nodemask->maskp, nodemask->size + 1);
34
    numa_free_nodemask(nodemask);
35

36
    // 3. 执行线程函数
37
    return cfg->func(cfg->arg);
38
}

8.3 NUMA 感知的数据结构#

分区哈希表：每个 NUMA 节点一个分区，线程只访问本地分区：

1
// NUMA 分区哈希表
2
struct numa_hash_table {
3
    int num_nodes;
4
    struct hash_bucket **partitions;  // 每个节点一个分区
5
};
6

7
// 创建：每个分区在对应 NUMA 节点分配
8
struct numa_hash_table* numa_ht_create(int capacity) {
9
    struct numa_hash_table *ht = malloc(sizeof(*ht));
10
    ht->num_nodes = numa_num_configured_nodes();
11
    ht->partitions = malloc(ht->num_nodes * sizeof(void*));
12

13
    for (int i = 0; i < ht->num_nodes; i++) {
14
        // 在 NUMA 节点 i 上分配分区
15
        ht->partitions[i] = numa_alloc_onnode(
16
            capacity / ht->num_nodes * sizeof(struct hash_bucket), i);
17
    }
18
    return ht;
19
}
20

21
// 查找：只访问本地分区（零跨节点访问）
22
struct hash_entry* numa_ht_lookup(struct numa_hash_table *ht, int key) {
23
    int node = numa_node_of_cpu(sched_getcpu());  // 当前线程所在节点
24
    int local_key = key / ht->num_nodes;  // 映射到本地分区
25
    return hash_lookup(ht->partitions[node], local_key);
26
}

8.4 性能影响实测#

1
// NUMA 延迟基准测试
2
#include <stdio.h>
3
#include <stdlib.h>
4
#include <time.h>
5
#include <numa.h>
6

7
#define SIZE (64 * 1024 * 1024)  // 256MB
8

9
int main() {
10
    if (numa_available() < 0) { printf("NUMA 不可用\n"); return 1; }
11

12
    int nn = numa_num_configured_nodes();
13
    int cpu_node = numa_node_of_cpu(sched_getcpu());
14

15
    printf("当前线程在 Node %d\n\n", cpu_node);
16
    printf("%-15s %-15s %-10s\n", "分配节点", "访问延迟(ms)", "相对延迟");
17

18
    for (int mem_node = 0; mem_node < nn; mem_node++) {
19
        int *arr = numa_alloc_onnode(SIZE * sizeof(int), mem_node);
20
        for (int i = 0; i < SIZE; i++) arr[i] = i;
21

22
        clock_t start = clock();
23
        volatile long long sum = 0;
24
        for (int i = 0; i < SIZE; i++) sum += arr[i];
25
        clock_t end = clock();
26

27
        double ms = (double)(end - start) / CLOCKS_PER_SEC * 1000;
28
        printf("Node %-8d  %-12.1f  %.1fx\n", mem_node, ms,
29
               mem_node == cpu_node ? 1.0 : ms / 100.0);
30
        numa_free(arr, SIZE * sizeof(int));
31
    }
32
    return 0;
33
}

场景	本地访问	远端访问	差异
顺序遍历 256MB	~80 ms	~140 ms	1.75x
随机访问 256MB	~500 ms	~900 ms	1.8x
原子操作	~50 ns/op	~120 ns/op	2.4x

Warning

NUMA 感知编程的一个常见错误是”过度绑定”——把所有线程和内存都绑到 Node 0，让其他节点空闲。正确的做法是均匀分配：每个节点承载 1/N 的线程和内存，让每个线程主要访问本地内存。

九、小结#

上一章从全景视角介绍了TLB 与页表结构。

概念	要点	对软件的影响
NUMA 拓扑	每个插槽有本地内存	远端访问延迟 1.5-2x
距离矩阵	量化节点间延迟	调度决策的依据
内存分配策略	bind/interleave/preferred	不同场景选择不同策略
线程绑定	CPU 亲和性	避免线程迁移
数据库 NUMA	Buffer Pool 分配	交错分配或分实例
虚拟化 NUMA	vCPU 绑定	虚拟机 NUMA 感知