containerd-shim - souloss Blog

一个令人困惑的问题：为什么 containerd 不直接调用 runc，而要在中间加一层 shim？答案涉及三个关键场景：

升级安全：containerd 升级重启时，运行中的容器不受影响
进程监督：容器进程有独立的父进程，不会因为 containerd 崩溃而变成孤儿
runc 退出：runc 在创建容器后可以退出，不需要一直运行

shim 的诞生源于一个运维痛点。在早期 Docker 架构中，containerd 直接调用 runc 管理容器进程，runc 作为 containerd 的子进程持续运行。这意味着 containerd 升级重启时，所有容器进程的父进程关系会断裂，容器要么变成孤儿进程，要么跟着重启。2016 年，containerd 引入 shim 作为”中间人”：shim 作为容器进程的父进程独立运行，containerd 通过 RPC 与 shim 通信（早期 shim v1 用 gRPC，后来 shim v2 改用更轻量的 ttrpc）。这样，containerd 升级重启不影响运行中的容器，runc 在创建容器后也可以退出。shim 用一个轻量级守护进程的代价换来了显著的运维收益，这是容器运行时演进中”用架构换可靠性”的经典案例。

前置知识#

Important

Ch07 containerd 架构：shim 是 containerd 的子组件，理解 containerd 的整体架构是前提
Ch06 runc 源码分析：shim 调用 runc 创建容器，理解 runc 的生命周期管理
Linux 进程关系：父进程、孤儿进程、僵尸进程、prctl(PR_SET_CHILD_SUBREAPER)

Note

shim 的设计思想，“用独立进程解耦父子关系”，在系统编程中并不罕见。systemd 的 socket activation、SSH 的代理跳转，都采用了类似的解耦模式。

一、为什么需要 Shim#

1.1 没有 Shim 的问题#

假设 containerd 直接调用 runc，不经过 shim：

graph TB subgraph 没有Shim["没有 Shim 的问题"] CTNRD1["containerd PID 1000"] RUNC1["runc PID 2000（持续运行）"] APP1["容器进程 PID 3000"] CTNRD1 -->|"直接调用"| RUNC1 RUNC1 -->|"父进程"| APP1 PROBLEM1["containerd 重启 → runc 可能受影响"] PROBLEM2["runc 必须持续运行 → 资源浪费"] PROBLEM3["containerd 崩溃 → 容器变孤儿进程"] end style 没有Shim fill:#ffcdd2,stroke:#c62828

1.2 有 Shim 的解决方案#

graph TB subgraph 有Shim["有 Shim 的解决方案"] CTNRD2["containerd PID 1000"] SHIM2["containerd-shim PID 2000"] RUNC2["runc PID 3000（创建后退出）"] APP2["容器进程 PID 4000"] CTNRD2 -->|"fork + exec"| SHIM2 SHIM2 -->|"exec runc create"| RUNC2 RUNC2 -.->|"创建后退出"| EXIT["runc 退出"] SHIM2 -->|"父进程"| APP2 BENEFIT1["containerd 重启 → shim 不受影响"] BENEFIT2["runc 创建后退出 → 不浪费资源"] BENEFIT3["shim 监督容器 → 不会变孤儿"] end style 有Shim fill:#c8e6c9,stroke:#2e7d32

1.3 Shim 的核心价值#

特性	无 shim（旧版 Docker）	shim v1	shim v2 (shim-runc-v2)
daemon 重启影响	所有容器退出	容器继续运行	容器继续运行
shim 进程模型	无	每个容器一个 shim	每个容器一个 shim
多容器支持	N/A	每个容器独立 shim	默认 per-container；可选 sandbox 模式让一个 shim 管 Pod 内多个容器
资源开销	无	每容器 ~5MB	每容器 ~2MB（sandbox 模式下共享）

Tip

containerd 1.4+ 默认使用 shim v2。在 Kubernetes 中运行大量短生命周期容器时，shim v2 的共享进程模型可以显著减少内存开销。

问题	没有 Shim	有 Shim
containerd 升级	影响所有容器	容器不受影响
containerd 崩溃	容器变孤儿进程	shim 继续监督
runc 生命周期	必须持续运行	创建后可退出
容器 stdout/stderr	需要容器运行时处理	shim 负责收集
容器退出状态	需要容器运行时等待	shim 负责收集

二、Shim 的进程关系#

2.1 完整的进程树#

1
# 查看容器相关的进程树
2
pstree -p $(pgrep containerd) | head -30
3

4
# 典型输出：
5
# containerd(1000)───containerd-shim(2000)───nginx(3000)
6
#                                             └─nginx(3001)
7
#                  ───containerd-shim(4000)───redis(5000)
8
#                  ───containerd-shim(6000)───myapp(7000)

2.2 进程关系详解#

graph TB subgraph 宿主机进程树["宿主机进程树"] SYSTEMD["systemd PID 1"] CTNRD["containerd PID 1000"] SHIM1["containerd-shim PID 2000 (容器 A)"] SHIM2["containerd-shim PID 4000 (容器 B)"] APP1["nginx PID 3000"] APP2["redis PID 5000"] end SYSTEMD --> CTNRD CTNRD --> SHIM1 CTNRD --> SHIM2 SHIM1 --> APP1 SHIM2 --> APP2 subgraph 关键特性["关键特性"] F1["1. shim 是容器的父进程"] F2["2. containerd 通过 RPC 与 shim 通信 (v1: gRPC, v2: ttrpc)"] F3["3. containerd 重启后重新连接 shim"] F4["4. shim 收集容器退出状态"] end style 宿主机进程树 fill:#e8eaf6,stroke:#283593 style 关键特性 fill:#c8e6c9,stroke:#2e7d32

2.3 Shim 与 runc 的交互#

1
# shim 调用 runc 的时序：
2
# 1. shim 启动后，调用 runc create 创建容器
3
# 2. runc create 完成后，runc 进程退出
4
# 3. 容器进程由 shim 监督
5
# 4. 当需要启动容器时，shim 调用 runc start
6
# 5. runc start 完成后，runc 进程退出
7

8
# 查看 shim 的命令行参数
9
ps aux | grep containerd-shim
10
# containerd-shim -namespace default -id mynginx -address /run/containerd/containerd.sock

三、Shim API#

3.1 Task Service API#

shim 通过 RPC 暴露 Task Service API，containerd 通过这个 API 管理容器。具体协议随版本演进：shim v1 用 gRPC，shim v2 改用 ttrpc（一种剥离了 HTTP/2 栈、更轻量的 protobuf RPC）；但两者共享同一套 Task 服务定义：

方法	功能	对应 runc 命令
Create	创建容器	runc create
Start	启动容器	runc start
Kill	发送信号	runc kill
Delete	删除容器	runc delete
State	查询状态	runc state
Exec	执行命令	runc exec
Pids	获取 PID	—
ResizePty	调整终端大小	—
CloseIO	关闭 IO	—
Checkpoint	检查点	runc checkpoint
Restore	恢复	runc restore
Update	更新资源	runc update
Wait	等待退出	—
Stats	获取统计	—

3.2 Shim 的 RPC 通信#

1
// shim 的 Task 服务定义（protobuf，v1 走 gRPC、v2 走 ttrpc，简化）
2
service Task {
3
    rpc Create(CreateTaskRequest) returns (CreateTaskResponse);
4
    rpc Start(StartRequest) returns (StartResponse);
5
    rpc Kill(KillRequest) returns (google.protobuf.Empty);
6
    rpc Delete(DeleteRequest) returns (DeleteResponse);
7
    rpc State(StateRequest) returns (StateResponse);
8
    rpc Exec(ExecRequest) returns (ExecResponse);
9
    rpc Pids(PidsRequest) returns (PidsResponse);
10
    rpc Wait(WaitRequest) returns (WaitResponse);
11
    rpc Stats(StatsRequest) returns (StatsResponse);
12
}
13

14
message CreateTaskRequest {
15
    string id = 1;
16
    string bundle = 2;
17
    repeated Mount rootfs = 3;
18
    bool terminal = 4;
19
    string stdin = 5;
20
    string stdout = 6;
21
    string stderr = 7;
22
    bool checkpoint = 8;
23
    string parent_checkpoint = 9;
24
}

3.3 containerd 与 shim 的通信流程#

sequenceDiagram participant CTNRD as containerd participant SHIM as containerd-shim participant RUNC as runc participant APP as 容器进程 Note over CTNRD,APP: 容器创建 CTNRD->>SHIM: fork + exec shim SHIM-->>CTNRD: shim 启动成功 Note over CTNRD,SHIM: 以下 RPC 调用：shim v1 走 gRPC，shim v2 走 ttrpc CTNRD->>SHIM: RPC: Create(bundle, rootfs) SHIM->>RUNC: exec: runc create --bundle <path> <id> RUNC->>APP: clone(CLONE_NEWPID|...) + exec RUNC-->>SHIM: 容器已创建（runc 退出） SHIM-->>CTNRD: RPC: CreateResponse{pid} Note over CTNRD,APP: 容器启动 CTNRD->>SHIM: RPC: Start(id) SHIM->>RUNC: exec: runc start <id> RUNC-->>SHIM: 容器已启动（runc 退出） SHIM-->>CTNRD: RPC: StartResponse{pid} Note over CTNRD,APP: 容器退出 APP-->>SHIM: 进程退出（wait 获取退出码） SHIM->>CTNRD: 事件: TaskExit{exit_status}

四、Shim 的升级安全机制#

4.1 containerd 重启后的恢复#

当 containerd 重启时，它需要重新连接所有运行中的 shim：

1
// containerd 重启后的 shim 恢复（简化）
2
func (m *TaskManager) LoadTasks(ctx context.Context) error {
3
    // 1. 扫描 shim 的 socket 文件
4
    shimSockets, err := filepath.Glob("/run/containerd/s/*.sock")
5

6
    // 2. 逐个连接 shim
7
    for _, socket := range shimSockets {
8
        shim, err := shim.Connect(ctx, socket)
9
        if err != nil {
10
            continue // shim 可能已退出
11
        }
12

13
        // 3. 查询 shim 管理的容器状态
14
        state, err := shim.State(ctx, &task.StateRequest{})
15
        if err != nil {
16
            continue
17
        }
18

19
        // 4. 恢复 Task 对象
20
        m.tasks[state.ID] = &Task{
21
            shim:    shim,
22
            id:      state.ID,
23
            pid:     state.Pid,
24
            status:  state.Status,
25
        }
26
    }
27

28
    return nil
29
}

4.2 Shim 的 Socket 文件#

每个 shim 在 /run/containerd/s/ 目录下创建一个 Unix socket 文件：

1
# 查看 shim 的 socket 文件
2
ls /run/containerd/s/
3

4
# 输出示例：
5
# 5a3b2c1d4e5f.sock
6
# 7f8e9d0c1b2a.sock
7

8
# socket 文件名与容器 ID 对应
9
# containerd 重启后通过这些 socket 重新连接 shim

4.3 升级安全的保证#

flowchart TB subgraph 升级前["升级前"] CTNRD1["containerd v1.4"] SHIM1A["shim (容器 A)"] SHIM1B["shim (容器 B)"] APP1A["nginx"] APP1B["redis"] CTNRD1 --> SHIM1A CTNRD1 --> SHIM1B SHIM1A --> APP1A SHIM1B --> APP1B end subgraph 升级中["升级中"] CTNRD_STOP["containerd 停止"] SHIM2A["shim (容器 A) 继续运行"] SHIM2B["shim (容器 B) 继续运行"] APP2A["nginx 继续运行"] APP2B["redis 继续运行"] SHIM2A --> APP2A SHIM2B --> APP2B end subgraph 升级后["升级后"] CTNRD3["containerd v1.5"] SHIM3A["shim (容器 A) 重新连接"] SHIM3B["shim (容器 B) 重新连接"] APP3A["nginx 继续运行"] APP3B["redis 继续运行"] CTNRD3 -->|"重新连接 socket"| SHIM3A CTNRD3 -->|"重新连接 socket"| SHIM3B SHIM3A --> APP3A SHIM3B --> APP3B end 升级前 --> 升级中 --> 升级后 style 升级前 fill:#bbdefb,stroke:#1565c0 style 升级中 fill:#fff3e0,stroke:#e65100 style 升级后 fill:#c8e6c9,stroke:#2e7d32

五、Shim v1 vs v2#

5.1 架构对比#

维度	Shim v1	Shim v2
进程模型	每个容器一个 shim	每个容器一个 shim
通信方式	gRPC over Unix socket	ttrpc over Unix socket
二进制	containerd-shim	自定义 shim 二进制
运行时支持	仅 runc	runc/gVisor/Kata/Wasm
配置方式	命令行参数	containerd 配置文件
IO 处理	FIFO + 外部进程	内置 FIFO 处理

5.2 Shim v2 的改进#

Shim v2 的最大改进是支持自定义运行时，每个运行时可以提供自己的 shim 二进制：

1
# containerd 配置：注册自定义运行时
2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
3
  runtime_type = "io.containerd.runc.v2"
4

5
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
6
  runtime_type = "io.containerd.runsc.v1"  # gVisor shim
7

8
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
9
  runtime_type = "io.containerd.kata.v2"   # Kata shim

5.3 自定义 Shim 的接口#

1
// shim v2 的接口定义（简化）
2
type Shim interface {
3
    // 启动 shim
4
    Start(ctx context.Context, id string, opts StartOpts) (Shim, error)
5

6
    // 容器操作
7
    Create(ctx context.Context, task *TaskConfig) (Task, error)
8
    Delete(ctx context.Context, id string) error
9

10
    // 生命周期
11
    Wait(ctx context.Context, id string) (uint32, error)
12
    Kill(ctx context.Context, id string, signal syscall.Signal) error
13

14
    // IO
15
    ResizePty(ctx context.Context, id string, size ConsoleSize) error
16
    CloseIO(ctx context.Context, id string) error
17

18
    // 清理
19
    Shutdown(ctx context.Context) error
20
}

六、Shim 的 IO 处理#

6.1 容器 IO 的路径#

flowchart LR subgraph 容器进程["容器进程"] STDIN["stdin"] STDOUT["stdout"] STDERR["stderr"] end subgraph Shim["containerd-shim"] FIFO_IN["FIFO: stdin"] FIFO_OUT["FIFO: stdout"] FIFO_ERR["FIFO: stderr"] IO_COPY["IO 转发"] end subgraph 客户端["客户端"] CLIENT_STDIN["stdin"] CLIENT_STDOUT["stdout"] CLIENT_STDERR["stderr"] end CLIENT_STDIN --> FIFO_IN --> STDIN STDOUT --> FIFO_OUT --> IO_COPY --> CLIENT_STDOUT STDERR --> FIFO_ERR --> IO_COPY --> CLIENT_STDERR style 容器进程 fill:#ffcdd2,stroke:#c62828 style Shim fill:#c8e6c9,stroke:#2e7d32 style 客户端 fill:#bbdefb,stroke:#1565c0

6.2 FIFO 文件的位置#

1
# 查看容器的 FIFO 文件
2
ls /run/containerd/fifo/
3

4
# 输出示例：
5
# 5a3b2c1d4e5f-stdin
6
# 5a3b2c1d4e5f-stdout
7
# 5a3b2c1d4e5f-stderr
8

9
# shim 负责打开这些 FIFO 并转发 IO

Warning

shim 进程以 root 权限运行，且每个容器对应一个 shim。如果攻击者通过容器逃逸获得了 shim 进程的控制权，可以利用 shim 的 RPC 接口操作该容器（kill、exec 等）。在安全要求高的场景下，可以考虑使用 User Namespace 将 shim 也放入非特权上下文中运行。

七、动手实践#

7.1 观察 Shim 的行为#

1
#!/bin/bash
2
# 观察 containerd-shim 的行为
3

4
# 1. 启动一个容器
5
ctr run -d docker.io/library/nginx:alpine mynginx /bin/sh -c "while true; do echo hello; sleep 1; done"
6

7
# 2. 查看 shim 进程
8
ps aux | grep "containerd-shim.*mynginx"
9

10
# 3. 查看 shim 的 socket
11
ls -la /run/containerd/s/
12

13
# 4. 查看容器的 FIFO
14
ls -la /run/containerd/fifo/
15

16
# 5. 重启 containerd
17
sudo systemctl restart containerd
18

19
# 6. 验证容器仍在运行
20
ctr tasks list
21
# TASK       PID     STATUS
22
# mynginx    12345   RUNNING  ← 容器未受影响
23

24
# 7. 查看 shim 进程（新的 PID）
25
ps aux | grep "containerd-shim.*mynginx"
26

27
# 8. 清理
28
ctr tasks kill mynginx
29
ctr containers delete mynginx

7.2 用 strace 追踪 Shim 与 runc 的交互#

1
#!/bin/bash
2
# 追踪 shim 调用 runc 的过程
3

4
# 1. 找到 shim 进程
5
SHIM_PID=$(pgrep -f "containerd-shim.*mynginx")
6

7
# 2. 追踪 shim 的子进程（runc）
8
sudo strace -f -p $SHIM_PID -e trace=execve,clone,wait4 -o /tmp/shim-strace.log &
9

10
# 3. 通过 containerd 操作容器
11
ctr tasks kill mynginx SIGUSR1
12

13
# 4. 查看追踪结果
14
grep "execve" /tmp/shim-strace.log
15
# 应该能看到 shim 调用 runc kill 的记录

八、本章小结#

上一章了解了 containerd 的架构设计。

特性	说明
解耦	shim 让 containerd 与容器进程解耦
升级安全	containerd 升级重启不影响运行中的容器
进程监督	shim 是容器进程的父进程，负责收集退出状态
IO 转发	shim 通过 FIFO 转发容器的 stdin/stdout/stderr
自定义运行时	shim v2 支持自定义 shim 二进制（gVisor/Kata/Wasm）
通信方式	shim 通过 gRPC/ttrpc 与 containerd 通信