mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
540 字
2 分钟
Kubernetes Pod 创建流程:从 kubectl 到容器运行
2024-01-25

你执行 kubectl run nginx --image=nginx,不到两秒终端返回 pod/nginx created。但这两秒里,请求经历了认证、授权、准入控制,etcd 写入,Controller 协调,Scheduler 调度,Kubelet 拉镜像、创建容器——至少七个组件参与了这个过程。本文深入剖析 Pod 创建的完整流程。

Kubernetes 架构概览#

flowchart TB subgraph 客户端 A[kubectl] end subgraph Control Plane B[API Server] C[etcd] D[Controller Manager] E[Scheduler] end subgraph Node 1 F[Kubelet] G[Container Runtime] H[kube-proxy] I[Pod 1] J[Pod 2] end subgraph Node 2 K[Kubelet] L[Container Runtime] M[kube-proxy] N[Pod 3] end A --> B B <--> C D --> B E --> B F --> B K --> B F --> G --> I F --> G --> J K --> L --> N

一、kubectl 处理#

1.1 命令解析#

flowchart TB A["kubectl run nginx --image=nginx"] --> B[解析参数] B --> C[生成 Pod 定义] C --> D[验证资源] D --> E[发送到 API Server]
# kubectl 生成的 Pod 定义
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
run: nginx
spec:
containers:
- name: nginx
image: nginx
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always

1.2 客户端验证#

// kubectl 创建资源流程(简化)
func (o *RunOptions) Run() error {
// 1. 构建 Pod 对象
pod := o.createPod()
// 2. 客户端验证
if errs := validation.ValidatePod(pod); len(errs) > 0 {
return errs.ToAggregate()
}
// 3. 发送到 API Server
result, err := o.Client.CoreV1().Pods(namespace).Create(
context.TODO(), pod, metav1.CreateOptions{})
return result, err
}

二、API Server 处理#

2.1 请求处理流程#

sequenceDiagram participant K as kubectl participant A as API Server participant E as etcd participant W as Webhook participant AD as Admission K->>A: POST /api/v1/namespaces/default/pods A->>A: 认证 Authentication A->>A: 授权 Authorization A->>AD: 准入控制 Admission AD->>W: Mutating Webhook W-->>AD: 修改对象 AD->>AD: 验证准入控制 AD->>W: Validating Webhook W-->>AD: 验证结果 AD->>A: 准入完成 A->>E: 存储到 etcd E-->>A: 存储成功 A-->>K: 返回 Pod 对象

2.2 认证与授权#

flowchart TB A[请求到达] --> B{认证} B -->|失败| C[401 Unauthorized] B -->|成功| D{授权} D -->|失败| E[403 Forbidden] D -->|成功| F[准入控制]

认证方式:

方式说明
X509 证书客户端证书认证
Bearer TokenJWT Token
Service AccountPod 内部认证
OIDC外部身份认证

授权模式:

模式说明
Node节点授权
ABAC属性访问控制
RBAC角色访问控制
Webhook外部授权

2.3 准入控制#

准入控制器链:

flowchart LR A[请求对象] --> B[NamespaceExists] B --> C[NamespaceLifecycle] C --> D[LimitRanger] D --> E[ServiceAccount] E --> F[NodeRestriction] F --> G[MutatingWebhook] G --> H[ValidatingWebhook] H --> I[ResourceQuota] I --> J[存储到 etcd]

常用准入控制器:

控制器功能
NamespaceLifecycle命名空间生命周期
LimitRanger资源限制
ServiceAccount服务账户
NodeRestriction节点限制
PodSecurityPolicyPod 安全策略(已废弃)
ResourceQuota资源配额
DefaultStorageClass默认存储类
MutatingWebhook变更 Webhook
ValidatingWebhook验证 Webhook

2.4 存储到 etcd#

// API Server 存储接口(简化)
type Registry interface {
Create(ctx context.Context, obj runtime.Object) error
Update(ctx context.Context, obj runtime.Object) error
Get(ctx context.Context, name string) (runtime.Object, error)
Delete(ctx context.Context, name string) error
}
// etcd 存储实现
func (s *store) Create(ctx context.Context, obj runtime.Object) error {
key := s.keyFunc(obj)
data, _ := s.codec.Encode(obj)
return s.client.Put(ctx, key, string(data))
}

三、Controller Manager#

3.1 控制器模式#

flowchart TB A[Informer] -->|Watch| B[API Server] B -->|事件| A A -->|入队| C[WorkQueue] C -->|出队| D[Controller] D -->|处理| E[Reconcile] E -->|更新| B

3.2 Pod 创建控制器#

// Deployment Controller(简化)
func (dc *DeploymentController) reconcileDeployment(d *apps.Deployment) error {
// 1. 获取关联的 ReplicaSet
rsList := dc.getReplicaSetsForDeployment(d)
// 2. 计算需要的 Pod 数量
desiredReplicas := int(*d.Spec.Replicas)
currentReplicas := len(rsList)
// 3. 创建/删除 ReplicaSet
if currentReplicas < desiredReplicas {
dc.createReplicaSet(d)
}
return nil
}
// ReplicaSet Controller(简化)
func (rsc *ReplicaSetController) syncReplicaSet(rs *apps.ReplicaSet) error {
// 1. 获取关联的 Pod
podList := rsc.getPodsForReplicaSet(rs)
// 2. 计算差异
diff := len(podList) - int(*rs.Spec.Replicas)
// 3. 创建/删除 Pod
if diff < 0 {
for i := 0; i < -diff; i++ {
rsc.createPod(rs)
}
}
return nil
}

3.3 控制器协作流程#

sequenceDiagram participant D as Deployment Controller participant R as ReplicaSet Controller participant A as API Server participant E as etcd Note over A: Pod 定义写入 etcd A->>D: Deployment Informer 事件 D->>A: 创建 ReplicaSet A->>E: 存储 ReplicaSet A->>R: ReplicaSet Informer 事件 R->>A: 创建 Pod A->>E: 存储 Pod

四、Scheduler 调度#

4.1 调度流程#

flowchart TB A[Pod 未调度] --> B[Informer 接收] B --> C[加入调度队列] C --> D[预选 Predicate] D --> E{有可用节点?} E -->|否| F[调度失败] E -->|是| G[优选 Priority] G --> H[选择最优节点] H --> I[绑定 Pod 到节点] I --> J[更新 Pod spec.nodeName]

4.2 预选(Predicate)#

过滤不符合条件的节点:

// 预选算法(部分)
var predicates = map[string]Predicate{
"PodFitsResources": PodFitsResources, // 资源充足
"PodFitsHostPorts": PodFitsHostPorts, // 端口可用
"PodMatchNodeSelector": PodMatchNodeSelector, // 节点选择器
"PodToleratesNodeTaints": PodToleratesNodeTaints,// 污点容忍
"CheckNodeCondition": CheckNodeCondition, // 节点状态
"CheckNodeUnschedulable": CheckNodeUnschedulable,// 不可调度标记
}

资源检查示例:

# Pod 资源请求
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1
memory: 1Gi
# 节点资源
Allocatable:
cpu: 4
memory: 8Gi
Allocated:
cpu: 2.5
memory: 4Gi
# 可用: cpu 1.5, memory 4Gi -> 满足要求

4.3 优选(Priority)#

为候选节点打分:

// 优选算法(部分)
var priorities = map[string]Priority{
"LeastRequestedPriority": LeastRequestedPriority, // 资源使用最少
"BalancedResourceAllocation": BalancedResourceAllocation, // 资源均衡
"NodeAffinityPriority": NodeAffinityPriority, // 节点亲和性
"InterPodAffinityPriority": InterPodAffinityPriority, // Pod 亲和性
"TaintTolerationPriority": TaintTolerationPriority, // 污点容忍
"ImageLocalityPriority": ImageLocalityPriority, // 镜像本地化
}

评分示例:

节点评分计算:
Node A:
- LeastRequested: 80 分
- BalancedResource: 70 分
- ImageLocality: 90 分
- 综合得分: 80 * 1 + 70 * 1 + 90 * 2 = 330
Node B:
- LeastRequested: 60 分
- BalancedResource: 80 分
- ImageLocality: 50 分
- 综合得分: 60 * 1 + 80 * 1 + 50 * 2 = 240
选择 Node A

4.4 绑定(Binding)#

// 调度绑定
func (sched *Scheduler) bind(assumed *v1.Pod, targetNode string) error {
binding := &v1.Binding{
ObjectMeta: assumed.ObjectMeta,
Target: v1.ObjectReference{
Kind: "Node",
Name: targetNode,
},
}
return sched.Client.CoreV1().Pods(assumed.Namespace).Bind(
context.TODO(), binding, metav1.CreateOptions{})
}

五、Kubelet 处理#

5.1 Kubelet 架构#

flowchart TB subgraph Kubelet A[Pod Manager] --> B[PLEG] B --> C[PLEG] C --> D[Runtime Manager] D --> E[CRI Runtime] F[Volume Manager] --> G[Volume Plugins] H[Probe Manager] --> I[Health Checks] end subgraph Container Runtime E --> J[containerd] J --> K[runc] end

5.2 Pod 同步流程#

sequenceDiagram participant A as API Server participant K as Kubelet participant P as Pod Manager participant V as Volume Manager participant R as Runtime A->>K: Pod 分配到节点 K->>P: 添加 Pod 到管理器 K->>V: 挂载卷 V-->>K: 卷挂载完成 K->>R: 创建 Pod Sandbox R-->>K: Sandbox 创建完成 K->>R: 创建容器 R-->>K: 容器启动 K->>A: 更新 Pod 状态

5.3 Pod 生命周期管理#

// Kubelet Pod 同步(简化)
func (kl *Kubelet) syncPod(pod *v1.Pod) error {
// 1. 检查 Pod 是否可以运行
if err := kl.canRunPod(pod); err != nil {
return err
}
// 2. 创建 Pod 目录
kl.makePodDataDirs(pod)
// 3. 挂载卷
if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
return err
}
// 4. 拉取镜像
for _, container := range pod.Spec.Containers {
kl.imageManager.PullImage(container.Image)
}
// 5. 创建 Pod Sandbox
podSandboxID := kl.runtimeService.CreatePodSandbox(pod)
// 6. 创建容器
for _, container := range pod.Spec.Containers {
kl.createContainer(pod, container, podSandboxID)
}
// 7. 启动容器
for _, container := range pod.Spec.Containers {
kl.runtimeService.StartContainer(containerID)
}
return nil
}

5.4 PLEG(Pod Lifecycle Event Generator)#

// PLEG 定期检查容器状态
func (pleg *GenericPLEG) relist() {
// 1. 获取所有容器状态
containers := pleg.runtime.GetPods()
// 2. 与缓存状态对比
for _, pod := range containers {
oldPod := pleg.podRecords[pod.ID]
// 3. 检测变化
events := pleg.generateEvents(oldPod, pod)
// 4. 发送事件
for _, event := range events {
pleg.eventChannel <- event
}
}
}

六、容器运行时#

6.1 CRI 接口#

// CRI 服务定义(简化)
service RuntimeService {
rpc CreatePodSandbox(CreatePodSandboxRequest) returns (CreatePodSandboxResponse);
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse);
rpc ListContainers(ListContainersRequest) returns (ListContainersResponse);
rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse);
}

6.2 containerd 处理流程#

sequenceDiagram participant K as Kubelet participant C as containerd participant S as containerd-shim participant R as runc K->>C: CreatePodSandbox C->>S: 创建 shim S->>R: 创建 pause 容器 R-->>S: 容器创建 S-->>C: Sandbox ID C-->>K: Sandbox ID K->>C: CreateContainer C->>S: 创建容器 S->>R: 创建容器 R-->>S: 容器创建 S-->>C: Container ID C-->>K: Container ID K->>C: StartContainer C->>S: 启动容器 S->>R: 启动容器 R-->>S: 容器启动 S-->>C: 成功 C-->>K: 成功

6.3 Pause 容器#

# Pause 容器是 Pod 的根容器
# 也称为 Sandbox 容器
# 作用:
# 1. 持有 Pod 的网络 namespace
# 2. 作为 Pod 中其他容器的父进程
# 3. 回收孤儿进程
# pause 镜像内容
FROM scratch
ARG ARCH
ADD bin/pause-${ARCH} /pause
ENTRYPOINT ["/pause"]

七、网络配置#

7.1 CNI 插件#

/etc/cni/net.d/10-flannel.conflist
# CNI 配置文件
{
"name": "cbr0",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {"portMappings": true}
}
]
}

7.2 网络配置流程#

sequenceDiagram participant K as Kubelet participant C as CNI Plugin participant N as Network Namespace participant B as Bridge/Veth K->>C: CNI ADD C->>N: 创建网络 namespace C->>B: 创建 veth pair C->>B: 连接网桥 C->>N: 配置 IP 地址 C->>N: 配置路由 C-->>K: 返回网络配置

八、状态更新#

8.1 Pod 状态流转#

stateDiagram-v2 [*] --> Pending: 创建 Pending --> Running: 调度成功,容器启动 Running --> Succeeded: 正常退出 Running --> Failed: 异常退出 Pending --> Failed: 调度失败 Succeeded --> [*] Failed --> [*]

8.2 状态更新流程#

// Kubelet 状态更新
func (kl *Kubelet) updateStatus(pod *v1.Pod) error {
// 1. 收集容器状态
containerStatuses := kl.getContainerStatuses(pod)
// 2. 构建 Pod 状态
podStatus := &v1.PodStatus{
Phase: kl.getPhase(pod, containerStatuses),
Conditions: kl.getConditions(pod),
ContainerStatuses: containerStatuses,
}
// 3. 更新到 API Server
_, err := kl.kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(
context.TODO(), pod, metav1.UpdateOptions{})
return err
}

九、完整流程总结#

9.1 时序图#

sequenceDiagram participant U as User participant K as kubectl participant A as API Server participant E as etcd participant C as Controller participant S as Scheduler participant N as Kubelet participant R as Runtime U->>K: kubectl run nginx K->>A: POST /api/v1/namespaces/default/pods A->>A: 认证/授权/准入 A->>E: 存储 Pod Note over E: Pod 状态: Pending C->>A: Watch Pod 事件 C->>A: 创建 ReplicaSet(如需要) S->>A: Watch 未调度 Pod S->>S: 预选/优选 S->>A: 更新 Pod.nodeName Note over E: Pod 状态: Pending, nodeName: node1 N->>A: Watch 分配到本节点的 Pod N->>N: 挂载卷 N->>R: 创建 Sandbox N->>R: 创建容器 N->>R: 启动容器 N->>A: 更新 Pod 状态 Note over E: Pod 状态: Running A-->>K: 返回结果 K-->>U: pod/nginx created

9.2 关键要点#

  1. kubectl:生成 Pod 定义,发送请求
  2. API Server:认证、授权、准入、存储
  3. etcd:持久化存储所有资源
  4. Controller:维护期望状态
  5. Scheduler:选择最优节点
  6. Kubelet:管理节点上的 Pod
  7. Runtime:创建和运行容器

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

Kubernetes Pod 创建流程:从 kubectl 到容器运行
https://blog.souloss.com/posts/principles/principles-kubernetes-pod-creation/
作者
Souloss
发布于
2024-01-25
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时