mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
326 字
1 分钟
Kubernetes 调度与资源管理
2024-05-01

一、调度流程#

1.1 调度阶段#

sequenceDiagram participant API as kube-apiserver participant SCH as kube-scheduler participant Node as kubelet API->>SCH: Watch 新 Pod (Pending) SCH->>SCH: 预选 (Filtering) SCH->>SCH: 优选 (Scoring) SCH->>API: Bind Pod to Node API->>Node: 通知 Kubelet Node->>Node: 拉取镜像、启动容器 Node->>API: 更新 Pod 状态 (Running)

1.2 调度算法#

# 预选阶段 (Filtering) - 过滤不符合的节点
# - PodFitsResources: 检查资源是否足够
# - PodFitsHostPorts: 检查端口是否冲突
# - HostName: 检查节点名称
# - MatchNodeSelector: 检查节点选择器
# - NoDiskConflict: 检查存储冲突
# 优选阶段 (Scoring) - 选择最佳节点
# - LeastRequestedPriority: 优先选择资源少的节点
# - BalancedResourceAllocation: 平衡 CPU 和内存
# - ImageLocalityPriority: 优先选择镜像缓存的节点

二、节点选择#

2.1 nodeSelector#

apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
nodeSelector:
disktype: ssd # 标签键值对
region: us-east-1
# 给节点打标签
kubectl label nodes node-1 disktype=ssd
kubectl label nodes node-1 region=us-east-1
# 查看标签
kubectl get nodes --show-labels
# 删除标签
kubectl label nodes node-1 disktype-

2.2 节点亲和性#

apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
# 必须满足的条件
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1a
- us-east-1b
# 优先满足的条件
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: memory
operator: Gt
values:
- "8"
operator说明
In标签值在列表中
NotIn标签值不在列表中
Exists标签键存在
DoesNotExist标签键不存在
Gt大于(字符串比较)
Lt小于(字符串比较)

三、Pod 亲和性与反亲和性#

3.1 Pod 亲和性#

# 优先调度到有 redis 的节点
apiVersion: v1
kind: Pod
metadata:
name: app
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: redis
topologyKey: kubernetes.io/hostname

3.2 Pod 反亲和性(分散部署)#

# 分散 Pod,避免单点故障
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: nginx
topologyKey: kubernetes.io/hostname

3.3 拓扑分布约束#

# Pod 拓扑分布约束
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: nginx
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: nginx
whenUnsatisfiable说明
DoNotSchedule不满足则不调度
ScheduleAnyway优先选择偏差小的

四、污点与容忍#

4.1 污点配置#

# 添加污点
kubectl taint nodes node1 key=value:NoSchedule
kubectl taint nodes node1 dedicated=gpu:NoExecute
kubectl taint nodes node1 temporary=true:NoSchedule
# 查看污点
kubectl describe node node1 | grep Taints
# 删除污点
kubectl taint nodes node1 key=value:NoSchedule-
kubectl taint nodes node1 dedicated-

4.2 容忍配置#

apiVersion: v1
kind: Pod
metadata:
name: gpu-app
spec:
tolerations:
# 匹配任意值的污点
- key: "dedicated"
operator: "Exists"
effect: "NoSchedule"
# 匹配特定值
- key: "dedicated"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
# 匹配所有污点
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
# 无视所有污点
- operator: "Exists"
污点效果说明
NoSchedule不调度新 Pod 到该节点
PreferNoSchedule尽量不调度
NoExecute驱逐已有 Pod

五、资源配额#

5.1 ResourceQuota#

apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: default
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
services: "10"
persistentvolumeclaims: "5"
# 查看配额使用
kubectl get resourcequota -o wide
kubectl describe resourcequota compute-quota

5.2 LimitRange#

apiVersion: v1
kind: LimitRange
metadata:
name: limits
namespace: default
spec:
limits:
- type: Container
max:
cpu: "2"
memory: 1Gi
min:
cpu: "100m"
memory: 64Mi
default:
cpu: "500m"
memory: 256Mi
defaultRequest:
cpu: "200m"
memory: 128Mi
maxLimitRequestRatio:
cpu: "10"
memory: "4"

5.3 资源请求与限制#

apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.25
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
# 资源请求 (requests) - 调度依据
# 资源限制 (limits) - 限制最大使用
# QoS 等级
# 1. Guaranteed ( limits == requests )
# 2. Burstable ( requests < limits )
# 3. BestEffort ( 无 requests 和 limits )

六、优先级与抢占#

6.1 PriorityClass#

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 100000
globalDefault: false
description: "高优先级 Pod"

6.2 使用优先级#

apiVersion: v1
kind: Pod
metadata:
name: important-app
spec:
priorityClassName: high-priority
containers:
- name: app
image: app:latest

6.3 抢占机制#

sequenceDiagram participant P as 新高优先级 Pod participant SCH as Scheduler participant L as 低优先级 Pod P->>SCH: 请求调度 SCH->>SCH: 无可用节点 SCH->>L: 抢占低优先级 Pod Note over L: Pod 被驱逐 SCH->>P: 调度到该节点

七、调度配置#

7.1 调度器配置#

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: NodeResources
args:
mode: Least

7.2 多个调度器#

# 使用自定义调度器
kubectl create configmap my-scheduler-config \
--from-file=kube-scheduler-config.yaml \
-n kube-system
# 创建使用自定义调度器的 Pod
kubectl create -f pod.yaml

八、调度优化实践#

8.1 资源预留#

# 为系统组件预留资源
# kube-apiserver: 至少 500m CPU, 256Mi 内存
# kubelet: 至少 500m CPU, 500Mi 内存

8.2 亲和性策略#

策略场景
节点亲和性特定硬件需求(GPU、SSD)
Pod 亲和性相关服务就近通信
Pod 反亲和性高可用部署分散
污点容忍专用节点运行特殊工作负载

8.3 调度决策因素#

# 调度器考虑的因
# 1. 资源需求(CPU、内存、GPU)
# 2. 亲和性/反亲和性规则
# 3. 污点与容忍
# 4. 优先级
# 5. 拓扑分布约束
# 6. 污点容忽时间

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

Kubernetes 调度与资源管理
https://blog.souloss.com/posts/interview/kubernetes-scheduling-and-resource-management/
作者
Souloss
发布于
2024-05-01
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时