mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
398 字
1 分钟
Kubernetes 运维实践深度解析
2024-03-02

一、集群管理#

1.1 集群组件管理#

graph TB subgraph "Control Plane" A["kube-apiserver"] --> B["etcd"] C["kube-scheduler"] --> A D["kube-controller-manager"] --> A end subgraph "Node" E["kubelet"] --> F["kube-proxy"] E --> G["Container Runtime"] end
组件关键指标巡检频率
etcd磁盘延迟、投票超时每分钟
apiserver请求延迟、错误率每分钟
scheduler调度延迟、Pending Pods每分钟
kubeletPLEG duration、容器状态每分钟
# 查看控制平面组件状态
kubectl get componentstatus
kubectl get pods -n kube-system
# 查看 etcd 健康状态
kubectl exec -n kube-system etcd-<node> -- etcdctl endpoint health
# 查看 API Server 延迟
kubectl get --raw '/metrics' | grep apiserver_request_duration_seconds

1.2 节点管理#

# 节点维护操作
# 1. 驱逐 Pod
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# 2. 标记不可调度
kubectl cordon <node-name>
# 3. 解除维护
kubectl uncordon <node-name>
# 查看节点资源
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

二、备份与恢复#

2.1 etcd 备份#

# 方式一:直接备份
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /tmp/etcd-backup.db
# 方式二:API Server 备份
kubectl get all --all-namespaces -o yaml > /tmp/cluster-backup.yaml
# 定时备份脚本
#!/bin/bash
BACKUP_DIR="/var/backups/k8s"
DATE=$(date +%Y%m%d_%H%M%S)
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etds/server.key \
snapshot save ${BACKUP_DIR}/etcd-${DATE}.db

2.2 恢复流程#

# 恢复 etcd(单节点)
systemctl stop etcd
rm -rf /var/lib/etcd
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-backup.db \
--initial-cluster=new-etcd-0=https://192.168.1.10:2380 \
--initial-advertise-peer-urls=https://192.168.1.10:2380 \
--name=new-etcd-0 \
--data-dir=/var/lib/etcd
systemctl start etcd

三、版本升级#

3.1 升级策略#

graph TB subgraph "升级前" A["备份 etcd"] --> B["测试环境验证"] B --> C["准备回滚方案"] end subgraph "升级控制平面" D["升级 etcd"] --> E["升级 kube-apiserver"] E --> F["升级 controller-manager"] F --> G["升级 scheduler"] end subgraph "升级 Nodes" H["升级 kubelet"] --> I["升级 kube-proxy"] I --> J["验证应用"] end

3.2 升级执行#

# 查看可用版本
apt-cache madison kubeadm
kubeadm upgrade plan
# 升级控制平面
apt-get install -y kubeadm=1.28.0-*
kubeadm upgrade apply v1.28.0
# 升级节点
apt-get install -y kubelet=1.28.0-*
systemctl restart kubelet
# 升级插件
kubectl apply -f kube-proxy.yaml
kubectl rollout status daemonset/kube-proxy -n kube-system

四、资源配额管理#

4.1 ResourceQuota 与 LimitRange#

# ResourceQuota:命名空间级别资源配额
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: my-namespace
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "100"
services: "20"
---
apiVersion: v1
kind: LimitRange
metadata:
name: container-limits
namespace: my-namespace
spec:
limits:
- max:
cpu: "4"
memory: 8Gi
min:
cpu: "100m"
memory: 128Mi
default:
cpu: "500m"
memory: 512Mi
defaultRequest:
cpu: "200m"
memory: 256Mi
type: Container

4.2 资源配额计算#

class ResourceCalculator:
@staticmethod
def calculate_node_capacity(node):
"""计算节点可用资源"""
return {
"cpu": node.status.capacity["cpu"],
"memory": node.status.capacity["memory"],
}
@staticmethod
def calculate_allocatable(node):
"""计算节点可分配资源"""
capacity = ResourceCalculator.calculate_node_capacity(node)
system_reserved = {
"cpu": "500m", # 系统预留
"memory": "1Gi",
}
return {
"cpu": capacity["cpu"] - parse_resources(system_reserved["cpu"]),
"memory": capacity["memory"] - parse_resources(system_reserved["memory"]),
}

五、日志管理#

5.1 日志架构#

graph LR A["Pod 日志"] --> B["Node"] B --> C["日志代理"] C --> D["日志存储"] C --> E["日志索引"] E --> F["日志可视化"]
# Fluent Bit 配置
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
[OUTPUT]
Name es
Match kube.*
Host elasticsearch.logging.svc
Port 9200
HTTP_User elastic
HTTP_Passwd changeme
Logstash_Format On
Logstash_Prefix kubernetes

5.2 日志查询#

# 查看 Pod 日志
kubectl logs <pod-name> --previous # 上一个容器
kubectl logs <pod-name> -c <container> # 指定容器
# 实时跟踪日志
kubectl logs -f <pod-name>
# 查看带时间戳的日志
kubectl logs <pod-name> --timestamps
# 资源限制配置
apiVersion: v1
kind: Pod
metadata:
name: logger
spec:
containers:
- name: logger
image: busybox
args:
- /bin/sh
- -c
- |
while true; do
echo "$(date) - Log message" >> /var/log/app.log
sleep 10
done
volumeMounts:
- name: log
mountPath: /var/log
volumes:
- name: log
emptyDir: {}

六、运维与监控协作#

6.1 日常运维监控要点#

运维工作离不开完善的监控体系。通过监控可以提前发现问题,降低故障影响:

graph LR A["日常运维"] --> B["资源监控"] A --> C["日志分析"] A --> D["告警响应"] B --> E["Prometheus"] C --> F["Loki/ELK"] D --> G["Alertmanager"] E --> H["预防性维护"] F --> H G --> H

关键运维指标

运维场景监控重点告警阈值
节点维护CPU/内存使用率、Pod 驱逐使用率 > 80%
集群升级API Server 延迟、etcd 健康延迟 > 100ms
资源扩容资源分配率、Pending Pods分配率 > 90%
备份验证etcd 快照大小、备份时效备份过期 > 24h

提示:完整的监控配置请参考 Kubernetes 监控与可观测性

6.2 故障排查协作#

当监控告警触发时,运维人员需要快速响应:

# 快速诊断脚本示例
#!/bin/bash
echo "=== Cluster Status ==="
kubectl get nodes
kubectl get pods -A | grep -v Running | head -20
echo "=== Resource Usage ==="
kubectl top nodes
kubectl top pods -A | sort -k3 -rn | head -10
echo "=== Recent Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

七、总结#

graph TB A["Kubernetes 运维"] --> B["集群管理"] A --> C["备份恢复"] A --> D["版本升级"] A --> E["资源配额"] A --> F["日志监控"] A --> G["故障排查"] B --> B1["组件状态"] B --> B2["节点维护"] C --> C1["etcd 备份"] C --> C2["资源导出"] D --> D1["控制平面"] D --> D2["工作节点"] F --> F1["日志收集"] F --> F2["日志查询"] G --> G1["状态检查"] G --> G2["资源分析"] A --> H["监控联动"] H --> I["预防性维护"] H --> J["快速响应"]

运维最佳实践

  • 定期备份 etcd 和资源定义
  • 使用驱逐机制进行节点维护
  • 配置资源配额防止资源耗尽
  • 集中收集日志便于排查
  • 建立监控告警体系

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

Kubernetes 运维实践深度解析
https://blog.souloss.com/posts/kubernetes/k8s-operations/
作者
Souloss
发布于
2024-03-02
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时