479 字
1 分钟
Kubernetes 监控与可观测性
一、可观测性三大支柱
1.1 指标、日志、链路追踪
graph TB
subgraph "可观测性三大支柱"
A["Metrics 指标"] --> D["Prometheus"]
B["Logs 日志"] --> E["Loki/ELK"]
C["Traces 链路"] --> F["Jaeger/Tempo"]
end
subgraph "统一展示"
D --> G["Grafana"]
E --> G
F --> G
end
| 支柱 | 工具 | 用途 |
|---|---|---|
| Metrics | Prometheus | 时间序列指标,资源使用率 |
| Logs | Loki/ELK | 日志聚合分析 |
| Traces | Jaeger | 请求链路追踪 |
二、Prometheus 架构
2.1 组件架构
graph TB
subgraph "Prometheus 生态"
A["Prometheus Server"] --> B["Alertmanager"]
A --> C["Pushgateway"]
A --> D["Exporters"]
D --> E["Node Exporter"]
D --> F["Kube-state-metrics"]
D --> G["cAdvisor"]
end
subgraph "服务发现"
H["Kubernetes SD"] --> A
end
2.2 Kubernetes 部署
# Prometheus OperatorapiVersion: monitoring.coreos.com/v1kind: Prometheusmetadata: name: k8s-prometheus namespace: monitoringspec: replicas: 2 retention: 15d serviceAccountName: prometheus serviceMonitorSelector: matchLabels: team: monitoring resources: requests: cpu: 200m memory: 512Mi limits: cpu: 1000m memory: 2Gi storage: volumeClaimTemplate: spec: storageClassName: ssd resources: requests: storage: 50Gi三、核心指标
3.1 Node 指标
| 指标名称 | 说明 | 告警阈值 |
|---|---|---|
| node_cpu_usage_seconds | CPU 使用率 | > 80% |
| node_memory_usage_bytes | 内存使用率 | > 85% |
| node_filesystem_usage_bytes | 磁盘使用率 | > 90% |
| node_network_receive_bytes | 网络接收速率 | 速率异常 |
| node_network_transmit_bytes | 网络发送速率 | 速率异常 |
# CPU 使用率100 - (sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
# 内存使用率100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# 磁盘使用率100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 1003.2 Pod 指标
| 指标名称 | 说明 | 用途 |
|---|---|---|
| kube_pod_container_resource_requests | 资源请求 | 调度依据 |
| kube_pod_container_resource_limits | 资源限制 | 限流依据 |
| kube_pod_status_phase | Pod 状态 | 状态监控 |
# Pod CPU 使用率sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace)
# Pod 内存使用sum(container_memory_working_set_bytes) by (pod, namespace)
# Pod 重启次数increase(kube_pod_container_status_restarts_total[1h])3.3 K8s 组件指标
# API Server 请求延迟histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket[5m]))
# Scheduler 调度延迟histogram_quantile(0.95, rate(scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
# etcd 磁盘写入延迟histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))四、Grafana Dashboard
4.1 常用 Dashboard
# Import dashboard via JSONapiVersion: v1kind: ConfigMapmetadata: name: grafana-dashboard-k8s namespace: monitoringdata: k8s-cluster.json: | { "dashboard": { "title": "Kubernetes Cluster", "uid": "k8s-cluster", "panels": [ { "title": "CPU 使用率", "type": "graph", "targets": [ { "expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)", "legendFormat": "{{namespace}}" } ] }, { "title": "内存使用", "type": "graph", "targets": [ { "expr": "sum(container_memory_working_set_bytes) by (namespace)", "legendFormat": "{{namespace}}" } ] } ] } }4.2 关键 Panel 配置
# Kubernetes 集群总览apiVersion: v1kind: ConfigMapmetadata: name: cluster-overview namespace: monitoringdata: dashboard.json: | { "panels": [ { "title": "节点数", "type": "stat", "gridPos": {"h": 8, "w": 6}, "targets": [ {"expr": "count(kube_node_info)"} ] }, { "title": "Pod 总数", "type": "stat", "gridPos": {"h": 8, "w": 6}, "targets": [ {"expr": "sum(kube_pod_info)"} ] }, { "title": "CPU 分配率", "type": "gauge", "gridPos": {"h": 8, "w": 6}, "targets": [ {"expr": "sum(kube_pod_container_resource_requests_cpu_cores) / sum(kube_node_status_allocatable_cpu_cores) * 100"} ] }, { "title": "内存分配率", "type": "gauge", "gridPos": {"h": 8, "w": 6}, "targets": [ {"expr": "sum(kube_pod_container_resource_requests_memory_bytes) / sum(kube_node_status_allocatable_memory_bytes) * 100"} ] } ] }五、告警规则
5.1 告警规则定义
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: k8s-alerts namespace: monitoringspec: groups: - name: kubernetes rules: # K8s 组件告警 - alert: K8sApiserverDown expr: up{job="kube-apiserver"} == 0 for: 5m labels: severity: critical annotations: summary: "API Server is down" description: "API Server has been down for more than 5 minutes"
- alert: K8sNodeNotReady expr: kube_node_status_condition{condition="Ready",status="true"} == 0 for: 10m labels: severity: warning annotations: summary: "Node {{ $labels.node }} is not ready" description: "Node {{ $labels.node }} has been not ready for 10 minutes"
# 资源告警 - alert: HighCPUUsage expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (node) > 0.8 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.node }}" description: "Node {{ $labels.node }} CPU usage is above 80%"
- alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85 for: 5m labels: severity: warning annotations: summary: "High Memory usage on {{ $labels.node }}"
# Pod 告警 - alert: PodRestartingTooMuch expr: rate(kube_pod_container_status_restarts_total[1h]) > 0.05 for: 5m labels: severity: warning annotations: summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting too much"5.2 Alertmanager 配置
apiVersion: monitoring.coreos.com/v1kind: AlertmanagerConfigmetadata: name: team-config namespace: monitoringspec: receivers: - name: "default" emailConfigs: - to: "ops-team@example.com" headers: subject: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}" - name: "slack" slackConfigs: - channel: "#alerts" apiUrl: "https://hooks.slack.com/xxx" title: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}" text: | {{ range .Alerts }} *Alert:* {{ .Annotations.summary }} *Description:* {{ .Annotations.description }} {{ end }}六、自定义指标
6.1 Prometheus Operator
# ServiceMonitorapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: my-app namespace: monitoring labels: team: monitoringspec: selector: matchLabels: app: my-app endpoints: - port: metrics interval: 30s path: /metrics namespaceSelector: matchNames: - production6.2 应用暴露指标
# Python 应用暴露 Prometheus 指标from prometheus_client import Counter, Histogram, Gauge
# 定义指标REQUEST_COUNT = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
REQUEST_LATENCY = Histogram( 'http_request_duration_seconds', 'HTTP request latency', ['method', 'endpoint'])
ACTIVE_CONNECTIONS = Gauge( 'http_active_connections', 'Active HTTP connections')
# 使用指标@app.route('/api/users')def get_users(): REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc() with REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time(): # 处理请求 pass return jsonify(users)七、监控与故障排查联动
7.1 从监控到排查
完善的监控体系是故障排查的基础。当告警触发时,快速定位问题的流程如下:
graph LR
A["告警触发"] --> B["查看 Grafana"]
B --> C["分析指标趋势"]
C --> D["关联日志"]
D --> E["定位根因"]
E --> F["执行修复"]
常见告警与排查方向:
| 告警类型 | 排查方向 | 相关工具 |
|---|---|---|
| HighCPUUsage | 应用性能、资源配额 | pprof、top |
| HighMemoryUsage | 内存泄漏、缓存策略 | jmap、pprof |
| PodRestarting | 应用崩溃、健康检查 | kubectl logs |
| K8sNodeNotReady | 节点资源、网络 | describe node |
7.2 日志与链路追踪
当指标异常时,需要结合日志和链路追踪进行深度分析:
# 查看相关 Pod 日志kubectl logs -n <namespace> <pod-name> --tail=100
# 查看链路追踪 (Jaeger)# 访问 Jaeger UI,根据 trace ID 查询完整调用链八、总结
graph TB
A["监控体系"] --> B["指标收集"]
A --> C["日志收集"]
A --> D["链路追踪"]
B --> E["Prometheus"]
C --> F["Loki/ELK"]
D --> G["Jaeger"]
E --> H["Alertmanager"]
H --> I["告警通知"]
E --> J["Grafana"]
F --> J
G --> J
A --> K["故障排查"]
K --> L["根因分析"]
L --> M["快速恢复"]
| 组件 | 作用 | 关键配置 |
|---|---|---|
| Prometheus | 指标收集存储 | ServiceMonitor |
| Grafana | 可视化展示 | Dashboard |
| Alertmanager | 告警管理 | Route/Receiver |
| Loki | 日志聚合 | Promtail |
| Jaeger | 链路追踪 | Instrument |
监控最佳实践:
- USE 方法(Utilization、Saturation、Errors)
- RED 方法(Rate、Errors、Duration)用于服务
- 遵循 Four Golden Signals(延迟、流量、错误、饱和度)
支持与分享
如果这篇文章对你有帮助,欢迎支持作者或分享给更多人
Kubernetes 监控与可观测性
https://blog.souloss.com/posts/kubernetes/k8s-observability/ 部分信息可能已经过时
相关文章 智能推荐
1
可观测性全景:从监控到可观测性
可观测性 从传统监控到现代可观测性的范式转变——三大信号(日志、指标、追踪)的认知模型、可观测性成熟度模型、以及为什么'监控告诉你系统出了问题,可观测性让你理解为什么'。
2
Kubernetes 故障排查与性能优化
云原生 深入解析 Kubernetes 故障排查与性能优化——Pod 调度问题、网络问题、存储问题与性能调优。
3
综合实战:搭建可观测性平台
可观测性 综合实战——使用 OTel + Prometheus + Tempo + Loki + Grafana 搭建全链路可观测性平台,配置 SLO、告警规则,实现日志-指标-追踪的信号关联。
4
Kubernetes 安全:RBAC、Pod 安全与网络策略
云原生 深入 Kubernetes 安全体系——RBAC 权限模型(Role/ClusterRole/RoleBinding/ClusterRoleBinding)、ServiceAccount 身份管理、Pod 安全标准与 Pod Security Admission、SecurityContext 容器安全、NetworkPolicy 网络隔离、Secret 加密与外部密钥管理,构建纵深防御的安全架构。
5
Kubernetes 扩展机制:CRD、Operator 与 Webhook
云原生 深入讲解 Kubernetes 的扩展机制——CRD(自定义资源)与 Operator 模式、Aggregated API Server、Admission Webhook,帮助开发者按需扩展 K8s 平台能力。






