mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
479 字
1 分钟
Kubernetes 监控与可观测性
2024-02-17

一、可观测性三大支柱#

1.1 指标、日志、链路追踪#

graph TB subgraph "可观测性三大支柱" A["Metrics 指标"] --> D["Prometheus"] B["Logs 日志"] --> E["Loki/ELK"] C["Traces 链路"] --> F["Jaeger/Tempo"] end subgraph "统一展示" D --> G["Grafana"] E --> G F --> G end
支柱工具用途
MetricsPrometheus时间序列指标,资源使用率
LogsLoki/ELK日志聚合分析
TracesJaeger请求链路追踪

二、Prometheus 架构#

2.1 组件架构#

graph TB subgraph "Prometheus 生态" A["Prometheus Server"] --> B["Alertmanager"] A --> C["Pushgateway"] A --> D["Exporters"] D --> E["Node Exporter"] D --> F["Kube-state-metrics"] D --> G["cAdvisor"] end subgraph "服务发现" H["Kubernetes SD"] --> A end

2.2 Kubernetes 部署#

# Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: k8s-prometheus
namespace: monitoring
spec:
replicas: 2
retention: 15d
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
team: monitoring
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
storage:
volumeClaimTemplate:
spec:
storageClassName: ssd
resources:
requests:
storage: 50Gi

三、核心指标#

3.1 Node 指标#

指标名称说明告警阈值
node_cpu_usage_secondsCPU 使用率> 80%
node_memory_usage_bytes内存使用率> 85%
node_filesystem_usage_bytes磁盘使用率> 90%
node_network_receive_bytes网络接收速率速率异常
node_network_transmit_bytes网络发送速率速率异常
# CPU 使用率
100 - (sum(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)
# 内存使用率
100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# 磁盘使用率
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

3.2 Pod 指标#

指标名称说明用途
kube_pod_container_resource_requests资源请求调度依据
kube_pod_container_resource_limits资源限制限流依据
kube_pod_status_phasePod 状态状态监控
# Pod CPU 使用率
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod, namespace)
# Pod 内存使用
sum(container_memory_working_set_bytes) by (pod, namespace)
# Pod 重启次数
increase(kube_pod_container_status_restarts_total[1h])

3.3 K8s 组件指标#

# API Server 请求延迟
histogram_quantile(0.99,
rate(apiserver_request_duration_seconds_bucket[5m]))
# Scheduler 调度延迟
histogram_quantile(0.95,
rate(scheduler_e2e_scheduling_duration_seconds_bucket[5m]))
# etcd 磁盘写入延迟
histogram_quantile(0.99,
rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m]))

四、Grafana Dashboard#

4.1 常用 Dashboard#

# Import dashboard via JSON
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-k8s
namespace: monitoring
data:
k8s-cluster.json: |
{
"dashboard": {
"title": "Kubernetes Cluster",
"uid": "k8s-cluster",
"panels": [
{
"title": "CPU 使用率",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace)",
"legendFormat": "{{namespace}}"
}
]
},
{
"title": "内存使用",
"type": "graph",
"targets": [
{
"expr": "sum(container_memory_working_set_bytes) by (namespace)",
"legendFormat": "{{namespace}}"
}
]
}
]
}
}

4.2 关键 Panel 配置#

# Kubernetes 集群总览
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-overview
namespace: monitoring
data:
dashboard.json: |
{
"panels": [
{
"title": "节点数",
"type": "stat",
"gridPos": {"h": 8, "w": 6},
"targets": [
{"expr": "count(kube_node_info)"}
]
},
{
"title": "Pod 总数",
"type": "stat",
"gridPos": {"h": 8, "w": 6},
"targets": [
{"expr": "sum(kube_pod_info)"}
]
},
{
"title": "CPU 分配率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6},
"targets": [
{"expr": "sum(kube_pod_container_resource_requests_cpu_cores) / sum(kube_node_status_allocatable_cpu_cores) * 100"}
]
},
{
"title": "内存分配率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6},
"targets": [
{"expr": "sum(kube_pod_container_resource_requests_memory_bytes) / sum(kube_node_status_allocatable_memory_bytes) * 100"}
]
}
]
}

五、告警规则#

5.1 告警规则定义#

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: k8s-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes
rules:
# K8s 组件告警
- alert: K8sApiserverDown
expr: up{job="kube-apiserver"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "API Server is down"
description: "API Server has been down for more than 5 minutes"
- alert: K8sNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for 10 minutes"
# 资源告警
- alert: HighCPUUsage
expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (node) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.node }}"
description: "Node {{ $labels.node }} CPU usage is above 80%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory usage on {{ $labels.node }}"
# Pod 告警
- alert: PodRestartingTooMuch
expr: rate(kube_pod_container_status_restarts_total[1h]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting too much"

5.2 Alertmanager 配置#

apiVersion: monitoring.coreos.com/v1
kind: AlertmanagerConfig
metadata:
name: team-config
namespace: monitoring
spec:
receivers:
- name: "default"
emailConfigs:
- to: "ops-team@example.com"
headers:
subject: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}"
- name: "slack"
slackConfigs:
- channel: "#alerts"
apiUrl: "https://hooks.slack.com/xxx"
title: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}"
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ end }}

六、自定义指标#

6.1 Prometheus Operator#

# ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: monitoring
labels:
team: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- production

6.2 应用暴露指标#

# Python 应用暴露 Prometheus 指标
from prometheus_client import Counter, Histogram, Gauge
# 定义指标
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
ACTIVE_CONNECTIONS = Gauge(
'http_active_connections',
'Active HTTP connections'
)
# 使用指标
@app.route('/api/users')
def get_users():
REQUEST_COUNT.labels(method='GET', endpoint='/api/users', status='200').inc()
with REQUEST_LATENCY.labels(method='GET', endpoint='/api/users').time():
# 处理请求
pass
return jsonify(users)

七、监控与故障排查联动#

7.1 从监控到排查#

完善的监控体系是故障排查的基础。当告警触发时,快速定位问题的流程如下:

graph LR A["告警触发"] --> B["查看 Grafana"] B --> C["分析指标趋势"] C --> D["关联日志"] D --> E["定位根因"] E --> F["执行修复"]

常见告警与排查方向

告警类型排查方向相关工具
HighCPUUsage应用性能、资源配额pprof、top
HighMemoryUsage内存泄漏、缓存策略jmap、pprof
PodRestarting应用崩溃、健康检查kubectl logs
K8sNodeNotReady节点资源、网络describe node

7.2 日志与链路追踪#

当指标异常时,需要结合日志和链路追踪进行深度分析:

# 查看相关 Pod 日志
kubectl logs -n <namespace> <pod-name> --tail=100
# 查看链路追踪 (Jaeger)
# 访问 Jaeger UI,根据 trace ID 查询完整调用链

八、总结#

graph TB A["监控体系"] --> B["指标收集"] A --> C["日志收集"] A --> D["链路追踪"] B --> E["Prometheus"] C --> F["Loki/ELK"] D --> G["Jaeger"] E --> H["Alertmanager"] H --> I["告警通知"] E --> J["Grafana"] F --> J G --> J A --> K["故障排查"] K --> L["根因分析"] L --> M["快速恢复"]
组件作用关键配置
Prometheus指标收集存储ServiceMonitor
Grafana可视化展示Dashboard
Alertmanager告警管理Route/Receiver
Loki日志聚合Promtail
Jaeger链路追踪Instrument

监控最佳实践

  • USE 方法(Utilization、Saturation、Errors)
  • RED 方法(Rate、Errors、Duration)用于服务
  • 遵循 Four Golden Signals(延迟、流量、错误、饱和度)

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

Kubernetes 监控与可观测性
https://blog.souloss.com/posts/kubernetes/k8s-observability/
作者
Souloss
发布于
2024-02-17
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时