941 字
3 分钟
综合实战:搭建可观测性平台
理论讲了 15 章,现在动手。在一个 Docker Compose 文件里搭起完整的可观测性平台——OTel Collector 采集数据、Prometheus 存指标、Tempo 存追踪、Loki 存日志、Grafana 做可视化,再配上 SLO 告警规则和信号关联。这不是玩具 demo——这是你在生产环境能用的架构骨架。
一、平台架构
本实战平台使用四个存储后端,各自处理不同类型的可观测性数据:
| 后端 | 数据类型 | 查询语言 | 保留期 | 适用场景 |
|---|---|---|---|---|
| Prometheus | 指标(时序) | PromQL | 15 天(默认) | 告警、SLO、仪表盘 |
| Tempo | 分布式追踪 | TraceQL | 30 天 | 延迟分析、请求链路 |
| Loki | 结构化日志 | LogQL | 30 天 | 错误排查、上下文搜索 |
| Pyroscope | 持续性能分析 | Flamegraph | 7 天 | CPU/内存热点定位 |
1.1 整体架构
graph TB
subgraph "应用层"
APP["Go 应用<br/>OTel SDK"]
end
subgraph "采集层"
OTEL["OTel Collector<br/>接收→处理→路由"]
end
subgraph "存储层"
PROM["Prometheus<br/>指标存储"]
TEMPO["Tempo<br/>追踪存储"]
LOKI["Loki<br/>日志存储"]
end
subgraph "可视化层"
GRAFANA["Grafana<br/>统一查询"]
end
APP -->|"OTLP"| OTEL
OTEL -->|"metrics"| PROM
OTEL -->|"traces"| TEMPO
OTEL -->|"logs"| LOKI
PROM --> GRAFANA
TEMPO --> GRAFANA
LOKI --> GRAFANA
style APP fill:#e3f2fd,stroke:#1565c0
style OTEL fill:#fff9c4,stroke:#f9a825
style GRAFANA fill:#e8f5e9,stroke:#2e7d32
1.2 组件职责
| 组件 | 职责 | 端口 |
|---|---|---|
| OTel Collector | 接收遥测数据,处理/路由/采样 | 4317 (gRPC), 4318 (HTTP) |
| Prometheus | 存储和查询指标数据 | 9090 |
| Tempo | 存储和查询追踪数据 | 3200 |
| Loki | 存储和查询日志数据 | 3100 |
| Grafana | 统一可视化,连接三大信号 | 3000 |
二、Docker Compose 部署
2.1 完整配置
version: "3.8"
services: # ===== 应用 ===== app: build: . ports: - "8080:8080" environment: - OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317 - OTEL_SERVICE_NAME=demo-app depends_on: - otel-collector
# ===== 采集层 ===== otel-collector: image: otel/opentelemetry-collector-contrib:0.96.0 command: ["--config=/etc/otel-collector-config.yaml"] volumes: - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml ports: - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP depends_on: - prometheus - tempo - loki
# ===== 存储层 ===== prometheus: image: prom/prometheus:v2.52.0 volumes: - ./prometheus.yaml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus ports: - "9090:9090"
tempo: image: grafana/tempo:2.4.0 command: ["-config.file=/etc/tempo.yaml"] volumes: - ./tempo.yaml:/etc/tempo.yaml - tempo_data:/var/tempo ports: - "3200:3200" # Tempo API
loki: image: grafana/loki:2.9.4 command: ["-config.file=/etc/loki.yaml"] volumes: - ./loki.yaml:/etc/loki.yaml - loki_data:/loki ports: - "3100:3100"
# ===== 可视化层 ===== grafana: image: grafana/grafana:10.4.0 volumes: - ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml - ./grafana-dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml - ./dashboards:/var/lib/grafana/dashboards ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin depends_on: - prometheus - tempo - loki
volumes: prometheus_data: tempo_data: loki_data:2.2 OTel Collector 配置
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318
processors: batch: send_batch_size: 1024 timeout: 5s tail_sampling: decision_wait: 10s num_traces: 100 expected_new_traces_per_sec: 10 sampling_policies: - name: errors type: status_code status_code: status_codes: - ERROR - name: slow type: latency latency: threshold_ms: 1000 - name: everything_else type: probabilistic probabilistic: sampling_percentage: 10
exporters: prometheusremotewrite: endpoint: http://prometheus:9090/api/v1/write otlphttp/tempo: endpoint: http://tempo:4318 loki: endpoint: http://loki:3100/loki/api/v1/push default_labels_enabled: exporter: false
service: pipelines: metrics: receivers: [otlp] processors: [batch] exporters: [prometheusremotewrite] traces: receivers: [otlp] processors: [batch, tail_sampling] exporters: [otlphttp/tempo] logs: receivers: [otlp] processors: [batch] exporters: [loki]Note
tail_sampling 处理器实现了尾部采样:总是保留错误和慢请求的追踪,其余只保留 10%。这既保证了关键追踪不丢失,又控制了存储成本。
三、应用仪器化
3.1 Go 应用 OTel 初始化
package main
import ( "context" "log"
"go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc" "go.opentelemetry.io/otel/sdk/resource" sdktrace "go.opentelemetry.io/otel/sdk/trace" semconv "go.opentelemetry.io/otel/semconv/v1.24.0")
func initTracer(ctx context.Context) func() { // 创建 OTLP gRPC exporter exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint("otel-collector:4317"), otlptracegrpc.WithInsecure(), ) if err != nil { log.Fatal(err) }
// 创建 TracerProvider provider := sdktrace.NewTracerProvider( sdktrace.WithBatcher(exporter), sdktrace.WithResource(resource.NewWithAttributes( semconv.SchemaURL, semconv.ServiceNameKey.String("demo-app"), semconv.ServiceVersionKey.String("1.0.0"), )), )
otel.SetTracerProvider(provider)
return func() { provider.Shutdown(ctx) }}3.2 HTTP 中间件
import ( "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp")
func main() { mux := http.NewServeMux() mux.HandleFunc("/orders", handleOrders) mux.HandleFunc("/orders/{id}", handleOrderByID)
// OTel HTTP 中间件:自动创建 Span、传播 trace context handler := otelhttp.NewHandler(mux, "http-server")
log.Fatal(http.ListenAndServe(":8080", handler))}
func handleOrders(w http.ResponseWriter, r *http.Request) { ctx := r.Context() tracer := otel.Tracer("demo-app")
ctx, span := tracer.Start(ctx, "handleOrders") defer span.End()
// 业务逻辑 orders, err := getOrders(ctx) if err != nil { span.RecordError(err) span.SetStatus(codes.Error, err.Error()) http.Error(w, err.Error(), 500) return }
json.NewEncoder(w).Encode(orders)}四、Prometheus 配置
4.1 基础配置
global: scrape_interval: 15s evaluation_interval: 15s
scrape_configs: - job_name: 'otel-collector' scrape_interval: 15s static_configs: - targets: ['otel-collector:8889']
rule_files: - 'alert_rules.yaml'4.2 SLO 告警规则
groups: - name: slo-alerts rules: # SLO: 99.9% 可用性(30 天窗口) - alert: SLOAvailabilityBurnRate expr: | ( sum(rate(http_requests_total{code=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (1 - 0.999) * 14.4 for: 1m labels: severity: critical slo: availability annotations: summary: "SLO 可用性燃烧率过高" description: "1 小时错误率已消耗 30 天错误预算的 14.4%"
# SLO: P99 延迟 < 500ms - alert: SLOLatencyBurnRate expr: | ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h])) / sum(rate(http_request_duration_seconds_count[1h])) ) < (1 - 0.99) * 14.4 for: 1m labels: severity: critical slo: latency annotations: summary: "SLO 延迟燃烧率过高" description: "1 小时 P99 延迟已消耗 30 天延迟预算的 14.4%"
# 多窗口多燃烧率告警 - alert: SLOAvailabilityMultiWindow expr: | ( sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (1 - 0.999) * 6 and ( sum(rate(http_requests_total{code=~"5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (1 - 0.999) * 3 for: 2m labels: severity: warningTip
多窗口多燃烧率告警是 SLO 告警的最佳实践。短窗口(5 分钟)检测突发问题,长窗口(30 分钟)确认持续问题。两个条件同时满足才告警,减少误报。
五、Tempo 和 Loki 配置
5.1 Tempo 配置
server: http_listen_port: 3200
distributor: receivers: otlp: protocols: http: endpoint: 0.0.0.0:4318
storage: trace: backend: local local: path: /var/tempo/traces wal: path: /var/tempo/wal
metrics_generator: registry: external_labels: source: tempo storage: path: /var/tempo/generator/wal remote_write: - url: http://prometheus:9090/api/v1/write5.2 Loki 配置
auth_enabled: false
server: http_listen_port: 3100
common: path_prefix: /loki
schema_config: configs: - from: 2020-10-24 store: tsdb object_store: filesystem schema: v13 index: prefix: loki_index_ period: 24h
storage_config: tsdb_shipper: active_index_directory: /loki/tsdb-index cache_location: /loki/tsdb-cache filesystem: directory: /loki/storage六、Grafana 配置
6.1 数据源
apiVersion: 1
datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: false
- name: Tempo type: tempo access: proxy url: http://tempo:3200 editable: false jsonData: tracesToMetrics: datasourceUid: prometheus tags: - key: service.name value: service tracesToLogs: datasourceUid: loki filterByTraceID: true filterBySpanID: true
- name: Loki type: loki access: proxy url: http://loki:3100 editable: false jsonData: derivedFields: - datasourceUid: tempo matcherRegex: '"trace_id":"(\w+)"' name: TraceID url: '$${__value}'6.2 RED 方法仪表板
{ "dashboard": { "title": "RED Method Dashboard", "panels": [ { "title": "Request Rate", "type": "timeseries", "targets": [{ "expr": "sum(rate(http_requests_total[5m])) by (service)" }] }, { "title": "Error Rate", "type": "timeseries", "targets": [{ "expr": "sum(rate(http_requests_total{code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)" }] }, { "title": "P99 Latency", "type": "timeseries", "targets": [{ "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))" }] } ] }}七、信号关联
7.1 Exemplar:指标到追踪
# Exemplar:在指标图表中点击数据点,跳转到对应的 Trace# Prometheus 2.42+ 支持 Exemplar
# 查看请求延迟,每个数据点附带 TraceIDhistogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# 在 Grafana 中:# 1. 将鼠标悬停在 P99 延迟的尖峰上# 2. 点击 Exemplar 链接# 3. 自动跳转到 Tempo 查看对应的 Trace7.2 TraceID 关联日志
// 在日志中注入 TraceIDfunc loggingMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { ctx := r.Context() span := trace.SpanFromContext(ctx) sc := span.SpanContext()
// 将 TraceID 和 SpanID 注入日志上下文 logger := slog.Default().With( "trace_id", sc.TraceID().String(), "span_id", sc.SpanID().String(), )
ctx = context.WithValue(ctx, loggerKey, logger) next.ServeHTTP(w, r.WithContext(ctx)) })}# Loki: 通过 TraceID 查找所有相关日志{service="demo-app"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"
# Loki: 查找某个服务的错误日志{service="demo-app"} |= "error" | json | level="error"7.3 完整排障流程
sequenceDiagram
participant S as SLO 告警
participant G as Grafana
participant P as Prometheus
participant T as Tempo
participant L as Loki
S->>G: P99 延迟超过 500ms
G->>P: 查看延迟指标
P-->>G: 返回延迟图表 + Exemplar
G->>T: 点击 Exemplar 跳转 Trace
T-->>G: 返回 Trace 详情
Note over G: 发现 DB 查询慢
G->>L: 通过 TraceID 查日志
L-->>G: 返回相关日志
Note over G: 日志显示慢查询 SQL
G->>P: 查看 DB 连接池指标
P-->>G: 连接池耗尽
Note over G: 根因:连接池配置过小
八、端到端验证
8.1 启动平台
# 启动所有服务docker compose up -d
# 等待服务就绪sleep 10
# 验证各组件curl -s http://localhost:3000/api/health # Grafanacurl -s http://localhost:9090/-/healthy # Prometheuscurl -s http://localhost:3100/ready # Lokicurl -s http://localhost:3200/status/build # Tempo8.2 生成流量
# 使用 hey 生成 HTTP 流量hey -z 5m -q 50 http://localhost:8080/orders
# 同时注入一些错误for i in $(seq 1 100); do curl -s http://localhost:8080/orders/error sleep 1done8.3 验证信号
# 验证 Prometheus 指标curl -s 'http://localhost:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq
# 验证 Tempo 追踪curl -s 'http://localhost:3200/api/traces?service=demo-app&limit=10' | jq
# 验证 Loki 日志curl -s 'http://localhost:3100/loki/api/v1/query_range?query={service="demo-app"}&limit=10' | jq
# 验证 Grafana 仪表板curl -s -u admin:admin 'http://localhost:3000/api/dashboards/db/red-method' | jq '.meta.slug'flowchart TB
APP["应用服务"] --> SDK["OTel SDK"] --> COL["OTel Collector"]
COL --> PROM["Prometheus<br/>指标存储"]
COL --> TEMPO["Tempo<br/>追踪存储"]
COL --> LOKI["Loki<br/>日志存储"]
PROM --> GRAFANA["Grafana<br/>统一可视化"]
TEMPO --> GRAFANA
LOKI --> GRAFANA
style COL fill:#fff9c4,stroke:#f9a825
style GRAFANA fill:#c8e6c9,stroke:#2e7d32
flowchart LR
ALERT2["告警规则"] --> EVAL["Prometheus<br/>规则评估"] --> FIRE["触发告警"]
FIRE --> AM["Alertmanager<br/>去重/分组/路由"] --> NOTIFY["通知渠道<br/>Slack/Email/PagerDuty"]
style AM fill:#ffcdd2,stroke:#c62828
九、总结
| 步骤 | 组件 | 配置文件 | 验证方式 |
|---|---|---|---|
| 1. 部署基础设施 | Docker Compose | docker-compose.yml | docker compose ps |
| 2. 配置 OTel Collector | OTel Collector | otel-collector-config.yaml | 发送测试遥测 |
| 3. 仪器化应用 | OTel SDK | Go 代码 | 生成 Trace |
| 4. 配置指标存储 | Prometheus | prometheus.yaml | PromQL 查询 |
| 5. 配置追踪存储 | Tempo | tempo.yaml | TraceID 查询 |
| 6. 配置日志存储 | Loki | loki.yaml | LogQL 查询 |
| 7. 配置可视化 | Grafana | datasources + dashboards | 仪表板查看 |
| 8. 配置 SLO 告警 | Prometheus | alert_rules.yaml | 触发告警 |
| 9. 验证信号关联 | Grafana | Exemplar + TraceID | 端到端跳转 |
Tip
搭建可观测性平台只是开始。真正的挑战在于:
持续优化采样策略:平衡成本与可见性
持续完善仪表板:让仪表板反映业务健康度
持续迭代 SLO:随着业务演进调整 SLO 目标
持续培养可观测性文化:让所有工程师都具备 ODD 意识
本系列从可观测性全景出发,经过结构化日志、指标体系、分布式追踪、OpenTelemetry、持续性能分析、eBPF 可观测性、数据管道、存储后端、信号关联、SLO 告警、规模设计、平台设计、生产调试、可观测性驱动开发,最终通过本实战章搭建了完整的可观测性平台。希望这个系列能帮助你从”知道可观测性”进阶到”能搭建和运营可观测性平台”。
支持与分享
如果这篇文章对你有帮助,欢迎支持作者或分享给更多人
部分信息可能已经过时
相关文章 智能推荐
1
可观测性平台设计
可观测性 可观测性平台的架构设计——Grafana 作为统一可视化层、多租户隔离与配额、数据生命周期管理、平台工程实践,以及从工具到平台的演进路径。
2
可观测性全景:从监控到可观测性
可观测性 从传统监控到现代可观测性的范式转变——三大信号(日志、指标、追踪)的认知模型、可观测性成熟度模型、以及为什么'监控告诉你系统出了问题,可观测性让你理解为什么'。
3
系列导读
可观测性 从监控到可观测性的完整工程实践——结构化日志、指标体系、分布式追踪、OpenTelemetry、eBPF、SLO 驱动告警、平台设计——16 章覆盖可观测性工程全栈,每章配有可在 Docker 环境中验证的实践操作。
4
指标体系:Prometheus 与度量类型
可观测性 深入 Prometheus 的四大度量类型(Counter、Gauge、Histogram、Summary),Histogram 分桶原理与 P99 计算、基数爆炸的成因与治理、以及如何用指标驱动 SLO 和告警。
5
可观测性数据管道
可观测性 可观测性数据管道的设计与实现——OTel Collector 管道架构、采样策略(头部/尾部/自适应)、路由与转换、缓冲与背压、以及数据管道的可靠性与可观测性。






