mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
941 字
3 分钟
综合实战:搭建可观测性平台
2025-12-10

理论讲了 15 章,现在动手。在一个 Docker Compose 文件里搭起完整的可观测性平台——OTel Collector 采集数据、Prometheus 存指标、Tempo 存追踪、Loki 存日志、Grafana 做可视化,再配上 SLO 告警规则和信号关联。这不是玩具 demo——这是你在生产环境能用的架构骨架。

一、平台架构#

本实战平台使用四个存储后端,各自处理不同类型的可观测性数据:

后端数据类型查询语言保留期适用场景
Prometheus指标(时序)PromQL15 天(默认)告警、SLO、仪表盘
Tempo分布式追踪TraceQL30 天延迟分析、请求链路
Loki结构化日志LogQL30 天错误排查、上下文搜索
Pyroscope持续性能分析Flamegraph7 天CPU/内存热点定位

1.1 整体架构#

graph TB subgraph "应用层" APP["Go 应用<br/>OTel SDK"] end subgraph "采集层" OTEL["OTel Collector<br/>接收→处理→路由"] end subgraph "存储层" PROM["Prometheus<br/>指标存储"] TEMPO["Tempo<br/>追踪存储"] LOKI["Loki<br/>日志存储"] end subgraph "可视化层" GRAFANA["Grafana<br/>统一查询"] end APP -->|"OTLP"| OTEL OTEL -->|"metrics"| PROM OTEL -->|"traces"| TEMPO OTEL -->|"logs"| LOKI PROM --> GRAFANA TEMPO --> GRAFANA LOKI --> GRAFANA style APP fill:#e3f2fd,stroke:#1565c0 style OTEL fill:#fff9c4,stroke:#f9a825 style GRAFANA fill:#e8f5e9,stroke:#2e7d32

1.2 组件职责#

组件职责端口
OTel Collector接收遥测数据,处理/路由/采样4317 (gRPC), 4318 (HTTP)
Prometheus存储和查询指标数据9090
Tempo存储和查询追踪数据3200
Loki存储和查询日志数据3100
Grafana统一可视化,连接三大信号3000

二、Docker Compose 部署#

2.1 完整配置#

docker-compose.yml
version: "3.8"
services:
# ===== 应用 =====
app:
build: .
ports:
- "8080:8080"
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317
- OTEL_SERVICE_NAME=demo-app
depends_on:
- otel-collector
# ===== 采集层 =====
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
depends_on:
- prometheus
- tempo
- loki
# ===== 存储层 =====
prometheus:
image: prom/prometheus:v2.52.0
volumes:
- ./prometheus.yaml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
tempo:
image: grafana/tempo:2.4.0
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
- tempo_data:/var/tempo
ports:
- "3200:3200" # Tempo API
loki:
image: grafana/loki:2.9.4
command: ["-config.file=/etc/loki.yaml"]
volumes:
- ./loki.yaml:/etc/loki.yaml
- loki_data:/loki
ports:
- "3100:3100"
# ===== 可视化层 =====
grafana:
image: grafana/grafana:10.4.0
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
- ./grafana-dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml
- ./dashboards:/var/lib/grafana/dashboards
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- prometheus
- tempo
- loki
volumes:
prometheus_data:
tempo_data:
loki_data:

2.2 OTel Collector 配置#

otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 1024
timeout: 5s
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
sampling_policies:
- name: errors
type: status_code
status_code:
status_codes:
- ERROR
- name: slow
type: latency
latency:
threshold_ms: 1000
- name: everything_else
type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
otlphttp/tempo:
endpoint: http://tempo:4318
loki:
endpoint: http://loki:3100/loki/api/v1/push
default_labels_enabled:
exporter: false
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlphttp/tempo]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Note

tail_sampling 处理器实现了尾部采样:总是保留错误和慢请求的追踪,其余只保留 10%。这既保证了关键追踪不丢失,又控制了存储成本。

三、应用仪器化#

3.1 Go 应用 OTel 初始化#

package main
import (
"context"
"log"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
func initTracer(ctx context.Context) func() {
// 创建 OTLP gRPC exporter
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
log.Fatal(err)
}
// 创建 TracerProvider
provider := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("demo-app"),
semconv.ServiceVersionKey.String("1.0.0"),
)),
)
otel.SetTracerProvider(provider)
return func() { provider.Shutdown(ctx) }
}

3.2 HTTP 中间件#

import (
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func main() {
mux := http.NewServeMux()
mux.HandleFunc("/orders", handleOrders)
mux.HandleFunc("/orders/{id}", handleOrderByID)
// OTel HTTP 中间件:自动创建 Span、传播 trace context
handler := otelhttp.NewHandler(mux, "http-server")
log.Fatal(http.ListenAndServe(":8080", handler))
}
func handleOrders(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("demo-app")
ctx, span := tracer.Start(ctx, "handleOrders")
defer span.End()
// 业务逻辑
orders, err := getOrders(ctx)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
http.Error(w, err.Error(), 500)
return
}
json.NewEncoder(w).Encode(orders)
}

四、Prometheus 配置#

4.1 基础配置#

prometheus.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 15s
static_configs:
- targets: ['otel-collector:8889']
rule_files:
- 'alert_rules.yaml'

4.2 SLO 告警规则#

alert_rules.yaml
groups:
- name: slo-alerts
rules:
# SLO: 99.9% 可用性(30 天窗口)
- alert: SLOAvailabilityBurnRate
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (1 - 0.999) * 14.4
for: 1m
labels:
severity: critical
slo: availability
annotations:
summary: "SLO 可用性燃烧率过高"
description: "1 小时错误率已消耗 30 天错误预算的 14.4%"
# SLO: P99 延迟 < 500ms
- alert: SLOLatencyBurnRate
expr: |
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[1h]))
/
sum(rate(http_request_duration_seconds_count[1h]))
) < (1 - 0.99) * 14.4
for: 1m
labels:
severity: critical
slo: latency
annotations:
summary: "SLO 延迟燃烧率过高"
description: "1 小时 P99 延迟已消耗 30 天延迟预算的 14.4%"
# 多窗口多燃烧率告警
- alert: SLOAvailabilityMultiWindow
expr: |
(
sum(rate(http_requests_total{code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > (1 - 0.999) * 6
and
(
sum(rate(http_requests_total{code=~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
) > (1 - 0.999) * 3
for: 2m
labels:
severity: warning
Tip

多窗口多燃烧率告警是 SLO 告警的最佳实践。短窗口(5 分钟)检测突发问题,长窗口(30 分钟)确认持续问题。两个条件同时满足才告警,减少误报。

五、Tempo 和 Loki 配置#

5.1 Tempo 配置#

tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
metrics_generator:
registry:
external_labels:
source: tempo
storage:
path: /var/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write

5.2 Loki 配置#

loki.yaml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
filesystem:
directory: /loki/storage

六、Grafana 配置#

6.1 数据源#

grafana-datasources.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3200
editable: false
jsonData:
tracesToMetrics:
datasourceUid: prometheus
tags:
- key: service.name
value: service
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: false
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\w+)"'
name: TraceID
url: '$${__value}'

6.2 RED 方法仪表板#

{
"dashboard": {
"title": "RED Method Dashboard",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(http_requests_total[5m])) by (service)"
}]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [{
"expr": "sum(rate(http_requests_total{code=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)"
}]
},
{
"title": "P99 Latency",
"type": "timeseries",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))"
}]
}
]
}
}

七、信号关联#

7.1 Exemplar:指标到追踪#

# Exemplar:在指标图表中点击数据点,跳转到对应的 Trace
# Prometheus 2.42+ 支持 Exemplar
# 查看请求延迟,每个数据点附带 TraceID
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# 在 Grafana 中:
# 1. 将鼠标悬停在 P99 延迟的尖峰上
# 2. 点击 Exemplar 链接
# 3. 自动跳转到 Tempo 查看对应的 Trace

7.2 TraceID 关联日志#

// 在日志中注入 TraceID
func loggingMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
span := trace.SpanFromContext(ctx)
sc := span.SpanContext()
// 将 TraceID 和 SpanID 注入日志上下文
logger := slog.Default().With(
"trace_id", sc.TraceID().String(),
"span_id", sc.SpanID().String(),
)
ctx = context.WithValue(ctx, loggerKey, logger)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
# Loki: 通过 TraceID 查找所有相关日志
{service="demo-app"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"
# Loki: 查找某个服务的错误日志
{service="demo-app"} |= "error" | json | level="error"

7.3 完整排障流程#

sequenceDiagram participant S as SLO 告警 participant G as Grafana participant P as Prometheus participant T as Tempo participant L as Loki S->>G: P99 延迟超过 500ms G->>P: 查看延迟指标 P-->>G: 返回延迟图表 + Exemplar G->>T: 点击 Exemplar 跳转 Trace T-->>G: 返回 Trace 详情 Note over G: 发现 DB 查询慢 G->>L: 通过 TraceID 查日志 L-->>G: 返回相关日志 Note over G: 日志显示慢查询 SQL G->>P: 查看 DB 连接池指标 P-->>G: 连接池耗尽 Note over G: 根因:连接池配置过小

八、端到端验证#

8.1 启动平台#

# 启动所有服务
docker compose up -d
# 等待服务就绪
sleep 10
# 验证各组件
curl -s http://localhost:3000/api/health # Grafana
curl -s http://localhost:9090/-/healthy # Prometheus
curl -s http://localhost:3100/ready # Loki
curl -s http://localhost:3200/status/build # Tempo

8.2 生成流量#

# 使用 hey 生成 HTTP 流量
hey -z 5m -q 50 http://localhost:8080/orders
# 同时注入一些错误
for i in $(seq 1 100); do
curl -s http://localhost:8080/orders/error
sleep 1
done

8.3 验证信号#

# 验证 Prometheus 指标
curl -s 'http://localhost:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq
# 验证 Tempo 追踪
curl -s 'http://localhost:3200/api/traces?service=demo-app&limit=10' | jq
# 验证 Loki 日志
curl -s 'http://localhost:3100/loki/api/v1/query_range?query={service="demo-app"}&limit=10' | jq
# 验证 Grafana 仪表板
curl -s -u admin:admin 'http://localhost:3000/api/dashboards/db/red-method' | jq '.meta.slug'
flowchart TB APP["应用服务"] --> SDK["OTel SDK"] --> COL["OTel Collector"] COL --> PROM["Prometheus<br/>指标存储"] COL --> TEMPO["Tempo<br/>追踪存储"] COL --> LOKI["Loki<br/>日志存储"] PROM --> GRAFANA["Grafana<br/>统一可视化"] TEMPO --> GRAFANA LOKI --> GRAFANA style COL fill:#fff9c4,stroke:#f9a825 style GRAFANA fill:#c8e6c9,stroke:#2e7d32
flowchart LR ALERT2["告警规则"] --> EVAL["Prometheus<br/>规则评估"] --> FIRE["触发告警"] FIRE --> AM["Alertmanager<br/>去重/分组/路由"] --> NOTIFY["通知渠道<br/>Slack/Email/PagerDuty"] style AM fill:#ffcdd2,stroke:#c62828

九、总结#

步骤组件配置文件验证方式
1. 部署基础设施Docker Composedocker-compose.ymldocker compose ps
2. 配置 OTel CollectorOTel Collectorotel-collector-config.yaml发送测试遥测
3. 仪器化应用OTel SDKGo 代码生成 Trace
4. 配置指标存储Prometheusprometheus.yamlPromQL 查询
5. 配置追踪存储Tempotempo.yamlTraceID 查询
6. 配置日志存储Lokiloki.yamlLogQL 查询
7. 配置可视化Grafanadatasources + dashboards仪表板查看
8. 配置 SLO 告警Prometheusalert_rules.yaml触发告警
9. 验证信号关联GrafanaExemplar + TraceID端到端跳转
Tip

搭建可观测性平台只是开始。真正的挑战在于:

  1. 持续优化采样策略:平衡成本与可见性

  2. 持续完善仪表板:让仪表板反映业务健康度

  3. 持续迭代 SLO:随着业务演进调整 SLO 目标

  4. 持续培养可观测性文化:让所有工程师都具备 ODD 意识

本系列从可观测性全景出发,经过结构化日志、指标体系、分布式追踪、OpenTelemetry、持续性能分析、eBPF 可观测性、数据管道、存储后端、信号关联、SLO 告警、规模设计、平台设计、生产调试、可观测性驱动开发,最终通过本实战章搭建了完整的可观测性平台。希望这个系列能帮助你从”知道可观测性”进阶到”能搭建和运营可观测性平台”。

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

综合实战:搭建可观测性平台
https://blog.souloss.com/posts/observability/observability-hands-on-practice/
作者
Souloss
发布于
2025-12-10
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时