机器学习系统设计

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

Souloss

公告

欢迎来到我的博客！这是一条示例公告

Learn More

标签

321 字

1 分钟

机器学习系统设计

2025-03-03

AI

机器学习

/

系统设计

一、ML 系统架构#

1.1 端到端 ML 流程#

graph TB subgraph "数据层" A["数据源"] --> B["数据仓库"] B --> C["特征存储"] end subgraph "训练层" C --> D["特征工程"] D --> E["模型训练"] E --> F["模型评估"] end subgraph "服务层" F --> G["模型注册"] G --> H["模型服务"] H --> I["推理端点"] end subgraph "监控层" I --> J["监控指标"] J --> K["特征漂移检测"] K --> D end

阶段	关键任务	工具
数据收集	ETL、清洗	Airflow, Spark
特征工程	特征提取、存储	Feast, Tecton
模型训练	分布式训练	TF, PyTorch
模型服务	在线推理	Triton, Seldon
监控	漂移检测	Evidently, Prometheus

1.2 批处理 vs 在线学习#

1
# 批处理系统
2
class BatchMLSystem:
3
    """
4
    适用于：模型不频繁更新、延迟要求低
5
    """
6
    def __init__(self):
7
        self.model = None
8
        self.schedule = "daily"
9

10
    def retrain(self):
11
        """每日/每周重新训练"""
12
        data = self.fetch_batch_data()
13
        features = self.compute_features(data)
14
        self.model = self.train(features)
15
        self.model.save()
16

17
# 在线学习系统
18
class OnlineMLSystem:
19
    """
20
    适用于：模型快速适应、数据分布变化
21
    """
22
    def __init__(self):
23
        self.model = None
24
        self.learning_rate = 0.01
25

26
    def partial_fit(self, new_data):
27
        """增量更新模型"""
28
        features = self.compute_features(new_data)
29
        self.model.partial_fit(features)

二、特征工程#

2.1 特征类型#

类型	说明	示例
数值特征	连续值	年龄、收入
类别特征	离散值	性别、国家
时间特征	时间相关	星期、月份
交叉特征	组合特征	年龄×性别

2.2 特征处理#

1
from sklearn.preprocessing import StandardScaler, OneHotEncoder
2
from sklearn.compose import ColumnTransformer
3

4
# 数值特征标准化
5
numeric_features = ['age', 'income', 'score']
6
numeric_transformer = StandardScaler()
7

8
# 类别特征编码
9
categorical_features = ['country', 'gender']
10
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
11

12
# 组合变换
13
preprocessor = ColumnTransformer(
14
    transformers=[
15
        ('num', numeric_transformer, numeric_features),
16
        ('cat', categorical_transformer, categorical_features)
17
    ])
18

19
# 应用到数据
20
X_processed = preprocessor.fit_transform(X)

2.3 特征存储#

1
# 特征存储服务 (Feast)
2
from feast import Entity, Feature, FeatureView, FileSource
3

4
# 定义实体
5
user = Entity(name="user_id", join_keys=["user_id"])
6

7
# 定义特征源
8
user_profile_source = FileSource(
9
    path="data/user_features.parquet",
10
    timestamp_field="event_timestamp"
11
)
12

13
# 定义特征视图
14
user_profile_view = FeatureView(
15
    name="user_profile",
16
    entities=[user],
17
    ttl=timedelta(days=30),
18
    schema=[
19
        Field(name="age", dtype=Int64),
20
        Field(name="gender", dtype=String),
21
        Field(name="country", dtype=String),
22
        Field(name="income", dtype=Float64),
23
    ],
24
    source=user_profile_source
25
)
26

27
# 获取特征
28
feature_store = FeatureStore(config_path="feature_repo/feature_store.yaml")
29
training_df = feature_store.get_historical_features(
30
    entity_df=user_df,
31
    feature_refs=[
32
        "user_profile:age",
33
        "user_profile:income",
34
    ]
35
).to_df()

三、模型训练#

3.1 分布式训练#

1
import torch.distributed as dist
2
from torch.nn.parallel import DistributedDataParallel
3

4
# 多 GPU 训练
5
def train_distributed(model, train_loader, num_epochs):
6
    # 初始化分布式环境
7
    dist.init_process_group(backend='nccl')
8

9
    # 将模型移到 GPU
10
    local_rank = dist.get_rank()
11
    torch.cuda.set_device(local_rank)
12
    model = model.cuda(local_rank)
13

14
    # 包装模型
15
    model = DistributedDataParallel(model, device_ids=[local_rank])
16

17
    # 训练循环
18
    for epoch in range(num_epochs):
19
        for batch in train_loader:
20
            inputs, labels = batch
21
            inputs = inputs.cuda(local_rank)
22
            labels = labels.cuda(local_rank)
23

24
            outputs = model(inputs)
25
            loss = criterion(outputs, labels)
26

27
            loss.backward()
28
            optimizer.step()
29
            optimizer.zero_grad()
30

31
    dist.destroy_process_group()

3.2 超参数调优#

1
from ray import tune
2
from sklearn.model_selection import cross_val_score
3

4
def train_model(config):
5
    model = RandomForestClassifier(
6
        n_estimators=config["n_estimators"],
7
        max_depth=config["max_depth"],
8
        min_samples_split=config["min_samples_split"]
9
    )
10

11
    scores = cross_val_score(model, X_train, y_train, cv=3)
12
    tune.report(mean_accuracy=scores.mean())
13

14
# 超参数搜索
15
analysis = tune.run(
16
    train_model,
17
    config={
18
        "n_estimators": tune.choice([50, 100, 200, 300]),
19
        "max_depth": tune.choice([5, 10, 15, None]),
20
        "min_samples_split": tune.choice([2, 5, 10]),
21
    },
22
    num_samples=50,
23
    resources_per_trial={"cpu": 2, "gpu": 0}
24
)
25

26
# 最佳参数
27
best_config = analysis.best_config

四、模型服务化#

4.1 模型服务架构#

graph TB subgraph "请求入口" A["API Gateway"] --> B["模型服务"] end subgraph "模型服务" B --> C["模型缓存"] B --> D["模型推理"] C --> E["特征预处理"] end subgraph "后端" D --> F["特征存储"] E --> G["数据源"] end

4.2 模型服务实现#

1
# Triton Inference Server
2
import tritonclient.http as httpclient
3
import numpy as np
4

5
# 创建推理客户端
6
client = httpclient.InferenceServerClient(url="localhost:8000")
7

8
# 准备输入数据
9
inputs = [
10
    httpclient.InferInput("input", [1, 10], "FP32")
11
]
12
inputs[0].set_data_from_numpy(np.random.randn(1, 10).astype(np.float32))
13

14
# 推理请求
15
outputs = [httpclient.InferRequestedOutput("output")]
16
response = client.infer("model_name", inputs, outputs=outputs)
17

18
# 获取结果
19
result = response.as_numpy("output")

4.3 A/B 测试#

1
# 模型 A/B 测试
2
class ModelABTest:
3
    def __init__(self, model_a, model_b):
4
        self.model_a = model_a
5
        self.model_b = model_b
6
        self.traffic_split = 0.1  # 10% 流量到 B
7

8
    def predict(self, features, user_id):
9
        # 根据用户 ID 哈希决定模型
10
        if hash(user_id) % 100 < self.traffic_split * 100:
11
            return self.model_b.predict(features)
12
        return self.model_a.predict(features)
13

14
    def evaluate(self, test_data):
15
        """评估两个模型"""
16
        a_results = self.model_a.evaluate(test_data)
17
        b_results = self.model_b.evaluate(test_data)
18

19
        return {
20
            "model_a": a_results,
21
            "model_b": b_results,
22
            "improvement": (b_results - a_results) / a_results
23
        }

五、模型监控#

5.1 监控指标#

指标类型	说明	告警阈值
延迟	P99 推理延迟	> 100ms
吞吐量	QPS	低于基线 20%
错误率	推理失败比例	> 1%
特征漂移	PSI > 0.2	漂移检测

5.2 漂移检测#

1
import numpy as np
2
from scipy.stats import ks_2samp
3

4
def detect_drift(reference_data, current_data, threshold=0.2):
5
    """
6
    Population Stability Index (PSI) 检测特征漂移
7
    """
8
    # 计算分位数
9
    bins = np.percentile(reference_data, np.linspace(0, 100, 11))
10

11
    # 计算各区间占比
12
    reference_perc = np.histogram(reference_data, bins=bins)[0] / len(reference_data)
13
    current_perc = np.histogram(current_data, bins=bins)[0] / len(current_data)
14

15
    # 避免除零
16
    reference_perc = np.where(reference_perc == 0, 0.0001, reference_perc)
17
    current_perc = np.where(current_perc == 0, 0.0001, current_perc)
18

19
    # 计算 PSI
20
    psi = np.sum((current_perc - reference_perc) *
21
                  np.log(current_perc / reference_perc))
22

23
    return {
24
        "psi": psi,
25
        "drifted": psi > threshold,
26
        "severity": "high" if psi > 0.2 else "medium" if psi > 0.1 else "low"
27
    }

六、总结#

graph TB A["ML 系统设计"] --> B["数据 pipeline"] A --> C["特征工程"] A --> D["模型训练"] A --> E["模型服务"] A --> F["监控运维"] B --> B1["ETL"] C --> C1["特征存储"] D --> D1["分布式训练"] E --> E1["A/B 测试"] F --> F1["漂移检测"]

ML 系统关键点：