mobile wallpaper 1mobile wallpaper 2mobile wallpaper 3mobile wallpaper 4
1673 字
5 分钟
Agent 评估体系:如何衡量 Agent 的能力
2025-03-26

前言#

评估 Agent 是工程化的难题。Agent 的输出不像传统 API 有明确的对错,需要从任务完成、工具调用、成本效率多维度评估。本章将从定量指标、Benchmark 套件、A/B 测试框架到 LLM-as-Judge,系统讲解 Agent 评估的方法论和实现。

一、Agent 评估的特殊挑战#

1.1 评估难点#

难点说明
任务开放性没有标准答案
工具调用依赖外部 API 不稳定
多步骤推理中间步骤难追踪
幻觉风险可能传播错误信息

1.2 评估维度#

graph radar title Agent 评估维度 axis[任务完成, 工具使用, 成本效率, 响应时间, 安全性]

1.3 评估的分类#

Agent 评估可以分为三大类,各自适用于不同阶段:

评估类型执行时机目的示例
离线评估开发/测试阶段验证基本能力Benchmark 跑分
在线评估生产环境监控真实表现成功率、延迟监控
对比评估迭代优化A/B 实验决策新旧版本对比

二、任务完成率评估#

2.1 定义完成标准#

@dataclass
class TaskResult:
task_id: str
user_intent: str
agent_response: str
tools_used: list[str]
completed: bool # 人工标注
partial_completed: bool
# 自动指标
intent_match_score: float # 0-1
key_points_covered: list[str]

2.2 自动化评估指标#

def evaluate_task_completion(result: TaskResult) -> dict:
"""自动化评估指标"""
metrics = {
# 1. 关键实体覆盖
"entity_recall": len(result.covered_entities) / len(result.required_entities),
# 2. 意图匹配
"intent_match": result.intent_match_score,
# 3. 工具调用准确率
"tool_accuracy": sum(result.correct_tool_calls) / len(result.tool_calls),
# 4. 幻觉检测
"hallucination_score": 1 - detect_hallucination_rate(result.response)
}
return metrics

2.3 任务完成率的量化公式#

精确量化 Agent 的任务完成能力需要综合多个指标:

Task Score=w1Intent Match+w2Entity Recall+w3Tool Accuracy+w4Factuality\text{Task Score} = w_1 \cdot \text{Intent Match} + w_2 \cdot \text{Entity Recall} + w_3 \cdot \text{Tool Accuracy} + w_4 \cdot \text{Factuality}

其中各权重可根据业务场景调整:

@dataclass
class TaskScoreWeights:
"""不同场景的权重配置"""
intent_match: float = 0.3
entity_recall: float = 0.25
tool_accuracy: float = 0.25
factuality: float = 0.2
# 不同场景的权重
WEIGHTS = {
"research": TaskScoreWeights(intent_match=0.2, entity_recall=0.35, tool_accuracy=0.15, factuality=0.3),
"coding": TaskScoreWeights(intent_match=0.3, entity_recall=0.1, tool_accuracy=0.4, factuality=0.2),
"customer_support": TaskScoreWeights(intent_match=0.4, entity_recall=0.2, tool_accuracy=0.2, factuality=0.2),
}
def compute_task_score(result: TaskResult, scenario: str = "research") -> float:
weights = WEIGHTS[scenario]
metrics = evaluate_task_completion(result)
return (
weights.intent_match * metrics["intent_match"]
+ weights.entity_recall * metrics["entity_recall"]
+ weights.tool_accuracy * metrics["tool_accuracy"]
+ weights.factuality * metrics["hallucination_score"]
)

2.4 事实验证(Factuality)检测#

class FactualityChecker:
"""检查 Agent 回答中的事实准确性"""
def __init__(self, knowledge_base):
self.kb = knowledge_base
async def check(self, response: str, references: list[str]) -> dict:
"""将回答中的每个声明与参考来源对照"""
claims = await self._extract_claims(response)
results = []
for claim in claims:
evidence = await self.kb.search(claim, k=3)
is_supported = await self._verify_claim(claim, evidence)
results.append({
"claim": claim,
"supported": is_supported,
"evidence": evidence[:1] if evidence else None,
})
supported_count = sum(1 for r in results if r["supported"])
return {
"total_claims": len(results),
"supported": supported_count,
"factuality_score": supported_count / len(results) if results else 0,
"details": results,
}
async def _extract_claims(self, text: str) -> list[str]:
"""从回答中提取可验证的声明"""
extraction_prompt = f"""从以下文本中提取可验证的事实声明。
每个声明应该是独立、可核验的句子。
文本: {text}
输出 JSON 列表。"""
response = await llm.complete(extraction_prompt)
return parse_json_list(response)
async def _verify_claim(self, claim: str, evidence: list) -> bool:
"""判断声明是否有证据支持"""
verification_prompt = f"""判断以下声明是否被证据支持。
声明: {claim}
证据: {evidence}
回答 "supported" 或 "not_supported"。"""
result = await llm.complete(verification_prompt)
return "supported" in result.lower()

三、工具调用评估#

3.1 工具调用准确率#

def evaluate_tool_calls(ground_truth: list, predictions: list) -> dict:
"""评估工具调用准确率"""
return {
"precision": len(set(ground_truth) & set(predictions)) / len(set(predictions)),
"recall": len(set(ground_truth) & set(predictions)) / len(set(ground_truth)),
"order_accuracy": ground_truth == predictions,
}

3.2 工具选择评估#

评估项指标
工具选择正确率%
参数填充准确率%
工具调用顺序是否合理

3.3 工具调用全链路评估#

完整的工具调用评估需要覆盖从意图理解到参数填充的全过程:

@dataclass
class ToolCallEvaluation:
"""工具调用全链路评估"""
# 1. 工具选择是否正确
correct_tool_selected: bool
# 2. 参数是否完整
all_params_filled: bool
# 3. 参数值是否正确
correct_param_values: dict[str, bool] # param_name -> correct?
# 4. 调用时机是否合理
call_timing_appropriate: bool # 是否在不该调用时调用了
# 5. 结果处理是否得当
result_utilized: bool # 工具返回的结果是否被正确利用
def evaluate_tool_call_chain(
agent_trace: list[dict],
ground_truth: list[dict]
) -> dict:
"""评估整个工具调用链"""
metrics = {
"tool_selection_accuracy": 0.0,
"param_completeness": 0.0,
"param_accuracy": 0.0,
"ordering_correctness": 0.0,
"unnecessary_calls": 0,
}
# 逐步骤对比
for i, (actual, expected) in enumerate(zip(agent_trace, ground_truth)):
if actual.get("tool") == expected.get("tool"):
metrics["tool_selection_accuracy"] += 1
# 检查参数
actual_params = actual.get("params", {})
expected_params = expected.get("params", {})
filled = sum(1 for k in expected_params if k in actual_params)
metrics["param_completeness"] += filled / len(expected_params)
correct = sum(
1 for k in expected_params
if actual_params.get(k) == expected_params[k]
)
metrics["param_accuracy"] += correct / len(expected_params)
n = len(ground_truth)
metrics["tool_selection_accuracy"] /= n
metrics["param_completeness"] /= n
metrics["param_accuracy"] /= n
# 检查多余的调用
metrics["unnecessary_calls"] = max(0, len(agent_trace) - len(ground_truth))
return metrics

3.4 工具调用常见问题分类#

问题类型表现根因
错误工具选择搜索时用了计算器Prompt 中工具描述不清
参数缺失搜索时不传 query 参数参数提取逻辑有缺陷
参数值幻觉查天气时城市名不存在LLM 编造了不存在的值
不必要调用简单问答也调用搜索缺乏”已知信息”判断
忽略返回结果工具返回了结果但没用上下文窗口管理不佳
循环调用同一工具用相同参数反复调终止条件缺失

四、成本效率评估#

4.1 Token 成本#

@dataclass
class CostMetrics:
prompt_tokens: int
completion_tokens: int
total_tokens: int
latency_ms: int
cost_usd: float
def calculate_cost(result: TaskResult) -> CostMetrics:
"""计算任务成本"""
return CostMetrics(
prompt_tokens=result.prompt_tokens,
completion_tokens=result.completion_tokens,
total_tokens=result.total_tokens,
latency_ms=result.latency_ms,
cost_usd=calculate_cost_usd(result.total_tokens)
)

4.2 成本效率公式#

Cost Efficiency=Task Quality ScoreCost in USD\text{Cost Efficiency} = \frac{\text{Task Quality Score}}{\text{Cost in USD}}

4.3 详细成本归因分析#

一次 Agent 任务的成本由多个环节构成,需要分别追踪:

from datetime import datetime
@dataclass
class CostRecord:
"""单次调用的成本记录"""
timestamp: datetime
agent_name: str
task_type: str
model: str
prompt_tokens: int
completion_tokens: int
latency_ms: float
cost_usd: float
success: bool
class CostTracker:
"""成本追踪器"""
def __init__(self):
self.records: list[CostRecord] = []
def record(self, **kwargs):
self.records.append(CostRecord(timestamp=datetime.now(), **kwargs))
def summary(self, group_by: str = "agent_name") -> dict:
"""按维度汇总成本"""
groups = {}
for r in self.records:
key = getattr(r, group_by)
if key not in groups:
groups[key] = {"total_cost": 0, "total_tokens": 0, "count": 0, "success": 0}
groups[key]["total_cost"] += r.cost_usd
groups[key]["total_tokens"] += r.prompt_tokens + r.completion_tokens
groups[key]["count"] += 1
groups[key]["success"] += 1 if r.success else 0
return groups
def cost_per_success(self) -> float:
"""计算每次成功任务的平均成本"""
successful = [r for r in self.records if r.success]
if not successful:
return float("inf")
return sum(r.cost_usd for r in successful) / len(successful)

4.4 多模型成本对比#

不同底层模型对 Agent 成本的影响巨大:

# 常见模型定价(2026年初参考价格,单位: USD / 1M tokens)
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4": {"input": 3.00, "output": 15.00},
"claude-haiku-3.5": {"input": 0.80, "output": 4.00},
"deepseek-chat": {"input": 0.14, "output": 0.28},
"deepseek-reasoner": {"input": 0.55, "output": 2.19},
"gemini-2.0-flash": {"input": 0.10, "output": 0.40},
"gemini-2.5-pro": {"input": 1.25, "output": 10.00},
}
def estimate_monthly_cost(
daily_requests: int,
avg_prompt_tokens: int = 2000,
avg_completion_tokens: int = 500,
avg_tool_calls_per_request: int = 3,
model: str = "gpt-4o",
) -> dict:
"""估算月度成本"""
pricing = MODEL_PRICING[model]
# 每次 LLM 调用的成本
cost_per_call = (
avg_prompt_tokens * pricing["input"] / 1_000_000
+ avg_completion_tokens * pricing["output"] / 1_000_000
)
# 每次请求的总 LLM 调用次数 = 初始调用 + 工具调用次数 * 2(调用+结果处理) + 最终汇总
calls_per_request = 1 + avg_tool_calls_per_request * 2 + 1
daily_cost = daily_requests * calls_per_request * cost_per_call
monthly_cost = daily_cost * 30
return {
"model": model,
"cost_per_call": cost_per_call,
"calls_per_request": calls_per_request,
"daily_cost_usd": daily_cost,
"monthly_cost_usd": monthly_cost,
}

五、Agent Benchmark#

5.1 常用 Benchmark#

Benchmark评估内容适用场景数据规模
GAIA真实世界任务通用 Agent466 个任务
AgentBench多维度 Agent 能力综合评估8 个子任务
WebArena网页操作Web Agent812 个任务
MMLU多学科知识问答 Agent14000+ 题
HotpotQA多跳推理推理 Agent113000+ 问题
API-Bank工具使用Tool Agent53 个 API
TAU-Bench航空/零售任务实际任务对话式
HumanEval代码生成编码 Agent164 个问题
SWE-bench真实软件工程编码 Agent2294 个任务

5.2 Benchmark 详细解读#

flowchart TB subgraph 通用能力["通用能力 Benchmark"] GAIA["GAIA<br/>真实世界推理任务"] AgentBench["AgentBench<br/>8个维度综合评估"] end subgraph 工具使用["工具使用 Benchmark"] APIBank["API-Bank<br/>API 调用能力"] WebArena["WebArena<br/>网页操作"] TAU["TAU-Bench<br/>实际业务场景"] end subgraph 专业能力["专业能力 Benchmark"] HumanEval["HumanEval<br/>代码生成"] SWEBench["SWE-bench<br/>软件工程"] HotpotQA["HotpotQA<br/>多跳推理"] end

5.3 AgentBench 评估实战#

AgentBench 是目前最全面的 Agent 评估套件之一,涵盖 8 个子任务:

class AgentBenchRunner:
"""AgentBench 评估运行器"""
TASKS = [
"os", # 操作系统操作
"webshop", # 网购任务
"web_browsing", # 网页浏览
"digital_card", # 数字卡牌游戏
"lateral_thinking", # 侧向思维谜题
"householding", # 家务任务
"textcraft", # 文字冒险
"mind2web", # 网页任务执行
]
async def evaluate_agent(self, agent, tasks: list[str] | None = None) -> dict:
tasks = tasks or self.TASKS
results = {}
for task in tasks:
task_env = self._create_env(task)
score = 0
total = 0
for episode in task_env.episodes:
obs = task_env.reset(episode)
done = False
steps = 0
while not done and steps < task_env.max_steps:
action = await agent.act(obs, task_env.available_actions)
obs, reward, done, info = task_env.step(action)
score += reward
steps += 1
total += 1
results[task] = {
"score": score,
"total_episodes": total,
"average_score": score / total if total > 0 else 0,
}
return results

5.4 WebArena:Web Agent 评估#

WebArena 专注于评估 Agent 在真实网页环境中的操作能力:

class WebArenaEvaluator:
"""WebArena 评估器"""
async def evaluate(self, agent, test_cases: list[dict]) -> dict:
results = {
"total": len(test_cases),
"success": 0,
"partial_success": 0,
"failure": 0,
"avg_steps": 0,
}
for case in test_cases:
# 启动浏览器环境
env = WebEnvironment(
start_url=case["start_url"],
intent=case["intent"],
target_url=case.get("target_url"),
)
steps = 0
done = False
while not done and steps < 30:
# 获取页面状态
page_state = env.get_state()
# Agent 决策
action = await agent.decide(page_state, case["intent"])
# 执行动作
env.execute(action)
steps += 1
# 检查是否完成
if env.check_completion(case["eval_criteria"]):
results["success"] += 1
done = True
elif steps >= 30:
# 超时,检查部分完成
if env.check_partial_completion(case["eval_criteria"]):
results["partial_success"] += 1
else:
results["failure"] += 1
results["avg_steps"] += steps
results["avg_steps"] /= results["total"]
results["success_rate"] = results["success"] / results["total"]
return results

5.5 AgentEval 框架#

from agenteval import AgentEvaluator
evaluator = AgentEvaluator(
tasks=load_benchmark("GAIA"),
metrics=["task_completion", "tool_usage", "cost"]
)
results = evaluator.evaluate(your_agent)
print(f"Overall Score: {results.overall_score}")

六、A/B 测试框架#

6.1 Agent 的 A/B 测试设计#

Agent 系统的迭代优化需要科学的 A/B 测试。与传统的 Web A/B 测试不同,Agent 的输出具有不确定性,需要特殊的实验设计:

flowchart LR A["用户请求"] --> B{"流量分配<br/>50/50"} B -->|"Group A"| C["Agent v1<br/>当前版本"] B -->|"Group B"| D["Agent v2<br/>新版本"] C --> E["收集指标"] D --> E E --> F["统计显著性检验"] F --> G["决策:是否发布"]

6.2 A/B 测试引擎实现#

import hashlib
from dataclasses import dataclass
@dataclass
class ABTestConfig:
test_name: str
variants: dict[str, float] # variant_name -> traffic_ratio
metrics: list[str]
min_sample_size: int = 1000
significance_level: float = 0.05
class ABTestEngine:
"""Agent A/B 测试引擎"""
def __init__(self):
self.results: dict[str, list[dict]] = {}
self.active_tests: dict[str, ABTestConfig] = {}
def register(self, config: ABTestConfig):
self.active_tests[config.test_name] = config
self.results[config.test_name] = {v: [] for v in config.variants}
def assign_variant(self, test_name: str, user_id: str) -> str:
"""根据用户 ID 确定性分配变体"""
config = self.active_tests[test_name]
hash_val = int(hashlib.md5(f"{test_name}:{user_id}".encode()).hexdigest(), 16)
bucket = (hash_val % 100) / 100
cumulative = 0
for variant, ratio in config.variants.items():
cumulative += ratio
if bucket <= cumulative:
return variant
return list(config.variants.keys())[-1]
def record_result(self, test_name: str, variant: str, metrics: dict):
"""记录一次实验结果"""
self.results[test_name][variant].append(metrics)
def analyze(self, test_name: str) -> dict:
"""分析实验结果"""
config = self.active_tests[test_name]
report = {}
for metric in config.metrics:
variant_stats = {}
for variant, data in self.results[test_name].items():
values = [d[metric] for d in data if metric in d]
variant_stats[variant] = {
"mean": sum(values) / len(values) if values else 0,
"std": self._std(values),
"n": len(values),
}
# t-test 检验显著性
variants = list(variant_stats.keys())
if len(variants) == 2:
is_significant = self._t_test(
variant_stats[variants[0]],
variant_stats[variants[1]],
config.significance_level,
)
else:
is_significant = False
report[metric] = {
"variant_stats": variant_stats,
"is_significant": is_significant,
}
return report
def _std(self, values: list[float]) -> float:
if len(values) < 2:
return 0
mean = sum(values) / len(values)
return (sum((v - mean) ** 2 for v in values) / (len(values) - 1)) ** 0.5
def _t_test(self, a: dict, b: dict, alpha: float) -> bool:
"""简化的 t 检验"""
from scipy import stats
if a["n"] < 30 or b["n"] < 30:
return False
# 实际实现中使用 scipy.stats.ttest_ind 进行显著性检验
return True # 简化示例,生产环境使用完整的 t 检验

6.3 Agent A/B 测试的注意事项#

注意事项说明
最小样本量每个变体至少 1000 个样本
新用户优先避免同一用户体验不同版本导致困惑
时间均衡不同时段流量特征不同,需均匀分配
指标选择同时跟踪质量指标和成本指标
梯度发布显著胜出后逐步增加流量,而非一刀切

七、实战:构建评估系统#

7.1 评估流程#

graph TB A["收集用户反馈"] --> B["自动化指标计算"] B --> C["LLM-as-Judge"] C --> D["综合评分"] D --> E["生成报告"]

7.2 LLM-as-Judge#

async def llm_judge评估(response: str, criteria: list[str]) -> float:
"""用更强 LLM 评估 Agent 输出"""
judge_prompt = f"""
评估以下 Agent 响应:
响应:{response}
评估标准:
{chr(10).join(criteria)}
给出 0-10 的评分和简要理由。
"""
result = await judge_llm.complete(judge_prompt)
return parse_score(result)

7.3 LLM-as-Judge 的高级实现#

简单的 LLM 评分容易不稳定。以下是更健壮的实现方案:

class RobustLLMJudge:
"""健壮的 LLM-as-Judge 评估器"""
def __init__(self, judge_model: str = "claude-opus-4"):
self.judge_model = judge_model
self.calibration_samples = [] # 标定样本
async def judge(
self,
query: str,
response: str,
reference: str | None = None,
rubric: list[str] | None = None,
) -> dict:
"""评估单条 Agent 响应"""
rubric_text = "\n".join(f"- {r}" for r in (rubric or self._default_rubric()))
prompt = f"""你是一个专业的 Agent 输出评估专家。请严格按照评分标准评估。
用户问题: {query}
Agent 回答: {response}
{"参考答案: " + reference if reference else ""}
评分标准:
{rubric_text}
请以 JSON 格式输出:
{{
"scores": {{
"relevance": <0-10>,
"accuracy": <0-10>,
"completeness": <0-10>,
"clarity": <0-10>,
"safety": <0-10>
}},
"overall": <0-10>,
"reasoning": "<简要说明评分理由>",
"improvement_suggestions": ["<建议1>", "<建议2>"]
}}"""
result = await llm.complete(prompt, model=self.judge_model)
parsed = parse_json(result)
return {
"scores": parsed.get("scores", {}),
"overall": parsed.get("overall", 0),
"reasoning": parsed.get("reasoning", ""),
"suggestions": parsed.get("improvement_suggestions", []),
}
async def judge_pairwise(
self,
query: str,
response_a: str,
response_b: str,
) -> dict:
"""成对比较评估:减少 LLM 评分的不稳定性"""
prompt = f"""比较以下两个 Agent 对同一问题的回答,判断哪个更好。
问题: {query}
回答 A: {response_a}
回答 B: {response_b}
输出 JSON:
{{
"winner": "A" 或 "B" 或 "tie",
"confidence": <0.0-1.0>,
"reasoning": "<简要说明>"
}}"""
result = await llm.complete(prompt, model=self.judge_model)
return parse_json(result)
async def batch_judge(
self,
queries_and_responses: list[tuple[str, str]],
reference_answers: list[str] | None = None,
) -> list[dict]:
"""批量评估,自动计算统计信息"""
results = []
for i, (query, response) in enumerate(queries_and_responses):
ref = reference_answers[i] if reference_answers else None
result = await self.judge(query, response, reference=ref)
results.append(result)
# 汇总统计
overall_scores = [r["overall"] for r in results]
return {
"individual_results": results,
"summary": {
"mean_score": sum(overall_scores) / len(overall_scores),
"min_score": min(overall_scores),
"max_score": max(overall_scores),
"score_distribution": self._compute_distribution(overall_scores),
}
}
def _default_rubric(self) -> list[str]:
return [
"相关性: 回答是否切题",
"准确性: 事实是否正确",
"完整性: 是否覆盖了所有要点",
"清晰度: 表达是否清晰易懂",
"安全性: 是否包含有害内容",
]

7.4 自动化评估 Pipeline#

将所有评估手段整合为一个自动化 Pipeline:

class EvaluationPipeline:
"""自动化评估流水线"""
def __init__(self, config: dict):
self.judge = RobustLLMJudge(config.get("judge_model"))
self.cost_tracker = CostTracker()
self.benchmarks = config.get("benchmarks", [])
async def evaluate(self, agent, test_set: list[dict]) -> dict:
"""运行完整评估"""
results = {
"task_metrics": [],
"tool_metrics": [],
"cost_metrics": [],
"judge_scores": [],
}
for test_case in test_set:
# 运行 Agent
agent_result = await agent.run(test_case["query"])
# 1. 任务完成指标
task_metrics = evaluate_task_completion(agent_result)
results["task_metrics"].append(task_metrics)
# 2. 工具调用指标
tool_metrics = evaluate_tool_call_chain(
agent_result.trace,
test_case.get("expected_tools", [])
)
results["tool_metrics"].append(tool_metrics)
# 3. 成本指标
cost = calculate_cost(agent_result)
results["cost_metrics"].append(cost)
# 4. LLM-as-Judge
judge_result = await self.judge.judge(
query=test_case["query"],
response=agent_result.response,
reference=test_case.get("reference_answer"),
)
results["judge_scores"].append(judge_result)
# 汇总报告
return self._generate_report(results)
def _generate_report(self, results: dict) -> dict:
"""生成评估报告"""
task_scores = [m.get("intent_match", 0) for m in results["task_metrics"]]
judge_scores = [r["overall"] for r in results["judge_scores"]]
costs = [c.cost_usd for c in results["cost_metrics"]]
return {
"overall_score": sum(judge_scores) / len(judge_scores),
"task_completion_rate": sum(1 for s in task_scores if s >= 0.8) / len(task_scores),
"avg_cost_per_task": sum(costs) / len(costs),
"cost_efficiency": (sum(judge_scores) / len(judge_scores)) / (sum(costs) / len(costs)),
"details": results,
}

八、评估框架横向对比#

8.1 开源评估框架对比#

框架语言特点支持的 Benchmark
AgentBenchPython最全面的多维度评估8 个内置子任务
WebArenaPython真实网页环境测试812 个网页操作任务
AgentEvalPython轻量级,易扩展自定义
PromptfooTS/JSPrompt 级别的对比评估自定义 + 内置
LangSmith EvalPython与 LangChain 深度集成自定义
AutoEvalPython自动生成评估用例自定义

8.2 选型建议#

flowchart TD A["选择评估框架"] --> B{"主要评估什么?"} B -->|"通用 Agent 能力"| C["AgentBench"] B -->|"Web Agent"| D["WebArena"] B -->|"Prompt 质量"| E["Promptfoo"] B -->|"LangChain 项目"| F["LangSmith Eval"] B -->|"自定义场景"| G{"需要自动生成用例?"} G -->|"是"| H["AutoEval"] G -->|"否"| I["AgentEval"]

九、定量评估指标汇总#

9.1 指标全景表#

指标类别指标名称计算方式理想值
任务完成Task Completion Rate成功任务数 / 总任务数> 0.85
任务完成Intent Match Score意图匹配分数 (0-1)> 0.9
任务完成Partial Completion部分完成率< 0.1
工具使用Tool Accuracy正确工具调用 / 总调用> 0.9
工具使用Param Completeness必填参数填充率> 0.95
工具使用Unnecessary Calls多余调用次数< 2
质量评估Factuality Score有据可查声明比例> 0.9
质量评估LLM Judge ScoreLLM 评估综合分> 7.0/10
质量评估Hallucination Rate幻觉声明比例< 0.05
效率指标Avg Latency (ms)平均响应延迟< 3000
效率指标Avg Token/Task每任务平均 Token 消耗场景相关
效率指标Cost/Success ($0)每成功任务成本场景相关
安全性Injection Block Rate注入攻击拦截率> 0.95
安全性Data Leak Rate敏感数据泄露率= 0

9.2 评估报告模板#

class EvaluationReport:
"""评估报告生成器"""
def generate(self, results: dict) -> str:
report = f"""
# Agent 评估报告
## 概要
- 评估时间: {datetime.now().isoformat()}
- 测试用例数: {results['total_cases']}
- 整体评分: {results['overall_score']:.2f} / 10
## 任务完成
| 指标 | 值 | 目标 | 状态 |
|------|-----|------|------|
| 完成率 | {results['completion_rate']:.1%} | > 85% | {'' if results['completion_rate'] > 0.85 else ''} |
| 意图匹配 | {results['intent_match']:.2f} | > 0.9 | {'' if results['intent_match'] > 0.9 else ''} |
## 工具使用
| 指标 | 值 | 目标 | 状态 |
|------|-----|------|------|
| 准确率 | {results['tool_accuracy']:.1%} | > 90% | {'' if results['tool_accuracy'] > 0.9 else ''} |
| 参数完整 | {results['param_completeness']:.1%} | > 95% | {'' if results['param_completeness'] > 0.95 else ''} |
## 成本效率
| 指标 | 值 | 目标 | 状态 |
|------|-----|------|------|
| 平均延迟 | {results['avg_latency_ms']:.0f}ms | < 3000ms | {'' if results['avg_latency_ms'] < 3000 else ''} |
| 每任务成本 | ${results['cost_per_task']:.4f} | < $0.10 | {'' if results['cost_per_task'] < 0.1 else ''} |
"""
return report

十、总结#

评估维度关键指标
任务完成完成率、意图匹配
工具使用准确率、召回率
成本效率Token 消耗、延迟
安全性幻觉率、越狱尝试

10.1 评估实施路线图#

  1. 第 1 周:搭建自动化评估 Pipeline,接入 Task Completion 和 Tool Accuracy 指标
  2. 第 2 周:接入 LLM-as-Judge,建立 Golden Dataset
  3. 第 3 周:上线 A/B 测试框架,开始对比实验
  4. 第 4 周:接入 Benchmark 评估(如 AgentBench),与业界水平对标

参考资料#

支持与分享

如果这篇文章对你有帮助,欢迎支持作者或分享给更多人

Agent 评估体系:如何衡量 Agent 的能力
https://blog.souloss.com/posts/machine-learning/agent-guide/agent-evaluation-system/
作者
Souloss
发布于
2025-03-26
许可协议
CC BY-NC-SA 4.0

部分信息可能已经过时