评估
Llama Stack 在 Llama Stack 中提供一组 API,用于支持运行 LLM 应用程序的评估。
/datasetio
+/datasets
API/scoring
+/scoring_functions
API/eval
+/benchmarks
API
本指南将引导您完成使用 Llama Stack 构建的 LLM 应用程序的评估过程。查阅评估参考指南,该指南详细介绍了使用 Llama Stack 运行基准和应用程序评估的 API 集和开发者体验流程。查阅我们的 Colab notebook,其中包含评估的工作示例此处。
应用程序评估
Llama Stack 提供一个评分函数库和 /scoring
API,允许您对预先标注的 AI 应用程序数据集运行评估。
在此示例中,我们将向您展示如何
使用 Llama Stack 构建代理
查询代理的会话、回合和步骤
评估结果。
构建搜索代理
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
agent = Agent(
client,
model="meta-llama/Llama-3.3-70B-Instruct",
instructions="You are a helpful assistant. Use search tool to answer the questions. ",
tools=["builtin::websearch"],
)
user_prompts = [
"Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
"In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
"What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]
session_id = agent.create_session("test-session")
for prompt in user_prompts:
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
查询代理执行步骤
现在,让我们更深入地研究代理的执行步骤,看看我们的代理表现如何。
# query the agents session
from rich.pretty import pprint
session_response = client.agents.session.retrieve(
session_id=session_id,
agent_id=agent.agent_id,
)
pprint(session_response)
作为健全性检查,我们将首先检查所有用户提示后面是否都跟着对 brave_search
的工具调用。
num_tool_call = 0
for turn in session_response.turns:
for step in turn.steps:
if (
step.step_type == "tool_execution"
and step.tool_calls[0].tool_name == "brave_search"
):
num_tool_call += 1
print(
f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
)
评估代理响应
现在,我们要评估代理对用户提示的响应。
首先,我们将代理的执行历史处理成可用于评估的行列表。
接下来,我们将用预期答案标记这些行。
最后,我们将使用
/scoring
API 对代理的响应进行评分。
eval_rows = []
expected_answers = [
"Dallas Mavericks and the Minnesota Timberwolves",
"Season 4, Episode 12",
"King Cobra",
]
for i, turn in enumerate(session_response.turns):
eval_rows.append(
{
"input_query": turn.input_messages[0].content,
"generated_answer": turn.output_message.content,
"expected_answer": expected_answers[i],
}
)
pprint(eval_rows)
scoring_params = {
"basic::subset_of": None,
}
scoring_response = client.scoring.score(
input_rows=eval_rows, scoring_functions=scoring_params
)
pprint(scoring_response)