评估

Llama Stack 在 Llama Stack 中提供一组 API，用于支持运行 LLM 应用程序的评估。

/datasetio + /datasets API
/scoring + /scoring_functions API
/eval + /benchmarks API

本指南将引导您完成使用 Llama Stack 构建的 LLM 应用程序的评估过程。查阅评估参考指南，该指南详细介绍了使用 Llama Stack 运行基准和应用程序评估的 API 集和开发者体验流程。查阅我们的 Colab notebook，其中包含评估的工作示例此处。

应用程序评估

Llama Stack 提供一个评分函数库和 /scoring API，允许您对预先标注的 AI 应用程序数据集运行评估。

在此示例中，我们将向您展示如何

使用 Llama Stack 构建代理
查询代理的会话、回合和步骤
评估结果。

构建搜索代理

from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger

client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")

agent = Agent(
    client,
    model="meta-llama/Llama-3.3-70B-Instruct",
    instructions="You are a helpful assistant. Use search tool to answer the questions. ",
    tools=["builtin::websearch"],
)
user_prompts = [
    "Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
]

session_id = agent.create_session("test-session")

for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )

    for log in AgentEventLogger().log(response):
        log.print()

查询代理执行步骤

现在，让我们更深入地研究代理的执行步骤，看看我们的代理表现如何。

# query the agents session
from rich.pretty import pprint

session_response = client.agents.session.retrieve(
    session_id=session_id,
    agent_id=agent.agent_id,
)

pprint(session_response)

作为健全性检查，我们将首先检查所有用户提示后面是否都跟着对 brave_search 的工具调用。

num_tool_call = 0
for turn in session_response.turns:
    for step in turn.steps:
        if (
            step.step_type == "tool_execution"
            and step.tool_calls[0].tool_name == "brave_search"
        ):
            num_tool_call += 1

print(
    f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
)

评估代理响应

现在，我们要评估代理对用户提示的响应。

首先，我们将代理的执行历史处理成可用于评估的行列表。
接下来，我们将用预期答案标记这些行。
最后，我们将使用 /scoring API 对代理的响应进行评分。

eval_rows = []

expected_answers = [
    "Dallas Mavericks and the Minnesota Timberwolves",
    "Season 4, Episode 12",
    "King Cobra",
]

for i, turn in enumerate(session_response.turns):
    eval_rows.append(
        {
            "input_query": turn.input_messages[0].content,
            "generated_answer": turn.output_message.content,
            "expected_answer": expected_answers[i],
        }
    )

pprint(eval_rows)

scoring_params = {
    "basic::subset_of": None,
}
scoring_response = client.scoring.score(
    input_rows=eval_rows, scoring_functions=scoring_params
)
pprint(scoring_response)