评估

Llama Stack 评估流程允许您对 GenAI 应用程序数据集或预注册的基准运行评估。

我们在 Llama Stack 中引入了一组 API，用于支持 LLM 应用程序的评估。

/datasetio + /datasets API
/scoring + /scoring_functions API
/eval + /benchmarks API

本指南介绍了这组 API 以及使用 Llama Stack 对不同用例运行评估的开发者体验流程。请查看我们的 Colab notebook，其中包含评估的工作示例在此。

评估概念

评估 API 与一组资源相关联，如下图所示。请访问我们的核心概念指南中的资源部分，以获得更好的高层次理解。

Eval Concepts

DatasetIO: 定义与数据集和数据加载器的接口。
- 与 Dataset 资源相关联。
Scoring: 评估系统的输出。
- 与 ScoringFunction 资源相关联。我们提供了一套开箱即用的评分函数，也支持您添加自定义评估器。这些评分函数是定义评估任务以输出评估指标的核心部分。
Eval: 生成输出（通过 Inference 或 Agents）并执行评分。
- 与 Benchmark 资源相关联。

评估示例演练

最好在 Colab 中打开此 notebook 以跟随示例进行操作。

1. 开放基准模型评估

第一个示例将引导您如何在 Llama Stack 提供的开放基准上评估候选模型。我们将使用以下基准：

MMMU (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: 设计用于评估多模态模型的基准。
SimpleQA: 设计用于评估模型回答简短事实查询问题的基准。

1.1 运行 MMMU

我们将使用来自 llamastack/mmmu 的预处理 MMMU 数据集。预处理代码显示在此 GitHub Gist 中。该数据集是通过将原始 MMMU/MMMU 数据集转换为 inference/chat-completion API 可接受的正确格式获得的。

import datasets

ds = datasets.load_dataset(path="llamastack/mmmu", name="Agriculture", split="dev")
ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
eval_rows = ds.to_pandas().to_dict(orient="records")

接下来，我们将在候选模型上运行评估，我们需要：
- 定义系统提示
- 定义一个 EvalCandidate
- 在数据集上运行评估

from rich.pretty import pprint
from tqdm import tqdm

SYSTEM_PROMPT_TEMPLATE = """
You are an expert in {subject} whose job is to answer questions from the user using images.

First, reason about the correct answer.

Then write the answer in the following format where X is exactly one of A,B,C,D:

Answer: X

Make sure X is one of A,B,C,D.

If you are uncertain of the correct answer, guess the most likely one.
"""

system_message = {
    "role": "system",
    "content": SYSTEM_PROMPT_TEMPLATE.format(subject=subset),
}

# register the evaluation benchmark task with the dataset and scoring function
client.benchmarks.register(
    benchmark_id="meta-reference::mmmu",
    dataset_id=f"mmmu-{subset}-{split}",
    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
)

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::mmmu",
    input_rows=eval_rows,
    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "strategy": {
                    "type": "top_p",
                    "temperature": 1.0,
                    "top_p": 0.95,
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
            "system_message": system_message,
        },
    },
)
pprint(response)

1.2. 运行 SimpleQA

我们将使用来自 llamastack/evals 的预处理 SimpleQA 数据集，该数据集是通过将输入查询转换为 inference/chat-completion API 可接受的正确格式获得的。
由于我们将在下一个示例中将此数据集用于代理评估，我们将使用 /datasets API 注册它，并通过 /datasetio API 与其交互。

simpleqa_dataset_id = "huggingface::simpleqa"

_ = client.datasets.register(
    purpose="eval/messages-answer",
    source={
        "type": "uri",
        "uri": "huggingface://datasets/llamastack/simpleqa?split=train",
    },
    dataset_id=simpleqa_dataset_id,
)

eval_rows = client.datasets.iterrows(
    dataset_id=simpleqa_dataset_id,
    limit=5,
)

client.benchmarks.register(
    benchmark_id="meta-reference::simpleqa",
    dataset_id=simpleqa_dataset_id,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
)

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=eval_rows.data,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
            "sampling_params": {
                "strategy": {
                    "type": "greedy",
                },
                "max_tokens": 4096,
                "repeat_penalty": 1.0,
            },
        },
    },
)
pprint(response)

2. 代理评估

在本例中，我们将演示如何通过 /agent API 评估 Llama Stack 提供的代理候选。
我们将继续使用我们在上一个示例中使用的 SimpleQA 数据集。
我们将不在模型上运行评估，而是在一个具有搜索工具访问权限的搜索代理上运行评估。我们将通过 AgentConfig 定义我们的代理评估候选。

agent_config = {
    "model": "meta-llama/Llama-3.3-70B-Instruct",
    "instructions": "You are a helpful assistant that have access to tool to search the web. ",
    "sampling_params": {
        "strategy": {
            "type": "top_p",
            "temperature": 0.5,
            "top_p": 0.9,
        }
    },
    "toolgroups": [
        "builtin::websearch",
    ],
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "input_shields": [],
    "output_shields": [],
    "enable_session_persistence": False,
}

response = client.eval.evaluate_rows(
    benchmark_id="meta-reference::simpleqa",
    input_rows=eval_rows.data,
    scoring_functions=["llm-as-judge::405b-simpleqa"],
    benchmark_config={
        "eval_candidate": {
            "type": "agent",
            "config": agent_config,
        },
    },
)
pprint(response)

3. 代理应用程序数据集评分

Llama Stack 提供了一个评分函数库和 /scoring API，允许您对预先标注的 AI 应用程序数据集运行评估。

在此示例中，我们将使用一个您先前构建的 RAG 数据集示例，使用标注进行标记，并使用带有自定义判断提示的 LLM-As-Judge 进行评分。请查看我们的 Llama Stack Playground，以获取用于上传数据集和运行评分的交互式界面。

judge_model_id = "meta-llama/Llama-3.1-405B-Instruct-FP8"

JUDGE_PROMPT = """
Given a QUESTION and GENERATED_RESPONSE and EXPECTED_RESPONSE.

Compare the factual content of the GENERATED_RESPONSE with the EXPECTED_RESPONSE. Ignore any differences in style, grammar, or punctuation.
  The GENERATED_RESPONSE may either be a subset or superset of the EXPECTED_RESPONSE, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The GENERATED_RESPONSE is a subset of the EXPECTED_RESPONSE and is fully consistent with it.
  (B) The GENERATED_RESPONSE is a superset of the EXPECTED_RESPONSE and is fully consistent with it.
  (C) The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.
  (D) There is a disagreement between the GENERATED_RESPONSE and the EXPECTED_RESPONSE.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.

Give your answer in the format "Answer: One of ABCDE, Explanation: ".

Your actual task:

QUESTION: {input_query}
GENERATED_RESPONSE: {generated_answer}
EXPECTED_RESPONSE: {expected_answer}
"""

input_query = (
    "What are the top 5 topics that were explained? Only list succinct bullet points."
)
generated_answer = """
Here are the top 5 topics that were explained in the documentation for Torchtune:

* What is LoRA and how does it work?
* Fine-tuning with LoRA: memory savings and parameter-efficient finetuning
* Running a LoRA finetune with Torchtune: overview and recipe
* Experimenting with different LoRA configurations: rank, alpha, and attention modules
* LoRA finetuning
"""
expected_answer = """LoRA"""

dataset_rows = [
    {
        "input_query": input_query,
        "generated_answer": generated_answer,
        "expected_answer": expected_answer,
    },
]

scoring_params = {
    "llm-as-judge::base": {
        "judge_model": judge_model_id,
        "prompt_template": JUDGE_PROMPT,
        "type": "llm_as_judge",
        "judge_score_regexes": ["Answer: (A|B|C|D|E)"],
    },
    "basic::subset_of": None,
    "braintrust::factuality": None,
}

response = client.scoring.score(
    input_rows=dataset_rows, scoring_functions=scoring_params
)

通过 CLI 运行评估

以下示例提供了使用 llama-stack-client CLI 快速开始运行评估的步骤。

基准评估 CLI

运行基准评估需要 3 个必需输入：

list of benchmark_ids: 要运行评估的基准 ID 列表
model-id: 要评估的模型 ID
utput_dir: 存储评估结果的路径

llama-stack-client eval run-benchmark <benchmark_id_1> <benchmark_id_2> ... \
--model_id <model id to evaluate on> \
--output_dir <directory to store the evaluate results> \

您可以运行

llama-stack-client eval run-benchmark help

查看运行基准评估的所有标志的说明。

在输出日志中，您可以找到包含评估结果的文件路径。打开该文件，您可以在其中查看聚合的评估结果。

应用程序评估 CLI

用法：对于运行应用程序评估，您将拥有应用程序中现有的可用数据集。您需要指定：

scoring-fn-id: 希望用于对应用程序运行的 ScoringFunction 标识符列表。
用于评估的 Dataset
- (1) --dataset-path: 包含要运行评估的数据集的本地文件系统路径
- (2) --dataset-id: 在 Llama Stack 中预注册的数据集
(可选) --scoring-params-config: 可选择使用自定义参数参数化评分函数（例如 judge_prompt, judge_model, parsing_regexes）。

llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <scoring_fn_id_n>
--dataset-path <path-to-local-dataset> \
--output-dir ./

定义 BenchmarkConfig

BenchmarkConfig 是用户指定的配置，用于定义：

要运行生成的 EvalCandidate
- ModelCandidate: 将通过 LlamaStack /inference API 用于生成的模型。
- AgentCandidate: 由 AgentConfig 指定的代理系统将通过 LlamaStack /agents API 用于生成。
可选的评分函数参数，允许自定义评分函数行为。这对于使用自定义 judge_model / judge_prompt 参数化通用评分函数（如 LLMAsJudge）非常有用。

BenchmarkConfig 示例

{
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.1-405B-Instruct",
        "sampling_params": {
            "strategy": {
                "type": "greedy",
            },
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    },
    "scoring_params": {
        "llm-as-judge::llm_as_judge_base": {
            "type": "llm_as_judge",
            "judge_model": "meta-llama/Llama-3.1-8B-Instruct",
            "prompt_template": "Your job is to look at a question, a gold target ........",
            "judge_score_regexes": [
                "(A|B|C)"
            ]
        }
    }
}

开放基准贡献指南

为您的新基准创建新数据集

评估开放基准主要包含两部分：

raw data: 与基准相关的原始数据集。通常您需要搜索引入该基准的原始论文，找到规范数据集（通常托管在 huggingface 上）。
prompt template: 如何要求候选模型生成答案（提示模板对评估结果起着关键作用）。通常，您可以在基准作者的仓库 (示例) 或其他流行的开源仓库 (示例) 中找到与基准相关的参考提示模板。

要在 llama stack 中创建新的开放基准，您需要将提示模板和原始数据合并到评估数据集的 chat_completion_input 列中。

Llama stack 强制要求评估数据集 schema 至少包含 3 列：

chat_completion_input: 用于运行评估生成的模型实际输入
input_query: 来自原始数据集且未包含提示模板的原始输入
expected_answer: 评分函数计算分数所需的基础事实

您需要编写一个脚本（示例转换脚本）将基准原始数据集转换为 llama stack 格式的评估数据集，并将数据集更新到 huggingface（示例基准数据集）。

为您的新基准寻找评分函数

评分函数的作用是根据候选模型的生成结果和 expected_answer 计算每个示例的分数。它还会聚合所有示例的分数并生成最终评估结果。

首先，您可以查看现有的 llama stack 评分函数是否满足您的需求。如果不能，您需要根据基准作者/其他开源仓库的描述编写新的评分函数。

将新基准添加到模板中

首先，您需要将与您的基准相关的评估数据集添加到 open-benchmark 模板中 datasets 资源下。

其次，您需要将刚刚创建的新基准添加到同一模板的 benchmarks 资源下。添加新基准需要具备：

benchmark_id: 基准的标识符
dataset_id: 与您的基准相关联的数据集标识符
scoring_functions: 用于根据生成结果和 expected_answer 计算分数的评分函数

测试新基准

使用 'open-benchmark' 模板启动 llama stack 服务器

llama stack run llama_stack/templates/open-benchmark/run.yaml

使用您的新基准 ID 运行 eval benchmark CLI

llama-stack-client eval run-benchmark <new_benchmark_id> \
--model_id <model id to evaluate on> \
--output_dir <directory to store the evaluate results> \