详细教程

在本指南中，我们将详细介绍如何使用 Llama Stack（服务器和客户端 SDK）来测试一个简单的智能体。Llama Stack 智能体是一个简单的集成系统，它可以通过结合 Llama 模型进行推理和工具（例如 RAG、网页搜索、代码执行等）进行操作来执行任务。在 Llama Stack 中，我们提供一个暴露多个 API 的服务器。这些 API 由不同提供者的实现提供支持。

Llama Stack 是一个有状态服务，提供 REST API 以支持 AI 应用在不同环境之间无缝过渡。服务器可以通过多种方式运行，包括作为独立二进制文件、Docker 容器或托管服务。您可以先使用本地服务器构建和测试，然后部署到托管端点以进行生产。

在本指南中，我们将详细介绍如何在本地使用 Llama Stack 构建 RAG 智能体，并使用 Ollama 作为 Llama 模型的推理提供者。

步骤 1：安装和设置

按照Ollama 网站上的说明安装 Ollama，然后下载 Llama 3.2 3B 模型，接着启动 Ollama 服务。

ollama pull llama3.2:3b
ollama run llama3.2:3b --keepalive 60m

安装 uv 以设置您的虚拟环境

macOS 和 Linux

使用 curl 下载脚本并使用 sh 执行它

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows

使用 irm 下载脚本并使用 iex 执行它

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

设置您的虚拟环境。

uv sync --python 3.10
source .venv/bin/activate

步骤 2：运行 Llama Stack

Llama Stack 是一个暴露多个 API 的服务器，您可以使用 Llama Stack 客户端 SDK 与其连接。

使用 venv

您可以使用 Python 构建和运行 Llama Stack 服务器，这对于测试和开发非常有用。

Llama Stack 使用YAML 配置文件来指定堆栈设置，该文件定义了提供者及其设置。现在让我们构建并运行 Ollama 的 Llama Stack 配置。

INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run

使用 conda

您可以使用 Python 构建和运行 Llama Stack 服务器，这对于测试和开发非常有用。

Llama Stack 使用YAML 配置文件来指定堆栈设置，该文件定义了提供者及其设置。现在让我们构建并运行 Ollama 的 Llama Stack 配置。

INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type conda  --image-name llama3-3b-conda --run

使用容器

您可以使用容器镜像来运行 Llama Stack 服务器。我们为服务器组件提供了几个可以直接与不同推理提供者一起工作的容器镜像。在本指南中，我们将使用 llamastack/distribution-ollama 作为容器镜像。如果您想构建自己的镜像或自定义配置，请参阅本指南。首先，让我们设置一些环境变量并创建一个本地目录，以便挂载到容器的文件系统。

export INFERENCE_MODEL="llama3.2:3b"
export LLAMA_STACK_PORT=8321
mkdir -p ~/.llama

然后使用您选择的容器工具启动服务器。例如，如果您正在运行 Docker，可以使用以下命令

docker run -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  llamastack/distribution-ollama \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://host.docker.internal:11434

注意：要使用 Podman 启动容器，操作相同，但请将命令开头的 docker 替换为 podman。如果您使用的 podman 版本低于 4.7.0，请同时将 OLLAMA_URL 中的 host.docker.internal 替换为 host.containers.internal。

Ollama 分发版的配置 YAML 文件位于 distributions/ollama/run.yaml。

提示

Docker 容器在 Linux 上运行于其独立的网络命名空间中。为了允许容器通过 localhost 与主机上运行的服务通信，您需要 --network=host。这使得容器直接使用主机的网络，以便它可以连接到运行在 localhost:11434 上的 Ollama。

遇到上述命令运行问题的 Linux 用户应尝试以下命令

docker run -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  --network=host \
  llamastack/distribution-ollama \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=https://:11434

您将看到如下输出

INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)

现在您可以使用 Llama Stack 客户端运行推理并构建智能体了！

您可以重用服务器设置或使用 Llama Stack 客户端。请注意，客户端包已包含在 llama-stack 包中。

步骤 3：运行客户端 CLI

打开一个新的终端，并导航到您启动服务器的同一目录。然后设置一个新的或激活您现有的服务器虚拟环境。

重用服务器 venv

# The client is included in the llama-stack package so we just activate the server venv
source .venv/bin/activate

使用 venv 安装

uv venv client --python 3.10
source client/bin/activate
pip install llama-stack-client

使用 conda 安装

yes | conda create -n stack-client python=3.10
conda activate stack-client
pip install llama-stack-client

现在让我们使用 llama-stack-client CLI 来检查与服务器的连接。

llama-stack-client configure --endpoint https://:8321 --api-key none

您将看到如下信息

Done! You can now use the Llama Stack Client CLI with endpoint https://:8321

列出模型

llama-stack-client models list
Available Models

┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ model_type      ┃ identifier                          ┃ provider_resource_id                ┃ metadata                                  ┃ provider_id     ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ embedding       │ all-MiniLM-L6-v2                    │ all-minilm:latest                   │ {'embedding_dimension': 384.0}            │ ollama          │
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼─────────────────┤
│ llm             │ llama3.2:3b                         │ llama3.2:3b                         │                                           │ ollama          │
└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴─────────────────┘

Total models: 2

您可以使用 CLI 测试基本的 Llama 推理完成。

llama-stack-client inference chat-completion --message "tell me a joke"

示例输出

ChatCompletionResponse(
    completion_message=CompletionMessage(
        content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!",
        role="assistant",
        stop_reason="end_of_turn",
        tool_calls=[],
    ),
    logprobs=None,
    metrics=[
        Metric(metric="prompt_tokens", value=14.0, unit=None),
        Metric(metric="completion_tokens", value=27.0, unit=None),
        Metric(metric="total_tokens", value=41.0, unit=None),
    ],
)

步骤 4：运行演示

请注意，这些演示展示了Python 客户端 SDK。还提供了其他 SDK，请参阅客户端 SDK 列表以获取完整选项。

基本推理

现在您可以使用 Llama Stack 客户端 SDK 运行推理了。

i. 创建脚本

创建一个文件 inference.py 并添加以下代码

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="https://:8321")

# List available models
models = client.models.list()

# Select the first LLM
llm = next(m for m in models if m.model_type == "llm")
model_id = llm.identifier

print("Model:", model_id)

response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"},
    ],
)
print(response.completion_message.content)

ii. 运行脚本

让我们使用 uv 运行脚本

uv run python inference.py

将输出

Model: llama3.2:3b
Here is a haiku about coding:

Lines of code unfold
Logic flows through digital night
Beauty in the bits

构建一个简单的智能体

接下来，我们可以超越简单的推理，构建一个可以使用 Llama Stack 服务器执行任务的智能体。

i. 创建脚本

创建一个文件 agent.py 并添加以下代码

from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from rich.pretty import pprint
import uuid

client = LlamaStackClient(base_url=f"https://:8321")

models = client.models.list()
llm = next(m for m in models if m.model_type == "llm")
model_id = llm.identifier

agent = Agent(client, model=model_id, instructions="You are a helpful assistant.")

s_id = agent.create_session(session_name=f"s{uuid.uuid4().hex}")

print("Non-streaming ...")
response = agent.create_turn(
    messages=[{"role": "user", "content": "Who are you?"}],
    session_id=s_id,
    stream=False,
)
print("agent>", response.output_message.content)

print("Streaming ...")
stream = agent.create_turn(
    messages=[{"role": "user", "content": "Who are you?"}], session_id=s_id, stream=True
)
for event in stream:
    pprint(event)

print("Streaming with print helper...")
stream = agent.create_turn(
    messages=[{"role": "user", "content": "Who are you?"}], session_id=s_id, stream=True
)
for event in AgentEventLogger().log(stream):
    event.print()

ii. 运行脚本

让我们使用 uv 运行脚本

uv run python agent.py

👋 点击这里查看示例输出

Non-streaming ...
agent> I'm an artificial intelligence designed to assist and communicate with users like you. I don't have a personal identity, but I'm here to provide information, answer questions, and help with tasks to the best of my abilities.

I can be used for a wide range of purposes, such as:

* Providing definitions and explanations
* Offering suggestions and ideas
* Helping with language translation
* Assisting with writing and proofreading
* Generating text or responses to questions
* Playing simple games or chatting about topics of interest

I'm constantly learning and improving my abilities, so feel free to ask me anything, and I'll do my best to help!

Streaming ...
AgentTurnResponseStreamChunk(
│   event=TurnResponseEvent(
│   │   payload=AgentTurnResponseStepStartPayload(
│   │   │   event_type='step_start',
│   │   │   step_id='69831607-fa75-424a-949b-e2049e3129d1',
│   │   │   step_type='inference',
│   │   │   metadata={}
│   │   )
│   )
)
AgentTurnResponseStreamChunk(
│   event=TurnResponseEvent(
│   │   payload=AgentTurnResponseStepProgressPayload(
│   │   │   delta=TextDelta(text='As', type='text'),
│   │   │   event_type='step_progress',
│   │   │   step_id='69831607-fa75-424a-949b-e2049e3129d1',
│   │   │   step_type='inference'
│   │   )
│   )
)
AgentTurnResponseStreamChunk(
│   event=TurnResponseEvent(
│   │   payload=AgentTurnResponseStepProgressPayload(
│   │   │   delta=TextDelta(text=' a', type='text'),
│   │   │   event_type='step_progress',
│   │   │   step_id='69831607-fa75-424a-949b-e2049e3129d1',
│   │   │   step_type='inference'
│   │   )
│   )
)
...
AgentTurnResponseStreamChunk(
│   event=TurnResponseEvent(
│   │   payload=AgentTurnResponseStepCompletePayload(
│   │   │   event_type='step_complete',
│   │   │   step_details=InferenceStep(
│   │   │   │   api_model_response=CompletionMessage(
│   │   │   │   │   content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface – I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│   │   │   │   │   role='assistant',
│   │   │   │   │   stop_reason='end_of_turn',
│   │   │   │   │   tool_calls=[]
│   │   │   │   ),
│   │   │   │   step_id='69831607-fa75-424a-949b-e2049e3129d1',
│   │   │   │   step_type='inference',
│   │   │   │   turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│   │   │   │   completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 716174, tzinfo=TzInfo(UTC)),
│   │   │   │   started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28823, tzinfo=TzInfo(UTC))
│   │   │   ),
│   │   │   step_id='69831607-fa75-424a-949b-e2049e3129d1',
│   │   │   step_type='inference'
│   │   )
│   )
)
AgentTurnResponseStreamChunk(
│   event=TurnResponseEvent(
│   │   payload=AgentTurnResponseTurnCompletePayload(
│   │   │   event_type='turn_complete',
│   │   │   turn=Turn(
│   │   │   │   input_messages=[UserMessage(content='Who are you?', role='user', context=None)],
│   │   │   │   output_message=CompletionMessage(
│   │   │   │   │   content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface – I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│   │   │   │   │   role='assistant',
│   │   │   │   │   stop_reason='end_of_turn',
│   │   │   │   │   tool_calls=[]
│   │   │   │   ),
│   │   │   │   session_id='abd4afea-4324-43f4-9513-cfe3970d92e8',
│   │   │   │   started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28722, tzinfo=TzInfo(UTC)),
│   │   │   │   steps=[
│   │   │   │   │   InferenceStep(
│   │   │   │   │   │   api_model_response=CompletionMessage(
│   │   │   │   │   │   │   content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface – I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│   │   │   │   │   │   │   role='assistant',
│   │   │   │   │   │   │   stop_reason='end_of_turn',
│   │   │   │   │   │   │   tool_calls=[]
│   │   │   │   │   │   ),
│   │   │   │   │   │   step_id='69831607-fa75-424a-949b-e2049e3129d1',
│   │   │   │   │   │   step_type='inference',
│   │   │   │   │   │   turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│   │   │   │   │   │   completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 716174, tzinfo=TzInfo(UTC)),
│   │   │   │   │   │   started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28823, tzinfo=TzInfo(UTC))
│   │   │   │   │   )
│   │   │   │   ],
│   │   │   │   turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│   │   │   │   completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 727364, tzinfo=TzInfo(UTC)),
│   │   │   │   output_attachments=[]
│   │   │   )
│   │   )
│   )
)


Streaming with print helper...
inference> Déjà vu!

As I mentioned earlier, I'm an artificial intelligence language model. I don't have a personal identity or consciousness like humans do. I exist solely to process and respond to text-based inputs, providing information and assistance on a wide range of topics.

I'm a computer program designed to simulate human-like conversations, using natural language processing (NLP) and machine learning algorithms to understand and generate responses. My purpose is to help users like you with their questions, provide information, and engage in conversation.

Think of me as a virtual companion, a helpful tool designed to make your interactions more efficient and enjoyable. I don't have personal opinions, emotions, or biases, but I'm here to provide accurate and informative responses to the best of my abilities.

So, who am I? I'm just a computer program designed to help you!

构建 RAG 智能体

对于我们的最后一个演示，我们可以构建一个 RAG 智能体，它可以使用向量数据库中的文档回答有关 Torchtune 项目的问题。

i. 创建脚本

创建一个文件 rag_agent.py 并添加以下代码

from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from llama_stack_client.types import Document
import uuid

client = LlamaStackClient(base_url="https://:8321")

# Create a vector database instance
embed_lm = next(m for m in client.models.list() if m.model_type == "embedding")
embedding_model = embed_lm.identifier
vector_db_id = f"v{uuid.uuid4().hex}"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=embedding_model,
)

# Create Documents
urls = [
    "memory_optimizations.rst",
    "chat.rst",
    "llama3.rst",
    "qat_finetune.rst",
    "lora_finetune.rst",
]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

# Insert documents
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

# Get the model being served
llm = next(m for m in client.models.list() if m.model_type == "llm")
model = llm.identifier

# Create the RAG agent
rag_agent = Agent(
    client,
    model=model,
    instructions="You are a helpful assistant. Use the RAG tool to answer questions as needed.",
    tools=[
        {
            "name": "builtin::rag/knowledge_search",
            "args": {"vector_db_ids": [vector_db_id]},
        }
    ],
)

session_id = rag_agent.create_session(session_name=f"s{uuid.uuid4().hex}")

turns = ["what is torchtune", "tell me about dora"]

for t in turns:
    print("user>", t)
    stream = rag_agent.create_turn(
        messages=[{"role": "user", "content": t}], session_id=session_id, stream=True
    )
    for event in AgentEventLogger().log(stream):
        event.print()

ii. 运行脚本

让我们使用 uv 运行脚本

uv run python rag_agent.py

您已准备好构建自己的应用！

恭喜！🥳 现在您已准备好构建您自己的 Llama Stack 应用了！🚀