详细教程
在本指南中,我们将详细介绍如何使用 Llama Stack(服务器和客户端 SDK)来测试一个简单的智能体。Llama Stack 智能体是一个简单的集成系统,它可以通过结合 Llama 模型进行推理和工具(例如 RAG、网页搜索、代码执行等)进行操作来执行任务。在 Llama Stack 中,我们提供一个暴露多个 API 的服务器。这些 API 由不同提供者的实现提供支持。
Llama Stack 是一个有状态服务,提供 REST API 以支持 AI 应用在不同环境之间无缝过渡。服务器可以通过多种方式运行,包括作为独立二进制文件、Docker 容器或托管服务。您可以先使用本地服务器构建和测试,然后部署到托管端点以进行生产。
在本指南中,我们将详细介绍如何在本地使用 Llama Stack 构建 RAG 智能体,并使用 Ollama 作为 Llama 模型的推理提供者。
步骤 1:安装和设置
按照Ollama 网站上的说明安装 Ollama,然后下载 Llama 3.2 3B 模型,接着启动 Ollama 服务。
ollama pull llama3.2:3b
ollama run llama3.2:3b --keepalive 60m
安装 uv 以设置您的虚拟环境
使用 curl
下载脚本并使用 sh
执行它
curl -LsSf https://astral.sh/uv/install.sh | sh
使用 irm
下载脚本并使用 iex
执行它
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
设置您的虚拟环境。
uv sync --python 3.10
source .venv/bin/activate
步骤 2:运行 Llama Stack
Llama Stack 是一个暴露多个 API 的服务器,您可以使用 Llama Stack 客户端 SDK 与其连接。
您可以使用 Python 构建和运行 Llama Stack 服务器,这对于测试和开发非常有用。
Llama Stack 使用YAML 配置文件来指定堆栈设置,该文件定义了提供者及其设置。现在让我们构建并运行 Ollama 的 Llama Stack 配置。
INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
您可以使用 Python 构建和运行 Llama Stack 服务器,这对于测试和开发非常有用。
Llama Stack 使用YAML 配置文件来指定堆栈设置,该文件定义了提供者及其设置。现在让我们构建并运行 Ollama 的 Llama Stack 配置。
INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type conda --image-name llama3-3b-conda --run
您可以使用容器镜像来运行 Llama Stack 服务器。我们为服务器组件提供了几个可以直接与不同推理提供者一起工作的容器镜像。在本指南中,我们将使用 llamastack/distribution-ollama
作为容器镜像。如果您想构建自己的镜像或自定义配置,请参阅本指南。首先,让我们设置一些环境变量并创建一个本地目录,以便挂载到容器的文件系统。
export INFERENCE_MODEL="llama3.2:3b"
export LLAMA_STACK_PORT=8321
mkdir -p ~/.llama
然后使用您选择的容器工具启动服务器。例如,如果您正在运行 Docker,可以使用以下命令
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
注意:要使用 Podman 启动容器,操作相同,但请将命令开头的 docker
替换为 podman
。如果您使用的 podman
版本低于 4.7.0
,请同时将 OLLAMA_URL
中的 host.docker.internal
替换为 host.containers.internal
。
Ollama 分发版的配置 YAML 文件位于 distributions/ollama/run.yaml
。
提示
Docker 容器在 Linux 上运行于其独立的网络命名空间中。为了允许容器通过 localhost
与主机上运行的服务通信,您需要 --network=host
。这使得容器直接使用主机的网络,以便它可以连接到运行在 localhost:11434
上的 Ollama。
遇到上述命令运行问题的 Linux 用户应尝试以下命令
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
--network=host \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=https://:11434
您将看到如下输出
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
现在您可以使用 Llama Stack 客户端运行推理并构建智能体了!
您可以重用服务器设置或使用 Llama Stack 客户端。请注意,客户端包已包含在 llama-stack
包中。
步骤 3:运行客户端 CLI
打开一个新的终端,并导航到您启动服务器的同一目录。然后设置一个新的或激活您现有的服务器虚拟环境。
# The client is included in the llama-stack package so we just activate the server venv
source .venv/bin/activate
uv venv client --python 3.10
source client/bin/activate
pip install llama-stack-client
yes | conda create -n stack-client python=3.10
conda activate stack-client
pip install llama-stack-client
现在让我们使用 llama-stack-client
CLI 来检查与服务器的连接。
llama-stack-client configure --endpoint https://:8321 --api-key none
您将看到如下信息
Done! You can now use the Llama Stack Client CLI with endpoint https://:8321
列出模型
llama-stack-client models list
Available Models
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ embedding │ all-MiniLM-L6-v2 │ all-minilm:latest │ {'embedding_dimension': 384.0} │ ollama │
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼─────────────────┤
│ llm │ llama3.2:3b │ llama3.2:3b │ │ ollama │
└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴─────────────────┘
Total models: 2
您可以使用 CLI 测试基本的 Llama 推理完成。
llama-stack-client inference chat-completion --message "tell me a joke"
示例输出
ChatCompletionResponse(
completion_message=CompletionMessage(
content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!",
role="assistant",
stop_reason="end_of_turn",
tool_calls=[],
),
logprobs=None,
metrics=[
Metric(metric="prompt_tokens", value=14.0, unit=None),
Metric(metric="completion_tokens", value=27.0, unit=None),
Metric(metric="total_tokens", value=41.0, unit=None),
],
)
步骤 4:运行演示
请注意,这些演示展示了Python 客户端 SDK。还提供了其他 SDK,请参阅客户端 SDK 列表以获取完整选项。
现在您可以使用 Llama Stack 客户端 SDK 运行推理了。
i. 创建脚本
创建一个文件 inference.py
并添加以下代码
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="https://:8321")
# List available models
models = client.models.list()
# Select the first LLM
llm = next(m for m in models if m.model_type == "llm")
model_id = llm.identifier
print("Model:", model_id)
response = client.inference.chat_completion(
model_id=model_id,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about coding"},
],
)
print(response.completion_message.content)
ii. 运行脚本
让我们使用 uv
运行脚本
uv run python inference.py
将输出
Model: llama3.2:3b
Here is a haiku about coding:
Lines of code unfold
Logic flows through digital night
Beauty in the bits
接下来,我们可以超越简单的推理,构建一个可以使用 Llama Stack 服务器执行任务的智能体。
i. 创建脚本
创建一个文件 agent.py
并添加以下代码
from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from rich.pretty import pprint
import uuid
client = LlamaStackClient(base_url=f"https://:8321")
models = client.models.list()
llm = next(m for m in models if m.model_type == "llm")
model_id = llm.identifier
agent = Agent(client, model=model_id, instructions="You are a helpful assistant.")
s_id = agent.create_session(session_name=f"s{uuid.uuid4().hex}")
print("Non-streaming ...")
response = agent.create_turn(
messages=[{"role": "user", "content": "Who are you?"}],
session_id=s_id,
stream=False,
)
print("agent>", response.output_message.content)
print("Streaming ...")
stream = agent.create_turn(
messages=[{"role": "user", "content": "Who are you?"}], session_id=s_id, stream=True
)
for event in stream:
pprint(event)
print("Streaming with print helper...")
stream = agent.create_turn(
messages=[{"role": "user", "content": "Who are you?"}], session_id=s_id, stream=True
)
for event in AgentEventLogger().log(stream):
event.print()
ii. 运行脚本
让我们使用 uv
运行脚本
uv run python agent.py
👋 点击这里查看示例输出
Non-streaming ...
agent> I'm an artificial intelligence designed to assist and communicate with users like you. I don't have a personal identity, but I'm here to provide information, answer questions, and help with tasks to the best of my abilities.
I can be used for a wide range of purposes, such as:
* Providing definitions and explanations
* Offering suggestions and ideas
* Helping with language translation
* Assisting with writing and proofreading
* Generating text or responses to questions
* Playing simple games or chatting about topics of interest
I'm constantly learning and improving my abilities, so feel free to ask me anything, and I'll do my best to help!
Streaming ...
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepStartPayload(
│ │ │ event_type='step_start',
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference',
│ │ │ metadata={}
│ │ )
│ )
)
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepProgressPayload(
│ │ │ delta=TextDelta(text='As', type='text'),
│ │ │ event_type='step_progress',
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference'
│ │ )
│ )
)
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepProgressPayload(
│ │ │ delta=TextDelta(text=' a', type='text'),
│ │ │ event_type='step_progress',
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference'
│ │ )
│ )
)
...
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseStepCompletePayload(
│ │ │ event_type='step_complete',
│ │ │ step_details=InferenceStep(
│ │ │ │ api_model_response=CompletionMessage(
│ │ │ │ │ content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface – I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│ │ │ │ │ role='assistant',
│ │ │ │ │ stop_reason='end_of_turn',
│ │ │ │ │ tool_calls=[]
│ │ │ │ ),
│ │ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ │ step_type='inference',
│ │ │ │ turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│ │ │ │ completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 716174, tzinfo=TzInfo(UTC)),
│ │ │ │ started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28823, tzinfo=TzInfo(UTC))
│ │ │ ),
│ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ step_type='inference'
│ │ )
│ )
)
AgentTurnResponseStreamChunk(
│ event=TurnResponseEvent(
│ │ payload=AgentTurnResponseTurnCompletePayload(
│ │ │ event_type='turn_complete',
│ │ │ turn=Turn(
│ │ │ │ input_messages=[UserMessage(content='Who are you?', role='user', context=None)],
│ │ │ │ output_message=CompletionMessage(
│ │ │ │ │ content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface – I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│ │ │ │ │ role='assistant',
│ │ │ │ │ stop_reason='end_of_turn',
│ │ │ │ │ tool_calls=[]
│ │ │ │ ),
│ │ │ │ session_id='abd4afea-4324-43f4-9513-cfe3970d92e8',
│ │ │ │ started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28722, tzinfo=TzInfo(UTC)),
│ │ │ │ steps=[
│ │ │ │ │ InferenceStep(
│ │ │ │ │ │ api_model_response=CompletionMessage(
│ │ │ │ │ │ │ content='As a conversational AI, I don\'t have a personal identity in the classical sense. I exist as a program running on computer servers, designed to process and respond to text-based inputs.\n\nI\'m an instance of a type of artificial intelligence called a "language model," which is trained on vast amounts of text data to generate human-like responses. My primary function is to understand and respond to natural language inputs, like our conversation right now.\n\nThink of me as a virtual assistant, a chatbot, or a conversational interface – I\'m here to provide information, answer questions, and engage in conversation to the best of my abilities. I don\'t have feelings, emotions, or consciousness like humans do, but I\'m designed to simulate human-like interactions to make our conversations feel more natural and helpful.\n\nSo, that\'s me in a nutshell! What can I help you with today?',
│ │ │ │ │ │ │ role='assistant',
│ │ │ │ │ │ │ stop_reason='end_of_turn',
│ │ │ │ │ │ │ tool_calls=[]
│ │ │ │ │ │ ),
│ │ │ │ │ │ step_id='69831607-fa75-424a-949b-e2049e3129d1',
│ │ │ │ │ │ step_type='inference',
│ │ │ │ │ │ turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│ │ │ │ │ │ completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 716174, tzinfo=TzInfo(UTC)),
│ │ │ │ │ │ started_at=datetime.datetime(2025, 4, 3, 1, 15, 14, 28823, tzinfo=TzInfo(UTC))
│ │ │ │ │ )
│ │ │ │ ],
│ │ │ │ turn_id='8b360202-f7cb-4786-baa9-166a1b46e2ca',
│ │ │ │ completed_at=datetime.datetime(2025, 4, 3, 1, 15, 21, 727364, tzinfo=TzInfo(UTC)),
│ │ │ │ output_attachments=[]
│ │ │ )
│ │ )
│ )
)
Streaming with print helper...
inference> Déjà vu!
As I mentioned earlier, I'm an artificial intelligence language model. I don't have a personal identity or consciousness like humans do. I exist solely to process and respond to text-based inputs, providing information and assistance on a wide range of topics.
I'm a computer program designed to simulate human-like conversations, using natural language processing (NLP) and machine learning algorithms to understand and generate responses. My purpose is to help users like you with their questions, provide information, and engage in conversation.
Think of me as a virtual companion, a helpful tool designed to make your interactions more efficient and enjoyable. I don't have personal opinions, emotions, or biases, but I'm here to provide accurate and informative responses to the best of my abilities.
So, who am I? I'm just a computer program designed to help you!
对于我们的最后一个演示,我们可以构建一个 RAG 智能体,它可以使用向量数据库中的文档回答有关 Torchtune 项目的问题。
i. 创建脚本
创建一个文件 rag_agent.py
并添加以下代码
from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from llama_stack_client.types import Document
import uuid
client = LlamaStackClient(base_url="https://:8321")
# Create a vector database instance
embed_lm = next(m for m in client.models.list() if m.model_type == "embedding")
embedding_model = embed_lm.identifier
vector_db_id = f"v{uuid.uuid4().hex}"
client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=embedding_model,
)
# Create Documents
urls = [
"memory_optimizations.rst",
"chat.rst",
"llama3.rst",
"qat_finetune.rst",
"lora_finetune.rst",
]
documents = [
Document(
document_id=f"num-{i}",
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
mime_type="text/plain",
metadata={},
)
for i, url in enumerate(urls)
]
# Insert documents
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)
# Get the model being served
llm = next(m for m in client.models.list() if m.model_type == "llm")
model = llm.identifier
# Create the RAG agent
rag_agent = Agent(
client,
model=model,
instructions="You are a helpful assistant. Use the RAG tool to answer questions as needed.",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": [vector_db_id]},
}
],
)
session_id = rag_agent.create_session(session_name=f"s{uuid.uuid4().hex}")
turns = ["what is torchtune", "tell me about dora"]
for t in turns:
print("user>", t)
stream = rag_agent.create_turn(
messages=[{"role": "user", "content": t}], session_id=session_id, stream=True
)
for event in AgentEventLogger().log(stream):
event.print()
ii. 运行脚本
让我们使用 uv
运行脚本
uv run python rag_agent.py
👋 点击这里查看示例输出
user> what is torchtune
inference> [knowledge_search(query='TorchTune')]
tool_execution> Tool:knowledge_search Args:{'query': 'TorchTune'}
tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text='Result 1:\nDocument_id:num-1\nContent: conversational data, :func:`~torchtune.datasets.chat_dataset` seems to be a good fit. ..., type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
inference> Here is a high-level overview of the text:
**LoRA Finetuning with PyTorch Tune**
PyTorch Tune provides a recipe for LoRA (Low-Rank Adaptation) finetuning, which is a technique to adapt pre-trained models to new tasks. The recipe uses the `lora_finetune_distributed` command.
...
Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results.
您已准备好构建自己的应用!
恭喜!🥳 现在您已准备好构建您自己的 Llama Stack 应用了!🚀