iOS SDK

我们通过一个 Swift SDK llama-stack-client-swift 提供 Llama Stack 的远程和设备端使用，该 SDK 包含两个组件

用于远程的 LlamaStackClient
用于设备端的本地推理

Seamlessly switching between local, on-device inference and remote hosted inference

仅限远程

如果您不想在设备端运行推理，则可以使用 #1 连接到任何托管的 Llama Stack 分发。

在 Xcode 中将 https://github.com/meta-llama/llama-stack-client-swift/ 添加为包依赖项
将 LlamaStackClient 作为框架添加到您的应用目标中
调用 API

import LlamaStackClient

let agents = RemoteAgents(url: URL(string: "https://:8321")!)
let request = Components.Schemas.CreateAgentTurnRequest(
        agent_id: agentId,
        messages: [
          .UserMessage(Components.Schemas.UserMessage(
            content: .case1("Hello Llama!"),
            role: .user
          ))
        ],
        session_id: self.agenticSystemSessionId,
        stream: true
      )

      for try await chunk in try await agents.createTurn(request: request) {
        let payload = chunk.event.payload
      // ...

查看 iOSCalendarAssistant 获取完整的应用演示。

LocalInference

LocalInference 提供了一个由 executorch 提供支持的本地推理实现。

Llama Stack 目前支持 iOS 设备端推理，Android 即将推出。今天您可以使用 executorch（PyTorch 的设备端推理库）在 Android 上运行设备端推理。

API 的使用方式与远程方式相同 – 唯一的区别是您将改用 LocalAgents / LocalInference 类并传入一个 DispatchQueue

private let runnerQueue = DispatchQueue(label: "org.llamastack.stacksummary")
let inference = LocalInference(queue: runnerQueue)
let agents = LocalAgents(inference: self.inference)

查看 iOSCalendarAssistantWithLocalInf 获取完整的应用演示。

安装

我们正在努力使 LocalInference 更易于设置。目前，您需要通过 .xcframework 导入它

克隆此仓库中的 executorch 子模块及其依赖项：git submodule update --init --recursive
安装 Cmake 用于 executorch 构建`
将 LocalInference.xcodeproj 拖到您的项目中
将 LocalInference 作为框架添加到您的应用目标中

准备模型

准备一个 .pte 文件，按照 executorch 文档进行
将 .pte 和 tokenizer.model 文件打包到您的应用中

我们现在支持使用 SpinQuant 和 QAT-LoRA 量化的模型，这些模型提供了显著的性能提升（iPhone 13 Pro 上的演示应用）

Llama 3.2 1B	每秒 Token 数（总计）		首个 Token 生成时间（秒）
	俳句	段落	俳句	段落
BF16	2.2	2.5	2.3	1.9
QAT+LoRA	7.1	3.3	0.37	0.24
SpinQuant	10.1	5.2	0.2	0.2

使用 LocalInference

使用 DispatchQueue 实例化 LocalInference。您也可以将其传入您的智能体服务中

  init () {
    runnerQueue = DispatchQueue(label: "org.meta.llamastack")
    inferenceService = LocalInferenceService(queue: runnerQueue)
    agentsService = LocalAgentsService(inference: inferenceService)
  }

在进行任何推理调用之前，从您的 Bundle 中加载模型

let mainBundle = Bundle.main
inferenceService.loadModel(
    modelPath: mainBundle.url(forResource: "llama32_1b_spinquant", withExtension: "pte"),
    tokenizerPath: mainBundle.url(forResource: "tokenizer", withExtension: "model"),
    completion: {_ in } // use to handle load failures
)

像使用 LlamaStack 通常那样进行推理调用（或智能体调用）

for await chunk in try await agentsService.initAndCreateTurn(
    messages: [
    .UserMessage(Components.Schemas.UserMessage(
        content: .case1("Call functions as needed to handle any actions in the following text:\n\n" + text),
        role: .user))
    ]
) {

故障排除

如果您收到类似“缺少包产品”或“无效校验和”的错误，请尝试清理构建文件夹并重置 Swift 包缓存

（按住 Opt 并点击）Product > Clean Build Folder Immediately

rm -rf \
  ~/Library/org.swift.swiftpm \
  ~/Library/Caches/org.swift.swiftpm \
  ~/Library/Caches/com.apple.dt.Xcode \
  ~/Library/Developer/Xcode/DerivedData