vLLM Client
一个兼容 OpenAI 接口的 vLLM API Rust 客户端库。
特性
- OpenAI 兼容:使用与 OpenAI 相同的 API 结构,方便迁移
- 流式响应:完整支持 Server-Sent Events (SSE) 流式响应
- 工具调用:支持函数/工具调用,支持流式增量更新
- 推理模型:内置支持推理/思考模型(如启用了思考模式的 Qwen)
- 异步支持:基于 Tokio 运行时的完全异步实现
- 类型安全:使用 Serde 序列化的强类型定义
快速开始
添加到你的 Cargo.toml:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
基本用法
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("your-model-name") .messages(json!([ {"role": "user", "content": "你好,世界!"} ])) .send() .await?; println!("{}", response.choices[0].message.content); Ok(()) }
文档
语言
- English - English documentation
- 中文 - 当前页面
许可证
根据 Apache 许可证 2.0 版本或 MIT 许可证任选其一授权。
快速开始
安装
将 vllm-client 添加到你的 Cargo.toml:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
快速开始
基础聊天补全
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { // 创建客户端 let client = VllmClient::new("http://localhost:8000/v1"); // 发送聊天补全请求 let response = client .chat .completions() .create() .model("your-model-name") .messages(json!([ {"role": "user", "content": "你好,你好吗?"} ])) .send() .await?; // 打印响应 println!("{}", response.choices[0].message.content); Ok(()) }
流式响应
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("your-model-name") .messages(json!([ {"role": "user", "content": "写一首关于春天的诗"} ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match &event { StreamEvent::Reasoning(delta) => print!("{}", delta), StreamEvent::Content(delta) => print!("{}", delta), _ => {} } } println!(); Ok(()) }
配置
API 密钥
如果你的 vLLM 服务器需要认证:
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("your-api-key"); }
自定义超时
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_timeout(std::time::Duration::from_secs(60)); }
下一步
安装
环境要求
- Rust: 1.70 及以上版本
- Cargo: 安装 Rust 时会自动安装
引入项目
在 Cargo.toml 中添加依赖:
[dependencies]
vllm-client = "0.1"
或直接运行:
cargo add vllm-client
依赖说明
本库依赖 tokio 异步运行时,请在 Cargo.toml 中添加:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
为方便使用,库内重新导出了 serde_json::json,你可以选择添加:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
特性开关
目前 vllm-client 暂无额外特性开关,所有功能默认启用。
验证安装
写一段简单代码验证安装是否成功:
use vllm_client::VllmClient; fn main() { let client = VllmClient::new("http://localhost:8000/v1"); println!("客户端创建成功,地址: {}", client.base_url()); }
运行:
cargo run
启动 vLLM 服务
使用本客户端前,需要先启动 vLLM 服务:
# 安装 vLLM
pip install vllm
# 启动服务并加载模型
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000
服务启动后会在 http://localhost:8000/v1 提供接口。
常见问题
连接失败
遇到连接错误时,请检查:
- vLLM 服务是否正常运行
- 服务地址是否正确(默认
http://localhost:8000/v1) - 防火墙是否阻止了端口访问
TLS/SSL 报错
如果 vLLM 服务使用了自签名 HTTPS 证书,需要在代码中处理证书验证问题。
请求超时
请求耗时时长较大时,可以调大超时时间:
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); // 5 分钟 }
下一步
快速上手
本节带你完成第一次 API 调用。
前置条件
- Rust 1.70 及以上版本
- 已启动的 vLLM 服务
基础对话补全
最简单的使用方式如下:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { // 创建客户端,指向 vLLM 服务地址 let client = VllmClient::new("http://localhost:8000/v1"); // 发送对话补全请求 let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "你好,最近怎么样?"} ])) .send() .await?; // 打印响应内容 println!("回复: {}", response.content.unwrap_or_default()); Ok(()) }
流式响应
如果需要实时输出,可以使用流式模式:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 创建流式请求 let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "写一首关于春天的短诗"} ])) .stream(true) .send_stream() .await?; // 处理流式事件 while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Reasoning(delta) => eprint!("[思考: {}]", delta), StreamEvent::Done => println!("\n[完成]"), StreamEvent::Error(e) => eprintln!("\n错误: {}", e), _ => {} } } Ok(()) }
使用构建器模式
需要更多配置时,可以使用构建器:
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("your-api-key") // 可选 .timeout_secs(120) // 可选 .build(); }
完整示例
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "system", "content": "你是一个有帮助的助手。"}, {"role": "user", "content": "法国的首都是哪里?"} ])) .temperature(0.7) .max_tokens(1024) .top_p(0.9) .send() .await?; println!("回复: {}", response.content.unwrap_or_default()); // 打印 token 使用统计(如有) if let Some(usage) = response.usage { println!("Token 统计: 提示词={}, 补全={}, 总计={}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } Ok(()) }
错误处理
建议做好错误处理:
use vllm_client::{VllmClient, json, VllmError}; async fn chat() -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "你好!"} ])) .send() .await?; Ok(response.content.unwrap_or_default()) } #[tokio::main] async fn main() { match chat().await { Ok(text) => println!("回复: {}", text), Err(VllmError::ApiError { status_code, message, .. }) => { eprintln!("API 错误 ({}): {}", status_code, message); } Err(VllmError::Timeout) => { eprintln!("请求超时"); } Err(e) => { eprintln!("错误: {}", e); } } }
下一步
配置说明
本文档介绍 vllm-client 的全部配置选项。
客户端配置
基础配置
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1"); }
构建器模式
需要更复杂的配置时,使用构建器模式:
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("your-api-key") .timeout_secs(120) .build(); }
配置选项
Base URL
vLLM 服务的地址,需要包含 /v1 路径以兼容 OpenAI 接口。
#![allow(unused)] fn main() { // 本地开发 let client = VllmClient::new("http://localhost:8000/v1"); // 远程服务 let client = VllmClient::new("https://api.example.com/v1"); // 末尾斜杠会自动处理 let client = VllmClient::new("http://localhost:8000/v1/"); // 等同于: "http://localhost:8000/v1" }
API Key
如果 vLLM 服务需要认证,配置 API Key:
#![allow(unused)] fn main() { // 链式调用 let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-your-api-key"); // 构建器模式 let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("sk-your-api-key") .build(); }
API Key 会作为 Bearer Token 放在 Authorization 请求头中发送。
超时设置
长时间运行的任务需要调大超时时间:
#![allow(unused)] fn main() { // 链式调用 let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); // 5 分钟 // 构建器模式 let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .timeout_secs(300) .build(); }
默认使用底层 HTTP 客户端的超时设置(通常为 30 秒)。
请求参数配置
发起请求时,可以配置以下参数:
模型选择
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .send() .await?; }
采样参数
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .temperature(0.7) // 0.0 - 2.0 .top_p(0.9) // 0.0 - 1.0 .top_k(50) // vLLM 扩展参数 .max_tokens(1024) // 最大输出 token 数 .send() .await?; }
| 参数 | 类型 | 范围 | 说明 |
|---|---|---|---|
temperature | f32 | 0.0 - 2.0 | 控制随机性,值越高输出越随机 |
top_p | f32 | 0.0 - 1.0 | 核采样阈值 |
top_k | i32 | 1+ | Top-K 采样(vLLM 扩展) |
max_tokens | u32 | 1+ | 最大生成 token 数 |
停止序列
#![allow(unused)] fn main() { use serde_json::json; // 多个停止序列 let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .stop(json!(["END", "STOP", "\n\n"])) .send() .await?; // 单个停止序列 let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .stop(json!("END")) .send() .await?; }
扩展参数
vLLM 支持通过 extra() 方法传入额外参数:
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "请思考这个问题"}])) .extra(json!({ "chat_template_kwargs": { "think_mode": true }, "reasoning_effort": "high" })) .send() .await?; }
环境变量
可以通过环境变量配置客户端:
#![allow(unused)] fn main() { use std::env; use vllm_client::VllmClient; let base_url = env::var("VLLM_BASE_URL") .unwrap_or_else(|_| "http://localhost:8000/v1".to_string()); let api_key = env::var("VLLM_API_KEY").ok(); let mut client_builder = VllmClient::builder() .base_url(&base_url); if let Some(key) = api_key { client_builder = client_builder.api_key(&key); } let client = client_builder.build(); }
常用环境变量
| 变量名 | 说明 | 示例 |
|---|---|---|
VLLM_BASE_URL | vLLM 服务地址 | http://localhost:8000/v1 |
VLLM_API_KEY | API Key(可选) | sk-xxx |
VLLM_TIMEOUT | 超时时间(秒) | 300 |
最佳实践
复用客户端
客户端应该创建一次、多次复用:
#![allow(unused)] fn main() { // 推荐:复用客户端 let client = VllmClient::new("http://localhost:8000/v1"); for prompt in prompts { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await?; } // 避免:每次请求都创建客户端 for prompt in prompts { let client = VllmClient::new("http://localhost:8000/v1"); // 效率低! // ... } }
超时时间选择
根据使用场景选择合适的超时时间:
| 使用场景 | 建议超时 |
|---|---|
| 简单问答 | 30 秒 |
| 复杂推理 | 2-5 分钟 |
| 长文本生成 | 10 分钟以上 |
错误处理
务必正确处理错误:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, VllmError}; match client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .send() .await { Ok(response) => println!("{}", response.content.unwrap()), Err(VllmError::Timeout) => eprintln!("请求超时"), Err(VllmError::ApiError { status_code, message, .. }) => { eprintln!("API 错误 ({}): {}", status_code, message); } Err(e) => eprintln!("错误: {}", e), } }
下一步
API 参考
本文档提供 vLLM Client API 的完整参考。
目录
客户端
VllmClient
与 vLLM API 交互的主要客户端。
#![allow(unused)] fn main() { use vllm_client::VllmClient; // 创建新客户端 let client = VllmClient::new("http://localhost:8000/v1"); // 带API密钥 let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("your-api-key"); // 带自定义超时 let client = VllmClient::new("http://localhost:8000/v1") .with_timeout(std::time::Duration::from_secs(60)); }
方法
| 方法 | 描述 |
|---|---|
new(base_url: &str) | 使用给定的基础URL创建新客户端 |
with_api_key(key: &str) | 设置用于认证的API密钥 |
with_timeout(duration) | 设置请求超时时间 |
chat | 访问聊天补全API |
聊天补全
创建补全请求
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "system", "content": "你是一个有帮助的助手。"}, {"role": "user", "content": "你好!"} ])) .temperature(0.7) .max_tokens(1000) .send() .await?; }
构建器方法
| 方法 | 类型 | 描述 |
|---|---|---|
model(name) | &str | 使用的模型名称 |
messages(msgs) | Value | 聊天消息数组 |
temperature(temp) | f32 | 采样温度 (0.0-2.0) |
max_tokens(tokens) | u32 | 最大生成token数 |
top_p(p) | f32 | 核采样参数 |
top_k(k) | u32 | Top-k采样参数 |
stream(enable) | bool | 启用流式响应 |
tools(tools) | Value | 函数调用的工具定义 |
extra(json) | Value | 额外参数(厂商特定) |
响应结构
#![allow(unused)] fn main() { pub struct ChatCompletionResponse { pub id: String, pub object: String, pub created: u64, pub model: String, pub choices: Vec<Choice>, pub usage: Usage, } pub struct Choice { pub index: u32, pub message: Message, pub finish_reason: Option<String>, } pub struct Message { pub role: String, pub content: Option<String>, pub tool_calls: Option<Vec<ToolCall>>, } pub struct Usage { pub prompt_tokens: u32, pub completion_tokens: u32, pub total_tokens: u32, } }
流式响应
流式补全
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "写一首诗"} ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match &event { StreamEvent::Reasoning(delta) => { // 推理内容(用于思考模型) print!("{}", delta); } StreamEvent::Content(delta) => { // 常规内容 print!("{}", delta); } StreamEvent::ToolCallDelta { tool_call_id, delta } => { // 工具调用流式更新 } StreamEvent::ToolCallComplete(tool_call) => { // 完整的工具调用 } StreamEvent::Usage(usage) => { // Token使用信息 } StreamEvent::Done => { // 流式完成 break; } StreamEvent::Error(e) => { eprintln!("错误: {}", e); } } } }
StreamEvent 类型
| 变体 | 描述 |
|---|---|
Reasoning(String) | 推理/思考内容 |
Content(String) | 常规内容增量 |
ToolCallDelta { tool_call_id, delta } | 流式工具调用 |
ToolCallComplete(ToolCall) | 完整工具调用 |
Usage(Usage) | Token使用统计 |
Done | 流式结束 |
Error(VllmError) | 发生错误 |
工具调用
定义工具
#![allow(unused)] fn main() { use vllm_client::json; let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定位置的当前天气", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "城市名称" } }, "required": ["location"] } } } ]); let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "东京的天气怎么样?"} ])) .tools(tools) .send() .await?; // 处理工具调用 if let Some(tool_calls) = response.choices[0].message.tool_calls { for tool_call in tool_calls { println!("函数: {}", tool_call.function.name); println!("参数: {}", tool_call.function.arguments); } } }
ToolCall 结构
#![allow(unused)] fn main() { pub struct ToolCall { pub id: String, pub r#type: String, pub function: FunctionCall, } pub struct FunctionCall { pub name: String, pub arguments: String, // JSON字符串 } }
返回工具结果
#![allow(unused)] fn main() { // 执行工具后,返回结果 let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "东京的天气怎么样?"}, {"role": "assistant", "tool_calls": [ { "id": "call_123", "type": "function", "function": { "name": "get_weather", "arguments": "{\"location\": \"Tokyo\"}" } } ]}, { "role": "tool", "tool_call_id": "call_123", "content": "{\"temperature\": 25, \"condition\": \"sunny\"}" } ])) .tools(tools) .send() .await?; }
类型定义
消息类型
#![allow(unused)] fn main() { // 系统消息 json!({"role": "system", "content": "你是一个有帮助的助手。"}) // 用户消息 json!({"role": "user", "content": "你好!"}) // 助手消息 json!({"role": "assistant", "content": "你好!"}) // 工具结果消息 json!({ "role": "tool", "tool_call_id": "call_123", "content": "结果" }) }
vLLM 特定参数
使用 .extra() 传递 vLLM 特定参数:
#![allow(unused)] fn main() { client .chat .completions() .create() .model("qwen-3") .messages(json!([{"role": "user", "content": "思考一下这个问题"}])) .extra(json!({ "chat_template_kwargs": { "enable_thinking": true } })) .send() .await?; }
错误处理
VllmError
#![allow(unused)] fn main() { use vllm_client::VllmError; match client.chat.completions().create().send().await { Ok(response) => { /* ... */ }, Err(VllmError::HttpError(e)) => { eprintln!("HTTP错误: {}", e); } Err(VllmError::ApiError { message, code }) => { eprintln!("API错误 ({}): {}", code, message); } Err(VllmError::StreamError(e)) => { eprintln!("流式错误: {}", e); } Err(VllmError::ParseError(e)) => { eprintln!("解析错误: {}", e); } Err(e) => { eprintln!("其他错误: {}", e); } } }
错误类型
| 变体 | 描述 |
|---|---|
HttpError | HTTP请求/响应错误 |
ApiError | API级别错误(限流等) |
StreamError | 流式特定错误 |
ParseError | JSON解析错误 |
IoError | I/O错误 |
完整示例
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("your-api-key"); // 流式示例 let mut stream = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "写一首关于编程的俳句"} ])) .temperature(0.7) .max_tokens(100) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match &event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Done => break, StreamEvent::Error(e) => eprintln!("错误: {}", e), _ => {} } } println!(); Ok(()) }
客户端 API
VllmClient 是使用 vLLM API 的主要入口。
创建客户端
简单创建
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1"); }
带 API Key
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-your-api-key"); }
设置超时
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(120); // 2 分钟 }
使用构建器模式
复杂配置可以用构建器:
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("sk-your-api-key") .timeout_secs(300) .build(); }
方法参考
new(base_url: impl Into<String>) -> Self
用指定的 base URL 创建客户端。
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1"); }
参数:
base_url- vLLM 服务的基础 URL(需包含/v1路径)
注意:
- 末尾斜杠会自动移除
- 客户端创建开销很小,但仍建议复用
with_api_key(self, api_key: impl Into<String>) -> Self
设置 API Key(构建器模式)。
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-xxx"); }
参数:
api_key- 用于 Bearer 认证的 API Key
注意:
- API Key 会作为 Bearer Token 放在
Authorization请求头中 - 此方法返回新的客户端实例
timeout_secs(self, secs: u64) -> Self
设置请求超时时间(构建器模式)。
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); }
参数:
secs- 超时时间(秒)
注意:
- 对该客户端发起的所有请求生效
- 长时间生成任务建议调大超时时间
base_url(&self) -> &str
获取客户端的 base URL。
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1"); assert_eq!(client.base_url(), "http://localhost:8000/v1"); }
api_key(&self) -> Option<&str>
获取已配置的 API Key。
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-xxx"); assert_eq!(client.api_key(), Some("sk-xxx")); }
builder() -> VllmClientBuilder
创建新的客户端构建器,支持更多配置选项。
#![allow(unused)] fn main() { let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("sk-xxx") .timeout_secs(120) .build(); }
API 模块
客户端提供多个 API 模块:
chat - 对话补全 API
访问对话补全接口:
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .send() .await?; }
completions - 传统补全 API
访问传统文本补全接口:
#![allow(unused)] fn main() { let response = client.completions.create() .model("Qwen/Qwen2.5-72B-Instruct") .prompt("从前有座山") .send() .await?; }
VllmClientBuilder
构建器提供灵活的客户端配置方式。
方法
| 方法 | 类型 | 说明 |
|---|---|---|
base_url(url) | impl Into<String> | 设置基础 URL |
api_key(key) | impl Into<String> | 设置 API Key |
timeout_secs(secs) | u64 | 设置超时时间(秒) |
build() | - | 构建客户端 |
默认值
| 选项 | 默认值 |
|---|---|
base_url | http://localhost:8000/v1 |
api_key | None |
timeout_secs | HTTP 客户端默认值(30秒) |
使用示例
基础用法
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "你好!"} ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
使用环境变量
#![allow(unused)] fn main() { use std::env; use vllm_client::VllmClient; fn create_client() -> VllmClient { let base_url = env::var("VLLM_BASE_URL") .unwrap_or_else(|_| "http://localhost:8000/v1".to_string()); let api_key = env::var("VLLM_API_KEY").ok(); let mut builder = VllmClient::builder().base_url(&base_url); if let Some(key) = api_key { builder = builder.api_key(&key); } builder.build() } }
多次请求
复用客户端处理多次请求:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; async fn process_prompts(client: &VllmClient, prompts: &[&str]) -> Vec<String> { let mut results = Vec::new(); for prompt in prompts { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match response { Ok(r) => results.push(r.content.unwrap_or_default()), Err(e) => eprintln!("错误: {}", e), } } results } }
线程安全
VllmClient 是线程安全的,可以跨线程共享:
#![allow(unused)] fn main() { use std::sync::Arc; use vllm_client::VllmClient; let client = Arc::new(VllmClient::new("http://localhost:8000/v1")); // 可以克隆并在多线程间传递 let client_clone = Arc::clone(&client); }
相关链接
对话补全 API
对话补全 API 是生成文本响应的主要接口。
概述
通过 client.chat.completions() 访问对话补全 API:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "你好!"} ])) .send() .await?; }
请求构建器
必需参数
model(name: impl Into<String>)
设置生成使用的模型名称。
#![allow(unused)] fn main() { .model("Qwen/Qwen2.5-72B-Instruct") // 或 .model("meta-llama/Llama-3-70b") }
messages(messages: Value)
设置对话消息,格式为 JSON 数组。
#![allow(unused)] fn main() { .messages(json!([ {"role": "system", "content": "你是一个有帮助的助手。"}, {"role": "user", "content": "Rust 是什么?"} ])) }
消息类型
| 角色 | 说明 |
|---|---|
system | 设置助手行为 |
user | 用户输入 |
assistant | 助手回复(多轮对话时使用) |
tool | 工具结果(函数调用时使用) |
采样参数
temperature(temp: f32)
控制随机性。范围:0.0 到 2.0。
#![allow(unused)] fn main() { .temperature(0.7) // 常规行为 .temperature(0.0) // 确定性输出 .temperature(1.5) // 更有创意 }
max_tokens(tokens: u32)
最大生成 token 数。
#![allow(unused)] fn main() { .max_tokens(1024) .max_tokens(4096) }
top_p(p: f32)
核采样阈值。范围:0.0 到 1.0。
#![allow(unused)] fn main() { .top_p(0.9) }
top_k(k: i32)
Top-K 采样(vLLM 扩展)。限制为 top K 个 token。
#![allow(unused)] fn main() { .top_k(50) }
stop(sequences: Value)
遇到这些序列时停止生成。
#![allow(unused)] fn main() { // 多个停止序列 .stop(json!(["END", "STOP", "\n\n"])) // 单个停止序列 .stop(json!("---")) }
工具调用参数
tools(tools: Value)
定义模型可调用的工具/函数。
#![allow(unused)] fn main() { .tools(json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取某地的天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ])) }
tool_choice(choice: Value)
控制工具选择行为。
#![allow(unused)] fn main() { .tool_choice(json!("auto")) // 模型决定 .tool_choice(json!("none")) // 不使用工具 .tool_choice(json!("required")) // 强制使用工具 .tool_choice(json!({ "type": "function", "function": {"name": "get_weather"} })) }
高级参数
stream(enable: bool)
启用流式响应。
#![allow(unused)] fn main() { .stream(true) }
extra(params: Value)
传入 vLLM 特有或其他额外参数。
#![allow(unused)] fn main() { .extra(json!({ "chat_template_kwargs": { "think_mode": true }, "reasoning_effort": "high" })) }
发送请求
send() - 同步响应
一次性返回完整响应。
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .send() .await?; }
send_stream() - 流式响应
返回流式数据,实现实时输出。
#![allow(unused)] fn main() { let mut stream = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .stream(true) .send_stream() .await?; }
详见流式响应。
响应结构
ChatCompletionResponse
| 字段 | 类型 | 说明 |
|---|---|---|
raw | Value | 原始 JSON 响应 |
id | String | 响应 ID |
object | String | 对象类型 |
model | String | 使用的模型 |
content | Option<String> | 生成的内容 |
reasoning_content | Option<String> | 推理内容(思考模型) |
tool_calls | Option<Vec<ToolCall>> | 工具调用 |
finish_reason | Option<String> | 停止原因 |
usage | Option<Usage> | Token 使用统计 |
使用示例
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "2+2 等于几?"} ])) .send() .await?; // 获取内容 println!("内容: {}", response.content.unwrap_or_default()); // 检查推理内容(思考模型) if let Some(reasoning) = response.reasoning_content { println!("推理: {}", reasoning); } // 检查停止原因 match response.finish_reason.as_deref() { Some("stop") => println!("自然结束"), Some("length") => println!("达到最大 token 数"), Some("tool_calls") => println!("进行了工具调用"), _ => {} } // Token 使用统计 if let Some(usage) = response.usage { println!("提示词 tokens: {}", usage.prompt_tokens); println!("补全 tokens: {}", usage.completion_tokens); println!("总 tokens: {}", usage.total_tokens); } }
完整示例
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "system", "content": "你是一个编程助手。"}, {"role": "user", "content": "用 Rust 写一个反转字符串的函数"} ])) .temperature(0.7) .max_tokens(1024) .top_p(0.9) .send() .await?; if let Some(content) = response.content { println!("{}", content); } Ok(()) }
多轮对话
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); // 第一轮 let response1 = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "我叫小明"} ])) .send() .await?; // 继续对话 let response2 = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "我叫小明"}, {"role": "assistant", "content": response1.content.unwrap()}, {"role": "user", "content": "我叫什么名字?"} ])) .send() .await?; }
相关链接
流式响应 API
流式响应可以实时处理大语言模型的输出,逐个 token 接收,无需等待完整响应。
概述
vLLM Client 通过 Server-Sent Events (SSE) 提供流式支持。使用 send_stream() 替代 send() 即可获得流式响应。
基础流式调用
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "写一首关于春天的诗"} ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Done => break, _ => {} } } println!(); Ok(()) }
StreamEvent 类型
StreamEvent 枚举表示不同类型的流式事件:
| 变体 | 说明 |
|---|---|
Content(String) | 普通内容 token 增量 |
Reasoning(String) | 推理/思考内容(思考模型) |
ToolCallDelta | 流式工具调用增量 |
ToolCallComplete(ToolCall) | 完整工具调用,可执行 |
Usage(Usage) | Token 使用统计 |
Done | 流式传输完成 |
Error(VllmError) | 发生错误 |
内容事件
最常见的事件类型,包含文本 token:
#![allow(unused)] fn main() { match event { StreamEvent::Content(delta) => { print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } _ => {} } }
推理事件
用于带推理能力的模型(如开启思考模式的 Qwen):
#![allow(unused)] fn main() { match event { StreamEvent::Reasoning(delta) => { eprintln!("[思考] {}", delta); } StreamEvent::Content(delta) => { print!("{}", delta); } _ => {} } }
工具调用事件
工具调用会先增量推送,完成后通知:
#![allow(unused)] fn main() { match event { StreamEvent::ToolCallDelta { index, id, name, arguments } => { println!("工具增量: index={}, name={}", index, name); // arguments 是部分 JSON 字符串 } StreamEvent::ToolCallComplete(tool_call) => { println!("工具就绪: {}({})", tool_call.name, tool_call.arguments); // 执行工具并返回结果 } _ => {} } }
使用统计事件
Token 使用信息通常在最后发送:
#![allow(unused)] fn main() { match event { StreamEvent::Usage(usage) => { println!("Tokens: 提示词={}, 补全={}, 总计={}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } _ => {} } }
MessageStream
MessageStream 类型是一个异步迭代器,产出 StreamEvent 值。
方法
| 方法 | 返回类型 | 说明 |
|---|---|---|
next() | Option<StreamEvent> | 获取下一个事件(异步) |
collect_content() | String | 收集所有内容为字符串 |
into_stream() | impl Stream | 转换为通用流 |
收集全部内容
为方便使用,可以一次性收集所有内容:
#![allow(unused)] fn main() { let content = stream.collect_content().await?; println!("完整响应: {}", content); }
注意:这种方式会等待完整响应,失去了流式的意义。仅当需要同时显示流式输出和保存完整文本时使用。
完整流式示例
use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "system", "content": "你是一个有帮助的助手。"}, {"role": "user", "content": "用简单的语言解释量子计算"} ])) .temperature(0.7) .max_tokens(1024) .stream(true) .send_stream() .await?; let mut reasoning = String::new(); let mut content = String::new(); let mut usage = None; while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { reasoning.push_str(&delta); } StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Usage(u) => { usage = Some(u); } StreamEvent::Done => { println!("\n[流式传输完成]"); } StreamEvent::Error(e) => { eprintln!("\n错误: {}", e); return Err(e); } _ => {} } } // 打印摘要 if !reasoning.is_empty() { eprintln!("\n--- 推理过程 ---"); eprintln!("{}", reasoning); } if let Some(usage) = usage { eprintln!("\n--- Token 使用 ---"); eprintln!("提示词: {}, 补全: {}, 总计: {}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } Ok(()) }
流式工具调用
使用工具时,工具调用会增量推送:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent, ToolCall}; use futures::StreamExt; let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取某地天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "东京的天气怎么样?"} ])) .tools(tools) .stream(true) .send_stream() .await?; let mut tool_calls: Vec<ToolCall> = Vec::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::ToolCallComplete(tool_call) => { tool_calls.push(tool_call); } StreamEvent::Done => break, _ => {} } } // 执行工具调用 for tool_call in tool_calls { println!("工具: {} 参数: {}", tool_call.name, tool_call.arguments); // 执行并在下一条消息中返回结果 } }
错误处理
流式传输过程中随时可能发生错误:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; async fn stream_chat(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .stream(true) .send_stream() .await?; let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => content.push_str(&delta), StreamEvent::Error(e) => return Err(e), StreamEvent::Done => break, _ => {} } } Ok(content) } }
最佳实践
刷新输出
实时显示时,每次输出后刷新 stdout:
#![allow(unused)] fn main() { use std::io::{self, Write}; match event { StreamEvent::Content(delta) => { print!("{}", delta); io::stdout().flush().ok(); } _ => {} } }
处理中断
交互式应用中,优雅地处理 Ctrl+C:
#![allow(unused)] fn main() { use tokio::signal; tokio::select! { result = process_stream(&mut stream) => { // 正常完成 } _ = signal::ctrl_c() => { println!("\n[已中断]"); } } }
空闲流超时
为可能卡住的流设置超时:
#![allow(unused)] fn main() { use tokio::time::{timeout, Duration}; let result = timeout( Duration::from_secs(60), stream.next() ).await; match result { Ok(Some(event)) => { /* 处理事件 */ } Ok(None) => { /* 流结束 */ } Err(_) => { /* 超时 */ } } }
Completions 流式 API
vLLM Client 同时支持旧版 /v1/completions API 的流式调用,使用 CompletionStreamEvent。
CompletionStreamEvent 类型
| 变体 | 说明 |
|---|---|
Text(String) | 文本 token 增量 |
FinishReason(String) | 流结束原因(如 "stop", "length") |
Usage(Usage) | Token 使用统计 |
Done | 流式传输完成 |
Error(VllmError) | 发生错误 |
Completions 流式示例
use vllm_client::{VllmClient, json, CompletionStreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .completions .create() .model("Qwen/Qwen2.5-7B-Instruct") .prompt("写一首关于春天的诗") .max_tokens(1024) .temperature(0.7) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match event { CompletionStreamEvent::Text(delta) => { print!("{}", delta); std::io::stdout().flush().ok(); } CompletionStreamEvent::FinishReason(reason) => { println!("\n[结束原因: {}]", reason); } CompletionStreamEvent::Usage(usage) => { println!("\nTokens: 提示词={}, 补全={}, 总计={}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } CompletionStreamEvent::Done => { println!("\n[流式传输完成]"); } CompletionStreamEvent::Error(e) => { eprintln!("错误: {}", e); return Err(e.into()); } } } Ok(()) }
CompletionStream 方法
| 方法 | 返回类型 | 说明 |
|---|---|---|
next() | Option<CompletionStreamEvent> | 获取下一个事件(异步) |
collect_text() | String | 收集所有文本为字符串 |
into_stream() | impl Stream | 转换为通用流 |
相关链接
工具调用 API
工具调用(也称为函数调用)允许模型在生成过程中调用外部函数,实现与外部 API、数据库和自定义逻辑的集成。
概述
vLLM Client 支持 OpenAI 兼容的工具调用:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "东京的天气怎么样?"} ])) .tools(tools) .send() .await?; }
定义工具
基础工具定义
工具使用遵循 OpenAI 规范的 JSON 格式定义:
#![allow(unused)] fn main() { let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定地点的当前天气", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "城市名称,如东京" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "温度单位" } }, "required": ["location"] } } } ]); }
多个工具
#![allow(unused)] fn main() { let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取天气信息", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } }, { "type": "function", "function": { "name": "search_web", "description": "搜索网页信息", "parameters": { "type": "object", "properties": { "query": {"type": "string"}, "limit": {"type": "integer"} }, "required": ["query"] } } } ]); }
工具选择
控制模型如何选择工具:
#![allow(unused)] fn main() { // 让模型自行决定(默认) .tool_choice(json!("auto")) // 禁止使用工具 .tool_choice(json!("none")) // 强制使用工具 .tool_choice(json!("required")) // 强制使用特定工具 .tool_choice(json!({ "type": "function", "function": {"name": "get_weather"} })) }
处理工具调用
检查工具调用
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "东京的天气怎么样?"} ])) .tools(tools) .send() .await?; // 检查响应是否包含工具调用 if response.has_tool_calls() { if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { println!("函数: {}", tool_call.name); println!("参数: {}", tool_call.arguments); } } } }
ToolCall 结构
#![allow(unused)] fn main() { pub struct ToolCall { pub id: String, // 调用的唯一标识 pub name: String, // 函数名称 pub arguments: String, // 参数的 JSON 字符串 } }
解析参数
将参数字符串解析为类型化数据:
#![allow(unused)] fn main() { use serde::Deserialize; #[derive(Deserialize)] struct WeatherArgs { location: String, unit: Option<String>, } if let Some(tool_call) = response.first_tool_call() { // 解析为特定类型 match tool_call.parse_args_as::<WeatherArgs>() { Ok(args) => { println!("地点: {}", args.location); if let Some(unit) = args.unit { println!("单位: {}", unit); } } Err(e) => { eprintln!("解析参数失败: {}", e); } } // 或解析为通用 JSON let args: Value = tool_call.parse_args()?; } }
工具结果方法
创建工具结果消息:
#![allow(unused)] fn main() { // 创建工具结果消息 let tool_result = tool_call.result(json!({ "temperature": 25, "condition": "sunny", "humidity": 60 })); // 返回一个可直接加入消息的 JSON 对象 // { // "role": "tool", // "tool_call_id": "...", // "content": "{\"temperature\": 25, ...}" // } }
完整工具调用流程
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, ToolCall}; use serde::{Deserialize, Serialize}; #[derive(Deserialize)] struct WeatherArgs { location: String, } #[derive(Serialize)] struct WeatherResult { temperature: f32, condition: String, } // 模拟天气 API fn get_weather(location: &str) -> WeatherResult { WeatherResult { temperature: 25.0, condition: "sunny".to_string(), } } async fn chat_with_tools(client: &VllmClient, user_message: &str) -> Result<String, Box<dyn std::error::Error>> { let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取当前天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); // 第一次请求 let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": user_message} ])) .tools(tools.clone()) .send() .await?; // 检查模型是否要调用工具 if response.has_tool_calls() { let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; // 将助手的工具调用加入消息 if let Some(tool_calls) = &response.tool_calls { let assistant_msg = response.assistant_message(); messages.push(assistant_msg); // 执行每个工具并加入结果 for tool_call in tool_calls { if tool_call.name == "get_weather" { let args: WeatherArgs = tool_call.parse_args_as()?; let result = get_weather(&args.location); messages.push(tool_call.result(json!(result))); } } } // 带工具结果继续对话 let final_response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!(messages)) .tools(tools) .send() .await?; return Ok(final_response.content.unwrap_or_default()); } Ok(response.content.unwrap_or_default()) } }
流式工具调用
流式响应中,工具调用会增量推送:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; let mut stream = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "东京和巴黎的天气怎么样?"} ])) .tools(tools) .stream(true) .send_stream() .await?; let mut tool_calls: Vec<ToolCall> = Vec::new(); let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); } StreamEvent::ToolCallDelta { index, id, name, arguments } => { println!("[工具增量 {}] {}({})", index, name, arguments); } StreamEvent::ToolCallComplete(tool_call) => { println!("[工具完成] {}({})", tool_call.name, tool_call.arguments); tool_calls.push(tool_call); } StreamEvent::Done => break, _ => {} } } // 执行所有收集到的工具调用 for tool_call in tool_calls { // 执行并返回结果... } }
多轮工具调用
#![allow(unused)] fn main() { async fn multi_round_tool_calling( client: &VllmClient, user_message: &str, max_rounds: usize, ) -> Result<String, Box<dyn std::error::Error>> { let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; for _ in 0..max_rounds { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!(&messages)) .tools(tools.clone()) .send() .await?; if response.has_tool_calls() { // 加入带工具调用的助手消息 messages.push(response.assistant_message()); // 执行工具并加入结果 if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { let result = execute_tool(&tool_call.name, &tool_call.arguments); messages.push(tool_call.result(result)); } } } else { // 没有更多工具调用,返回内容 return Ok(response.content.unwrap_or_default()); } } Err("超过最大轮数".into()) } }
最佳实践
清晰的工具描述
写清楚、详细的描述:
#![allow(unused)] fn main() { // 推荐 "description": "获取指定城市的当前天气状况。返回温度、湿度和天气状况。" // 避免 "description": "获取天气" }
精确的参数 Schema
定义准确的 JSON Schema:
#![allow(unused)] fn main() { "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "城市名称或坐标" }, "days": { "type": "integer", "minimum": 1, "maximum": 7, "description": "预报天数" } }, "required": ["location"] } }
错误处理
优雅地处理工具执行错误:
#![allow(unused)] fn main() { let tool_result = match execute_tool(&tool_call.name, &tool_call.arguments) { Ok(result) => json!({"success": true, "data": result}), Err(e) => json!({"success": false, "error": e.to_string()}), }; messages.push(tool_call.result(tool_result)); }
相关链接
错误处理
本文档介绍 vLLM Client 中的错误处理机制。
VllmError 枚举
vLLM Client 中的所有错误都通过 VllmError 枚举表示:
#![allow(unused)] fn main() { use thiserror::Error; #[derive(Debug, Error, Clone)] pub enum VllmError { #[error("HTTP request failed: {0}")] Http(String), #[error("JSON error: {0}")] Json(String), #[error("API error (status {status_code}): {message}")] ApiError { status_code: u16, message: String, error_type: Option<String>, }, #[error("Stream error: {0}")] Stream(String), #[error("Connection timeout")] Timeout, #[error("Model not found: {0}")] ModelNotFound(String), #[error("Missing required parameter: {0}")] MissingParameter(String), #[error("No response content")] NoContent, #[error("Invalid response format: {0}")] InvalidResponse(String), #[error("{0}")] Other(String), } }
错误类型
| 变体 | 发生场景 |
|---|---|
Http | 网络错误、连接失败 |
Json | 序列化/反序列化错误 |
ApiError | 服务器返回错误响应 |
Stream | 流式响应过程中的错误 |
Timeout | 请求超时 |
ModelNotFound | 指定的模型不存在 |
MissingParameter | 缺少必需参数 |
NoContent | 响应无内容 |
InvalidResponse | 响应格式不符合预期 |
Other | 其他错误 |
基础错误处理
use vllm_client::{VllmClient, json, VllmError}; async fn chat(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await?; Ok(response.content.unwrap_or_default()) } #[tokio::main] async fn main() { match chat("你好!").await { Ok(text) => println!("响应: {}", text), Err(e) => eprintln!("错误: {}", e), } }
详细错误处理
针对不同错误类型进行不同处理:
use vllm_client::{VllmClient, json, VllmError}; #[tokio::main] async fn main() { let client = VllmClient::new("http://localhost:8000/v1"); let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .send() .await; match result { Ok(response) => { println!("成功: {}", response.content.unwrap_or_default()); } Err(VllmError::ApiError { status_code, message, error_type }) => { eprintln!("API 错误 (HTTP {}): {}", status_code, message); if let Some(etype) = error_type { eprintln!("错误类型: {}", etype); } } Err(VllmError::Timeout) => { eprintln!("请求超时,请尝试增加超时时间。"); } Err(VllmError::Http(msg)) => { eprintln!("网络错误: {}", msg); } Err(VllmError::ModelNotFound(model)) => { eprintln!("模型 '{}' 未找到,请检查可用模型。", model); } Err(VllmError::MissingParameter(param)) => { eprintln!("缺少必需参数: {}", param); } Err(e) => { eprintln!("其他错误: {}", e); } } }
HTTP 状态码
常见的 API 错误状态码:
| 状态码 | 含义 | 处理建议 |
|---|---|---|
| 400 | 请求格式错误 | 检查请求参数 |
| 401 | 未授权 | 检查 API Key |
| 403 | 禁止访问 | 检查权限 |
| 404 | 未找到 | 检查端点或模型名称 |
| 429 | 请求频率限制 | 实现退避重试 |
| 500 | 服务器内部错误 | 重试或联系管理员 |
| 502 | 网关错误 | 检查 vLLM 服务器状态 |
| 503 | 服务不可用 | 等待后重试 |
| 504 | 网关超时 | 增加超时时间或重试 |
可重试错误
检查错误是否可重试:
#![allow(unused)] fn main() { use vllm_client::VllmError; fn should_retry(error: &VllmError) -> bool { error.is_retryable() } // 手动检查 match error { VllmError::Timeout => true, VllmError::ApiError { status_code: 429, .. } => true, // 频率限制 VllmError::ApiError { status_code: 500..=504, .. } => true, // 服务器错误 _ => false, } }
指数退避重试
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; use std::time::Duration; use tokio::time::sleep; async fn chat_with_retry( client: &VllmClient, prompt: &str, max_retries: u32, ) -> Result<String, VllmError> { let mut retries = 0; loop { let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match result { Ok(response) => { return Ok(response.content.unwrap_or_default()); } Err(e) if e.is_retryable() && retries < max_retries => { retries += 1; let delay = Duration::from_millis(100 * 2u64.pow(retries - 1)); eprintln!("第 {} 次重试,等待 {:?}: {}", retries, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } }
流式响应错误处理
处理流式响应过程中的错误:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; async fn stream_chat(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .stream(true) .send_stream() .await?; let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => content.push_str(&delta), StreamEvent::Done => break, StreamEvent::Error(e) => return Err(e), _ => {} } } Ok(content) } }
错误上下文
为错误添加上下文信息,便于调试:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; async fn chat_with_context(prompt: &str) -> Result<String, String> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map_err(|e| format!("获取对话响应失败: {}", e))?; Ok(response.content.unwrap_or_default()) } }
使用 anyhow 或 eyre
对于使用 anyhow 或 eyre 的应用程序:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; use anyhow::{Context, Result}; async fn chat(prompt: &str) -> Result<String> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .context("发送对话请求失败")?; Ok(response.content.unwrap_or_default()) } }
最佳实践
1. 始终处理错误
#![allow(unused)] fn main() { // 不好的做法 let response = client.chat.completions().create() .send().await.unwrap(); // 好的做法 match client.chat.completions().create().send().await { Ok(r) => { /* 处理响应 */ }, Err(e) => eprintln!("错误: {}", e), } }
2. 设置适当的超时时间
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); // 长时间任务设置为 5 分钟 }
3. 记录带上下文的错误
#![allow(unused)] fn main() { Err(e) => { log::error!("对话请求失败: {}", e); log::debug!("请求详情: model={}, prompt_len={}", model, prompt.len()); } }
4. 实现优雅降级
#![allow(unused)] fn main() { match primary_client.chat.completions().create().send().await { Ok(r) => r, Err(e) => { log::warn!("主客户端失败: {}, 尝试备用客户端", e); fallback_client.chat.completions().create().send().await? } } }
相关链接
示例代码
本节包含各种使用场景的代码示例。
目录
基础聊天
简单对话
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "你好,请介绍一下你自己。"} ])) .send() .await?; println!("{}", response.choices[0].message.content.unwrap()); Ok(()) }
带系统提示的对话
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "system", "content": "你是一个专业的 Rust 编程助手,回答简洁准确。"}, {"role": "user", "content": "什么是所有权?"} ])) .temperature(0.7) .max_tokens(500) .send() .await?; println!("{}", response.choices[0].message.content.unwrap()); Ok(()) }
多轮对话
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "我叫张三"}, {"role": "assistant", "content": "你好,张三!很高兴认识你。有什么我可以帮助你的吗?"}, {"role": "user", "content": "我叫什么名字?"} ])) .send() .await?; println!("{}", response.choices[0].message.content.unwrap()); Ok(()) }
流式聊天
基本流式输出
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "写一首关于春天的诗"} ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match &event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Done => break, StreamEvent::Error(e) => eprintln!("错误: {}", e), _ => {} } } println!(); Ok(()) }
带思考模式的流式输出
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("qwen-3") .messages(json!([ {"role": "user", "content": "解释相对论"} ])) .extra(json!({"chat_template_kwargs": {"enable_thinking": true}})) .stream(true) .send_stream() .await?; println!("=== 思考过程 ==="); while let Some(event) = stream.next().await { match &event { StreamEvent::Reasoning(delta) => { // 思考内容 print!("{}", delta); } StreamEvent::Content(delta) => { // 正式回复内容 print!("{}", delta); } StreamEvent::Done => break, _ => {} } } println!(); Ok(()) }
流式 Completions
旧版 Completions API 流式调用
use vllm_client::{VllmClient, json, CompletionStreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .completions .create() .model("Qwen/Qwen2.5-7B-Instruct") .prompt("什么是机器学习?") .max_tokens(500) .temperature(0.7) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match event { CompletionStreamEvent::Text(delta) => { print!("{}", delta); std::io::stdout().flush().ok(); } CompletionStreamEvent::FinishReason(reason) => { println!("\n[结束原因: {}]", reason); } CompletionStreamEvent::Usage(usage) => { println!("\nTokens: 提示词={}, 补全={}, 总计={}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } CompletionStreamEvent::Done => { println!("\n[流式传输完成]"); } CompletionStreamEvent::Error(e) => { eprintln!("错误: {}", e); return Err(e.into()); } } } Ok(()) }
注意: 对于新项目,推荐使用 Chat Completions API (
client.chat.completions()),它提供更灵活的功能和更好的消息格式。
工具调用
定义和使用工具
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 定义工具 let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定城市的当前天气", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "城市名称,如:北京、上海" } }, "required": ["city"] } } }, { "type": "function", "function": { "name": "get_time", "description": "获取指定城市的当前时间", "parameters": { "type": "object", "properties": { "city": { "type": "string", "description": "城市名称" } }, "required": ["city"] } } } ]); // 发送请求 let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "北京现在天气怎么样?"} ])) .tools(tools) .send() .await?; // 检查是否有工具调用 if let Some(tool_calls) = &response.choices[0].message.tool_calls { for tool_call in tool_calls { println!("工具: {}", tool_call.function.name); println!("参数: {}", tool_call.function.arguments); // 在这里执行实际的工具调用 // let result = execute_tool(&tool_call.function.name, &tool_call.function.arguments); } } Ok(()) }
返回工具结果
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取天气信息", "parameters": { "type": "object", "properties": { "city": {"type": "string"} }, "required": ["city"] } } } ]); // 模拟对话流程 let response = client .chat .completions() .create() .model("llama-3-70b") .messages(json!([ {"role": "user", "content": "上海天气如何?"}, { "role": "assistant", "tool_calls": [{ "id": "call_001", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\": \"上海\"}" } }] }, { "role": "tool", "tool_call_id": "call_001", "content": "{\"temperature\": 28, \"condition\": \"多云\", \"humidity\": 65}" } ])) .tools(tools) .send() .await?; println!("{}", response.choices[0].message.content.unwrap()); Ok(()) }
多模态
图像理解
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 使用 base64 编码的图像 let image_base64 = "data:image/png;base64,iVBORw0KGgo..."; let response = client .chat .completions() .create() .model("llava-v1.6") .messages(json!([ { "role": "user", "content": [ {"type": "text", "text": "这张图片里有什么?"}, { "type": "image_url", "image_url": {"url": image_base64} } ] } ])) .max_tokens(500) .send() .await?; println!("{}", response.choices[0].message.content.unwrap()); Ok(()) }
使用图像 URL
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("llava-v1.6") .messages(json!([ { "role": "user", "content": [ {"type": "text", "text": "描述这张图片"}, { "type": "image_url", "image_url": {"url": "https://example.com/image.jpg"} } ] } ])) .send() .await?; println!("{}", response.choices[0].message.content.unwrap()); Ok(()) }
思考模式
启用思考模式
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("qwen-3") .messages(json!([ {"role": "system", "content": "你是一个善于深度思考的AI助手。"}, {"role": "user", "content": "为什么天空是蓝色的?"} ])) .extra(json!({ "chat_template_kwargs": { "enable_thinking": true } })) .stream(true) .send_stream() .await?; let mut reasoning = String::new(); let mut content = String::new(); while let Some(event) = stream.next().await { match &event { StreamEvent::Reasoning(delta) => reasoning.push_str(delta), StreamEvent::Content(delta) => content.push_str(delta), StreamEvent::Done => break, _ => {} } } println!("=== 思考过程 ==="); println!("{}", reasoning); println!("\n=== 回答 ==="); println!("{}", content); Ok(()) }
禁用思考模式
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("qwen-3") .messages(json!([ {"role": "user", "content": "你好"} ])) .extra(json!({ "chat_template_kwargs": { "enable_thinking": false } })) .send() .await?; println!("{}", response.choices[0].message.content.unwrap()); Ok(()) }
更多示例
完整的示例代码可以在项目的 examples/ 目录中找到:
simple.rs- 基础聊天示例simple_streaming.rs- 流式聊天示例streaming_chat.rs- 带思考模式的流式聊天tool_calling.rs- 工具调用示例
基础聊天示例
本页演示 vLLM Client 的基础聊天补全使用模式。
简单聊天
发送聊天消息的最简单方式:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "你好,你好吗?"} ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
带系统消息
添加系统消息来控制助手的行为:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "system", "content": "你是一个有帮助的编程助手。你编写整洁、文档完善的代码。"}, {"role": "user", "content": "用 Rust 写一个检查数字是否为质数的函数"} ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
多轮对话
在多轮消息中保持上下文:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 构建对话历史 let mut messages = vec![ json!({"role": "system", "content": "你是一个有帮助的助手。"}), ]; // 第一轮 messages.push(json!({"role": "user", "content": "我叫小明"})); let response1 = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages.clone())) .send() .await?; let assistant_reply = response1.content.unwrap_or_default(); println!("助手: {}", assistant_reply); // 将助手回复添加到历史 messages.push(json!({"role": "assistant", "content": assistant_reply})); // 第二轮 messages.push(json!({"role": "user", "content": "我叫什么名字?"})); let response2 = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages)) .send() .await?; println!("助手: {}", response2.content.unwrap_or_default()); Ok(()) }
对话辅助工具
一个可复用的对话构建辅助工具:
use vllm_client::{VllmClient, json, VllmError}; use serde_json::Value; struct Conversation { client: VllmClient, model: String, messages: Vec<Value>, } impl Conversation { fn new(client: VllmClient, model: impl Into<String>) -> Self { Self { client, model: model.into(), messages: vec![ json!({"role": "system", "content": "你是一个有帮助的助手。"}) ], } } fn with_system(mut self, content: &str) -> Self { self.messages[0] = json!({"role": "system", "content": content}); self } async fn send(&mut self, user_message: &str) -> Result<String, VllmError> { self.messages.push(json!({ "role": "user", "content": user_message })); let response = self.client .chat .completions() .create() .model(&self.model) .messages(json!(&self.messages)) .send() .await?; let content = response.content.unwrap_or_default(); self.messages.push(json!({ "role": "assistant", "content": &content })); Ok(content) } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut conv = Conversation::new(client, "Qwen/Qwen2.5-7B-Instruct") .with_system("你是一个数学辅导员。简单地解释概念。"); println!("用户: 2 + 2 等于几?"); let reply = conv.send("2 + 2 等于几?").await?; println!("助手: {}", reply); println!("\n用户: 那乘以 3 等于几?"); let reply = conv.send("那乘以 3 等于几?").await?; println!("助手: {}", reply); Ok(()) }
使用采样参数
通过采样参数控制生成:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "写一个关于机器人的创意故事"} ])) .temperature(1.2) // 更高的温度增加创意性 .top_p(0.95) // 核采样 .top_k(50) // vLLM 扩展参数 .max_tokens(512) // 限制输出长度 .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
确定性输出
要获得可重复的结果,将温度设置为 0:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "2 + 2 等于几?"} ])) .temperature(0.0) // 确定性输出 .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
使用停止序列
在特定序列处停止生成:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "列出三种水果,每行一个"} ])) .stop(json!(["\n\n", "END"])) // 在双换行或 END 处停止 .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Token 使用追踪
追踪 token 使用情况以监控成本:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "解释量子计算"} ])) .send() .await?; println!("响应: {}", response.content.unwrap_or_default()); if let Some(usage) = response.usage { println!("\n--- Token 使用统计 ---"); println!("提示词 tokens: {}", usage.prompt_tokens); println!("补全 tokens: {}", usage.completion_tokens); println!("总 tokens: {}", usage.total_tokens); } Ok(()) }
批量处理
高效处理多个提示:
use vllm_client::{VllmClient, json, VllmError}; async fn process_prompts( client: &VllmClient, prompts: &[&str], ) -> Vec<Result<String, VllmError>> { let mut results = Vec::new(); for prompt in prompts { let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map(|r| r.content.unwrap_or_default()); results.push(result); } results } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(120); let prompts = [ "Rust 是什么?", "Python 是什么?", "Go 是什么?", ]; let results = process_prompts(&client, &prompts).await; for (prompt, result) in prompts.iter().zip(results.iter()) { match result { Ok(response) => println!("问: {}\n答: {}\n", prompt, response), Err(e) => eprintln!("'{}' 出错: {}", prompt, e), } } Ok(()) }
错误处理
生产代码的正确错误处理:
use vllm_client::{VllmClient, json, VllmError}; async fn safe_chat(prompt: &str) -> Result<String, String> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(60); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map_err(|e| format!("请求失败: {}", e))?; response.content.ok_or_else(|| "响应中无内容".to_string()) } #[tokio::main] async fn main() { match safe_chat("你好!").await { Ok(text) => println!("响应: {}", text), Err(e) => eprintln!("错误: {}", e), } }
相关链接
流式聊天示例
本示例演示如何使用流式响应实现实时输出。
基础流式响应
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "写一个关于机器人学习绘画的短篇故事。"} ])) .temperature(0.8) .max_tokens(1024) .stream(true) .send_stream() .await?; print!("响应: "); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\n错误: {}", e); break; } _ => {} } } println!(); Ok(()) }
带推理过程的流式响应(思考模型)
对于支持思考/推理模式的模型:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "计算: 15 * 23 + 47 等于多少?"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; let mut reasoning = String::new(); let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { reasoning.push_str(&delta); eprintln!("[思考中] {}", delta); } StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\n错误: {}", e); break; } _ => {} } } println!("\n"); if !reasoning.is_empty() { println!("--- 推理过程 ---"); println!("{}", reasoning); } Ok(()) }
带进度指示器的流式响应
在等待第一个 token 时显示输入指示器:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; use std::time::{Duration, Instant}; use std::sync::atomic::{AtomicBool, Ordering}; use std::sync::Arc; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let waiting = Arc::new(AtomicBool::new(true)); let waiting_clone = Arc::clone(&waiting); // 启动输入指示器任务 let indicator = tokio::spawn(async move { let chars = ['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏']; let mut i = 0; while waiting_clone.load(Ordering::Relaxed) { print!("\r{} 思考中...", chars[i]); std::io::Write::flush(&mut std::io::stdout()).ok(); i = (i + 1) % chars.len(); tokio::time::sleep(Duration::from_millis(80)).await; } print!("\r \r"); // 清除指示器 }); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "用简单的语言解释量子纠缠。"} ])) .stream(true) .send_stream() .await?; let mut first_token = true; let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { if first_token { waiting.store(false, Ordering::Relaxed); indicator.await.ok(); first_token = false; println!("响应:"); println!("---------"); } content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { waiting.store(false, Ordering::Relaxed); eprintln!("\n错误: {}", e); break; } _ => {} } } println!("\n"); Ok(()) }
多轮流式对话
处理带有流式响应的对话:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; use std::io::{self, BufRead, Write}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut messages: Vec<serde_json::Value> = Vec::new(); println!("与 AI 聊天(输入 'quit' 退出)"); println!("----------------------------------------\n"); let stdin = io::stdin(); for line in stdin.lock().lines() { let input = line?; if input.trim() == "quit" { break; } if input.trim().is_empty() { continue; } // 添加用户消息 messages.push(json!({"role": "user", "content": input})); // 流式响应 let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages)) .stream(true) .send_stream() .await?; print!("AI: "); io::stdout().flush().ok(); let mut response_content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { response_content.push_str(&delta); print!("{}", delta); io::stdout().flush().ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\n错误: {}", e); break; } _ => {} } } println!("\n"); // 将助手响应添加到历史 messages.push(json!({"role": "assistant", "content": response_content})); } println!("再见!"); Ok(()) }
带超时的流式响应
为慢速响应添加超时处理:
use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; use tokio::time::{timeout, Duration}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "写一篇关于人工智能的详细论文。"} ])) .stream(true) .send_stream() .await?; let mut content = String::new(); loop { // 每个事件 30 秒超时 match timeout(Duration::from_secs(30), stream.next()).await { Ok(Some(event)) => { match event { StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\n流式错误: {}", e); return Err(e.into()); } _ => {} } } Ok(None) => break, Err(_) => { eprintln!("\n等待下一个 token 超时"); break; } } } println!("\n\n生成了 {} 个字符", content.len()); Ok(()) }
收集使用统计
在流式响应过程中追踪 token 使用情况:
use vllm_client::{VllmClient, json, StreamEvent, Usage}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "写一首关于海洋的诗。"} ])) .stream(true) .send_stream() .await?; let mut content = String::new(); let mut usage: Option<Usage> = None; let mut start_time = std::time::Instant::now(); let mut token_count = 0; while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { content.push_str(&delta); token_count += 1; print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Usage(u) => { usage = Some(u); } StreamEvent::Done => break, _ => {} } } let elapsed = start_time.elapsed(); println!("\n"); println!("--- 统计信息 ---"); println!("耗时: {:.2}秒", elapsed.as_secs_f64()); println!("字符数: {}", content.len()); if let Some(usage) = usage { println!("提示词 tokens: {}", usage.prompt_tokens); println!("补全 tokens: {}", usage.completion_tokens); println!("总 tokens: {}", usage.total_tokens); println!("每秒 tokens: {:.2}", usage.completion_tokens as f64 / elapsed.as_secs_f64()); } Ok(()) }
相关链接
Streaming Completions 示例
本示例演示如何使用旧版 /v1/completions API 进行流式调用。
基础流式 Completions
use vllm_client::{VllmClient, json, CompletionStreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); println!("=== 流式 Completions 示例 ===\n"); println!("模型: Qwen/Qwen2.5-7B-Instruct\n"); println!("提示词: 什么是机器学习?"); println!("\n生成文本: "); let mut stream = client .completions .create() .model("Qwen/Qwen2.5-7B-Instruct") .prompt("什么是机器学习?") .max_tokens(500) .temperature(0.7) .stream(true) .send_stream() .await?; // 处理流式事件 while let Some(event) = stream.next().await { match event { CompletionStreamEvent::Text(delta) => { // 打印文本增量(实时输出) print!("{}", delta); // 刷新缓冲区,实现实时显示 std::io::stdout().flush().ok(); } CompletionStreamEvent::FinishReason(reason) => { println!("\n\n--- 结束原因: {} ---", reason); } CompletionStreamEvent::Usage(usage) => { // 流结束时输出 token 使用统计 println!("\n\n--- Token 使用统计 ---"); println!("提示词 tokens: {}", usage.prompt_tokens); println!("生成 tokens: {}", usage.completion_tokens); println!("总计 tokens: {}", usage.total_tokens); } CompletionStreamEvent::Done => { println!("\n\n=== 生成完成 ==="); break; } CompletionStreamEvent::Error(e) => { eprintln!("\n错误: {}", e); return Err(e.into()); } } } Ok(()) }
与 Chat 流式的区别
| 方面 | Chat Completions | Completions |
|---|---|---|
| 事件类型 | StreamEvent | CompletionStreamEvent |
| 内容变体 | Content(String) | Text(String) |
| 额外事件 | Reasoning, ToolCall | FinishReason |
| 适用场景 | 对话式 | 单提示词 |
何时使用 Completions API
- 简单的单提示词文本生成
- 与 OpenAI API 的旧版兼容
- 不需要聊天消息格式的场景
对于新项目,建议使用 Chat Completions API (client.chat.completions()),它提供更灵活的功能和更好的消息格式。
相关链接
工具调用示例
本示例演示如何在 vLLM Client 中使用工具调用(函数调用)。
基础工具调用
定义工具,让模型决定何时调用它们:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 定义可用工具 let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定地点的当前天气", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "城市名称,如:东京、纽约" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "温度单位" } }, "required": ["location"] } } } ]); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "东京的天气怎么样?"} ])) .tools(tools) .send() .await?; // 检查模型是否要调用工具 if response.has_tool_calls() { if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { println!("函数: {}", tool_call.name); println!("参数: {}", tool_call.arguments); } } } else { println!("响应: {}", response.content.unwrap_or_default()); } Ok(()) }
完整工具调用流程
执行工具并返回结果以继续对话:
use vllm_client::{VllmClient, json, ToolCall}; use serde::{Deserialize, Serialize}; #[derive(Deserialize)] struct WeatherArgs { location: String, unit: Option<String>, } #[derive(Serialize)] struct WeatherResult { temperature: f32, condition: String, humidity: u32, } // 模拟天气函数 fn get_weather(location: &str, unit: Option<&str>) -> WeatherResult { // 实际代码中,调用真实的天气 API let temp = match location { "Tokyo" => 25.0, "New York" => 20.0, "London" => 15.0, _ => 22.0, }; WeatherResult { temperature: if unit == Some("fahrenheit") { temp * 9.0 / 5.0 + 32.0 } else { temp }, condition: "晴朗".to_string(), humidity: 60, } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定地点的当前天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["location"] } } } ]); let user_message = "东京和纽约的天气怎么样?"; // 第一次请求 - 模型可能调用工具 let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": user_message} ])) .tools(tools.clone()) .send() .await?; if response.has_tool_calls() { // 构建消息历史 let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; // 添加助手的工具调用 messages.push(response.assistant_message()); // 执行每个工具并添加结果 if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { if tool_call.name == "get_weather" { let args: WeatherArgs = tool_call.parse_args_as()?; let result = get_weather(&args.location, args.unit.as_deref()); messages.push(tool_call.result(json!(result))); } } } // 使用工具结果继续对话 let final_response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages)) .tools(tools) .send() .await?; println!("{}", final_response.content.unwrap_or_default()); } else { println!("{}", response.content.unwrap_or_default()); } Ok(()) }
多个工具
为不同目的定义多个工具:
use vllm_client::{VllmClient, json}; use serde::Deserialize; #[derive(Deserialize)] struct SearchArgs { query: String, limit: Option<u32>, } #[derive(Deserialize)] struct CalcArgs { expression: String, } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "web_search", "description": "在网络上搜索信息", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "搜索查询" }, "limit": { "type": "integer", "description": "最大结果数" } }, "required": ["query"] } } }, { "type": "function", "function": { "name": "calculate", "description": "执行数学计算", "parameters": { "type": "object", "properties": { "expression": { "type": "string", "description": "要计算的数学表达式,如 '2 + 2 * 3'" } }, "required": ["expression"] } } } ]); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "搜索 Rust 编程语言并计算 42 * 17"} ])) .tools(tools) .send() .await?; if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { match tool_call.name.as_str() { "web_search" => { let args: SearchArgs = tool_call.parse_args_as()?; println!("搜索: {} (限制: {:?})", args.query, args.limit); } "calculate" => { let args: CalcArgs = tool_call.parse_args_as()?; println!("计算: {}", args.expression); } _ => println!("未知工具: {}", tool_call.name), } } } Ok(()) }
流式工具调用
实时流式传输工具调用更新:
use vllm_client::{VllmClient, json, StreamEvent, ToolCall}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定地点的天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "东京、巴黎和伦敦的天气怎么样?"} ])) .tools(tools) .stream(true) .send_stream() .await?; let mut tool_calls: Vec<ToolCall> = Vec::new(); let mut content = String::new(); println!("流式响应:\n"); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); } StreamEvent::ToolCallDelta { index, id, name, arguments } => { println!("[工具 {}] {} - 部分参数: {}", index, name, arguments); } StreamEvent::ToolCallComplete(tool_call) => { println!("[工具完成] {}({})", tool_call.name, tool_call.arguments); tool_calls.push(tool_call); } StreamEvent::Done => { println!("\n--- 流式完成 ---"); break; } StreamEvent::Error(e) => { eprintln!("\n错误: {}", e); break; } _ => {} } } println!("\n收集到 {} 个工具调用", tool_calls.len()); for (i, tc) in tool_calls.iter().enumerate() { println!(" {}. {}({})", i + 1, tc.name, tc.arguments); } Ok(()) }
多轮工具调用
处理多轮工具调用:
use vllm_client::{VllmClient, json, VllmError}; use serde_json::Value; async fn run_agent( client: &VllmClient, user_message: &str, tools: &Value, max_rounds: usize, ) -> Result<String, VllmError> { let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; for round in 0..max_rounds { println!("--- 第 {} 轮 ---", round + 1); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(&messages)) .tools(tools.clone()) .send() .await?; if response.has_tool_calls() { // 添加包含工具调用的助手消息 messages.push(response.assistant_message()); // 执行工具并添加结果 if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { println!("调用: {}({})", tool_call.name, tool_call.arguments); // 执行工具 let result = execute_tool(&tool_call.name, &tool_call.arguments); println!("结果: {}", result); // 将工具结果添加到消息 messages.push(tool_call.result(result)); } } } else { // 没有更多工具调用,返回最终响应 return Ok(response.content.unwrap_or_default()); } } Err(VllmError::Other("超过最大轮数".to_string())) } fn execute_tool(name: &str, args: &str) -> Value { // 在这里实现工具执行逻辑 match name { "get_weather" => json!({"temperature": 22, "condition": "晴朗"}), "web_search" => json!({"results": ["结果1", "结果2"]}), _ => json!({"error": "未知工具"}), } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定地点的天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } }, { "type": "function", "function": { "name": "web_search", "description": "在网络上搜索", "parameters": { "type": "object", "properties": { "query": {"type": "string"} }, "required": ["query"] } } } ]); let result = run_agent( &client, "东京的天气怎么样?并查找关于樱花的信息", &tools, 5 ).await?; println!("\n最终答案: {}", result); Ok(()) }
工具选择选项
控制工具选择行为:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定地点的天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); // 选项 1: 让模型决定(默认) let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "你好!"} ])) .tools(tools.clone()) .tool_choice(json!("auto")) .send() .await?; // 选项 2: 禁止工具使用 let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "东京的天气怎么样?"} ])) .tools(tools.clone()) .tool_choice(json!("none")) .send() .await?; // 选项 3: 强制使用工具 let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "我需要天气信息"} ])) .tools(tools.clone()) .tool_choice(json!("required")) .send() .await?; // 选项 4: 强制使用特定工具 let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "查看东京天气"} ])) .tools(tools.clone()) .tool_choice(json!({ "type": "function", "function": {"name": "get_weather"} })) .send() .await?; Ok(()) }
错误处理
优雅地处理工具执行错误:
use vllm_client::{VllmClient, json, ToolCall}; use serde_json::Value; fn execute_tool_safely(tool_call: &ToolCall) -> Value { match tool_call.name.as_str() { "get_weather" => { // 安全地解析参数 match tool_call.parse_args() { Ok(args) => { // 执行工具 match get_weather_internal(&args) { Ok(result) => json!({"success": true, "data": result}), Err(e) => json!({"success": false, "error": e.to_string()}), } } Err(e) => json!({ "success": false, "error": format!("无效参数: {}", e) }), } } _ => json!({ "success": false, "error": format!("未知工具: {}", tool_call.name) }), } } fn get_weather_internal(args: &Value) -> Result<Value, String> { let location = args["location"].as_str() .ok_or("location 是必需的")?; // 模拟 API 调用 Ok(json!({ "location": location, "temperature": 22, "condition": "晴朗" })) } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "获取指定地点的天气", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "天气怎么样?"} ])) .tools(tools) .send() .await?; if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { let result = execute_tool_safely(tool_call); println!("工具结果: {}", result); } } Ok(()) }
相关链接
多模态示例
多模态功能允许您将图像和其他媒体类型与文本一起发送给模型。
概述
vLLM 通过 OpenAI 兼容的 API 支持多模态输入。您可以使用 base64 编码或 URL 在聊天消息中包含图像。
基础图像输入(Base64)
发送 base64 编码的图像:
use vllm_client::{VllmClient, json}; use base64::{Engine as _, engine::general_purpose}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 读取并编码图像 let image_data = std::fs::read("image.png")?; let base64_image = general_purpose::STANDARD.encode(&image_data); let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") // 视觉模型 .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "这张图片里有什么?" }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ])) .max_tokens(512) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
使用 URL 引用图像
通过 URL 引用图像:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "详细描述这张图片。" }, { "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } } ] } ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
图像消息辅助函数
创建可复用的图像消息辅助函数:
use vllm_client::{VllmClient, json}; use serde_json::Value; fn image_message(text: &str, image_path: &str) -> Result<Value, Box<dyn std::error::Error>> { use base64::{Engine as _, engine::general_purpose}; let image_data = std::fs::read(image_path)?; let base64_image = general_purpose::STANDARD.encode(&image_data); // 根据扩展名检测图像类型 let mime_type = match image_path.to_lowercase().rsplit('.').next() { Some("png") => "image/png", Some("jpg") | Some("jpeg") => "image/jpeg", Some("gif") => "image/gif", Some("webp") => "image/webp", _ => "image/png", }; Ok(json!({ "role": "user", "content": [ { "type": "text", "text": text }, { "type": "image_url", "image_url": { "url": format!("data:{};base64,{}", mime_type, base64_image) } } ] })) } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let user_msg = image_message("这张图片里有什么?", "photo.jpg")?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([user_msg])) .max_tokens(1024) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
多图像处理
在单个请求中发送多张图像:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 读取并编码多张图像 let image1 = encode_image("image1.png")?; let image2 = encode_image("image2.png")?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "比较这两张图片。它们有什么不同?" }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", image1) } }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", image2) } } ] } ])) .max_tokens(1024) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) } fn encode_image(path: &str) -> Result<String, Box<dyn std::error::Error>> { use base64::{Engine as _, engine::general_purpose}; let data = std::fs::read(path)?; Ok(general_purpose::STANDARD.encode(&data)) }
带图像的流式响应
对图像查询进行流式响应:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let base64_image = encode_image("chart.png")?; let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "分析这个图表并解释趋势。" }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { if let StreamEvent::Content(delta) = event { print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } } println!(); Ok(()) }
带图像的多轮对话
在对话中保持图像上下文:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let base64_image = encode_image("screenshot.png")?; // 第一条带图像的消息 let messages = json!([ { "role": "user", "content": [ {"type": "text", "text": "这个截图里有什么?"}, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ]); let response1 = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(messages.clone()) .send() .await?; println!("第一次响应: {}", response1.content.unwrap_or_default()); // 继续对话(不需要新图像) let messages2 = json!([ { "role": "user", "content": [ {"type": "text", "text": "这个截图里有什么?"}, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] }, { "role": "assistant", "content": response1.content.unwrap_or_default() }, { "role": "user", "content": "你能翻译图片中的文本吗?" } ]); let response2 = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(messages2) .send() .await?; println!("\n第二次响应: {}", response2.content.unwrap_or_default()); Ok(()) }
OCR 和文档分析
使用视觉模型进行 OCR 和文档分析:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let document_image = encode_image("document.png")?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "system", "content": "你是一个 OCR 助手。准确提取图像中的文本并正确格式化。" }, { "role": "user", "content": [ { "type": "text", "text": "从这个文档图像中提取所有文本。尽可能保留格式。" }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", document_image) } } ] } ])) .max_tokens(2048) .send() .await?; println!("提取的文本:\n{}", response.content.unwrap_or_default()); Ok(()) }
图像大小考虑
正确处理大图像:
use vllm_client::{VllmClient, json}; fn encode_and_resize_image(path: &str, max_size: u32) -> Result<String, Box<dyn std::error::Error>> { use base64::{Engine as _, engine::general_purpose}; use image::ImageReader; // 加载并调整图像大小 let img = ImageReader::open(path)?.decode()?; let img = img.resize(max_size, max_size, image::imageops::FilterType::Lanczos3); // 转换为 PNG let mut buffer = std::io::Cursor::new(Vec::new()); img.write_to(&mut buffer, image::ImageFormat::Png)?; Ok(general_purpose::STANDARD.encode(&buffer.into_inner())) } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // 调整大小到最大 1024px,保持宽高比 let base64_image = encode_and_resize_image("large_image.jpg", 1024)?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ {"type": "text", "text": "描述这张图片。"}, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
支持的模型
对于多模态输入,请使用支持视觉的模型:
| 模型 | 描述 |
|---|---|
Qwen/Qwen2-VL-7B-Instruct | Qwen2 视觉语言模型 |
Qwen/Qwen2-VL-72B-Instruct | Qwen2 视觉语言大模型 |
meta-llama/Llama-3.2-11B-Vision-Instruct | Llama 3.2 视觉模型 |
openai/clip-vit-large-patch14 | CLIP 模型 |
使用以下命令检查 vLLM 服务器的可用模型:
curl http://localhost:8000/v1/models
必需的依赖
对于图像处理,添加以下依赖:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
base64 = "0.22"
image = "0.25" # 可选,用于图像处理
故障排除
图像过大
如果遇到图像大小错误,请减小图像尺寸:
#![allow(unused)] fn main() { // 发送前调整大小 let img = image::load_from_memory(&image_data)?; let resized = img.resize(1024, 1024, image::imageops::FilterType::Lanczos3); }
不支持的格式
将图像转换为支持的格式:
#![allow(unused)] fn main() { // 转换为 PNG let img = image::load_from_memory(&image_data)?; let mut output = Vec::new(); img.write_to(&mut std::io::Cursor::new(&mut output), image::ImageFormat::Png)?; }
模型不支持视觉
确保使用支持视觉的模型。非视觉模型会忽略图像输入。
相关链接
高级主题
本文档介绍 vLLM Client 的高级功能和用法。
目录
思考模式
某些模型(如 Qwen-3)支持"思考模式",可以输出推理过程。
启用思考模式
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("qwen-3") .messages(json!([ {"role": "user", "content": "请解释什么是递归"} ])) .extra(json!({ "chat_template_kwargs": { "enable_thinking": true } })) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match &event { // 思考/推理内容 StreamEvent::Reasoning(delta) => { print!("[思考] {}", delta); } // 常规回复内容 StreamEvent::Content(delta) => { print!("{}", delta); } _ => {} } } }
思考内容格式
在思考模式下,模型的输出分为两部分:
| 事件类型 | 描述 |
|---|---|
StreamEvent::Reasoning | 模型的推理/思考过程 |
StreamEvent::Content | 最终的回复内容 |
思考内容通常包含在 <think> 标签中,客户端会自动解析。
禁用思考模式
#![allow(unused)] fn main() { .extra(json!({ "chat_template_kwargs": { "enable_thinking": false } })) }
自定义请求头
如果需要添加自定义请求头(如代理认证、追踪ID等):
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1") .with_header("X-Custom-Header", "custom-value") .with_header("X-Request-ID", "req-12345"); }
常见用例
#![allow(unused)] fn main() { // 添加代理认证 let client = VllmClient::new("http://localhost:8000/v1") .with_header("Proxy-Authorization", "Bearer proxy-token"); // 添加追踪ID用于调试 let client = VllmClient::new("http://localhost:8000/v1") .with_header("X-Trace-ID", &uuid::Uuid::new_v4().to_string()); }
超时与重试
设置超时
#![allow(unused)] fn main() { use std::time::Duration; use vllm_client::VllmClient; // 设置60秒超时 let client = VllmClient::new("http://localhost:8000/v1") .with_timeout(Duration::from_secs(60)); // 设置5分钟超时(适用于长文本生成) let client = VllmClient::new("http://localhost:8000/v1") .with_timeout(Duration::from_secs(300)); }
实现重试逻辑
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; use std::time::Duration; use tokio::time::sleep; async fn send_with_retry( client: &VllmClient, messages: serde_json::Value, max_retries: u32, ) -> Result<vllm_client::ChatCompletionResponse, VllmError> { let mut attempts = 0; loop { match client .chat .completions() .create() .model("llama-3-70b") .messages(messages.clone()) .send() .await { Ok(response) => return Ok(response), Err(e) => { attempts += 1; if attempts >= max_retries { return Err(e); } // 指数退避 sleep(Duration::from_millis(100 * 2u64.pow(attempts))).await; } } } } }
多模态支持
图像输入
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); // 使用图像URL let response = client .chat .completions() .create() .model("llava-v1.6") .messages(json!([ { "role": "user", "content": [ {"type": "text", "text": "这张图片里有什么?"}, { "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } } ] } ])) .send() .await?; // 使用Base64编码图像 let base64_image = "data:image/jpeg;base64,/9j/4AAQ..."; let response = client .chat .completions() .create() .model("llava-v1.6") .messages(json!([ { "role": "user", "content": [ {"type": "text", "text": "描述这张图片"}, { "type": "image_url", "image_url": {"url": base64_image} } ] } ])) .send() .await?; }
多图像支持
#![allow(unused)] fn main() { let response = client .chat .completions() .create() .model("llava-v1.6") .messages(json!([ { "role": "user", "content": [ {"type": "text", "text": "比较这两张图片"}, {"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}}, {"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}} ] } ])) .send() .await?; }
最佳实践
1. 连接池管理
对于高并发场景,建议复用客户端实例:
#![allow(unused)] fn main() { // 推荐:共享客户端实例 use std::sync::Arc; let client = Arc::new(VllmClient::new("http://localhost:8000/v1")); // 在多个任务中使用 let client_clone = client.clone(); tokio::spawn(async move { client_clone.chat.completions().create() .model("llama-3") .messages(json!([{"role": "user", "content": "Hello"}])) .send() .await }); }
2. 错误处理
#![allow(unused)] fn main() { use vllm_client::{VllmClient, VllmError}; match client.chat.completions().create().send().await { Ok(response) => { println!("成功: {:?}", response); } Err(VllmError::ApiError { message, code }) => { eprintln!("API 错误 ({}): {}", code, message); // 根据错误码处理 match code { 429 => println!("被限流,请稍后重试"), 401 => println!("认证失败,检查API密钥"), _ => {} } } Err(e) => { eprintln!("其他错误: {}", e); } } }
3. 流式响应的资源管理
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; let mut stream = client .chat .completions() .create() .model("llama-3") .messages(json!([{"role": "user", "content": "Hello"}])) .stream(true) .send_stream() .await?; // 使用 take 限制处理的消息数量 while let Some(event) = stream.take(1000).next().await { match &event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Done | StreamEvent::Error(_) => break, _ => {} } } }
思考模式
思考模式(也称为推理模式)允许模型在给出最终答案之前输出其推理过程。这对于复杂推理任务特别有用。
概述
一些模型,如启用思考模式的 Qwen,可以输出两种类型的内容:
- 推理内容 - 模型的内部"思考"过程
- 内容 - 给用户的最终响应
启用思考模式
Qwen 模型
对于 Qwen 模型,通过 extra 参数启用思考模式:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "计算: 15 * 23 + 47 等于多少?"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .send() .await?; }
检查推理内容
在非流式响应中,单独访问推理内容:
#![allow(unused)] fn main() { // 检查推理内容 if let Some(reasoning) = response.reasoning_content { println!("推理: {}", reasoning); } // 获取最终内容 if let Some(content) = response.content { println!("答案: {}", content); } }
带思考模式的流式响应
使用思考模式的最佳方式是流式响应:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "逐步思考: 如果我有 5 个苹果,给朋友 2 个,然后又买了 3 个,我有多少个?"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; println!("=== 思考过程 ===\n"); let mut in_thinking = true; let mut reasoning = String::new(); let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { reasoning.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Content(delta) => { if in_thinking { in_thinking = false; println!("\n\n=== 最终答案 ===\n"); } content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\n错误: {}", e); break; } _ => {} } } println!(); Ok(()) }
使用场景
数学推理
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; async fn solve_math_problem(client: &VllmClient, problem: &str) -> Result<String, Box<dyn std::error::Error>> { let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "system", "content": "你是一个数学辅导员。清晰地展示你的工作过程。"}, {"role": "user", "content": problem} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; let mut answer = String::new(); while let Some(event) = stream.next().await { if let StreamEvent::Content(delta) = event { answer.push_str(&delta); } } Ok(answer) } }
代码分析
#![allow(unused)] fn main() { let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "分析这段代码的潜在 bug 和安全问题:\n\n```rust\nfn process_input(input: &str) -> String {\n let mut result = String::new();\n for c in input.chars() {\n result.push(c);\n }\n result\n}\n```"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .send() .await?; }
复杂决策
#![allow(unused)] fn main() { let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "system", "content": "你是一个决策支持助手。仔细考虑所有选项。"}, {"role": "user", "content": "我需要在公司 A(高薪,通勤远)和公司 B(中等薪资,远程工作)之间选择。帮我决定。"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .max_tokens(2048) .send() .await?; }
分离推理和答案
对于需要将推理与最终答案分离的应用:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; struct ThinkingResponse { reasoning: String, content: String, } async fn think_and_respond( client: &VllmClient, prompt: &str, ) -> Result<ThinkingResponse, Box<dyn std::error::Error>> { let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": prompt} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; let mut response = ThinkingResponse { reasoning: String::new(), content: String::new(), }; while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => response.reasoning.push_str(&delta), StreamEvent::Content(delta) => response.content.push_str(&delta), StreamEvent::Done => break, _ => {} } } Ok(response) } }
模型支持
| 模型 | 思考模式支持 |
|---|---|
| Qwen/Qwen2.5-72B-Instruct | ✅ 支持 |
| Qwen/Qwen2.5-32B-Instruct | ✅ 支持 |
| Qwen/Qwen2.5-7B-Instruct | ✅ 支持 |
| DeepSeek-R1 | ✅ 支持(内置) |
| 其他模型 | ❌ 取决于模型 |
检查您的 vLLM 服务器配置以验证思考模式支持。
配置选项
思考模型检测
模型自动处理思考标记:
#![allow(unused)] fn main() { // 推理内容从特殊标记中解析 // 通常结构为: <tool_call>...</think> 或类似格式 }
非流式访问
对于带推理的非流式请求:
#![allow(unused)] fn main() { let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "解释量子纠缠"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .send() .await?; // 访问推理内容(如果存在) if let Some(reasoning) = response.reasoning_content { println!("推理:\n{}\n", reasoning); } // 访问最终答案 println!("答案:\n{}", response.content.unwrap_or_default()); }
最佳实践
1. 用于复杂任务
思考模式对于以下场景最有价值:
- 多步推理
- 数学问题
- 代码分析
- 复杂决策
#![allow(unused)] fn main() { // 好: 复杂推理任务 .messages(json!([ {"role": "user", "content": "解这道题: 父亲的年龄是儿子的 4 倍。20 年后,他只会是儿子的 2 倍。他们现在各多少岁?"} ])) // 收益较小: 简单查询 .messages(json!([ {"role": "user", "content": "2 + 2 等于几?"} ])) }
2. 选择性显示推理
您可能希望在生产环境中隐藏推理,但在调试时显示:
#![allow(unused)] fn main() { let show_reasoning = std::env::var("SHOW_REASONING").is_ok(); while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { if show_reasoning { eprintln!("[思考中] {}", delta); } } StreamEvent::Content(delta) => print!("{}", delta), _ => {} } } }
3. 结合系统提示
使用系统提示引导思考过程:
#![allow(unused)] fn main() { .messages(json!([ { "role": "system", "content": "逐步思考问题。在确定答案之前考虑多种方法。" }, {"role": "user", "content": problem} ])) }
4. 调整最大 Token 数
思考模式使用更多 token。请相应调整:
#![allow(unused)] fn main() { .max_tokens(4096) // 考虑推理和答案两部分 }
故障排除
没有推理内容
如果看不到推理内容:
- 确保在
extra参数中启用了思考模式 - 验证模型支持思考模式
- 检查 vLLM 服务器配置
# 检查 vLLM 服务器日志以发现问题
流式响应不完整
如果流式响应似乎不完整:
#![allow(unused)] fn main() { // 确保处理所有事件类型 while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { /* 处理 */ }, StreamEvent::Content(delta) => { /* 处理 */ }, StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("错误: {}", e); break; } _ => {} // 不要忘记其他事件 } } }
相关链接
自定义请求头
本文档介绍如何在 vLLM Client 中使用自定义 HTTP 请求头。
概述
虽然 vLLM Client 通过 API Key 处理标准认证,但您可能需要添加自定义请求头用于:
- 自定义认证方案
- 请求追踪和调试
- 速率限制标识符
- 自定义元数据
当前限制
当前版本的 vLLM Client 不提供内置的自定义请求头方法。但是,您可以通过几种方式解决这个限制。
变通方法:环境变量
如果您的 vLLM 服务器通过环境变量或特定 API 参数接受配置:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1") .with_api_key(std::env::var("MY_API_KEY").unwrap_or_default()); }
变通方法:通过额外参数
一些自定义配置可以通过 extra() 方法传递:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": "你好!"}])) .extra(json!({ "custom_field": "custom_value", "request_id": "req-12345" })) .send() .await?; }
未来支持
自定义请求头支持计划在未来版本中实现。API 可能类似于:
// 未来 API(尚未实现)
let client = VllmClient::new("http://localhost:8000/v1")
.with_header("X-Custom-Header", "value")
.with_header("X-Request-ID", "req-123");
常见使用案例
追踪请求头
用于分布式追踪(当支持时):
// 未来 API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
.header("X-Trace-ID", trace_id)
.header("X-Span-ID", span_id)
.build();
自定义认证
用于非标准认证方案:
// 未来 API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
.header("X-API-Key", "custom-key")
.header("X-Tenant-ID", "tenant-123")
.build();
请求元数据
添加元数据用于日志或分析:
// 未来 API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
.header("X-Request-Source", "mobile-app")
.header("X-User-ID", "user-456")
.build();
替代方案:自定义 HTTP 客户端
对于高级用例,您可以直接使用底层的 reqwest 客户端:
use reqwest::Client; use serde_json::json; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = Client::new(); let response = client .post("http://localhost:8000/v1/chat/completions") .header("Content-Type", "application/json") .header("Authorization", "Bearer your-api-key") .header("X-Custom-Header", "custom-value") .json(&json!({ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [{"role": "user", "content": "你好!"}] })) .send() .await?; let result: serde_json::Value = response.json().await?; println!("{:?}", result); Ok(()) }
最佳实践
1. 尽可能使用标准认证
#![allow(unused)] fn main() { // 推荐 let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("your-api-key"); // 除非必要,避免使用自定义认证 }
2. 文档化自定义请求头
使用自定义请求头时,记录其用途:
// 未来 API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
// 用于日志中的请求追踪
.header("X-Request-ID", &request_id)
// 用于多租户标识
.header("X-Tenant-ID", &tenant_id)
.build();
3. 验证服务器支持
确保您的 vLLM 服务器接受并处理自定义请求头。一些代理或负载均衡器可能会移除未知的请求头。
安全考虑
不要暴露敏感请求头
避免记录包含敏感信息的请求头:
// 记录日志时要小心
let auth_header = "Bearer secret-key";
// 不要直接记录这个!
使用 HTTPS
传输敏感请求头时始终使用 HTTPS:
#![allow(unused)] fn main() { // 好 let client = VllmClient::new("https://api.example.com/v1"); // 对于敏感数据避免使用 let client = VllmClient::new("http://api.example.com/v1"); }
请求此功能
如果您需要自定义请求头支持,请在 GitHub 上提交 issue,包括:
- 您的使用场景
- 需要的请求头
- 您希望 API 如何设计
相关链接
超时与重试
本页介绍超时配置和重试策略,用于构建健壮的生产应用程序。
设置超时
客户端级别超时
创建客户端时设置超时:
#![allow(unused)] fn main() { use vllm_client::VllmClient; // 简单超时 let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(120); // 使用构建器 let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .timeout_secs(300) // 5 分钟 .build(); }
选择合适的超时时间
| 使用场景 | 推荐超时时间 |
|---|---|
| 简单查询 | 30-60 秒 |
| 代码生成 | 2-3 分钟 |
| 长文档生成 | 5-10 分钟 |
| 复杂推理任务 | 10+ 分钟 |
请求耗时因素
请求所需时间取决于:
- 提示词长度 - 更长的提示词需要更多处理时间
- 输出 token 数 - 更多 token = 更长生成时间
- 模型大小 - 更大的模型更慢
- 服务器负载 - 繁忙的服务器响应更慢
超时错误
处理超时
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; async fn chat_with_timeout(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(60); let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match result { Ok(response) => Ok(response.content.unwrap_or_default()), Err(VllmError::Timeout) => { eprintln!("请求在 60 秒后超时"); Err(VllmError::Timeout) } Err(e) => Err(e), } } }
重试策略
基础重试
使用指数退避重试失败的请求:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; use std::time::Duration; use tokio::time::sleep; async fn send_with_retry( client: &VllmClient, prompt: &str, max_retries: u32, ) -> Result<String, VllmError> { let mut attempts = 0; loop { match client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await { Ok(response) => { return Ok(response.content.unwrap_or_default()); } Err(e) if e.is_retryable() && attempts < max_retries => { attempts += 1; let delay = Duration::from_millis(100 * 2u64.pow(attempts - 1)); eprintln!("第 {} 次重试,等待 {:?}: {}", attempts, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } }
带抖动的重试
添加抖动以防止惊群效应:
#![allow(unused)] fn main() { use rand::Rng; use std::time::Duration; use tokio::time::sleep; fn backoff_with_jitter(attempt: u32, base_ms: u64, max_ms: u64) -> Duration { let exponential = base_ms * 2u64.pow(attempt); let jitter = rand::thread_rng().gen_range(0..base_ms); let delay = (exponential + jitter).min(max_ms); Duration::from_millis(delay) } async fn retry_with_jitter<F, T, E>( mut f: F, max_retries: u32, ) -> Result<T, E> where F: FnMut() -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<T, E>> + Send>>, E: std::fmt::Debug, { let mut attempts = 0; loop { match f().await { Ok(result) => return Ok(result), Err(e) if attempts < max_retries => { attempts += 1; let delay = backoff_with_jitter(attempts, 100, 10_000); eprintln!("第 {} 次重试,等待 {:?}: {:?}", attempts, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } }
仅重试可重试错误
并非所有错误都应该重试:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; async fn smart_retry( client: &VllmClient, prompt: &str, ) -> Result<String, VllmError> { let mut attempts = 0; let max_retries = 3; loop { let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match result { Ok(response) => return Ok(response.content.unwrap_or_default()), Err(e) => { // 检查错误是否可重试 if !e.is_retryable() { return Err(e); } if attempts >= max_retries { return Err(e); } attempts += 1; tokio::time::sleep(std::time::Duration::from_secs(2u64.pow(attempts))).await; } } } } }
可重试错误
| 错误 | 可重试 | 原因 |
|---|---|---|
Timeout | 是 | 服务器可能较慢 |
429 频率限制 | 是 | 等待后重试 |
500 服务器错误 | 是 | 临时服务器问题 |
502 网关错误 | 是 | 服务器可能正在重启 |
503 服务不可用 | 是 | 临时过载 |
504 网关超时 | 是 | 服务器错误 |
400 请求错误 | 否 | 客户端错误 |
401 未授权 | 否 | 认证问题 |
404 未找到 | 否 | 资源不存在 |
断路器模式
使用断路器防止级联故障:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU32, Ordering}; use std::time::{Duration, Instant}; use std::sync::Mutex; struct CircuitBreaker { failures: AtomicU32, last_failure: Mutex<Option<Instant>>, threshold: u32, reset_duration: Duration, } impl CircuitBreaker { fn new(threshold: u32, reset_duration: Duration) -> Self { Self { failures: AtomicU32::new(0), last_failure: Mutex::new(None), threshold, reset_duration, } } fn can_attempt(&self) -> bool { let failures = self.failures.load(Ordering::Relaxed); if failures < self.threshold { return true; } let last = self.last_failure.lock().unwrap(); if let Some(time) = *last { if time.elapsed() > self.reset_duration { // 重置断路器 self.failures.store(0, Ordering::Relaxed); return true; } } false } fn record_success(&self) { self.failures.store(0, Ordering::Relaxed); } fn record_failure(&self) { self.failures.fetch_add(1, Ordering::Relaxed); *self.last_failure.lock().unwrap() = Some(Instant::now()); } } }
流式响应超时
处理流式响应过程中的超时:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; use tokio::time::{timeout, Duration}; async fn stream_with_timeout( client: &VllmClient, prompt: &str, per_event_timeout: Duration, ) -> Result<String, vllm_client::VllmError> { let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .stream(true) .send_stream() .await?; let mut content = String::new(); loop { match timeout(per_event_timeout, stream.next()).await { Ok(Some(event)) => { match event { StreamEvent::Content(delta) => content.push_str(&delta), StreamEvent::Done => break, StreamEvent::Error(e) => return Err(e), _ => {} } } Ok(None) => break, Err(_) => { return Err(vllm_client::VllmError::Timeout); } } } Ok(content) } }
速率限制
实现客户端速率限制:
#![allow(unused)] fn main() { use tokio::sync::Semaphore; use std::sync::Arc; struct RateLimitedClient { client: vllm_client::VllmClient, semaphore: Arc<Semaphore>, } impl RateLimitedClient { fn new(base_url: &str, max_concurrent: usize) -> Self { Self { client: vllm_client::VllmClient::new(base_url), semaphore: Arc::new(Semaphore::new(max_concurrent)), } } async fn chat(&self, prompt: &str) -> Result<String, vllm_client::VllmError> { let _permit = self.semaphore.acquire().await.unwrap(); self.client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(vllm_client::json!([{"role": "user", "content": prompt}])) .send() .await .map(|r| r.content.unwrap_or_default()) } } }
生产环境配置
完整示例
use vllm_client::{VllmClient, json, VllmError}; use std::time::Duration; use tokio::time::sleep; struct RobustClient { client: VllmClient, max_retries: u32, base_backoff_ms: u64, max_backoff_ms: u64, } impl RobustClient { fn new(base_url: &str, timeout_secs: u64) -> Self { Self { client: VllmClient::builder() .base_url(base_url) .timeout_secs(timeout_secs) .build(), max_retries: 3, base_backoff_ms: 100, max_backoff_ms: 10_000, } } async fn chat(&self, prompt: &str) -> Result<String, VllmError> { let mut attempts = 0; loop { match self.send_request(prompt).await { Ok(response) => return Ok(response), Err(e) if self.should_retry(&e, attempts) => { attempts += 1; let delay = self.calculate_backoff(attempts); eprintln!("第 {} 次重试,等待 {:?}: {}", attempts, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } async fn send_request(&self, prompt: &str) -> Result<String, VllmError> { self.client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map(|r| r.content.unwrap_or_default()) } fn should_retry(&self, error: &VllmError, attempts: u32) -> bool { attempts < self.max_retries && error.is_retryable() } fn calculate_backoff(&self, attempt: u32) -> Duration { let delay = self.base_backoff_ms * 2u64.pow(attempt); Duration::from_millis(delay.min(self.max_backoff_ms)) } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = RobustClient::new("http://localhost:8000/v1", 300); match client.chat("你好!").await { Ok(response) => println!("响应: {}", response), Err(e) => eprintln!("重试后仍然失败: {}", e), } Ok(()) }
最佳实践
- 根据预期响应时间设置适当的超时
- 使用指数退避以避免压垮服务器
- 添加抖动以防止惊群效应问题
- 仅重试可重试错误 - 不要重试客户端错误
- 为生产系统实现断路器
- 记录重试尝试用于调试和监控
- 设置最大重试次数以避免无限循环
相关链接
贡献指南
感谢您有兴趣为 vLLM Client 做贡献!本文档提供了贡献的指南和说明。
目录
行为准则
请保持尊重和包容。我们欢迎所有人的贡献。
入门指南
- 在 GitHub 上 Fork 仓库
- 克隆您的 Fork 到本地
- 为您的更改创建分支
git clone https://github.com/YOUR_USERNAME/vllm-client.git
cd vllm-client
git checkout -b my-feature
开发环境设置
前提条件
- Rust 1.70 或更高版本
- Cargo(随 Rust 一起安装)
- 用于集成测试的 vLLM 服务器(可选)
构建
# 构建库
cargo build
# 构建所有功能
cargo build --all-features
运行测试
# 运行单元测试
cargo test
# 运行测试并显示输出
cargo test -- --nocapture
# 运行特定测试
cargo test test_name
# 运行集成测试(需要 vLLM 服务器)
cargo test --test integration
进行更改
分支命名
使用描述性的分支名称:
feature/add-new-feature- 用于新功能fix/bug-description- 用于 bug 修复docs/documentation-update- 用于文档更改refactor/code-cleanup- 用于重构
提交消息
遵循约定式提交格式:
类型(范围): 描述
[可选正文]
[可选页脚]
类型:
feat: 新功能fix: Bug 修复docs: 文档更改style: 代码风格更改(格式化等)refactor: 代码重构test: 添加或更新测试chore: 维护任务
示例:
feat(client): 添加连接池支持
fix(streaming): 正确处理空数据块
docs(api): 更新流式文档
测试
单元测试
所有新功能都应该有单元测试:
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn test_new_feature() { // 测试实现 } } }
集成测试
集成测试放在 tests/ 目录中:
#![allow(unused)] fn main() { // tests/integration_test.rs use vllm_client::{VllmClient, json}; #[tokio::test] async fn test_chat_completion() { let client = VllmClient::new("http://localhost:8000/v1"); // ... 测试代码 } }
测试覆盖率
我们追求良好的测试覆盖率。运行覆盖率报告:
cargo tarpaulin --out Html
文档
代码文档
使用文档注释记录所有公共 API:
#![allow(unused)] fn main() { /// 创建新的聊天补全请求。 /// /// # 参数 /// /// * `model` - 用于生成的模型名称 /// /// # 返回 /// /// 新的 `ChatCompletionsRequest` 构建器 /// /// # 示例 /// /// ```rust /// use vllm_client::{VllmClient, json}; /// /// let client = VllmClient::new("http://localhost:8000/v1"); /// let response = client.chat.completions().create() /// .model("Qwen/Qwen2.5-7B-Instruct") /// .messages(json!([{"role": "user", "content": "你好"}])) /// .send() /// .await?; /// ``` pub fn create(&self) -> ChatCompletionsRequest { // 实现 } }
更新文档
添加新功能时:
- 更新内联文档
- 更新
docs/src/api/中的 API 参考 - 在
docs/src/examples/中添加示例 - 更新变更日志
构建文档
# 构建并预览文档
cd docs && mdbook serve --open
Pull Request 流程
- 更新文档:确保文档反映您的更改
- 添加测试:为新功能包含测试
- 运行测试:确保所有测试通过
- 格式化代码:运行
cargo fmt - 检查 Lint:运行
cargo clippy - 更新 CHANGELOG:在变更日志中添加条目
PR 前检查清单
# 格式化代码
cargo fmt
# 检查 lint
cargo clippy -- -D warnings
# 运行所有测试
cargo test
# 构建文档
mdbook build docs
mdbook build docs/zh
提交 PR
- 将您的分支推送到您的 Fork
- 向
main分支发起 PR - 填写 PR 模板
- 等待审查
PR 模板
## 描述
更改的简要描述
## 更改类型
- [ ] Bug 修复
- [ ] 新功能
- [ ] 破坏性更改
- [ ] 文档更新
## 测试
- [ ] 单元测试已添加/更新
- [ ] 集成测试已添加/更新
- [ ] 已完成手动测试
## 检查清单
- [ ] 代码已用 `cargo fmt` 格式化
- [ ] 无 clippy 警告
- [ ] 文档已更新
- [ ] 变更日志已更新
编码标准
Rust 风格
遵循标准 Rust 约定:
- 使用
cargo fmt进行格式化 - 解决所有
clippy警告 - 遵循 Rust API 指南
命名约定
- 类型:PascalCase(
ChatCompletionResponse) - 函数/方法:snake_case(
send_stream) - 常量:SCREAMING_SNAKE_CASE(
MAX_RETRIES) - 模块:snake_case(
chat,completions)
错误处理
对所有错误使用 VllmError:
#![allow(unused)] fn main() { // 好 pub fn parse_response(data: &str) -> Result<Response, VllmError> { serde_json::from_str(data).map_err(VllmError::Json) } // 避免 pub fn parse_response(data: &str) -> Result<Response, String> { // ... } }
异步代码
对所有异步操作使用 async/await:
#![allow(unused)] fn main() { // 好 pub async fn send(&self) -> Result<Response, VllmError> { let response = self.http.post(&url).send().await?; // ... } // 避免在异步上下文中阻塞 pub async fn bad_example(&self) -> Result<Response, VllmError> { std::thread::sleep(Duration::from_secs(1)); // 不要这样做 // ... } }
项目结构
vllm-client/
├── src/
│ ├── lib.rs # 库入口点
│ ├── client.rs # 客户端实现
│ ├── chat.rs # 聊天 API
│ ├── completions.rs # 传统补全
│ ├── types.rs # 类型定义
│ └── error.rs # 错误类型
├── tests/
│ └── integration/ # 集成测试
├── docs/
│ ├── src/ # 英文文档
│ └── zh/src/ # 中文文档
├── examples/
│ └── *.rs # 示例程序
└── Cargo.toml
获取帮助
- 对于 bug 或功能请求,请提交 issue
- 对于问题,请发起讨论
- 创建新 issue 前请先检查现有 issue
许可证
通过贡献,您同意您的贡献将根据 MIT OR Apache-2.0 许可证授权。
致谢
贡献者将在我们的 README 和发布说明中得到认可。
感谢您为 vLLM Client 做贡献!
更新日志
本文件记录了项目的所有重要更改。
格式基于 Keep a Changelog, 本项目遵循 语义化版本。
0.1.0 - 2024-01-XX
新增
- vLLM Client 初始版本发布
VllmClient用于连接 vLLM 服务器- 聊天补全 API (
client.chat.completions()) - 使用
MessageStream的流式响应支持 - 工具/函数调用支持
- 兼容模型的推理/思考模式支持
- 使用
VllmError枚举的错误处理 - 客户端配置的构建器模式
- 聊天补全的请求构建器模式
- 通过
extra()支持 vLLM 特定参数 - 响应中的 token 使用追踪
- 超时配置
- API Key 认证
功能
客户端
VllmClient::new(base_url)- 创建新客户端VllmClient::builder()- 使用构建器模式创建客户端with_api_key()- 设置用于认证的 API Keytimeout_secs()- 设置请求超时
聊天补全
model()- 设置模型名称messages()- 设置对话消息temperature()- 设置采样温度max_tokens()- 设置最大输出 token 数top_p()- 设置核采样参数top_k()- 设置 top-k 采样(vLLM 扩展)stop()- 设置停止序列stream()- 启用流式模式tools()- 定义可用工具tool_choice()- 控制工具选择extra()- 传递 vLLM 特定参数
流式响应
StreamEvent::Content- 内容 tokenStreamEvent::Reasoning- 推理内容(思考模型)StreamEvent::ToolCallDelta- 流式工具调用更新StreamEvent::ToolCallComplete- 完整的工具调用StreamEvent::Usage- Token 使用统计StreamEvent::Done- 流式完成StreamEvent::Error- 错误事件
响应类型
ChatCompletionResponse- 聊天补全响应ToolCall- 带解析方法的工具调用数据Usage- Token 使用统计
依赖项
reqwest- HTTP 客户端serde/serde_json- JSON 序列化tokio- 异步运行时thiserror- 错误处理
[未发布]
计划中
- 自定义 HTTP 请求头支持
- 连接池配置
- 请求/响应日志
- 重试中间件
- 多模态输入辅助工具
- 批量处理的异步迭代器
- OpenTelemetry 集成
- WebSocket 传输
版本历史
| 版本 | 日期 | 亮点 |
|---|---|---|
| 0.1.0 | 2024-01 | 初始版本 |