vLLM Client

一个兼容 OpenAI 接口的 vLLM API Rust 客户端库。

特性

OpenAI 兼容：使用与 OpenAI 相同的 API 结构，方便迁移
流式响应：完整支持 Server-Sent Events (SSE) 流式响应
工具调用：支持函数/工具调用，支持流式增量更新
推理模型：内置支持推理/思考模型（如启用了思考模式的 Qwen）
异步支持：基于 Tokio 运行时的完全异步实现
类型安全：使用 Serde 序列化的强类型定义

快速开始

添加到你的 Cargo.toml：

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

基本用法

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "你好，世界！"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content);
    Ok(())
}

文档

快速开始 - 安装和基本配置
API 参考 - 完整的 API 文档
示例代码 - 代码示例
高级主题 - 流式响应、工具调用等

语言

English - English documentation
中文 - 当前页面

许可证

根据 Apache 许可证 2.0 版本或 MIT 许可证任选其一授权。

快速开始

安装

将 vllm-client 添加到你的 Cargo.toml：

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

快速开始

基础聊天补全

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建客户端
    let client = VllmClient::new("http://localhost:8000/v1");
    
    // 发送聊天补全请求
    let response = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "你好，你好吗？"}
        ]))
        .send()
        .await?;
    
    // 打印响应
    println!("{}", response.choices[0].message.content);
    
    Ok(())
}

流式响应

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;
    
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Reasoning(delta) => print!("{}", delta),
            StreamEvent::Content(delta) => print!("{}", delta),
            _ => {}
        }
    }
    
    println!();
    Ok(())
}

配置

API 密钥

如果你的 vLLM 服务器需要认证：

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");
}

自定义超时

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(std::time::Duration::from_secs(60));
}

下一步

API 参考 - 完整的 API 文档
示例 - 更多使用示例
高级功能 - 思考模式、工具调用等

安装

环境要求

Rust: 1.70 及以上版本
Cargo: 安装 Rust 时会自动安装

引入项目

在 Cargo.toml 中添加依赖：

[dependencies]
vllm-client = "0.1"

或直接运行：

cargo add vllm-client

依赖说明

本库依赖 tokio 异步运行时，请在 Cargo.toml 中添加：

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

为方便使用，库内重新导出了 serde_json::json，你可以选择添加：

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"

特性开关

目前 vllm-client 暂无额外特性开关，所有功能默认启用。

验证安装

写一段简单代码验证安装是否成功：

use vllm_client::VllmClient;

fn main() {
    let client = VllmClient::new("http://localhost:8000/v1");
    println!("客户端创建成功，地址: {}", client.base_url());
}

运行：

cargo run

启动 vLLM 服务

使用本客户端前，需要先启动 vLLM 服务：

# 安装 vLLM
pip install vllm

# 启动服务并加载模型
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000

服务启动后会在 http://localhost:8000/v1 提供接口。

常见问题

连接失败

遇到连接错误时，请检查：

vLLM 服务是否正常运行
服务地址是否正确（默认 http://localhost:8000/v1）
防火墙是否阻止了端口访问

TLS/SSL 报错

如果 vLLM 服务使用了自签名 HTTPS 证书，需要在代码中处理证书验证问题。

请求超时

请求耗时时长较大时，可以调大超时时间：

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 5 分钟
}

下一步

快速上手 - 开发第一个示例
配置说明 - 了解配置选项

快速上手

本节带你完成第一次 API 调用。

前置条件

Rust 1.70 及以上版本
已启动的 vLLM 服务

基础对话补全

最简单的使用方式如下：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建客户端，指向 vLLM 服务地址
    let client = VllmClient::new("http://localhost:8000/v1");

    // 发送对话补全请求
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好，最近怎么样？"}
        ]))
        .send()
        .await?;

    // 打印响应内容
    println!("回复: {}", response.content.unwrap_or_default());

    Ok(())
}

流式响应

如果需要实时输出，可以使用流式模式：

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 创建流式请求
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的短诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    // 处理流式事件
    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Reasoning(delta) => eprint!("[思考: {}]", delta),
            StreamEvent::Done => println!("\n[完成]"),
            StreamEvent::Error(e) => eprintln!("\n错误: {}", e),
            _ => {}
        }
    }

    Ok(())
}

使用构建器模式

需要更多配置时，可以使用构建器：

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("your-api-key")  // 可选
    .timeout_secs(120)         // 可选
    .build();
}

完整示例

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个有帮助的助手。"},
            {"role": "user", "content": "法国的首都是哪里？"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    println!("回复: {}", response.content.unwrap_or_default());
    
    // 打印 token 使用统计（如有）
    if let Some(usage) = response.usage {
        println!("Token 统计: 提示词={}, 补全={}, 总计={}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }

    Ok(())
}

错误处理

建议做好错误处理：

use vllm_client::{VllmClient, json, VllmError};

async fn chat() -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好！"}
        ]))
        .send()
        .await?;

    Ok(response.content.unwrap_or_default())
}

#[tokio::main]
async fn main() {
    match chat().await {
        Ok(text) => println!("回复: {}", text),
        Err(VllmError::ApiError { status_code, message, .. }) => {
            eprintln!("API 错误 ({}): {}", status_code, message);
        }
        Err(VllmError::Timeout) => {
            eprintln!("请求超时");
        }
        Err(e) => {
            eprintln!("错误: {}", e);
        }
    }
}

下一步

配置说明 - 了解全部配置选项
API 参考 - 详细 API 文档
示例代码 - 更多使用示例

配置说明

本文档介绍 vllm-client 的全部配置选项。

客户端配置

基础配置

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1");
}

构建器模式

需要更复杂的配置时，使用构建器模式：

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("your-api-key")
    .timeout_secs(120)
    .build();
}

配置选项

Base URL

vLLM 服务的地址，需要包含 /v1 路径以兼容 OpenAI 接口。

#![allow(unused)]
fn main() {
// 本地开发
let client = VllmClient::new("http://localhost:8000/v1");

// 远程服务
let client = VllmClient::new("https://api.example.com/v1");

// 末尾斜杠会自动处理
let client = VllmClient::new("http://localhost:8000/v1/");
// 等同于: "http://localhost:8000/v1"
}

API Key

如果 vLLM 服务需要认证，配置 API Key：

#![allow(unused)]
fn main() {
// 链式调用
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-your-api-key");

// 构建器模式
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-your-api-key")
    .build();
}

API Key 会作为 Bearer Token 放在 Authorization 请求头中发送。

超时设置

长时间运行的任务需要调大超时时间：

#![allow(unused)]
fn main() {
// 链式调用
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 5 分钟

// 构建器模式
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .timeout_secs(300)
    .build();
}

默认使用底层 HTTP 客户端的超时设置（通常为 30 秒）。

请求参数配置

发起请求时，可以配置以下参数：

模型选择

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .send()
    .await?;
}

采样参数

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .temperature(0.7)      // 0.0 - 2.0
    .top_p(0.9)            // 0.0 - 1.0
    .top_k(50)             // vLLM 扩展参数
    .max_tokens(1024)      // 最大输出 token 数
    .send()
    .await?;
}

参数	类型	范围	说明
`temperature`	f32	0.0 - 2.0	控制随机性，值越高输出越随机
`top_p`	f32	0.0 - 1.0	核采样阈值
`top_k`	i32	1+	Top-K 采样（vLLM 扩展）
`max_tokens`	u32	1+	最大生成 token 数

停止序列

#![allow(unused)]
fn main() {
use serde_json::json;

// 多个停止序列
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .stop(json!(["END", "STOP", "\n\n"]))
    .send()
    .await?;

// 单个停止序列
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .stop(json!("END"))
    .send()
    .await?;
}

扩展参数

vLLM 支持通过 extra() 方法传入额外参数：

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "请思考这个问题"}]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        },
        "reasoning_effort": "high"
    }))
    .send()
    .await?;
}

环境变量

可以通过环境变量配置客户端：

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

let base_url = env::var("VLLM_BASE_URL")
    .unwrap_or_else(|_| "http://localhost:8000/v1".to_string());

let api_key = env::var("VLLM_API_KEY").ok();

let mut client_builder = VllmClient::builder()
    .base_url(&base_url);

if let Some(key) = api_key {
    client_builder = client_builder.api_key(&key);
}

let client = client_builder.build();
}

常用环境变量

变量名	说明	示例
`VLLM_BASE_URL`	vLLM 服务地址	`http://localhost:8000/v1`
`VLLM_API_KEY`	API Key（可选）	`sk-xxx`
`VLLM_TIMEOUT`	超时时间（秒）	`300`

最佳实践

复用客户端

客户端应该创建一次、多次复用：

#![allow(unused)]
fn main() {
// 推荐：复用客户端
let client = VllmClient::new("http://localhost:8000/v1");

for prompt in prompts {
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await?;
}

// 避免：每次请求都创建客户端
for prompt in prompts {
    let client = VllmClient::new("http://localhost:8000/v1"); // 效率低！
    // ...
}
}

超时时间选择

根据使用场景选择合适的超时时间：

使用场景	建议超时
简单问答	30 秒
复杂推理	2-5 分钟
长文本生成	10 分钟以上

错误处理

务必正确处理错误：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, VllmError};

match client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .send()
    .await
{
    Ok(response) => println!("{}", response.content.unwrap()),
    Err(VllmError::Timeout) => eprintln!("请求超时"),
    Err(VllmError::ApiError { status_code, message, .. }) => {
        eprintln!("API 错误 ({}): {}", status_code, message);
    }
    Err(e) => eprintln!("错误: {}", e),
}
}

下一步

快速上手 - 基本用法示例
API 参考 - 完整 API 文档
错误处理 - 详细错误处理指南

API 参考

本文档提供 vLLM Client API 的完整参考。

客户端

`VllmClient`

与 vLLM API 交互的主要客户端。

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

// 创建新客户端
let client = VllmClient::new("http://localhost:8000/v1");

// 带API密钥
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");

// 带自定义超时
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(std::time::Duration::from_secs(60));
}

方法

方法	描述
`new(base_url: &str)`	使用给定的基础URL创建新客户端
`with_api_key(key: &str)`	设置用于认证的API密钥
`with_timeout(duration)`	设置请求超时时间
`chat`	访问聊天补全API

聊天补全

创建补全请求

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "system", "content": "你是一个有帮助的助手。"},
        {"role": "user", "content": "你好！"}
    ]))
    .temperature(0.7)
    .max_tokens(1000)
    .send()
    .await?;
}

构建器方法

方法	类型	描述
`model(name)`	`&str`	使用的模型名称
`messages(msgs)`	`Value`	聊天消息数组
`temperature(temp)`	`f32`	采样温度 (0.0-2.0)
`max_tokens(tokens)`	`u32`	最大生成token数
`top_p(p)`	`f32`	核采样参数
`top_k(k)`	`u32`	Top-k采样参数
`stream(enable)`	`bool`	启用流式响应
`tools(tools)`	`Value`	函数调用的工具定义
`extra(json)`	`Value`	额外参数（厂商特定）

响应结构

#![allow(unused)]
fn main() {
pub struct ChatCompletionResponse {
    pub id: String,
    pub object: String,
    pub created: u64,
    pub model: String,
    pub choices: Vec<Choice>,
    pub usage: Usage,
}

pub struct Choice {
    pub index: u32,
    pub message: Message,
    pub finish_reason: Option<String>,
}

pub struct Message {
    pub role: String,
    pub content: Option<String>,
    pub tool_calls: Option<Vec<ToolCall>>,
}

pub struct Usage {
    pub prompt_tokens: u32,
    pub completion_tokens: u32,
    pub total_tokens: u32,
}
}

流式响应

流式补全

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let client = VllmClient::new("http://localhost:8000/v1");

let mut stream = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "user", "content": "写一首诗"}
    ]))
    .stream(true)
    .send_stream()
    .await?;

while let Some(event) = stream.next().await {
    match &event {
        StreamEvent::Reasoning(delta) => {
            // 推理内容（用于思考模型）
            print!("{}", delta);
        }
        StreamEvent::Content(delta) => {
            // 常规内容
            print!("{}", delta);
        }
        StreamEvent::ToolCallDelta { tool_call_id, delta } => {
            // 工具调用流式更新
        }
        StreamEvent::ToolCallComplete(tool_call) => {
            // 完整的工具调用
        }
        StreamEvent::Usage(usage) => {
            // Token使用信息
        }
        StreamEvent::Done => {
            // 流式完成
            break;
        }
        StreamEvent::Error(e) => {
            eprintln!("错误: {}", e);
        }
    }
}
}

StreamEvent 类型

变体	描述
`Reasoning(String)`	推理/思考内容
`Content(String)`	常规内容增量
`ToolCallDelta { tool_call_id, delta }`	流式工具调用
`ToolCallComplete(ToolCall)`	完整工具调用
`Usage(Usage)`	Token使用统计
`Done`	流式结束
`Error(VllmError)`	发生错误

工具调用

定义工具

#![allow(unused)]
fn main() {
use vllm_client::json;

let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定位置的当前天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "城市名称"
                    }
                },
                "required": ["location"]
            }
        }
    }
]);

let response = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样？"}
    ]))
    .tools(tools)
    .send()
    .await?;

// 处理工具调用
if let Some(tool_calls) = response.choices[0].message.tool_calls {
    for tool_call in tool_calls {
        println!("函数: {}", tool_call.function.name);
        println!("参数: {}", tool_call.function.arguments);
    }
}
}

ToolCall 结构

#![allow(unused)]
fn main() {
pub struct ToolCall {
    pub id: String,
    pub r#type: String,
    pub function: FunctionCall,
}

pub struct FunctionCall {
    pub name: String,
    pub arguments: String, // JSON字符串
}
}

返回工具结果

#![allow(unused)]
fn main() {
// 执行工具后，返回结果
let response = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样？"},
        {"role": "assistant", "tool_calls": [
            {
                "id": "call_123",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": "{\"location\": \"Tokyo\"}"
                }
            }
        ]},
        {
            "role": "tool",
            "tool_call_id": "call_123",
            "content": "{\"temperature\": 25, \"condition\": \"sunny\"}"
        }
    ]))
    .tools(tools)
    .send()
    .await?;
}

类型定义

消息类型

#![allow(unused)]
fn main() {
// 系统消息
json!({"role": "system", "content": "你是一个有帮助的助手。"})

// 用户消息
json!({"role": "user", "content": "你好！"})

// 助手消息
json!({"role": "assistant", "content": "你好！"})

// 工具结果消息
json!({
    "role": "tool",
    "tool_call_id": "call_123",
    "content": "结果"
})
}

vLLM 特定参数

使用 .extra() 传递 vLLM 特定参数：

#![allow(unused)]
fn main() {
client
    .chat
    .completions()
    .create()
    .model("qwen-3")
    .messages(json!([{"role": "user", "content": "思考一下这个问题"}]))
    .extra(json!({
        "chat_template_kwargs": {
            "enable_thinking": true
        }
    }))
    .send()
    .await?;
}

错误处理

VllmError

#![allow(unused)]
fn main() {
use vllm_client::VllmError;

match client.chat.completions().create().send().await {
    Ok(response) => { /* ... */ },
    Err(VllmError::HttpError(e)) => {
        eprintln!("HTTP错误: {}", e);
    }
    Err(VllmError::ApiError { message, code }) => {
        eprintln!("API错误 ({}): {}", code, message);
    }
    Err(VllmError::StreamError(e)) => {
        eprintln!("流式错误: {}", e);
    }
    Err(VllmError::ParseError(e)) => {
        eprintln!("解析错误: {}", e);
    }
    Err(e) => {
        eprintln!("其他错误: {}", e);
    }
}
}

错误类型

变体	描述
`HttpError`	HTTP请求/响应错误
`ApiError`	API级别错误（限流等）
`StreamError`	流式特定错误
`ParseError`	JSON解析错误
`IoError`	I/O错误

完整示例

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .with_api_key("your-api-key");

    // 流式示例
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "写一首关于编程的俳句"}
        ]))
        .temperature(0.7)
        .max_tokens(100)
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Done => break,
            StreamEvent::Error(e) => eprintln!("错误: {}", e),
            _ => {}
        }
    }

    println!();
    Ok(())
}

客户端 API

VllmClient 是使用 vLLM API 的主要入口。

创建客户端

简单创建

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1");
}

带 API Key

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-your-api-key");
}

设置超时

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(120); // 2 分钟
}

使用构建器模式

复杂配置可以用构建器：

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-your-api-key")
    .timeout_secs(300)
    .build();
}

方法参考

`new(base_url: impl Into<String>) -> Self`

用指定的 base URL 创建客户端。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1");
}

参数：

base_url - vLLM 服务的基础 URL（需包含 /v1 路径）

注意：

末尾斜杠会自动移除
客户端创建开销很小，但仍建议复用

`with_api_key(self, api_key: impl Into<String>) -> Self`

设置 API Key（构建器模式）。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-xxx");
}

参数：

api_key - 用于 Bearer 认证的 API Key

注意：

API Key 会作为 Bearer Token 放在 Authorization 请求头中
此方法返回新的客户端实例

`timeout_secs(self, secs: u64) -> Self`

设置请求超时时间（构建器模式）。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300);
}

参数：

secs - 超时时间（秒）

注意：

对该客户端发起的所有请求生效
长时间生成任务建议调大超时时间

`base_url(&self) -> &str`

获取客户端的 base URL。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1");
assert_eq!(client.base_url(), "http://localhost:8000/v1");
}

`api_key(&self) -> Option<&str>`

获取已配置的 API Key。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-xxx");
assert_eq!(client.api_key(), Some("sk-xxx"));
}

`builder() -> VllmClientBuilder`

创建新的客户端构建器，支持更多配置选项。

#![allow(unused)]
fn main() {
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-xxx")
    .timeout_secs(120)
    .build();
}

API 模块

客户端提供多个 API 模块：

`chat` - 对话补全 API

访问对话补全接口：

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .send()
    .await?;
}

`completions` - 传统补全 API

访问传统文本补全接口：

#![allow(unused)]
fn main() {
let response = client.completions.create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .prompt("从前有座山")
    .send()
    .await?;
}

VllmClientBuilder

构建器提供灵活的客户端配置方式。

方法

方法	类型	说明
`base_url(url)`	`impl Into<String>`	设置基础 URL
`api_key(key)`	`impl Into<String>`	设置 API Key
`timeout_secs(secs)`	`u64`	设置超时时间（秒）
`build()`	-	构建客户端

默认值

选项	默认值
`base_url`	`http://localhost:8000/v1`
`api_key`	`None`
`timeout_secs`	HTTP 客户端默认值（30秒）

使用示例

基础用法

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好！"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

使用环境变量

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

fn create_client() -> VllmClient {
    let base_url = env::var("VLLM_BASE_URL")
        .unwrap_or_else(|_| "http://localhost:8000/v1".to_string());
    
    let api_key = env::var("VLLM_API_KEY").ok();
    
    let mut builder = VllmClient::builder().base_url(&base_url);
    
    if let Some(key) = api_key {
        builder = builder.api_key(&key);
    }
    
    builder.build()
}
}

多次请求

复用客户端处理多次请求：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

async fn process_prompts(client: &VllmClient, prompts: &[&str]) -> Vec<String> {
    let mut results = Vec::new();
    
    for prompt in prompts {
        let response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;
        
        match response {
            Ok(r) => results.push(r.content.unwrap_or_default()),
            Err(e) => eprintln!("错误: {}", e),
        }
    }
    
    results
}
}

线程安全

VllmClient 是线程安全的，可以跨线程共享：

#![allow(unused)]
fn main() {
use std::sync::Arc;
use vllm_client::VllmClient;

let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// 可以克隆并在多线程间传递
let client_clone = Arc::clone(&client);
}

对话补全 API

对话补全 API 是生成文本响应的主要接口。

概述

通过 client.chat.completions() 访问对话补全 API：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "你好！"}
    ]))
    .send()
    .await?;
}

请求构建器

必需参数

`model(name: impl Into<String>)`

设置生成使用的模型名称。

#![allow(unused)]
fn main() {
.model("Qwen/Qwen2.5-72B-Instruct")
// 或
.model("meta-llama/Llama-3-70b")
}

`messages(messages: Value)`

设置对话消息，格式为 JSON 数组。

#![allow(unused)]
fn main() {
.messages(json!([
    {"role": "system", "content": "你是一个有帮助的助手。"},
    {"role": "user", "content": "Rust 是什么？"}
]))
}

消息类型

角色	说明
`system`	设置助手行为
`user`	用户输入
`assistant`	助手回复（多轮对话时使用）
`tool`	工具结果（函数调用时使用）

采样参数

`temperature(temp: f32)`

控制随机性。范围：0.0 到 2.0。

#![allow(unused)]
fn main() {
.temperature(0.7)  // 常规行为
.temperature(0.0)  // 确定性输出
.temperature(1.5)  // 更有创意
}

`max_tokens(tokens: u32)`

最大生成 token 数。

#![allow(unused)]
fn main() {
.max_tokens(1024)
.max_tokens(4096)
}

`top_p(p: f32)`

核采样阈值。范围：0.0 到 1.0。

#![allow(unused)]
fn main() {
.top_p(0.9)
}

`top_k(k: i32)`

Top-K 采样（vLLM 扩展）。限制为 top K 个 token。

#![allow(unused)]
fn main() {
.top_k(50)
}

`stop(sequences: Value)`

遇到这些序列时停止生成。

#![allow(unused)]
fn main() {
// 多个停止序列
.stop(json!(["END", "STOP", "\n\n"]))

// 单个停止序列
.stop(json!("---"))
}

工具调用参数

`tools(tools: Value)`

定义模型可调用的工具/函数。

#![allow(unused)]
fn main() {
.tools(json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取某地的天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]))
}

`tool_choice(choice: Value)`

控制工具选择行为。

#![allow(unused)]
fn main() {
.tool_choice(json!("auto"))       // 模型决定
.tool_choice(json!("none"))       // 不使用工具
.tool_choice(json!("required"))   // 强制使用工具
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

高级参数

`stream(enable: bool)`

启用流式响应。

#![allow(unused)]
fn main() {
.stream(true)
}

`extra(params: Value)`

传入 vLLM 特有或其他额外参数。

#![allow(unused)]
fn main() {
.extra(json!({
    "chat_template_kwargs": {
        "think_mode": true
    },
    "reasoning_effort": "high"
}))
}

发送请求

`send()` - 同步响应

一次性返回完整响应。

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .send()
    .await?;
}

`send_stream()` - 流式响应

返回流式数据，实现实时输出。

#![allow(unused)]
fn main() {
let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .stream(true)
    .send_stream()
    .await?;
}

详见流式响应。

响应结构

`ChatCompletionResponse`

字段	类型	说明
`raw`	`Value`	原始 JSON 响应
`id`	`String`	响应 ID
`object`	`String`	对象类型
`model`	`String`	使用的模型
`content`	`Option<String>`	生成的内容
`reasoning_content`	`Option<String>`	推理内容（思考模型）
`tool_calls`	`Option<Vec<ToolCall>>`	工具调用
`finish_reason`	`Option<String>`	停止原因
`usage`	`Option<Usage>`	Token 使用统计

使用示例

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "2+2 等于几？"}
    ]))
    .send()
    .await?;

// 获取内容
println!("内容: {}", response.content.unwrap_or_default());

// 检查推理内容（思考模型）
if let Some(reasoning) = response.reasoning_content {
    println!("推理: {}", reasoning);
}

// 检查停止原因
match response.finish_reason.as_deref() {
    Some("stop") => println!("自然结束"),
    Some("length") => println!("达到最大 token 数"),
    Some("tool_calls") => println!("进行了工具调用"),
    _ => {}
}

// Token 使用统计
if let Some(usage) = response.usage {
    println!("提示词 tokens: {}", usage.prompt_tokens);
    println!("补全 tokens: {}", usage.completion_tokens);
    println!("总 tokens: {}", usage.total_tokens);
}
}

完整示例

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个编程助手。"},
            {"role": "user", "content": "用 Rust 写一个反转字符串的函数"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    if let Some(content) = response.content {
        println!("{}", content);
    }

    Ok(())
}

多轮对话

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

// 第一轮
let response1 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "我叫小明"}
    ]))
    .send()
    .await?;

// 继续对话
let response2 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "我叫小明"},
        {"role": "assistant", "content": response1.content.unwrap()},
        {"role": "user", "content": "我叫什么名字？"}
    ]))
    .send()
    .await?;
}

流式响应 API

流式响应可以实时处理大语言模型的输出，逐个 token 接收，无需等待完整响应。

概述

vLLM Client 通过 Server-Sent Events (SSE) 提供流式支持。使用 send_stream() 替代 send() 即可获得流式响应。

基础流式调用

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    println!();
    Ok(())
}

StreamEvent 类型

StreamEvent 枚举表示不同类型的流式事件：

变体	说明
`Content(String)`	普通内容 token 增量
`Reasoning(String)`	推理/思考内容（思考模型）
`ToolCallDelta`	流式工具调用增量
`ToolCallComplete(ToolCall)`	完整工具调用，可执行
`Usage(Usage)`	Token 使用统计
`Done`	流式传输完成
`Error(VllmError)`	发生错误

内容事件

最常见的事件类型，包含文本 token：

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Content(delta) => {
        print!("{}", delta);
        std::io::Write::flush(&mut std::io::stdout()).ok();
    }
    _ => {}
}
}

推理事件

用于带推理能力的模型（如开启思考模式的 Qwen）：

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Reasoning(delta) => {
        eprintln!("[思考] {}", delta);
    }
    StreamEvent::Content(delta) => {
        print!("{}", delta);
    }
    _ => {}
}
}

工具调用事件

工具调用会先增量推送，完成后通知：

#![allow(unused)]
fn main() {
match event {
    StreamEvent::ToolCallDelta { index, id, name, arguments } => {
        println!("工具增量: index={}, name={}", index, name);
        // arguments 是部分 JSON 字符串
    }
    StreamEvent::ToolCallComplete(tool_call) => {
        println!("工具就绪: {}({})", tool_call.name, tool_call.arguments);
        // 执行工具并返回结果
    }
    _ => {}
}
}

使用统计事件

Token 使用信息通常在最后发送：

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Usage(usage) => {
        println!("Tokens: 提示词={}, 补全={}, 总计={}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }
    _ => {}
}
}

MessageStream

MessageStream 类型是一个异步迭代器，产出 StreamEvent 值。

方法

方法	返回类型	说明
`next()`	`Option<StreamEvent>`	获取下一个事件（异步）
`collect_content()`	`String`	收集所有内容为字符串
`into_stream()`	`impl Stream`	转换为通用流

收集全部内容

为方便使用，可以一次性收集所有内容：

#![allow(unused)]
fn main() {
let content = stream.collect_content().await?;
println!("完整响应: {}", content);
}

注意：这种方式会等待完整响应，失去了流式的意义。仅当需要同时显示流式输出和保存完整文本时使用。

完整流式示例

use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个有帮助的助手。"},
            {"role": "user", "content": "用简单的语言解释量子计算"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .stream(true)
        .send_stream()
        .await?;

    let mut reasoning = String::new();
    let mut content = String::new();
    let mut usage = None;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
            }
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Usage(u) => {
                usage = Some(u);
            }
            StreamEvent::Done => {
                println!("\n[流式传输完成]");
            }
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                return Err(e);
            }
            _ => {}
        }
    }

    // 打印摘要
    if !reasoning.is_empty() {
        eprintln!("\n--- 推理过程 ---");
        eprintln!("{}", reasoning);
    }

    if let Some(usage) = usage {
        eprintln!("\n--- Token 使用 ---");
        eprintln!("提示词: {}, 补全: {}, 总计: {}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }

    Ok(())
}

流式工具调用

使用工具时，工具调用会增量推送：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, ToolCall};
use futures::StreamExt;

let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取某地天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]);

let mut stream = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样？"}
    ]))
    .tools(tools)
    .stream(true)
    .send_stream()
    .await?;

let mut tool_calls: Vec<ToolCall> = Vec::new();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Content(delta) => print!("{}", delta),
        StreamEvent::ToolCallComplete(tool_call) => {
            tool_calls.push(tool_call);
        }
        StreamEvent::Done => break,
        _ => {}
    }
}

// 执行工具调用
for tool_call in tool_calls {
    println!("工具: {} 参数: {}", tool_call.name, tool_call.arguments);
    // 执行并在下一条消息中返回结果
}
}

错误处理

流式传输过程中随时可能发生错误：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

async fn stream_chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => content.push_str(&delta),
            StreamEvent::Error(e) => return Err(e),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    Ok(content)
}
}

最佳实践

刷新输出

实时显示时，每次输出后刷新 stdout：

#![allow(unused)]
fn main() {
use std::io::{self, Write};

match event {
    StreamEvent::Content(delta) => {
        print!("{}", delta);
        io::stdout().flush().ok();
    }
    _ => {}
}
}

处理中断

交互式应用中，优雅地处理 Ctrl+C：

#![allow(unused)]
fn main() {
use tokio::signal;

tokio::select! {
    result = process_stream(&mut stream) => {
        // 正常完成
    }
    _ = signal::ctrl_c() => {
        println!("\n[已中断]");
    }
}
}

空闲流超时

为可能卡住的流设置超时：

#![allow(unused)]
fn main() {
use tokio::time::{timeout, Duration};

let result = timeout(
    Duration::from_secs(60),
    stream.next()
).await;

match result {
    Ok(Some(event)) => { /* 处理事件 */ }
    Ok(None) => { /* 流结束 */ }
    Err(_) => { /* 超时 */ }
}
}

Completions 流式 API

vLLM Client 同时支持旧版 /v1/completions API 的流式调用，使用 CompletionStreamEvent。

CompletionStreamEvent 类型

变体	说明
`Text(String)`	文本 token 增量
`FinishReason(String)`	流结束原因（如 "stop", "length"）
`Usage(Usage)`	Token 使用统计
`Done`	流式传输完成
`Error(VllmError)`	发生错误

Completions 流式示例

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("写一首关于春天的诗")
        .max_tokens(1024)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                print!("{}", delta);
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n[结束原因: {}]", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                println!("\nTokens: 提示词={}, 补全={}, 总计={}",
                    usage.prompt_tokens,
                    usage.completion_tokens,
                    usage.total_tokens
                );
            }
            CompletionStreamEvent::Done => {
                println!("\n[流式传输完成]");
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("错误: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

CompletionStream 方法

方法	返回类型	说明
`next()`	`Option<CompletionStreamEvent>`	获取下一个事件（异步）
`collect_text()`	`String`	收集所有文本为字符串
`into_stream()`	`impl Stream`	转换为通用流

工具调用 API

工具调用（也称为函数调用）允许模型在生成过程中调用外部函数，实现与外部 API、数据库和自定义逻辑的集成。

概述

vLLM Client 支持 OpenAI 兼容的工具调用：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样？"}
    ]))
    .tools(tools)
    .send()
    .await?;
}

定义工具

基础工具定义

工具使用遵循 OpenAI 规范的 JSON 格式定义：

#![allow(unused)]
fn main() {
let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定地点的当前天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "城市名称，如东京"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "温度单位"
                    }
                },
                "required": ["location"]
            }
        }
    }
]);
}

多个工具

#![allow(unused)]
fn main() {
let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取天气信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "搜索网页信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "limit": {"type": "integer"}
                },
                "required": ["query"]
            }
        }
    }
]);
}

工具选择

控制模型如何选择工具：

#![allow(unused)]
fn main() {
// 让模型自行决定（默认）
.tool_choice(json!("auto"))

// 禁止使用工具
.tool_choice(json!("none"))

// 强制使用工具
.tool_choice(json!("required"))

// 强制使用特定工具
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

处理工具调用

检查工具调用

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样？"}
    ]))
    .tools(tools)
    .send()
    .await?;

// 检查响应是否包含工具调用
if response.has_tool_calls() {
    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            println!("函数: {}", tool_call.name);
            println!("参数: {}", tool_call.arguments);
        }
    }
}
}

ToolCall 结构

#![allow(unused)]
fn main() {
pub struct ToolCall {
    pub id: String,           // 调用的唯一标识
    pub name: String,         // 函数名称
    pub arguments: String,    // 参数的 JSON 字符串
}
}

解析参数

将参数字符串解析为类型化数据：

#![allow(unused)]
fn main() {
use serde::Deserialize;

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
    unit: Option<String>,
}

if let Some(tool_call) = response.first_tool_call() {
    // 解析为特定类型
    match tool_call.parse_args_as::<WeatherArgs>() {
        Ok(args) => {
            println!("地点: {}", args.location);
            if let Some(unit) = args.unit {
                println!("单位: {}", unit);
            }
        }
        Err(e) => {
            eprintln!("解析参数失败: {}", e);
        }
    }
    
    // 或解析为通用 JSON
    let args: Value = tool_call.parse_args()?;
}
}

工具结果方法

创建工具结果消息：

#![allow(unused)]
fn main() {
// 创建工具结果消息
let tool_result = tool_call.result(json!({
    "temperature": 25,
    "condition": "sunny",
    "humidity": 60
}));

// 返回一个可直接加入消息的 JSON 对象
// {
//     "role": "tool",
//     "tool_call_id": "...",
//     "content": "{\"temperature\": 25, ...}"
// }
}

完整工具调用流程

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, ToolCall};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
}

#[derive(Serialize)]
struct WeatherResult {
    temperature: f32,
    condition: String,
}

// 模拟天气 API
fn get_weather(location: &str) -> WeatherResult {
    WeatherResult {
        temperature: 25.0,
        condition: "sunny".to_string(),
    }
}

async fn chat_with_tools(client: &VllmClient, user_message: &str) -> Result<String, Box<dyn std::error::Error>> {
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    // 第一次请求
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": user_message}
        ]))
        .tools(tools.clone())
        .send()
        .await?;

    // 检查模型是否要调用工具
    if response.has_tool_calls() {
        let mut messages = vec![
            json!({"role": "user", "content": user_message})
        ];

        // 将助手的工具调用加入消息
        if let Some(tool_calls) = &response.tool_calls {
            let assistant_msg = response.assistant_message();
            messages.push(assistant_msg);

            // 执行每个工具并加入结果
            for tool_call in tool_calls {
                if tool_call.name == "get_weather" {
                    let args: WeatherArgs = tool_call.parse_args_as()?;
                    let result = get_weather(&args.location);
                    messages.push(tool_call.result(json!(result)));
                }
            }
        }

        // 带工具结果继续对话
        let final_response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-72B-Instruct")
            .messages(json!(messages))
            .tools(tools)
            .send()
            .await?;

        return Ok(final_response.content.unwrap_or_default());
    }

    Ok(response.content.unwrap_or_default())
}
}

流式工具调用

流式响应中，工具调用会增量推送：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京和巴黎的天气怎么样？"}
    ]))
    .tools(tools)
    .stream(true)
    .send_stream()
    .await?;

let mut tool_calls: Vec<ToolCall> = Vec::new();
let mut content = String::new();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Content(delta) => {
            content.push_str(&delta);
            print!("{}", delta);
        }
        StreamEvent::ToolCallDelta { index, id, name, arguments } => {
            println!("[工具增量 {}] {}({})", index, name, arguments);
        }
        StreamEvent::ToolCallComplete(tool_call) => {
            println!("[工具完成] {}({})", tool_call.name, tool_call.arguments);
            tool_calls.push(tool_call);
        }
        StreamEvent::Done => break,
        _ => {}
    }
}

// 执行所有收集到的工具调用
for tool_call in tool_calls {
    // 执行并返回结果...
}
}

多轮工具调用

#![allow(unused)]
fn main() {
async fn multi_round_tool_calling(
    client: &VllmClient,
    user_message: &str,
    max_rounds: usize,
) -> Result<String, Box<dyn std::error::Error>> {
    let mut messages = vec![
        json!({"role": "user", "content": user_message})
    ];

    for _ in 0..max_rounds {
        let response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-72B-Instruct")
            .messages(json!(&messages))
            .tools(tools.clone())
            .send()
            .await?;

        if response.has_tool_calls() {
            // 加入带工具调用的助手消息
            messages.push(response.assistant_message());

            // 执行工具并加入结果
            if let Some(tool_calls) = &response.tool_calls {
                for tool_call in tool_calls {
                    let result = execute_tool(&tool_call.name, &tool_call.arguments);
                    messages.push(tool_call.result(result));
                }
            }
        } else {
            // 没有更多工具调用，返回内容
            return Ok(response.content.unwrap_or_default());
        }
    }

    Err("超过最大轮数".into())
}
}

最佳实践

清晰的工具描述

写清楚、详细的描述：

#![allow(unused)]
fn main() {
// 推荐
"description": "获取指定城市的当前天气状况。返回温度、湿度和天气状况。"

// 避免
"description": "获取天气"
}

精确的参数 Schema

定义准确的 JSON Schema：

#![allow(unused)]
fn main() {
"parameters": {
    "type": "object",
    "properties": {
        "location": {
            "type": "string",
            "description": "城市名称或坐标"
        },
        "days": {
            "type": "integer",
            "minimum": 1,
            "maximum": 7,
            "description": "预报天数"
        }
    },
    "required": ["location"]
}
}

错误处理

优雅地处理工具执行错误：

#![allow(unused)]
fn main() {
let tool_result = match execute_tool(&tool_call.name, &tool_call.arguments) {
    Ok(result) => json!({"success": true, "data": result}),
    Err(e) => json!({"success": false, "error": e.to_string()}),
};
messages.push(tool_call.result(tool_result));
}

错误处理

本文档介绍 vLLM Client 中的错误处理机制。

VllmError 枚举

vLLM Client 中的所有错误都通过 VllmError 枚举表示：

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Debug, Error, Clone)]
pub enum VllmError {
    #[error("HTTP request failed: {0}")]
    Http(String),

    #[error("JSON error: {0}")]
    Json(String),

    #[error("API error (status {status_code}): {message}")]
    ApiError {
        status_code: u16,
        message: String,
        error_type: Option<String>,
    },

    #[error("Stream error: {0}")]
    Stream(String),

    #[error("Connection timeout")]
    Timeout,

    #[error("Model not found: {0}")]
    ModelNotFound(String),

    #[error("Missing required parameter: {0}")]
    MissingParameter(String),

    #[error("No response content")]
    NoContent,

    #[error("Invalid response format: {0}")]
    InvalidResponse(String),

    #[error("{0}")]
    Other(String),
}
}

错误类型

变体	发生场景
`Http`	网络错误、连接失败
`Json`	序列化/反序列化错误
`ApiError`	服务器返回错误响应
`Stream`	流式响应过程中的错误
`Timeout`	请求超时
`ModelNotFound`	指定的模型不存在
`MissingParameter`	缺少必需参数
`NoContent`	响应无内容
`InvalidResponse`	响应格式不符合预期
`Other`	其他错误

基础错误处理

use vllm_client::{VllmClient, json, VllmError};

async fn chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await?;

    Ok(response.content.unwrap_or_default())
}

#[tokio::main]
async fn main() {
    match chat("你好！").await {
        Ok(text) => println!("响应: {}", text),
        Err(e) => eprintln!("错误: {}", e),
    }
}

详细错误处理

针对不同错误类型进行不同处理：

use vllm_client::{VllmClient, json, VllmError};

#[tokio::main]
async fn main() {
    let client = VllmClient::new("http://localhost:8000/v1");

    let result = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": "你好！"}]))
        .send()
        .await;

    match result {
        Ok(response) => {
            println!("成功: {}", response.content.unwrap_or_default());
        }
        Err(VllmError::ApiError { status_code, message, error_type }) => {
            eprintln!("API 错误 (HTTP {}): {}", status_code, message);
            if let Some(etype) = error_type {
                eprintln!("错误类型: {}", etype);
            }
        }
        Err(VllmError::Timeout) => {
            eprintln!("请求超时，请尝试增加超时时间。");
        }
        Err(VllmError::Http(msg)) => {
            eprintln!("网络错误: {}", msg);
        }
        Err(VllmError::ModelNotFound(model)) => {
            eprintln!("模型 '{}' 未找到，请检查可用模型。", model);
        }
        Err(VllmError::MissingParameter(param)) => {
            eprintln!("缺少必需参数: {}", param);
        }
        Err(e) => {
            eprintln!("其他错误: {}", e);
        }
    }
}

HTTP 状态码

常见的 API 错误状态码：

状态码	含义	处理建议
400	请求格式错误	检查请求参数
401	未授权	检查 API Key
403	禁止访问	检查权限
404	未找到	检查端点或模型名称
429	请求频率限制	实现退避重试
500	服务器内部错误	重试或联系管理员
502	网关错误	检查 vLLM 服务器状态
503	服务不可用	等待后重试
504	网关超时	增加超时时间或重试

可重试错误

检查错误是否可重试：

#![allow(unused)]
fn main() {
use vllm_client::VllmError;

fn should_retry(error: &VllmError) -> bool {
    error.is_retryable()
}

// 手动检查
match error {
    VllmError::Timeout => true,
    VllmError::ApiError { status_code: 429, .. } => true,  // 频率限制
    VllmError::ApiError { status_code: 500..=504, .. } => true,  // 服务器错误
    _ => false,
}
}

指数退避重试

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn chat_with_retry(
    client: &VllmClient,
    prompt: &str,
    max_retries: u32,
) -> Result<String, VllmError> {
    let mut retries = 0;

    loop {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;

        match result {
            Ok(response) => {
                return Ok(response.content.unwrap_or_default());
            }
            Err(e) if e.is_retryable() && retries < max_retries => {
                retries += 1;
                let delay = Duration::from_millis(100 * 2u64.pow(retries - 1));
                eprintln!("第 {} 次重试，等待 {:?}: {}", retries, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

流式响应错误处理

处理流式响应过程中的错误：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

async fn stream_chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => content.push_str(&delta),
            StreamEvent::Done => break,
            StreamEvent::Error(e) => return Err(e),
            _ => {}
        }
    }

    Ok(content)
}
}

错误上下文

为错误添加上下文信息，便于调试：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn chat_with_context(prompt: &str) -> Result<String, String> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map_err(|e| format!("获取对话响应失败: {}", e))?;

    Ok(response.content.unwrap_or_default())
}
}

使用 anyhow 或 eyre

对于使用 anyhow 或 eyre 的应用程序：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use anyhow::{Context, Result};

async fn chat(prompt: &str) -> Result<String> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .context("发送对话请求失败")?;

    Ok(response.content.unwrap_or_default())
}
}

最佳实践

1. 始终处理错误

#![allow(unused)]
fn main() {
// 不好的做法
let response = client.chat.completions().create()
    .send().await.unwrap();

// 好的做法
match client.chat.completions().create().send().await {
    Ok(r) => { /* 处理响应 */ },
    Err(e) => eprintln!("错误: {}", e),
}
}

2. 设置适当的超时时间

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 长时间任务设置为 5 分钟
}

3. 记录带上下文的错误

#![allow(unused)]
fn main() {
Err(e) => {
    log::error!("对话请求失败: {}", e);
    log::debug!("请求详情: model={}, prompt_len={}", model, prompt.len());
}
}

4. 实现优雅降级

#![allow(unused)]
fn main() {
match primary_client.chat.completions().create().send().await {
    Ok(r) => r,
    Err(e) => {
        log::warn!("主客户端失败: {}, 尝试备用客户端", e);
        fallback_client.chat.completions().create().send().await?
    }
}
}

示例代码

本节包含各种使用场景的代码示例。

基础聊天

简单对话

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "你好，请介绍一下你自己。"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

带系统提示的对话

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "system", "content": "你是一个专业的 Rust 编程助手，回答简洁准确。"},
            {"role": "user", "content": "什么是所有权？"}
        ]))
        .temperature(0.7)
        .max_tokens(500)
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

多轮对话

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "我叫张三"},
            {"role": "assistant", "content": "你好，张三！很高兴认识你。有什么我可以帮助你的吗？"},
            {"role": "user", "content": "我叫什么名字？"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

流式聊天

基本流式输出

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;
    
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Done => break,
            StreamEvent::Error(e) => eprintln!("错误: {}", e),
            _ => {}
        }
    }
    
    println!();
    Ok(())
}

带思考模式的流式输出

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("qwen-3")
        .messages(json!([
            {"role": "user", "content": "解释相对论"}
        ]))
        .extra(json!({"chat_template_kwargs": {"enable_thinking": true}}))
        .stream(true)
        .send_stream()
        .await?;
    
    println!("=== 思考过程 ===");
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Reasoning(delta) => {
                // 思考内容
                print!("{}", delta);
            }
            StreamEvent::Content(delta) => {
                // 正式回复内容
                print!("{}", delta);
            }
            StreamEvent::Done => break,
            _ => {}
        }
    }
    
    println!();
    Ok(())
}

流式 Completions

旧版 Completions API 流式调用

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("什么是机器学习？")
        .max_tokens(500)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                print!("{}", delta);
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n[结束原因: {}]", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                println!("\nTokens: 提示词={}, 补全={}, 总计={}",
                    usage.prompt_tokens,
                    usage.completion_tokens,
                    usage.total_tokens
                );
            }
            CompletionStreamEvent::Done => {
                println!("\n[流式传输完成]");
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("错误: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

注意: 对于新项目，推荐使用 Chat Completions API (client.chat.completions())，它提供更灵活的功能和更好的消息格式。

工具调用

定义和使用工具

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    // 定义工具
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定城市的当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "城市名称，如：北京、上海"
                        }
                    },
                    "required": ["city"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "get_time",
                "description": "获取指定城市的当前时间",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "城市名称"
                        }
                    },
                    "required": ["city"]
                }
            }
        }
    ]);
    
    // 发送请求
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "北京现在天气怎么样？"}
        ]))
        .tools(tools)
        .send()
        .await?;
    
    // 检查是否有工具调用
    if let Some(tool_calls) = &response.choices[0].message.tool_calls {
        for tool_call in tool_calls {
            println!("工具: {}", tool_call.function.name);
            println!("参数: {}", tool_call.function.arguments);
            
            // 在这里执行实际的工具调用
            // let result = execute_tool(&tool_call.function.name, &tool_call.function.arguments);
        }
    }
    
    Ok(())
}

返回工具结果

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取天气信息",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]);
    
    // 模拟对话流程
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "上海天气如何？"},
            {
                "role": "assistant",
                "tool_calls": [{
                    "id": "call_001",
                    "type": "function",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"city\": \"上海\"}"
                    }
                }]
            },
            {
                "role": "tool",
                "tool_call_id": "call_001",
                "content": "{\"temperature\": 28, \"condition\": \"多云\", \"humidity\": 65}"
            }
        ]))
        .tools(tools)
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

多模态

图像理解

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    // 使用 base64 编码的图像
    let image_base64 = "data:image/png;base64,iVBORw0KGgo...";
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llava-v1.6")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "这张图片里有什么？"},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_base64}
                    }
                ]
            }
        ]))
        .max_tokens(500)
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

使用图像 URL

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llava-v1.6")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "描述这张图片"},
                    {
                        "type": "image_url",
                        "image_url": {"url": "https://example.com/image.jpg"}
                    }
                ]
            }
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

思考模式

启用思考模式

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("qwen-3")
        .messages(json!([
            {"role": "system", "content": "你是一个善于深度思考的AI助手。"},
            {"role": "user", "content": "为什么天空是蓝色的？"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "enable_thinking": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;
    
    let mut reasoning = String::new();
    let mut content = String::new();
    
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Reasoning(delta) => reasoning.push_str(delta),
            StreamEvent::Content(delta) => content.push_str(delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }
    
    println!("=== 思考过程 ===");
    println!("{}", reasoning);
    println!("\n=== 回答 ===");
    println!("{}", content);
    
    Ok(())
}

禁用思考模式

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("qwen-3")
        .messages(json!([
            {"role": "user", "content": "你好"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "enable_thinking": false
            }
        }))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

基础聊天示例

本页演示 vLLM Client 的基础聊天补全使用模式。

简单聊天

发送聊天消息的最简单方式：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好，你好吗？"}
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

带系统消息

添加系统消息来控制助手的行为：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个有帮助的编程助手。你编写整洁、文档完善的代码。"},
            {"role": "user", "content": "用 Rust 写一个检查数字是否为质数的函数"}
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

多轮对话

在多轮消息中保持上下文：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 构建对话历史
    let mut messages = vec![
        json!({"role": "system", "content": "你是一个有帮助的助手。"}),
    ];

    // 第一轮
    messages.push(json!({"role": "user", "content": "我叫小明"}));
    
    let response1 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!(messages.clone()))
        .send()
        .await?;

    let assistant_reply = response1.content.unwrap_or_default();
    println!("助手: {}", assistant_reply);

    // 将助手回复添加到历史
    messages.push(json!({"role": "assistant", "content": assistant_reply}));

    // 第二轮
    messages.push(json!({"role": "user", "content": "我叫什么名字？"}));

    let response2 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!(messages))
        .send()
        .await?;

    println!("助手: {}", response2.content.unwrap_or_default());
    Ok(())
}

对话辅助工具

一个可复用的对话构建辅助工具：

use vllm_client::{VllmClient, json, VllmError};
use serde_json::Value;

struct Conversation {
    client: VllmClient,
    model: String,
    messages: Vec<Value>,
}

impl Conversation {
    fn new(client: VllmClient, model: impl Into<String>) -> Self {
        Self {
            client,
            model: model.into(),
            messages: vec![
                json!({"role": "system", "content": "你是一个有帮助的助手。"})
            ],
        }
    }

    fn with_system(mut self, content: &str) -> Self {
        self.messages[0] = json!({"role": "system", "content": content});
        self
    }

    async fn send(&mut self, user_message: &str) -> Result<String, VllmError> {
        self.messages.push(json!({
            "role": "user",
            "content": user_message
        }));

        let response = self.client
            .chat
            .completions()
            .create()
            .model(&self.model)
            .messages(json!(&self.messages))
            .send()
            .await?;

        let content = response.content.unwrap_or_default();
        self.messages.push(json!({
            "role": "assistant",
            "content": &content
        }));

        Ok(content)
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut conv = Conversation::new(client, "Qwen/Qwen2.5-7B-Instruct")
        .with_system("你是一个数学辅导员。简单地解释概念。");

    println!("用户: 2 + 2 等于几？");
    let reply = conv.send("2 + 2 等于几？").await?;
    println!("助手: {}", reply);

    println!("\n用户: 那乘以 3 等于几？");
    let reply = conv.send("那乘以 3 等于几？").await?;
    println!("助手: {}", reply);

    Ok(())
}

使用采样参数

通过采样参数控制生成：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一个关于机器人的创意故事"}
        ]))
        .temperature(1.2)      // 更高的温度增加创意性
        .top_p(0.95)           // 核采样
        .top_k(50)             // vLLM 扩展参数
        .max_tokens(512)       // 限制输出长度
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

确定性输出

要获得可重复的结果，将温度设置为 0：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "2 + 2 等于几？"}
        ]))
        .temperature(0.0)      // 确定性输出
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

使用停止序列

在特定序列处停止生成：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "列出三种水果，每行一个"}
        ]))
        .stop(json!(["\n\n", "END"]))  // 在双换行或 END 处停止
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Token 使用追踪

追踪 token 使用情况以监控成本：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "解释量子计算"}
        ]))
        .send()
        .await?;

    println!("响应: {}", response.content.unwrap_or_default());

    if let Some(usage) = response.usage {
        println!("\n--- Token 使用统计 ---");
        println!("提示词 tokens: {}", usage.prompt_tokens);
        println!("补全 tokens: {}", usage.completion_tokens);
        println!("总 tokens: {}", usage.total_tokens);
    }

    Ok(())
}

批量处理

高效处理多个提示：

use vllm_client::{VllmClient, json, VllmError};

async fn process_prompts(
    client: &VllmClient,
    prompts: &[&str],
) -> Vec<Result<String, VllmError>> {
    let mut results = Vec::new();

    for prompt in prompts {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default());

        results.push(result);
    }

    results
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(120);

    let prompts = [
        "Rust 是什么？",
        "Python 是什么？",
        "Go 是什么？",
    ];

    let results = process_prompts(&client, &prompts).await;

    for (prompt, result) in prompts.iter().zip(results.iter()) {
        match result {
            Ok(response) => println!("问: {}\n答: {}\n", prompt, response),
            Err(e) => eprintln!("'{}' 出错: {}", prompt, e),
        }
    }

    Ok(())
}

错误处理

生产代码的正确错误处理：

use vllm_client::{VllmClient, json, VllmError};

async fn safe_chat(prompt: &str) -> Result<String, String> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(60);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map_err(|e| format!("请求失败: {}", e))?;

    response.content.ok_or_else(|| "响应中无内容".to_string())
}

#[tokio::main]
async fn main() {
    match safe_chat("你好！").await {
        Ok(text) => println!("响应: {}", text),
        Err(e) => eprintln!("错误: {}", e),
    }
}

流式聊天示例

本示例演示如何使用流式响应实现实时输出。

基础流式响应

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一个关于机器人学习绘画的短篇故事。"}
        ]))
        .temperature(0.8)
        .max_tokens(1024)
        .stream(true)
        .send_stream()
        .await?;

    print!("响应: ");
    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }
    println!();

    Ok(())
}

带推理过程的流式响应（思考模型）

对于支持思考/推理模式的模型：

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "计算: 15 * 23 + 47 等于多少？"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut reasoning = String::new();
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
                eprintln!("[思考中] {}", delta);
            }
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n");
    if !reasoning.is_empty() {
        println!("--- 推理过程 ---");
        println!("{}", reasoning);
    }

    Ok(())
}

带进度指示器的流式响应

在等待第一个 token 时显示输入指示器：

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use std::time::{Duration, Instant};
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let waiting = Arc::new(AtomicBool::new(true));
    let waiting_clone = Arc::clone(&waiting);

    // 启动输入指示器任务
    let indicator = tokio::spawn(async move {
        let chars = ['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏'];
        let mut i = 0;
        while waiting_clone.load(Ordering::Relaxed) {
            print!("\r{} 思考中...", chars[i]);
            std::io::Write::flush(&mut std::io::stdout()).ok();
            i = (i + 1) % chars.len();
            tokio::time::sleep(Duration::from_millis(80)).await;
        }
        print!("\r        \r"); // 清除指示器
    });

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "用简单的语言解释量子纠缠。"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut first_token = true;
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                if first_token {
                    waiting.store(false, Ordering::Relaxed);
                    indicator.await.ok();
                    first_token = false;
                    println!("响应:");
                    println!("---------");
                }
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                waiting.store(false, Ordering::Relaxed);
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n");

    Ok(())
}

多轮流式对话

处理带有流式响应的对话：

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use std::io::{self, BufRead, Write};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    let mut messages: Vec<serde_json::Value> = Vec::new();

    println!("与 AI 聊天（输入 'quit' 退出）");
    println!("----------------------------------------\n");

    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let input = line?;
        if input.trim() == "quit" {
            break;
        }
        if input.trim().is_empty() {
            continue;
        }

        // 添加用户消息
        messages.push(json!({"role": "user", "content": input}));

        // 流式响应
        let mut stream = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(messages))
            .stream(true)
            .send_stream()
            .await?;

        print!("AI: ");
        io::stdout().flush().ok();

        let mut response_content = String::new();

        while let Some(event) = stream.next().await {
            match event {
                StreamEvent::Content(delta) => {
                    response_content.push_str(&delta);
                    print!("{}", delta);
                    io::stdout().flush().ok();
                }
                StreamEvent::Done => break,
                StreamEvent::Error(e) => {
                    eprintln!("\n错误: {}", e);
                    break;
                }
                _ => {}
            }
        }

        println!("\n");

        // 将助手响应添加到历史
        messages.push(json!({"role": "assistant", "content": response_content}));
    }

    println!("再见！");
    Ok(())
}

带超时的流式响应

为慢速响应添加超时处理：

use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;
use tokio::time::{timeout, Duration};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(300);

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一篇关于人工智能的详细论文。"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    loop {
        // 每个事件 30 秒超时
        match timeout(Duration::from_secs(30), stream.next()).await {
            Ok(Some(event)) => {
                match event {
                    StreamEvent::Content(delta) => {
                        content.push_str(&delta);
                        print!("{}", delta);
                        std::io::Write::flush(&mut std::io::stdout()).ok();
                    }
                    StreamEvent::Done => break,
                    StreamEvent::Error(e) => {
                        eprintln!("\n流式错误: {}", e);
                        return Err(e.into());
                    }
                    _ => {}
                }
            }
            Ok(None) => break,
            Err(_) => {
                eprintln!("\n等待下一个 token 超时");
                break;
            }
        }
    }

    println!("\n\n生成了 {} 个字符", content.len());

    Ok(())
}

收集使用统计

在流式响应过程中追踪 token 使用情况：

use vllm_client::{VllmClient, json, StreamEvent, Usage};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一首关于海洋的诗。"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();
    let mut usage: Option<Usage> = None;
    let mut start_time = std::time::Instant::now();
    let mut token_count = 0;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                token_count += 1;
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Usage(u) => {
                usage = Some(u);
            }
            StreamEvent::Done => break,
            _ => {}
        }
    }

    let elapsed = start_time.elapsed();

    println!("\n");
    println!("--- 统计信息 ---");
    println!("耗时: {:.2}秒", elapsed.as_secs_f64());
    println!("字符数: {}", content.len());

    if let Some(usage) = usage {
        println!("提示词 tokens: {}", usage.prompt_tokens);
        println!("补全 tokens: {}", usage.completion_tokens);
        println!("总 tokens: {}", usage.total_tokens);
        println!("每秒 tokens: {:.2}", 
            usage.completion_tokens as f64 / elapsed.as_secs_f64());
    }

    Ok(())
}

Streaming Completions 示例

本示例演示如何使用旧版 /v1/completions API 进行流式调用。

基础流式 Completions

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    println!("=== 流式 Completions 示例 ===\n");
    println!("模型: Qwen/Qwen2.5-7B-Instruct\n");
    println!("提示词: 什么是机器学习？");
    println!("\n生成文本: ");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("什么是机器学习？")
        .max_tokens(500)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    // 处理流式事件
    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                // 打印文本增量（实时输出）
                print!("{}", delta);
                // 刷新缓冲区，实现实时显示
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n\n--- 结束原因: {} ---", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                // 流结束时输出 token 使用统计
                println!("\n\n--- Token 使用统计 ---");
                println!("提示词 tokens: {}", usage.prompt_tokens);
                println!("生成 tokens: {}", usage.completion_tokens);
                println!("总计 tokens: {}", usage.total_tokens);
            }
            CompletionStreamEvent::Done => {
                println!("\n\n=== 生成完成 ===");
                break;
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

与 Chat 流式的区别

方面	Chat Completions	Completions
事件类型	`StreamEvent`	`CompletionStreamEvent`
内容变体	`Content(String)`	`Text(String)`
额外事件	`Reasoning`, `ToolCall`	`FinishReason`
适用场景	对话式	单提示词

何时使用 Completions API

简单的单提示词文本生成
与 OpenAI API 的旧版兼容
不需要聊天消息格式的场景

对于新项目，建议使用 Chat Completions API (client.chat.completions())，它提供更灵活的功能和更好的消息格式。

工具调用示例

本示例演示如何在 vLLM Client 中使用工具调用（函数调用）。

基础工具调用

定义工具，让模型决定何时调用它们：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 定义可用工具
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "城市名称，如：东京、纽约"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "温度单位"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "东京的天气怎么样？"}
        ]))
        .tools(tools)
        .send()
        .await?;

    // 检查模型是否要调用工具
    if response.has_tool_calls() {
        if let Some(tool_calls) = &response.tool_calls {
            for tool_call in tool_calls {
                println!("函数: {}", tool_call.name);
                println!("参数: {}", tool_call.arguments);
            }
        }
    } else {
        println!("响应: {}", response.content.unwrap_or_default());
    }

    Ok(())
}

完整工具调用流程

执行工具并返回结果以继续对话：

use vllm_client::{VllmClient, json, ToolCall};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
    unit: Option<String>,
}

#[derive(Serialize)]
struct WeatherResult {
    temperature: f32,
    condition: String,
    humidity: u32,
}

// 模拟天气函数
fn get_weather(location: &str, unit: Option<&str>) -> WeatherResult {
    // 实际代码中，调用真实的天气 API
    let temp = match location {
        "Tokyo" => 25.0,
        "New York" => 20.0,
        "London" => 15.0,
        _ => 22.0,
    };

    WeatherResult {
        temperature: if unit == Some("fahrenheit") {
            temp * 9.0 / 5.0 + 32.0
        } else {
            temp
        },
        condition: "晴朗".to_string(),
        humidity: 60,
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let user_message = "东京和纽约的天气怎么样？";

    // 第一次请求 - 模型可能调用工具
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": user_message}
        ]))
        .tools(tools.clone())
        .send()
        .await?;

    if response.has_tool_calls() {
        // 构建消息历史
        let mut messages = vec![
            json!({"role": "user", "content": user_message})
        ];

        // 添加助手的工具调用
        messages.push(response.assistant_message());

        // 执行每个工具并添加结果
        if let Some(tool_calls) = &response.tool_calls {
            for tool_call in tool_calls {
                if tool_call.name == "get_weather" {
                    let args: WeatherArgs = tool_call.parse_args_as()?;
                    let result = get_weather(&args.location, args.unit.as_deref());
                    messages.push(tool_call.result(json!(result)));
                }
            }
        }

        // 使用工具结果继续对话
        let final_response = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(messages))
            .tools(tools)
            .send()
            .await?;

        println!("{}", final_response.content.unwrap_or_default());
    } else {
        println!("{}", response.content.unwrap_or_default());
    }

    Ok(())
}

多个工具

为不同目的定义多个工具：

use vllm_client::{VllmClient, json};
use serde::Deserialize;

#[derive(Deserialize)]
struct SearchArgs {
    query: String,
    limit: Option<u32>,
}

#[derive(Deserialize)]
struct CalcArgs {
    expression: String,
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "web_search",
                "description": "在网络上搜索信息",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "搜索查询"
                        },
                        "limit": {
                            "type": "integer",
                            "description": "最大结果数"
                        }
                    },
                    "required": ["query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "calculate",
                "description": "执行数学计算",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {
                            "type": "string",
                            "description": "要计算的数学表达式，如 '2 + 2 * 3'"
                        }
                    },
                    "required": ["expression"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "搜索 Rust 编程语言并计算 42 * 17"}
        ]))
        .tools(tools)
        .send()
        .await?;

    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            match tool_call.name.as_str() {
                "web_search" => {
                    let args: SearchArgs = tool_call.parse_args_as()?;
                    println!("搜索: {} (限制: {:?})", args.query, args.limit);
                }
                "calculate" => {
                    let args: CalcArgs = tool_call.parse_args_as()?;
                    println!("计算: {}", args.expression);
                }
                _ => println!("未知工具: {}", tool_call.name),
            }
        }
    }

    Ok(())
}

流式工具调用

实时流式传输工具调用更新：

use vllm_client::{VllmClient, json, StreamEvent, ToolCall};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "东京、巴黎和伦敦的天气怎么样？"}
        ]))
        .tools(tools)
        .stream(true)
        .send_stream()
        .await?;

    let mut tool_calls: Vec<ToolCall> = Vec::new();
    let mut content = String::new();

    println!("流式响应:\n");

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
            }
            StreamEvent::ToolCallDelta { index, id, name, arguments } => {
                println!("[工具 {}] {} - 部分参数: {}", index, name, arguments);
            }
            StreamEvent::ToolCallComplete(tool_call) => {
                println!("[工具完成] {}({})", tool_call.name, tool_call.arguments);
                tool_calls.push(tool_call);
            }
            StreamEvent::Done => {
                println!("\n--- 流式完成 ---");
                break;
            }
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n收集到 {} 个工具调用", tool_calls.len());
    for (i, tc) in tool_calls.iter().enumerate() {
        println!("  {}. {}({})", i + 1, tc.name, tc.arguments);
    }

    Ok(())
}

多轮工具调用

处理多轮工具调用：

use vllm_client::{VllmClient, json, VllmError};
use serde_json::Value;

async fn run_agent(
    client: &VllmClient,
    user_message: &str,
    tools: &Value,
    max_rounds: usize,
) -> Result<String, VllmError> {
    let mut messages = vec![
        json!({"role": "user", "content": user_message})
    ];

    for round in 0..max_rounds {
        println!("--- 第 {} 轮 ---", round + 1);

        let response = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(&messages))
            .tools(tools.clone())
            .send()
            .await?;

        if response.has_tool_calls() {
            // 添加包含工具调用的助手消息
            messages.push(response.assistant_message());

            // 执行工具并添加结果
            if let Some(tool_calls) = &response.tool_calls {
                for tool_call in tool_calls {
                    println!("调用: {}({})", tool_call.name, tool_call.arguments);

                    // 执行工具
                    let result = execute_tool(&tool_call.name, &tool_call.arguments);
                    println!("结果: {}", result);

                    // 将工具结果添加到消息
                    messages.push(tool_call.result(result));
                }
            }
        } else {
            // 没有更多工具调用，返回最终响应
            return Ok(response.content.unwrap_or_default());
        }
    }

    Err(VllmError::Other("超过最大轮数".to_string()))
}

fn execute_tool(name: &str, args: &str) -> Value {
    // 在这里实现工具执行逻辑
    match name {
        "get_weather" => json!({"temperature": 22, "condition": "晴朗"}),
        "web_search" => json!({"results": ["结果1", "结果2"]}),
        _ => json!({"error": "未知工具"}),
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "web_search",
                "description": "在网络上搜索",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    },
                    "required": ["query"]
                }
            }
        }
    ]);

    let result = run_agent(
        &client,
        "东京的天气怎么样？并查找关于樱花的信息",
        &tools,
        5
    ).await?;

    println!("\n最终答案: {}", result);

    Ok(())
}

工具选择选项

控制工具选择行为：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    // 选项 1: 让模型决定（默认）
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好！"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("auto"))
        .send()
        .await?;

    // 选项 2: 禁止工具使用
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "东京的天气怎么样？"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("none"))
        .send()
        .await?;

    // 选项 3: 强制使用工具
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "我需要天气信息"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("required"))
        .send()
        .await?;

    // 选项 4: 强制使用特定工具
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "查看东京天气"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!({
            "type": "function",
            "function": {"name": "get_weather"}
        }))
        .send()
        .await?;

    Ok(())
}

错误处理

优雅地处理工具执行错误：

use vllm_client::{VllmClient, json, ToolCall};
use serde_json::Value;

fn execute_tool_safely(tool_call: &ToolCall) -> Value {
    match tool_call.name.as_str() {
        "get_weather" => {
            // 安全地解析参数
            match tool_call.parse_args() {
                Ok(args) => {
                    // 执行工具
                    match get_weather_internal(&args) {
                        Ok(result) => json!({"success": true, "data": result}),
                        Err(e) => json!({"success": false, "error": e.to_string()}),
                    }
                }
                Err(e) => json!({
                    "success": false,
                    "error": format!("无效参数: {}", e)
                }),
            }
        }
        _ => json!({
            "success": false,
            "error": format!("未知工具: {}", tool_call.name)
        }),
    }
}

fn get_weather_internal(args: &Value) -> Result<Value, String> {
    let location = args["location"].as_str()
        .ok_or("location 是必需的")?;

    // 模拟 API 调用
    Ok(json!({
        "location": location,
        "temperature": 22,
        "condition": "晴朗"
    }))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "天气怎么样？"}
        ]))
        .tools(tools)
        .send()
        .await?;

    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            let result = execute_tool_safely(tool_call);
            println!("工具结果: {}", result);
        }
    }

    Ok(())
}

多模态示例

多模态功能允许您将图像和其他媒体类型与文本一起发送给模型。

概述

vLLM 通过 OpenAI 兼容的 API 支持多模态输入。您可以使用 base64 编码或 URL 在聊天消息中包含图像。

基础图像输入（Base64）

发送 base64 编码的图像：

use vllm_client::{VllmClient, json};
use base64::{Engine as _, engine::general_purpose};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 读取并编码图像
    let image_data = std::fs::read("image.png")?;
    let base64_image = general_purpose::STANDARD.encode(&image_data);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")  // 视觉模型
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "这张图片里有什么？"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(512)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

使用 URL 引用图像

通过 URL 引用图像：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "详细描述这张图片。"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://example.com/image.jpg"
                        }
                    }
                ]
            }
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

图像消息辅助函数

创建可复用的图像消息辅助函数：

use vllm_client::{VllmClient, json};
use serde_json::Value;

fn image_message(text: &str, image_path: &str) -> Result<Value, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};

    let image_data = std::fs::read(image_path)?;
    let base64_image = general_purpose::STANDARD.encode(&image_data);

    // 根据扩展名检测图像类型
    let mime_type = match image_path.to_lowercase().rsplit('.').next() {
        Some("png") => "image/png",
        Some("jpg") | Some("jpeg") => "image/jpeg",
        Some("gif") => "image/gif",
        Some("webp") => "image/webp",
        _ => "image/png",
    };

    Ok(json!({
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": text
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": format!("data:{};base64,{}", mime_type, base64_image)
                }
            }
        ]
    }))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let user_msg = image_message("这张图片里有什么？", "photo.jpg")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([user_msg]))
        .max_tokens(1024)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

多图像处理

在单个请求中发送多张图像：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 读取并编码多张图像
    let image1 = encode_image("image1.png")?;
    let image2 = encode_image("image2.png")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "比较这两张图片。它们有什么不同？"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", image1)
                        }
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", image2)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(1024)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

fn encode_image(path: &str) -> Result<String, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};
    let data = std::fs::read(path)?;
    Ok(general_purpose::STANDARD.encode(&data))
}

带图像的流式响应

对图像查询进行流式响应：

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let base64_image = encode_image("chart.png")?;

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "分析这个图表并解释趋势。"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        if let StreamEvent::Content(delta) = event {
            print!("{}", delta);
            std::io::Write::flush(&mut std::io::stdout()).ok();
        }
    }

    println!();
    Ok(())
}

带图像的多轮对话

在对话中保持图像上下文：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let base64_image = encode_image("screenshot.png")?;

    // 第一条带图像的消息
    let messages = json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这个截图里有什么？"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": format!("data:image/png;base64,{}", base64_image)
                    }
                }
            ]
        }
    ]);

    let response1 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(messages.clone())
        .send()
        .await?;

    println!("第一次响应: {}", response1.content.unwrap_or_default());

    // 继续对话（不需要新图像）
    let messages2 = json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这个截图里有什么？"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": format!("data:image/png;base64,{}", base64_image)
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": response1.content.unwrap_or_default()
        },
        {
            "role": "user",
            "content": "你能翻译图片中的文本吗？"
        }
    ]);

    let response2 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(messages2)
        .send()
        .await?;

    println!("\n第二次响应: {}", response2.content.unwrap_or_default());

    Ok(())
}

OCR 和文档分析

使用视觉模型进行 OCR 和文档分析：

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let document_image = encode_image("document.png")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "system",
                "content": "你是一个 OCR 助手。准确提取图像中的文本并正确格式化。"
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "从这个文档图像中提取所有文本。尽可能保留格式。"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", document_image)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(2048)
        .send()
        .await?;

    println!("提取的文本:\n{}", response.content.unwrap_or_default());
    Ok(())
}

图像大小考虑

正确处理大图像：

use vllm_client::{VllmClient, json};

fn encode_and_resize_image(path: &str, max_size: u32) -> Result<String, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};
    use image::ImageReader;

    // 加载并调整图像大小
    let img = ImageReader::open(path)?.decode()?;
    let img = img.resize(max_size, max_size, image::imageops::FilterType::Lanczos3);

    // 转换为 PNG
    let mut buffer = std::io::Cursor::new(Vec::new());
    img.write_to(&mut buffer, image::ImageFormat::Png)?;

    Ok(general_purpose::STANDARD.encode(&buffer.into_inner()))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 调整大小到最大 1024px，保持宽高比
    let base64_image = encode_and_resize_image("large_image.jpg", 1024)?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "描述这张图片。"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

支持的模型

对于多模态输入，请使用支持视觉的模型：

模型	描述
`Qwen/Qwen2-VL-7B-Instruct`	Qwen2 视觉语言模型
`Qwen/Qwen2-VL-72B-Instruct`	Qwen2 视觉语言大模型
`meta-llama/Llama-3.2-11B-Vision-Instruct`	Llama 3.2 视觉模型
`openai/clip-vit-large-patch14`	CLIP 模型

使用以下命令检查 vLLM 服务器的可用模型：

curl http://localhost:8000/v1/models

必需的依赖

对于图像处理，添加以下依赖：

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
base64 = "0.22"
image = "0.25"  # 可选，用于图像处理

故障排除

图像过大

如果遇到图像大小错误，请减小图像尺寸：

#![allow(unused)]
fn main() {
// 发送前调整大小
let img = image::load_from_memory(&image_data)?;
let resized = img.resize(1024, 1024, image::imageops::FilterType::Lanczos3);
}

不支持的格式

将图像转换为支持的格式：

#![allow(unused)]
fn main() {
// 转换为 PNG
let img = image::load_from_memory(&image_data)?;
let mut output = Vec::new();
img.write_to(&mut std::io::Cursor::new(&mut output), image::ImageFormat::Png)?;
}

模型不支持视觉

确保使用支持视觉的模型。非视觉模型会忽略图像输入。

高级主题

本文档介绍 vLLM Client 的高级功能和用法。

思考模式

某些模型（如 Qwen-3）支持"思考模式"，可以输出推理过程。

启用思考模式

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let client = VllmClient::new("http://localhost:8000/v1");

let mut stream = client
    .chat
    .completions()
    .create()
    .model("qwen-3")
    .messages(json!([
        {"role": "user", "content": "请解释什么是递归"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "enable_thinking": true
        }
    }))
    .stream(true)
    .send_stream()
    .await?;

while let Some(event) = stream.next().await {
    match &event {
        // 思考/推理内容
        StreamEvent::Reasoning(delta) => {
            print!("[思考] {}", delta);
        }
        // 常规回复内容
        StreamEvent::Content(delta) => {
            print!("{}", delta);
        }
        _ => {}
    }
}
}

思考内容格式

在思考模式下，模型的输出分为两部分：

事件类型	描述
`StreamEvent::Reasoning`	模型的推理/思考过程
`StreamEvent::Content`	最终的回复内容

思考内容通常包含在 <think> 标签中，客户端会自动解析。

禁用思考模式

#![allow(unused)]
fn main() {
.extra(json!({
    "chat_template_kwargs": {
        "enable_thinking": false
    }
}))
}

自定义请求头

如果需要添加自定义请求头（如代理认证、追踪ID等）：

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("X-Custom-Header", "custom-value")
    .with_header("X-Request-ID", "req-12345");
}

常见用例

#![allow(unused)]
fn main() {
// 添加代理认证
let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("Proxy-Authorization", "Bearer proxy-token");

// 添加追踪ID用于调试
let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("X-Trace-ID", &uuid::Uuid::new_v4().to_string());
}

超时与重试

设置超时

#![allow(unused)]
fn main() {
use std::time::Duration;
use vllm_client::VllmClient;

// 设置60秒超时
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(Duration::from_secs(60));

// 设置5分钟超时（适用于长文本生成）
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(Duration::from_secs(300));
}

实现重试逻辑

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn send_with_retry(
    client: &VllmClient,
    messages: serde_json::Value,
    max_retries: u32,
) -> Result<vllm_client::ChatCompletionResponse, VllmError> {
    let mut attempts = 0;
    
    loop {
        match client
            .chat
            .completions()
            .create()
            .model("llama-3-70b")
            .messages(messages.clone())
            .send()
            .await
        {
            Ok(response) => return Ok(response),
            Err(e) => {
                attempts += 1;
                if attempts >= max_retries {
                    return Err(e);
                }
                // 指数退避
                sleep(Duration::from_millis(100 * 2u64.pow(attempts))).await;
            }
        }
    }
}
}

多模态支持

图像输入

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

// 使用图像URL
let response = client
    .chat
    .completions()
    .create()
    .model("llava-v1.6")
    .messages(json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这张图片里有什么？"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]))
    .send()
    .await?;

// 使用Base64编码图像
let base64_image = "data:image/jpeg;base64,/9j/4AAQ...";
let response = client
    .chat
    .completions()
    .create()
    .model("llava-v1.6")
    .messages(json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "描述这张图片"},
                {
                    "type": "image_url",
                    "image_url": {"url": base64_image}
                }
            ]
        }
    ]))
    .send()
    .await?;
}

多图像支持

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("llava-v1.6")
    .messages(json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "比较这两张图片"},
                {"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}}
            ]
        }
    ]))
    .send()
    .await?;
}

最佳实践

1. 连接池管理

对于高并发场景，建议复用客户端实例：

#![allow(unused)]
fn main() {
// 推荐：共享客户端实例
use std::sync::Arc;

let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// 在多个任务中使用
let client_clone = client.clone();
tokio::spawn(async move {
    client_clone.chat.completions().create()
        .model("llama-3")
        .messages(json!([{"role": "user", "content": "Hello"}]))
        .send()
        .await
});
}

2. 错误处理

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, VllmError};

match client.chat.completions().create().send().await {
    Ok(response) => {
        println!("成功: {:?}", response);
    }
    Err(VllmError::ApiError { message, code }) => {
        eprintln!("API 错误 ({}): {}", code, message);
        // 根据错误码处理
        match code {
            429 => println!("被限流，请稍后重试"),
            401 => println!("认证失败，检查API密钥"),
            _ => {}
        }
    }
    Err(e) => {
        eprintln!("其他错误: {}", e);
    }
}
}

3. 流式响应的资源管理

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client
    .chat
    .completions()
    .create()
    .model("llama-3")
    .messages(json!([{"role": "user", "content": "Hello"}]))
    .stream(true)
    .send_stream()
    .await?;

// 使用 take 限制处理的消息数量
while let Some(event) = stream.take(1000).next().await {
    match &event {
        StreamEvent::Content(delta) => print!("{}", delta),
        StreamEvent::Done | StreamEvent::Error(_) => break,
        _ => {}
    }
}
}

思考模式

思考模式（也称为推理模式）允许模型在给出最终答案之前输出其推理过程。这对于复杂推理任务特别有用。

概述

一些模型，如启用思考模式的 Qwen，可以输出两种类型的内容：

推理内容 - 模型的内部"思考"过程
内容 - 给用户的最终响应

启用思考模式

Qwen 模型

对于 Qwen 模型，通过 extra 参数启用思考模式：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "计算: 15 * 23 + 47 等于多少？"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;
}

检查推理内容

在非流式响应中，单独访问推理内容：

#![allow(unused)]
fn main() {
// 检查推理内容
if let Some(reasoning) = response.reasoning_content {
    println!("推理: {}", reasoning);
}

// 获取最终内容
if let Some(content) = response.content {
    println!("答案: {}", content);
}
}

带思考模式的流式响应

使用思考模式的最佳方式是流式响应：

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": "逐步思考: 如果我有 5 个苹果，给朋友 2 个，然后又买了 3 个，我有多少个？"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    println!("=== 思考过程 ===\n");
    
    let mut in_thinking = true;
    let mut reasoning = String::new();
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Content(delta) => {
                if in_thinking {
                    in_thinking = false;
                    println!("\n\n=== 最终答案 ===\n");
                }
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!();

    Ok(())
}

使用场景

数学推理

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

async fn solve_math_problem(client: &VllmClient, problem: &str) -> Result<String, Box<dyn std::error::Error>> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个数学辅导员。清晰地展示你的工作过程。"},
            {"role": "user", "content": problem}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut answer = String::new();

    while let Some(event) = stream.next().await {
        if let StreamEvent::Content(delta) = event {
            answer.push_str(&delta);
        }
    }

    Ok(answer)
}
}

代码分析

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "分析这段代码的潜在 bug 和安全问题:\n\n```rust\nfn process_input(input: &str) -> String {\n    let mut result = String::new();\n    for c in input.chars() {\n        result.push(c);\n    }\n    result\n}\n```"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;
}

复杂决策

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "system", "content": "你是一个决策支持助手。仔细考虑所有选项。"},
        {"role": "user", "content": "我需要在公司 A（高薪，通勤远）和公司 B（中等薪资，远程工作）之间选择。帮我决定。"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .max_tokens(2048)
    .send()
    .await?;
}

分离推理和答案

对于需要将推理与最终答案分离的应用：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

struct ThinkingResponse {
    reasoning: String,
    content: String,
}

async fn think_and_respond(
    client: &VllmClient,
    prompt: &str,
) -> Result<ThinkingResponse, Box<dyn std::error::Error>> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": prompt}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut response = ThinkingResponse {
        reasoning: String::new(),
        content: String::new(),
    };

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => response.reasoning.push_str(&delta),
            StreamEvent::Content(delta) => response.content.push_str(&delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    Ok(response)
}
}

模型支持

模型	思考模式支持
Qwen/Qwen2.5-72B-Instruct	✅ 支持
Qwen/Qwen2.5-32B-Instruct	✅ 支持
Qwen/Qwen2.5-7B-Instruct	✅ 支持
DeepSeek-R1	✅ 支持（内置）
其他模型	❌ 取决于模型

检查您的 vLLM 服务器配置以验证思考模式支持。

配置选项

思考模型检测

模型自动处理思考标记：

#![allow(unused)]
fn main() {
// 推理内容从特殊标记中解析
// 通常结构为: <tool_call>...</think> 或类似格式
}

非流式访问

对于带推理的非流式请求：

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "解释量子纠缠"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;

// 访问推理内容（如果存在）
if let Some(reasoning) = response.reasoning_content {
    println!("推理:\n{}\n", reasoning);
}

// 访问最终答案
println!("答案:\n{}", response.content.unwrap_or_default());
}

最佳实践

1. 用于复杂任务

思考模式对于以下场景最有价值：

多步推理
数学问题
代码分析
复杂决策

#![allow(unused)]
fn main() {
// 好: 复杂推理任务
.messages(json!([
    {"role": "user", "content": "解这道题: 父亲的年龄是儿子的 4 倍。20 年后，他只会是儿子的 2 倍。他们现在各多少岁？"}
]))

// 收益较小: 简单查询
.messages(json!([
    {"role": "user", "content": "2 + 2 等于几？"}
]))
}

2. 选择性显示推理

您可能希望在生产环境中隐藏推理，但在调试时显示：

#![allow(unused)]
fn main() {
let show_reasoning = std::env::var("SHOW_REASONING").is_ok();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => {
            if show_reasoning {
                eprintln!("[思考中] {}", delta);
            }
        }
        StreamEvent::Content(delta) => print!("{}", delta),
        _ => {}
    }
}
}

3. 结合系统提示

使用系统提示引导思考过程：

#![allow(unused)]
fn main() {
.messages(json!([
    {
        "role": "system", 
        "content": "逐步思考问题。在确定答案之前考虑多种方法。"
    },
    {"role": "user", "content": problem}
]))
}

4. 调整最大 Token 数

思考模式使用更多 token。请相应调整：

#![allow(unused)]
fn main() {
.max_tokens(4096)  // 考虑推理和答案两部分
}

故障排除

没有推理内容

如果看不到推理内容：

确保在 extra 参数中启用了思考模式
验证模型支持思考模式
检查 vLLM 服务器配置

# 检查 vLLM 服务器日志以发现问题

流式响应不完整

如果流式响应似乎不完整：

#![allow(unused)]
fn main() {
// 确保处理所有事件类型
while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => { /* 处理 */ },
        StreamEvent::Content(delta) => { /* 处理 */ },
        StreamEvent::Done => break,
        StreamEvent::Error(e) => {
            eprintln!("错误: {}", e);
            break;
        }
        _ => {}  // 不要忘记其他事件
    }
}
}

自定义请求头

本文档介绍如何在 vLLM Client 中使用自定义 HTTP 请求头。

概述

虽然 vLLM Client 通过 API Key 处理标准认证，但您可能需要添加自定义请求头用于：

自定义认证方案
请求追踪和调试
速率限制标识符
自定义元数据

当前限制

当前版本的 vLLM Client 不提供内置的自定义请求头方法。但是，您可以通过几种方式解决这个限制。

变通方法：环境变量

如果您的 vLLM 服务器通过环境变量或特定 API 参数接受配置：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key(std::env::var("MY_API_KEY").unwrap_or_default());
}

变通方法：通过额外参数

一些自定义配置可以通过 extra() 方法传递：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([{"role": "user", "content": "你好！"}]))
    .extra(json!({
        "custom_field": "custom_value",
        "request_id": "req-12345"
    }))
    .send()
    .await?;
}

未来支持

自定义请求头支持计划在未来版本中实现。API 可能类似于：

// 未来 API（尚未实现）
let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("X-Custom-Header", "value")
    .with_header("X-Request-ID", "req-123");

常见使用案例

追踪请求头

用于分布式追踪（当支持时）：

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-Trace-ID", trace_id)
    .header("X-Span-ID", span_id)
    .build();

自定义认证

用于非标准认证方案：

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-API-Key", "custom-key")
    .header("X-Tenant-ID", "tenant-123")
    .build();

请求元数据

添加元数据用于日志或分析：

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-Request-Source", "mobile-app")
    .header("X-User-ID", "user-456")
    .build();

替代方案：自定义 HTTP 客户端

对于高级用例，您可以直接使用底层的 reqwest 客户端：

use reqwest::Client;
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    
    let response = client
        .post("http://localhost:8000/v1/chat/completions")
        .header("Content-Type", "application/json")
        .header("Authorization", "Bearer your-api-key")
        .header("X-Custom-Header", "custom-value")
        .json(&json!({
            "model": "Qwen/Qwen2.5-7B-Instruct",
            "messages": [{"role": "user", "content": "你好！"}]
        }))
        .send()
        .await?;
    
    let result: serde_json::Value = response.json().await?;
    println!("{:?}", result);
    
    Ok(())
}

最佳实践

1. 尽可能使用标准认证

#![allow(unused)]
fn main() {
// 推荐
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");

// 除非必要，避免使用自定义认证
}

2. 文档化自定义请求头

使用自定义请求头时，记录其用途：

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    // 用于日志中的请求追踪
    .header("X-Request-ID", &request_id)
    // 用于多租户标识
    .header("X-Tenant-ID", &tenant_id)
    .build();

3. 验证服务器支持

确保您的 vLLM 服务器接受并处理自定义请求头。一些代理或负载均衡器可能会移除未知的请求头。

安全考虑

不要暴露敏感请求头

避免记录包含敏感信息的请求头：

// 记录日志时要小心
let auth_header = "Bearer secret-key";
// 不要直接记录这个！

使用 HTTPS

传输敏感请求头时始终使用 HTTPS：

#![allow(unused)]
fn main() {
// 好
let client = VllmClient::new("https://api.example.com/v1");

// 对于敏感数据避免使用
let client = VllmClient::new("http://api.example.com/v1");
}

请求此功能

如果您需要自定义请求头支持，请在 GitHub 上提交 issue，包括：

您的使用场景
需要的请求头
您希望 API 如何设计

超时与重试

本页介绍超时配置和重试策略，用于构建健壮的生产应用程序。

设置超时

客户端级别超时

创建客户端时设置超时：

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

// 简单超时
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(120);

// 使用构建器
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .timeout_secs(300)  // 5 分钟
    .build();
}

选择合适的超时时间

使用场景	推荐超时时间
简单查询	30-60 秒
代码生成	2-3 分钟
长文档生成	5-10 分钟
复杂推理任务	10+ 分钟

请求耗时因素

请求所需时间取决于：

提示词长度 - 更长的提示词需要更多处理时间
输出 token 数 - 更多 token = 更长生成时间
模型大小 - 更大的模型更慢
服务器负载 - 繁忙的服务器响应更慢

超时错误

处理超时

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn chat_with_timeout(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(60);

    let result = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await;

    match result {
        Ok(response) => Ok(response.content.unwrap_or_default()),
        Err(VllmError::Timeout) => {
            eprintln!("请求在 60 秒后超时");
            Err(VllmError::Timeout)
        }
        Err(e) => Err(e),
    }
}
}

重试策略

基础重试

使用指数退避重试失败的请求：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn send_with_retry(
    client: &VllmClient,
    prompt: &str,
    max_retries: u32,
) -> Result<String, VllmError> {
    let mut attempts = 0;

    loop {
        match client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
        {
            Ok(response) => {
                return Ok(response.content.unwrap_or_default());
            }
            Err(e) if e.is_retryable() && attempts < max_retries => {
                attempts += 1;
                let delay = Duration::from_millis(100 * 2u64.pow(attempts - 1));
                eprintln!("第 {} 次重试，等待 {:?}: {}", attempts, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

带抖动的重试

添加抖动以防止惊群效应：

#![allow(unused)]
fn main() {
use rand::Rng;
use std::time::Duration;
use tokio::time::sleep;

fn backoff_with_jitter(attempt: u32, base_ms: u64, max_ms: u64) -> Duration {
    let exponential = base_ms * 2u64.pow(attempt);
    let jitter = rand::thread_rng().gen_range(0..base_ms);
    let delay = (exponential + jitter).min(max_ms);
    Duration::from_millis(delay)
}

async fn retry_with_jitter<F, T, E>(
    mut f: F,
    max_retries: u32,
) -> Result<T, E>
where
    F: FnMut() -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<T, E>> + Send>>,
    E: std::fmt::Debug,
{
    let mut attempts = 0;

    loop {
        match f().await {
            Ok(result) => return Ok(result),
            Err(e) if attempts < max_retries => {
                attempts += 1;
                let delay = backoff_with_jitter(attempts, 100, 10_000);
                eprintln!("第 {} 次重试，等待 {:?}: {:?}", attempts, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

仅重试可重试错误

并非所有错误都应该重试：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn smart_retry(
    client: &VllmClient,
    prompt: &str,
) -> Result<String, VllmError> {
    let mut attempts = 0;
    let max_retries = 3;

    loop {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;

        match result {
            Ok(response) => return Ok(response.content.unwrap_or_default()),
            Err(e) => {
                // 检查错误是否可重试
                if !e.is_retryable() {
                    return Err(e);
                }

                if attempts >= max_retries {
                    return Err(e);
                }

                attempts += 1;
                tokio::time::sleep(std::time::Duration::from_secs(2u64.pow(attempts))).await;
            }
        }
    }
}
}

可重试错误

错误	可重试	原因
`Timeout`	是	服务器可能较慢
`429 频率限制`	是	等待后重试
`500 服务器错误`	是	临时服务器问题
`502 网关错误`	是	服务器可能正在重启
`503 服务不可用`	是	临时过载
`504 网关超时`	是	服务器错误
`400 请求错误`	否	客户端错误
`401 未授权`	否	认证问题
`404 未找到`	否	资源不存在

断路器模式

使用断路器防止级联故障：

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU32, Ordering};
use std::time::{Duration, Instant};
use std::sync::Mutex;

struct CircuitBreaker {
    failures: AtomicU32,
    last_failure: Mutex<Option<Instant>>,
    threshold: u32,
    reset_duration: Duration,
}

impl CircuitBreaker {
    fn new(threshold: u32, reset_duration: Duration) -> Self {
        Self {
            failures: AtomicU32::new(0),
            last_failure: Mutex::new(None),
            threshold,
            reset_duration,
        }
    }

    fn can_attempt(&self) -> bool {
        let failures = self.failures.load(Ordering::Relaxed);
        if failures < self.threshold {
            return true;
        }

        let last = self.last_failure.lock().unwrap();
        if let Some(time) = *last {
            if time.elapsed() > self.reset_duration {
                // 重置断路器
                self.failures.store(0, Ordering::Relaxed);
                return true;
            }
        }

        false
    }

    fn record_success(&self) {
        self.failures.store(0, Ordering::Relaxed);
    }

    fn record_failure(&self) {
        self.failures.fetch_add(1, Ordering::Relaxed);
        *self.last_failure.lock().unwrap() = Some(Instant::now());
    }
}
}

流式响应超时

处理流式响应过程中的超时：

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use tokio::time::{timeout, Duration};

async fn stream_with_timeout(
    client: &VllmClient,
    prompt: &str,
    per_event_timeout: Duration,
) -> Result<String, vllm_client::VllmError> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    loop {
        match timeout(per_event_timeout, stream.next()).await {
            Ok(Some(event)) => {
                match event {
                    StreamEvent::Content(delta) => content.push_str(&delta),
                    StreamEvent::Done => break,
                    StreamEvent::Error(e) => return Err(e),
                    _ => {}
                }
            }
            Ok(None) => break,
            Err(_) => {
                return Err(vllm_client::VllmError::Timeout);
            }
        }
    }

    Ok(content)
}
}

速率限制

实现客户端速率限制：

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;
use std::sync::Arc;

struct RateLimitedClient {
    client: vllm_client::VllmClient,
    semaphore: Arc<Semaphore>,
}

impl RateLimitedClient {
    fn new(base_url: &str, max_concurrent: usize) -> Self {
        Self {
            client: vllm_client::VllmClient::new(base_url),
            semaphore: Arc::new(Semaphore::new(max_concurrent)),
        }
    }

    async fn chat(&self, prompt: &str) -> Result<String, vllm_client::VllmError> {
        let _permit = self.semaphore.acquire().await.unwrap();
        
        self.client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(vllm_client::json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default())
    }
}
}

生产环境配置

完整示例

use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

struct RobustClient {
    client: VllmClient,
    max_retries: u32,
    base_backoff_ms: u64,
    max_backoff_ms: u64,
}

impl RobustClient {
    fn new(base_url: &str, timeout_secs: u64) -> Self {
        Self {
            client: VllmClient::builder()
                .base_url(base_url)
                .timeout_secs(timeout_secs)
                .build(),
            max_retries: 3,
            base_backoff_ms: 100,
            max_backoff_ms: 10_000,
        }
    }

    async fn chat(&self, prompt: &str) -> Result<String, VllmError> {
        let mut attempts = 0;

        loop {
            match self.send_request(prompt).await {
                Ok(response) => return Ok(response),
                Err(e) if self.should_retry(&e, attempts) => {
                    attempts += 1;
                    let delay = self.calculate_backoff(attempts);
                    eprintln!("第 {} 次重试，等待 {:?}: {}", attempts, delay, e);
                    sleep(delay).await;
                }
                Err(e) => return Err(e),
            }
        }
    }

    async fn send_request(&self, prompt: &str) -> Result<String, VllmError> {
        self.client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default())
    }

    fn should_retry(&self, error: &VllmError, attempts: u32) -> bool {
        attempts < self.max_retries && error.is_retryable()
    }

    fn calculate_backoff(&self, attempt: u32) -> Duration {
        let delay = self.base_backoff_ms * 2u64.pow(attempt);
        Duration::from_millis(delay.min(self.max_backoff_ms))
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = RobustClient::new("http://localhost:8000/v1", 300);

    match client.chat("你好！").await {
        Ok(response) => println!("响应: {}", response),
        Err(e) => eprintln!("重试后仍然失败: {}", e),
    }

    Ok(())
}

最佳实践

根据预期响应时间设置适当的超时
使用指数退避以避免压垮服务器
添加抖动以防止惊群效应问题
仅重试可重试错误 - 不要重试客户端错误
为生产系统实现断路器
记录重试尝试用于调试和监控
设置最大重试次数以避免无限循环

贡献指南

感谢您有兴趣为 vLLM Client 做贡献！本文档提供了贡献的指南和说明。

行为准则

请保持尊重和包容。我们欢迎所有人的贡献。

入门指南

在 GitHub 上 Fork 仓库
克隆您的 Fork 到本地
为您的更改创建分支

git clone https://github.com/YOUR_USERNAME/vllm-client.git
cd vllm-client
git checkout -b my-feature

开发环境设置

前提条件

Rust 1.70 或更高版本
Cargo（随 Rust 一起安装）
用于集成测试的 vLLM 服务器（可选）

构建

# 构建库
cargo build

# 构建所有功能
cargo build --all-features

运行测试

# 运行单元测试
cargo test

# 运行测试并显示输出
cargo test -- --nocapture

# 运行特定测试
cargo test test_name

# 运行集成测试（需要 vLLM 服务器）
cargo test --test integration

进行更改

分支命名

使用描述性的分支名称：

feature/add-new-feature - 用于新功能
fix/bug-description - 用于 bug 修复
docs/documentation-update - 用于文档更改
refactor/code-cleanup - 用于重构

提交消息

遵循约定式提交格式：

类型(范围): 描述

[可选正文]

[可选页脚]

类型：

feat: 新功能
fix: Bug 修复
docs: 文档更改
style: 代码风格更改（格式化等）
refactor: 代码重构
test: 添加或更新测试
chore: 维护任务

示例：

feat(client): 添加连接池支持

fix(streaming): 正确处理空数据块

docs(api): 更新流式文档

测试

单元测试

所有新功能都应该有单元测试：

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_new_feature() {
        // 测试实现
    }
}
}

集成测试

集成测试放在 tests/ 目录中：

#![allow(unused)]
fn main() {
// tests/integration_test.rs
use vllm_client::{VllmClient, json};

#[tokio::test]
async fn test_chat_completion() {
    let client = VllmClient::new("http://localhost:8000/v1");
    // ... 测试代码
}
}

测试覆盖率

我们追求良好的测试覆盖率。运行覆盖率报告：

cargo tarpaulin --out Html

文档

代码文档

使用文档注释记录所有公共 API：

#![allow(unused)]
fn main() {
/// 创建新的聊天补全请求。
///
/// # 参数
///
/// * `model` - 用于生成的模型名称
///
/// # 返回
///
/// 新的 `ChatCompletionsRequest` 构建器
///
/// # 示例
///
/// ```rust
/// use vllm_client::{VllmClient, json};
///
/// let client = VllmClient::new("http://localhost:8000/v1");
/// let response = client.chat.completions().create()
///     .model("Qwen/Qwen2.5-7B-Instruct")
///     .messages(json!([{"role": "user", "content": "你好"}]))
///     .send()
///     .await?;
/// ```
pub fn create(&self) -> ChatCompletionsRequest {
    // 实现
}
}

更新文档

添加新功能时：

更新内联文档
更新 docs/src/api/ 中的 API 参考
在 docs/src/examples/ 中添加示例
更新变更日志

构建文档

# 构建并预览文档
cd docs && mdbook serve --open

Pull Request 流程

更新文档：确保文档反映您的更改
添加测试：为新功能包含测试
运行测试：确保所有测试通过
格式化代码：运行 cargo fmt
检查 Lint：运行 cargo clippy
更新 CHANGELOG：在变更日志中添加条目

PR 前检查清单

# 格式化代码
cargo fmt

# 检查 lint
cargo clippy -- -D warnings

# 运行所有测试
cargo test

# 构建文档
mdbook build docs
mdbook build docs/zh

提交 PR

将您的分支推送到您的 Fork
向 main 分支发起 PR
填写 PR 模板
等待审查

PR 模板

## 描述

更改的简要描述

## 更改类型

- [ ] Bug 修复
- [ ] 新功能
- [ ] 破坏性更改
- [ ] 文档更新

## 测试

- [ ] 单元测试已添加/更新
- [ ] 集成测试已添加/更新
- [ ] 已完成手动测试

## 检查清单

- [ ] 代码已用 `cargo fmt` 格式化
- [ ] 无 clippy 警告
- [ ] 文档已更新
- [ ] 变更日志已更新

编码标准

Rust 风格

遵循标准 Rust 约定：

使用 cargo fmt 进行格式化
解决所有 clippy 警告
遵循 Rust API 指南

命名约定

类型：PascalCase（ChatCompletionResponse）
函数/方法：snake_case（send_stream）
常量：SCREAMING_SNAKE_CASE（MAX_RETRIES）
模块：snake_case（chat，completions）

错误处理

对所有错误使用 VllmError：

#![allow(unused)]
fn main() {
// 好
pub fn parse_response(data: &str) -> Result<Response, VllmError> {
    serde_json::from_str(data).map_err(VllmError::Json)
}

// 避免
pub fn parse_response(data: &str) -> Result<Response, String> {
    // ...
}
}

异步代码

对所有异步操作使用 async/await：

#![allow(unused)]
fn main() {
// 好
pub async fn send(&self) -> Result<Response, VllmError> {
    let response = self.http.post(&url).send().await?;
    // ...
}

// 避免在异步上下文中阻塞
pub async fn bad_example(&self) -> Result<Response, VllmError> {
    std::thread::sleep(Duration::from_secs(1)); // 不要这样做
    // ...
}
}

项目结构

vllm-client/
├── src/
│   ├── lib.rs         # 库入口点
│   ├── client.rs      # 客户端实现
│   ├── chat.rs        # 聊天 API
│   ├── completions.rs # 传统补全
│   ├── types.rs       # 类型定义
│   └── error.rs       # 错误类型
├── tests/
│   └── integration/   # 集成测试
├── docs/
│   ├── src/           # 英文文档
│   └── zh/src/        # 中文文档
├── examples/
│   └── *.rs           # 示例程序
└── Cargo.toml