vLLM Client

一个兼容 OpenAI 接口的 vLLM API Rust 客户端库。

特性

  • OpenAI 兼容:使用与 OpenAI 相同的 API 结构,方便迁移
  • 流式响应:完整支持 Server-Sent Events (SSE) 流式响应
  • 工具调用:支持函数/工具调用,支持流式增量更新
  • 推理模型:内置支持推理/思考模型(如启用了思考模式的 Qwen)
  • 异步支持:基于 Tokio 运行时的完全异步实现
  • 类型安全:使用 Serde 序列化的强类型定义

快速开始

添加到你的 Cargo.toml

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

基本用法

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "你好,世界!"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content);
    Ok(())
}

文档

语言

  • English - English documentation
  • 中文 - 当前页面

许可证

根据 Apache 许可证 2.0 版本或 MIT 许可证任选其一授权。

快速开始

安装

vllm-client 添加到你的 Cargo.toml

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

快速开始

基础聊天补全

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建客户端
    let client = VllmClient::new("http://localhost:8000/v1");
    
    // 发送聊天补全请求
    let response = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "你好,你好吗?"}
        ]))
        .send()
        .await?;
    
    // 打印响应
    println!("{}", response.choices[0].message.content);
    
    Ok(())
}

流式响应

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;
    
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Reasoning(delta) => print!("{}", delta),
            StreamEvent::Content(delta) => print!("{}", delta),
            _ => {}
        }
    }
    
    println!();
    Ok(())
}

配置

API 密钥

如果你的 vLLM 服务器需要认证:

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");
}

自定义超时

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(std::time::Duration::from_secs(60));
}

下一步

安装

环境要求

  • Rust: 1.70 及以上版本
  • Cargo: 安装 Rust 时会自动安装

引入项目

Cargo.toml 中添加依赖:

[dependencies]
vllm-client = "0.1"

或直接运行:

cargo add vllm-client

依赖说明

本库依赖 tokio 异步运行时,请在 Cargo.toml 中添加:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

为方便使用,库内重新导出了 serde_json::json,你可以选择添加:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"

特性开关

目前 vllm-client 暂无额外特性开关,所有功能默认启用。

验证安装

写一段简单代码验证安装是否成功:

use vllm_client::VllmClient;

fn main() {
    let client = VllmClient::new("http://localhost:8000/v1");
    println!("客户端创建成功,地址: {}", client.base_url());
}

运行:

cargo run

启动 vLLM 服务

使用本客户端前,需要先启动 vLLM 服务:

# 安装 vLLM
pip install vllm

# 启动服务并加载模型
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000

服务启动后会在 http://localhost:8000/v1 提供接口。

常见问题

连接失败

遇到连接错误时,请检查:

  1. vLLM 服务是否正常运行
  2. 服务地址是否正确(默认 http://localhost:8000/v1
  3. 防火墙是否阻止了端口访问

TLS/SSL 报错

如果 vLLM 服务使用了自签名 HTTPS 证书,需要在代码中处理证书验证问题。

请求超时

请求耗时时长较大时,可以调大超时时间:

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 5 分钟
}

下一步

快速上手

本节带你完成第一次 API 调用。

前置条件

  • Rust 1.70 及以上版本
  • 已启动的 vLLM 服务

基础对话补全

最简单的使用方式如下:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 创建客户端,指向 vLLM 服务地址
    let client = VllmClient::new("http://localhost:8000/v1");

    // 发送对话补全请求
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好,最近怎么样?"}
        ]))
        .send()
        .await?;

    // 打印响应内容
    println!("回复: {}", response.content.unwrap_or_default());

    Ok(())
}

流式响应

如果需要实时输出,可以使用流式模式:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 创建流式请求
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的短诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    // 处理流式事件
    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Reasoning(delta) => eprint!("[思考: {}]", delta),
            StreamEvent::Done => println!("\n[完成]"),
            StreamEvent::Error(e) => eprintln!("\n错误: {}", e),
            _ => {}
        }
    }

    Ok(())
}

使用构建器模式

需要更多配置时,可以使用构建器:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("your-api-key")  // 可选
    .timeout_secs(120)         // 可选
    .build();
}

完整示例

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个有帮助的助手。"},
            {"role": "user", "content": "法国的首都是哪里?"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    println!("回复: {}", response.content.unwrap_or_default());
    
    // 打印 token 使用统计(如有)
    if let Some(usage) = response.usage {
        println!("Token 统计: 提示词={}, 补全={}, 总计={}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }

    Ok(())
}

错误处理

建议做好错误处理:

use vllm_client::{VllmClient, json, VllmError};

async fn chat() -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好!"}
        ]))
        .send()
        .await?;

    Ok(response.content.unwrap_or_default())
}

#[tokio::main]
async fn main() {
    match chat().await {
        Ok(text) => println!("回复: {}", text),
        Err(VllmError::ApiError { status_code, message, .. }) => {
            eprintln!("API 错误 ({}): {}", status_code, message);
        }
        Err(VllmError::Timeout) => {
            eprintln!("请求超时");
        }
        Err(e) => {
            eprintln!("错误: {}", e);
        }
    }
}

下一步

配置说明

本文档介绍 vllm-client 的全部配置选项。

客户端配置

基础配置

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1");
}

构建器模式

需要更复杂的配置时,使用构建器模式:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("your-api-key")
    .timeout_secs(120)
    .build();
}

配置选项

Base URL

vLLM 服务的地址,需要包含 /v1 路径以兼容 OpenAI 接口。

#![allow(unused)]
fn main() {
// 本地开发
let client = VllmClient::new("http://localhost:8000/v1");

// 远程服务
let client = VllmClient::new("https://api.example.com/v1");

// 末尾斜杠会自动处理
let client = VllmClient::new("http://localhost:8000/v1/");
// 等同于: "http://localhost:8000/v1"
}

API Key

如果 vLLM 服务需要认证,配置 API Key:

#![allow(unused)]
fn main() {
// 链式调用
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-your-api-key");

// 构建器模式
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-your-api-key")
    .build();
}

API Key 会作为 Bearer Token 放在 Authorization 请求头中发送。

超时设置

长时间运行的任务需要调大超时时间:

#![allow(unused)]
fn main() {
// 链式调用
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 5 分钟

// 构建器模式
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .timeout_secs(300)
    .build();
}

默认使用底层 HTTP 客户端的超时设置(通常为 30 秒)。

请求参数配置

发起请求时,可以配置以下参数:

模型选择

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .send()
    .await?;
}

采样参数

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .temperature(0.7)      // 0.0 - 2.0
    .top_p(0.9)            // 0.0 - 1.0
    .top_k(50)             // vLLM 扩展参数
    .max_tokens(1024)      // 最大输出 token 数
    .send()
    .await?;
}
参数类型范围说明
temperaturef320.0 - 2.0控制随机性,值越高输出越随机
top_pf320.0 - 1.0核采样阈值
top_ki321+Top-K 采样(vLLM 扩展)
max_tokensu321+最大生成 token 数

停止序列

#![allow(unused)]
fn main() {
use serde_json::json;

// 多个停止序列
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .stop(json!(["END", "STOP", "\n\n"]))
    .send()
    .await?;

// 单个停止序列
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .stop(json!("END"))
    .send()
    .await?;
}

扩展参数

vLLM 支持通过 extra() 方法传入额外参数:

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "请思考这个问题"}]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        },
        "reasoning_effort": "high"
    }))
    .send()
    .await?;
}

环境变量

可以通过环境变量配置客户端:

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

let base_url = env::var("VLLM_BASE_URL")
    .unwrap_or_else(|_| "http://localhost:8000/v1".to_string());

let api_key = env::var("VLLM_API_KEY").ok();

let mut client_builder = VllmClient::builder()
    .base_url(&base_url);

if let Some(key) = api_key {
    client_builder = client_builder.api_key(&key);
}

let client = client_builder.build();
}

常用环境变量

变量名说明示例
VLLM_BASE_URLvLLM 服务地址http://localhost:8000/v1
VLLM_API_KEYAPI Key(可选)sk-xxx
VLLM_TIMEOUT超时时间(秒)300

最佳实践

复用客户端

客户端应该创建一次、多次复用:

#![allow(unused)]
fn main() {
// 推荐:复用客户端
let client = VllmClient::new("http://localhost:8000/v1");

for prompt in prompts {
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await?;
}

// 避免:每次请求都创建客户端
for prompt in prompts {
    let client = VllmClient::new("http://localhost:8000/v1"); // 效率低!
    // ...
}
}

超时时间选择

根据使用场景选择合适的超时时间:

使用场景建议超时
简单问答30 秒
复杂推理2-5 分钟
长文本生成10 分钟以上

错误处理

务必正确处理错误:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, VllmError};

match client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .send()
    .await
{
    Ok(response) => println!("{}", response.content.unwrap()),
    Err(VllmError::Timeout) => eprintln!("请求超时"),
    Err(VllmError::ApiError { status_code, message, .. }) => {
        eprintln!("API 错误 ({}): {}", status_code, message);
    }
    Err(e) => eprintln!("错误: {}", e),
}
}

下一步

API 参考

本文档提供 vLLM Client API 的完整参考。

目录

客户端

VllmClient

与 vLLM API 交互的主要客户端。

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

// 创建新客户端
let client = VllmClient::new("http://localhost:8000/v1");

// 带API密钥
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");

// 带自定义超时
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(std::time::Duration::from_secs(60));
}

方法

方法描述
new(base_url: &str)使用给定的基础URL创建新客户端
with_api_key(key: &str)设置用于认证的API密钥
with_timeout(duration)设置请求超时时间
chat访问聊天补全API

聊天补全

创建补全请求

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "system", "content": "你是一个有帮助的助手。"},
        {"role": "user", "content": "你好!"}
    ]))
    .temperature(0.7)
    .max_tokens(1000)
    .send()
    .await?;
}

构建器方法

方法类型描述
model(name)&str使用的模型名称
messages(msgs)Value聊天消息数组
temperature(temp)f32采样温度 (0.0-2.0)
max_tokens(tokens)u32最大生成token数
top_p(p)f32核采样参数
top_k(k)u32Top-k采样参数
stream(enable)bool启用流式响应
tools(tools)Value函数调用的工具定义
extra(json)Value额外参数(厂商特定)

响应结构

#![allow(unused)]
fn main() {
pub struct ChatCompletionResponse {
    pub id: String,
    pub object: String,
    pub created: u64,
    pub model: String,
    pub choices: Vec<Choice>,
    pub usage: Usage,
}

pub struct Choice {
    pub index: u32,
    pub message: Message,
    pub finish_reason: Option<String>,
}

pub struct Message {
    pub role: String,
    pub content: Option<String>,
    pub tool_calls: Option<Vec<ToolCall>>,
}

pub struct Usage {
    pub prompt_tokens: u32,
    pub completion_tokens: u32,
    pub total_tokens: u32,
}
}

流式响应

流式补全

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let client = VllmClient::new("http://localhost:8000/v1");

let mut stream = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "user", "content": "写一首诗"}
    ]))
    .stream(true)
    .send_stream()
    .await?;

while let Some(event) = stream.next().await {
    match &event {
        StreamEvent::Reasoning(delta) => {
            // 推理内容(用于思考模型)
            print!("{}", delta);
        }
        StreamEvent::Content(delta) => {
            // 常规内容
            print!("{}", delta);
        }
        StreamEvent::ToolCallDelta { tool_call_id, delta } => {
            // 工具调用流式更新
        }
        StreamEvent::ToolCallComplete(tool_call) => {
            // 完整的工具调用
        }
        StreamEvent::Usage(usage) => {
            // Token使用信息
        }
        StreamEvent::Done => {
            // 流式完成
            break;
        }
        StreamEvent::Error(e) => {
            eprintln!("错误: {}", e);
        }
    }
}
}

StreamEvent 类型

变体描述
Reasoning(String)推理/思考内容
Content(String)常规内容增量
ToolCallDelta { tool_call_id, delta }流式工具调用
ToolCallComplete(ToolCall)完整工具调用
Usage(Usage)Token使用统计
Done流式结束
Error(VllmError)发生错误

工具调用

定义工具

#![allow(unused)]
fn main() {
use vllm_client::json;

let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定位置的当前天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "城市名称"
                    }
                },
                "required": ["location"]
            }
        }
    }
]);

let response = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样?"}
    ]))
    .tools(tools)
    .send()
    .await?;

// 处理工具调用
if let Some(tool_calls) = response.choices[0].message.tool_calls {
    for tool_call in tool_calls {
        println!("函数: {}", tool_call.function.name);
        println!("参数: {}", tool_call.function.arguments);
    }
}
}

ToolCall 结构

#![allow(unused)]
fn main() {
pub struct ToolCall {
    pub id: String,
    pub r#type: String,
    pub function: FunctionCall,
}

pub struct FunctionCall {
    pub name: String,
    pub arguments: String, // JSON字符串
}
}

返回工具结果

#![allow(unused)]
fn main() {
// 执行工具后,返回结果
let response = client
    .chat
    .completions()
    .create()
    .model("llama-3-70b")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样?"},
        {"role": "assistant", "tool_calls": [
            {
                "id": "call_123",
                "type": "function",
                "function": {
                    "name": "get_weather",
                    "arguments": "{\"location\": \"Tokyo\"}"
                }
            }
        ]},
        {
            "role": "tool",
            "tool_call_id": "call_123",
            "content": "{\"temperature\": 25, \"condition\": \"sunny\"}"
        }
    ]))
    .tools(tools)
    .send()
    .await?;
}

类型定义

消息类型

#![allow(unused)]
fn main() {
// 系统消息
json!({"role": "system", "content": "你是一个有帮助的助手。"})

// 用户消息
json!({"role": "user", "content": "你好!"})

// 助手消息
json!({"role": "assistant", "content": "你好!"})

// 工具结果消息
json!({
    "role": "tool",
    "tool_call_id": "call_123",
    "content": "结果"
})
}

vLLM 特定参数

使用 .extra() 传递 vLLM 特定参数:

#![allow(unused)]
fn main() {
client
    .chat
    .completions()
    .create()
    .model("qwen-3")
    .messages(json!([{"role": "user", "content": "思考一下这个问题"}]))
    .extra(json!({
        "chat_template_kwargs": {
            "enable_thinking": true
        }
    }))
    .send()
    .await?;
}

错误处理

VllmError

#![allow(unused)]
fn main() {
use vllm_client::VllmError;

match client.chat.completions().create().send().await {
    Ok(response) => { /* ... */ },
    Err(VllmError::HttpError(e)) => {
        eprintln!("HTTP错误: {}", e);
    }
    Err(VllmError::ApiError { message, code }) => {
        eprintln!("API错误 ({}): {}", code, message);
    }
    Err(VllmError::StreamError(e)) => {
        eprintln!("流式错误: {}", e);
    }
    Err(VllmError::ParseError(e)) => {
        eprintln!("解析错误: {}", e);
    }
    Err(e) => {
        eprintln!("其他错误: {}", e);
    }
}
}

错误类型

变体描述
HttpErrorHTTP请求/响应错误
ApiErrorAPI级别错误(限流等)
StreamError流式特定错误
ParseErrorJSON解析错误
IoErrorI/O错误

完整示例

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .with_api_key("your-api-key");

    // 流式示例
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "写一首关于编程的俳句"}
        ]))
        .temperature(0.7)
        .max_tokens(100)
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Done => break,
            StreamEvent::Error(e) => eprintln!("错误: {}", e),
            _ => {}
        }
    }

    println!();
    Ok(())
}

客户端 API

VllmClient 是使用 vLLM API 的主要入口。

创建客户端

简单创建

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1");
}

带 API Key

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-your-api-key");
}

设置超时

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(120); // 2 分钟
}

使用构建器模式

复杂配置可以用构建器:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-your-api-key")
    .timeout_secs(300)
    .build();
}

方法参考

new(base_url: impl Into<String>) -> Self

用指定的 base URL 创建客户端。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1");
}

参数:

  • base_url - vLLM 服务的基础 URL(需包含 /v1 路径)

注意:

  • 末尾斜杠会自动移除
  • 客户端创建开销很小,但仍建议复用

with_api_key(self, api_key: impl Into<String>) -> Self

设置 API Key(构建器模式)。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-xxx");
}

参数:

  • api_key - 用于 Bearer 认证的 API Key

注意:

  • API Key 会作为 Bearer Token 放在 Authorization 请求头中
  • 此方法返回新的客户端实例

timeout_secs(self, secs: u64) -> Self

设置请求超时时间(构建器模式)。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300);
}

参数:

  • secs - 超时时间(秒)

注意:

  • 对该客户端发起的所有请求生效
  • 长时间生成任务建议调大超时时间

base_url(&self) -> &str

获取客户端的 base URL。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1");
assert_eq!(client.base_url(), "http://localhost:8000/v1");
}

api_key(&self) -> Option<&str>

获取已配置的 API Key。

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-xxx");
assert_eq!(client.api_key(), Some("sk-xxx"));
}

builder() -> VllmClientBuilder

创建新的客户端构建器,支持更多配置选项。

#![allow(unused)]
fn main() {
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-xxx")
    .timeout_secs(120)
    .build();
}

API 模块

客户端提供多个 API 模块:

chat - 对话补全 API

访问对话补全接口:

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .send()
    .await?;
}

completions - 传统补全 API

访问传统文本补全接口:

#![allow(unused)]
fn main() {
let response = client.completions.create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .prompt("从前有座山")
    .send()
    .await?;
}

VllmClientBuilder

构建器提供灵活的客户端配置方式。

方法

方法类型说明
base_url(url)impl Into<String>设置基础 URL
api_key(key)impl Into<String>设置 API Key
timeout_secs(secs)u64设置超时时间(秒)
build()-构建客户端

默认值

选项默认值
base_urlhttp://localhost:8000/v1
api_keyNone
timeout_secsHTTP 客户端默认值(30秒)

使用示例

基础用法

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好!"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

使用环境变量

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

fn create_client() -> VllmClient {
    let base_url = env::var("VLLM_BASE_URL")
        .unwrap_or_else(|_| "http://localhost:8000/v1".to_string());
    
    let api_key = env::var("VLLM_API_KEY").ok();
    
    let mut builder = VllmClient::builder().base_url(&base_url);
    
    if let Some(key) = api_key {
        builder = builder.api_key(&key);
    }
    
    builder.build()
}
}

多次请求

复用客户端处理多次请求:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

async fn process_prompts(client: &VllmClient, prompts: &[&str]) -> Vec<String> {
    let mut results = Vec::new();
    
    for prompt in prompts {
        let response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;
        
        match response {
            Ok(r) => results.push(r.content.unwrap_or_default()),
            Err(e) => eprintln!("错误: {}", e),
        }
    }
    
    results
}
}

线程安全

VllmClient 是线程安全的,可以跨线程共享:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use vllm_client::VllmClient;

let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// 可以克隆并在多线程间传递
let client_clone = Arc::clone(&client);
}

相关链接

对话补全 API

对话补全 API 是生成文本响应的主要接口。

概述

通过 client.chat.completions() 访问对话补全 API:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "你好!"}
    ]))
    .send()
    .await?;
}

请求构建器

必需参数

model(name: impl Into<String>)

设置生成使用的模型名称。

#![allow(unused)]
fn main() {
.model("Qwen/Qwen2.5-72B-Instruct")
// 或
.model("meta-llama/Llama-3-70b")
}

messages(messages: Value)

设置对话消息,格式为 JSON 数组。

#![allow(unused)]
fn main() {
.messages(json!([
    {"role": "system", "content": "你是一个有帮助的助手。"},
    {"role": "user", "content": "Rust 是什么?"}
]))
}

消息类型

角色说明
system设置助手行为
user用户输入
assistant助手回复(多轮对话时使用)
tool工具结果(函数调用时使用)

采样参数

temperature(temp: f32)

控制随机性。范围:0.02.0

#![allow(unused)]
fn main() {
.temperature(0.7)  // 常规行为
.temperature(0.0)  // 确定性输出
.temperature(1.5)  // 更有创意
}

max_tokens(tokens: u32)

最大生成 token 数。

#![allow(unused)]
fn main() {
.max_tokens(1024)
.max_tokens(4096)
}

top_p(p: f32)

核采样阈值。范围:0.01.0

#![allow(unused)]
fn main() {
.top_p(0.9)
}

top_k(k: i32)

Top-K 采样(vLLM 扩展)。限制为 top K 个 token。

#![allow(unused)]
fn main() {
.top_k(50)
}

stop(sequences: Value)

遇到这些序列时停止生成。

#![allow(unused)]
fn main() {
// 多个停止序列
.stop(json!(["END", "STOP", "\n\n"]))

// 单个停止序列
.stop(json!("---"))
}

工具调用参数

tools(tools: Value)

定义模型可调用的工具/函数。

#![allow(unused)]
fn main() {
.tools(json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取某地的天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]))
}

tool_choice(choice: Value)

控制工具选择行为。

#![allow(unused)]
fn main() {
.tool_choice(json!("auto"))       // 模型决定
.tool_choice(json!("none"))       // 不使用工具
.tool_choice(json!("required"))   // 强制使用工具
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

高级参数

stream(enable: bool)

启用流式响应。

#![allow(unused)]
fn main() {
.stream(true)
}

extra(params: Value)

传入 vLLM 特有或其他额外参数。

#![allow(unused)]
fn main() {
.extra(json!({
    "chat_template_kwargs": {
        "think_mode": true
    },
    "reasoning_effort": "high"
}))
}

发送请求

send() - 同步响应

一次性返回完整响应。

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .send()
    .await?;
}

send_stream() - 流式响应

返回流式数据,实现实时输出。

#![allow(unused)]
fn main() {
let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .stream(true)
    .send_stream()
    .await?;
}

详见流式响应

响应结构

ChatCompletionResponse

字段类型说明
rawValue原始 JSON 响应
idString响应 ID
objectString对象类型
modelString使用的模型
contentOption<String>生成的内容
reasoning_contentOption<String>推理内容(思考模型)
tool_callsOption<Vec<ToolCall>>工具调用
finish_reasonOption<String>停止原因
usageOption<Usage>Token 使用统计

使用示例

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "2+2 等于几?"}
    ]))
    .send()
    .await?;

// 获取内容
println!("内容: {}", response.content.unwrap_or_default());

// 检查推理内容(思考模型)
if let Some(reasoning) = response.reasoning_content {
    println!("推理: {}", reasoning);
}

// 检查停止原因
match response.finish_reason.as_deref() {
    Some("stop") => println!("自然结束"),
    Some("length") => println!("达到最大 token 数"),
    Some("tool_calls") => println!("进行了工具调用"),
    _ => {}
}

// Token 使用统计
if let Some(usage) = response.usage {
    println!("提示词 tokens: {}", usage.prompt_tokens);
    println!("补全 tokens: {}", usage.completion_tokens);
    println!("总 tokens: {}", usage.total_tokens);
}
}

完整示例

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个编程助手。"},
            {"role": "user", "content": "用 Rust 写一个反转字符串的函数"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    if let Some(content) = response.content {
        println!("{}", content);
    }

    Ok(())
}

多轮对话

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

// 第一轮
let response1 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "我叫小明"}
    ]))
    .send()
    .await?;

// 继续对话
let response2 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "我叫小明"},
        {"role": "assistant", "content": response1.content.unwrap()},
        {"role": "user", "content": "我叫什么名字?"}
    ]))
    .send()
    .await?;
}

相关链接

流式响应 API

流式响应可以实时处理大语言模型的输出,逐个 token 接收,无需等待完整响应。

概述

vLLM Client 通过 Server-Sent Events (SSE) 提供流式支持。使用 send_stream() 替代 send() 即可获得流式响应。

基础流式调用

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    println!();
    Ok(())
}

StreamEvent 类型

StreamEvent 枚举表示不同类型的流式事件:

变体说明
Content(String)普通内容 token 增量
Reasoning(String)推理/思考内容(思考模型)
ToolCallDelta流式工具调用增量
ToolCallComplete(ToolCall)完整工具调用,可执行
Usage(Usage)Token 使用统计
Done流式传输完成
Error(VllmError)发生错误

内容事件

最常见的事件类型,包含文本 token:

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Content(delta) => {
        print!("{}", delta);
        std::io::Write::flush(&mut std::io::stdout()).ok();
    }
    _ => {}
}
}

推理事件

用于带推理能力的模型(如开启思考模式的 Qwen):

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Reasoning(delta) => {
        eprintln!("[思考] {}", delta);
    }
    StreamEvent::Content(delta) => {
        print!("{}", delta);
    }
    _ => {}
}
}

工具调用事件

工具调用会先增量推送,完成后通知:

#![allow(unused)]
fn main() {
match event {
    StreamEvent::ToolCallDelta { index, id, name, arguments } => {
        println!("工具增量: index={}, name={}", index, name);
        // arguments 是部分 JSON 字符串
    }
    StreamEvent::ToolCallComplete(tool_call) => {
        println!("工具就绪: {}({})", tool_call.name, tool_call.arguments);
        // 执行工具并返回结果
    }
    _ => {}
}
}

使用统计事件

Token 使用信息通常在最后发送:

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Usage(usage) => {
        println!("Tokens: 提示词={}, 补全={}, 总计={}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }
    _ => {}
}
}

MessageStream

MessageStream 类型是一个异步迭代器,产出 StreamEvent 值。

方法

方法返回类型说明
next()Option<StreamEvent>获取下一个事件(异步)
collect_content()String收集所有内容为字符串
into_stream()impl Stream转换为通用流

收集全部内容

为方便使用,可以一次性收集所有内容:

#![allow(unused)]
fn main() {
let content = stream.collect_content().await?;
println!("完整响应: {}", content);
}

注意:这种方式会等待完整响应,失去了流式的意义。仅当需要同时显示流式输出和保存完整文本时使用。

完整流式示例

use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个有帮助的助手。"},
            {"role": "user", "content": "用简单的语言解释量子计算"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .stream(true)
        .send_stream()
        .await?;

    let mut reasoning = String::new();
    let mut content = String::new();
    let mut usage = None;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
            }
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Usage(u) => {
                usage = Some(u);
            }
            StreamEvent::Done => {
                println!("\n[流式传输完成]");
            }
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                return Err(e);
            }
            _ => {}
        }
    }

    // 打印摘要
    if !reasoning.is_empty() {
        eprintln!("\n--- 推理过程 ---");
        eprintln!("{}", reasoning);
    }

    if let Some(usage) = usage {
        eprintln!("\n--- Token 使用 ---");
        eprintln!("提示词: {}, 补全: {}, 总计: {}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }

    Ok(())
}

流式工具调用

使用工具时,工具调用会增量推送:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, ToolCall};
use futures::StreamExt;

let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取某地天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]);

let mut stream = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样?"}
    ]))
    .tools(tools)
    .stream(true)
    .send_stream()
    .await?;

let mut tool_calls: Vec<ToolCall> = Vec::new();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Content(delta) => print!("{}", delta),
        StreamEvent::ToolCallComplete(tool_call) => {
            tool_calls.push(tool_call);
        }
        StreamEvent::Done => break,
        _ => {}
    }
}

// 执行工具调用
for tool_call in tool_calls {
    println!("工具: {} 参数: {}", tool_call.name, tool_call.arguments);
    // 执行并在下一条消息中返回结果
}
}

错误处理

流式传输过程中随时可能发生错误:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

async fn stream_chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => content.push_str(&delta),
            StreamEvent::Error(e) => return Err(e),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    Ok(content)
}
}

最佳实践

刷新输出

实时显示时,每次输出后刷新 stdout:

#![allow(unused)]
fn main() {
use std::io::{self, Write};

match event {
    StreamEvent::Content(delta) => {
        print!("{}", delta);
        io::stdout().flush().ok();
    }
    _ => {}
}
}

处理中断

交互式应用中,优雅地处理 Ctrl+C:

#![allow(unused)]
fn main() {
use tokio::signal;

tokio::select! {
    result = process_stream(&mut stream) => {
        // 正常完成
    }
    _ = signal::ctrl_c() => {
        println!("\n[已中断]");
    }
}
}

空闲流超时

为可能卡住的流设置超时:

#![allow(unused)]
fn main() {
use tokio::time::{timeout, Duration};

let result = timeout(
    Duration::from_secs(60),
    stream.next()
).await;

match result {
    Ok(Some(event)) => { /* 处理事件 */ }
    Ok(None) => { /* 流结束 */ }
    Err(_) => { /* 超时 */ }
}
}

Completions 流式 API

vLLM Client 同时支持旧版 /v1/completions API 的流式调用,使用 CompletionStreamEvent

CompletionStreamEvent 类型

变体说明
Text(String)文本 token 增量
FinishReason(String)流结束原因(如 "stop", "length")
Usage(Usage)Token 使用统计
Done流式传输完成
Error(VllmError)发生错误

Completions 流式示例

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("写一首关于春天的诗")
        .max_tokens(1024)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                print!("{}", delta);
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n[结束原因: {}]", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                println!("\nTokens: 提示词={}, 补全={}, 总计={}",
                    usage.prompt_tokens,
                    usage.completion_tokens,
                    usage.total_tokens
                );
            }
            CompletionStreamEvent::Done => {
                println!("\n[流式传输完成]");
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("错误: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

CompletionStream 方法

方法返回类型说明
next()Option<CompletionStreamEvent>获取下一个事件(异步)
collect_text()String收集所有文本为字符串
into_stream()impl Stream转换为通用流

相关链接

工具调用 API

工具调用(也称为函数调用)允许模型在生成过程中调用外部函数,实现与外部 API、数据库和自定义逻辑的集成。

概述

vLLM Client 支持 OpenAI 兼容的工具调用:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样?"}
    ]))
    .tools(tools)
    .send()
    .await?;
}

定义工具

基础工具定义

工具使用遵循 OpenAI 规范的 JSON 格式定义:

#![allow(unused)]
fn main() {
let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定地点的当前天气",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "城市名称,如东京"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "温度单位"
                    }
                },
                "required": ["location"]
            }
        }
    }
]);
}

多个工具

#![allow(unused)]
fn main() {
let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取天气信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "搜索网页信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "limit": {"type": "integer"}
                },
                "required": ["query"]
            }
        }
    }
]);
}

工具选择

控制模型如何选择工具:

#![allow(unused)]
fn main() {
// 让模型自行决定(默认)
.tool_choice(json!("auto"))

// 禁止使用工具
.tool_choice(json!("none"))

// 强制使用工具
.tool_choice(json!("required"))

// 强制使用特定工具
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

处理工具调用

检查工具调用

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京的天气怎么样?"}
    ]))
    .tools(tools)
    .send()
    .await?;

// 检查响应是否包含工具调用
if response.has_tool_calls() {
    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            println!("函数: {}", tool_call.name);
            println!("参数: {}", tool_call.arguments);
        }
    }
}
}

ToolCall 结构

#![allow(unused)]
fn main() {
pub struct ToolCall {
    pub id: String,           // 调用的唯一标识
    pub name: String,         // 函数名称
    pub arguments: String,    // 参数的 JSON 字符串
}
}

解析参数

将参数字符串解析为类型化数据:

#![allow(unused)]
fn main() {
use serde::Deserialize;

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
    unit: Option<String>,
}

if let Some(tool_call) = response.first_tool_call() {
    // 解析为特定类型
    match tool_call.parse_args_as::<WeatherArgs>() {
        Ok(args) => {
            println!("地点: {}", args.location);
            if let Some(unit) = args.unit {
                println!("单位: {}", unit);
            }
        }
        Err(e) => {
            eprintln!("解析参数失败: {}", e);
        }
    }
    
    // 或解析为通用 JSON
    let args: Value = tool_call.parse_args()?;
}
}

工具结果方法

创建工具结果消息:

#![allow(unused)]
fn main() {
// 创建工具结果消息
let tool_result = tool_call.result(json!({
    "temperature": 25,
    "condition": "sunny",
    "humidity": 60
}));

// 返回一个可直接加入消息的 JSON 对象
// {
//     "role": "tool",
//     "tool_call_id": "...",
//     "content": "{\"temperature\": 25, ...}"
// }
}

完整工具调用流程

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, ToolCall};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
}

#[derive(Serialize)]
struct WeatherResult {
    temperature: f32,
    condition: String,
}

// 模拟天气 API
fn get_weather(location: &str) -> WeatherResult {
    WeatherResult {
        temperature: 25.0,
        condition: "sunny".to_string(),
    }
}

async fn chat_with_tools(client: &VllmClient, user_message: &str) -> Result<String, Box<dyn std::error::Error>> {
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    // 第一次请求
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": user_message}
        ]))
        .tools(tools.clone())
        .send()
        .await?;

    // 检查模型是否要调用工具
    if response.has_tool_calls() {
        let mut messages = vec![
            json!({"role": "user", "content": user_message})
        ];

        // 将助手的工具调用加入消息
        if let Some(tool_calls) = &response.tool_calls {
            let assistant_msg = response.assistant_message();
            messages.push(assistant_msg);

            // 执行每个工具并加入结果
            for tool_call in tool_calls {
                if tool_call.name == "get_weather" {
                    let args: WeatherArgs = tool_call.parse_args_as()?;
                    let result = get_weather(&args.location);
                    messages.push(tool_call.result(json!(result)));
                }
            }
        }

        // 带工具结果继续对话
        let final_response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-72B-Instruct")
            .messages(json!(messages))
            .tools(tools)
            .send()
            .await?;

        return Ok(final_response.content.unwrap_or_default());
    }

    Ok(response.content.unwrap_or_default())
}
}

流式工具调用

流式响应中,工具调用会增量推送:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "东京和巴黎的天气怎么样?"}
    ]))
    .tools(tools)
    .stream(true)
    .send_stream()
    .await?;

let mut tool_calls: Vec<ToolCall> = Vec::new();
let mut content = String::new();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Content(delta) => {
            content.push_str(&delta);
            print!("{}", delta);
        }
        StreamEvent::ToolCallDelta { index, id, name, arguments } => {
            println!("[工具增量 {}] {}({})", index, name, arguments);
        }
        StreamEvent::ToolCallComplete(tool_call) => {
            println!("[工具完成] {}({})", tool_call.name, tool_call.arguments);
            tool_calls.push(tool_call);
        }
        StreamEvent::Done => break,
        _ => {}
    }
}

// 执行所有收集到的工具调用
for tool_call in tool_calls {
    // 执行并返回结果...
}
}

多轮工具调用

#![allow(unused)]
fn main() {
async fn multi_round_tool_calling(
    client: &VllmClient,
    user_message: &str,
    max_rounds: usize,
) -> Result<String, Box<dyn std::error::Error>> {
    let mut messages = vec![
        json!({"role": "user", "content": user_message})
    ];

    for _ in 0..max_rounds {
        let response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-72B-Instruct")
            .messages(json!(&messages))
            .tools(tools.clone())
            .send()
            .await?;

        if response.has_tool_calls() {
            // 加入带工具调用的助手消息
            messages.push(response.assistant_message());

            // 执行工具并加入结果
            if let Some(tool_calls) = &response.tool_calls {
                for tool_call in tool_calls {
                    let result = execute_tool(&tool_call.name, &tool_call.arguments);
                    messages.push(tool_call.result(result));
                }
            }
        } else {
            // 没有更多工具调用,返回内容
            return Ok(response.content.unwrap_or_default());
        }
    }

    Err("超过最大轮数".into())
}
}

最佳实践

清晰的工具描述

写清楚、详细的描述:

#![allow(unused)]
fn main() {
// 推荐
"description": "获取指定城市的当前天气状况。返回温度、湿度和天气状况。"

// 避免
"description": "获取天气"
}

精确的参数 Schema

定义准确的 JSON Schema:

#![allow(unused)]
fn main() {
"parameters": {
    "type": "object",
    "properties": {
        "location": {
            "type": "string",
            "description": "城市名称或坐标"
        },
        "days": {
            "type": "integer",
            "minimum": 1,
            "maximum": 7,
            "description": "预报天数"
        }
    },
    "required": ["location"]
}
}

错误处理

优雅地处理工具执行错误:

#![allow(unused)]
fn main() {
let tool_result = match execute_tool(&tool_call.name, &tool_call.arguments) {
    Ok(result) => json!({"success": true, "data": result}),
    Err(e) => json!({"success": false, "error": e.to_string()}),
};
messages.push(tool_call.result(tool_result));
}

相关链接

错误处理

本文档介绍 vLLM Client 中的错误处理机制。

VllmError 枚举

vLLM Client 中的所有错误都通过 VllmError 枚举表示:

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Debug, Error, Clone)]
pub enum VllmError {
    #[error("HTTP request failed: {0}")]
    Http(String),

    #[error("JSON error: {0}")]
    Json(String),

    #[error("API error (status {status_code}): {message}")]
    ApiError {
        status_code: u16,
        message: String,
        error_type: Option<String>,
    },

    #[error("Stream error: {0}")]
    Stream(String),

    #[error("Connection timeout")]
    Timeout,

    #[error("Model not found: {0}")]
    ModelNotFound(String),

    #[error("Missing required parameter: {0}")]
    MissingParameter(String),

    #[error("No response content")]
    NoContent,

    #[error("Invalid response format: {0}")]
    InvalidResponse(String),

    #[error("{0}")]
    Other(String),
}
}

错误类型

变体发生场景
Http网络错误、连接失败
Json序列化/反序列化错误
ApiError服务器返回错误响应
Stream流式响应过程中的错误
Timeout请求超时
ModelNotFound指定的模型不存在
MissingParameter缺少必需参数
NoContent响应无内容
InvalidResponse响应格式不符合预期
Other其他错误

基础错误处理

use vllm_client::{VllmClient, json, VllmError};

async fn chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await?;

    Ok(response.content.unwrap_or_default())
}

#[tokio::main]
async fn main() {
    match chat("你好!").await {
        Ok(text) => println!("响应: {}", text),
        Err(e) => eprintln!("错误: {}", e),
    }
}

详细错误处理

针对不同错误类型进行不同处理:

use vllm_client::{VllmClient, json, VllmError};

#[tokio::main]
async fn main() {
    let client = VllmClient::new("http://localhost:8000/v1");

    let result = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": "你好!"}]))
        .send()
        .await;

    match result {
        Ok(response) => {
            println!("成功: {}", response.content.unwrap_or_default());
        }
        Err(VllmError::ApiError { status_code, message, error_type }) => {
            eprintln!("API 错误 (HTTP {}): {}", status_code, message);
            if let Some(etype) = error_type {
                eprintln!("错误类型: {}", etype);
            }
        }
        Err(VllmError::Timeout) => {
            eprintln!("请求超时,请尝试增加超时时间。");
        }
        Err(VllmError::Http(msg)) => {
            eprintln!("网络错误: {}", msg);
        }
        Err(VllmError::ModelNotFound(model)) => {
            eprintln!("模型 '{}' 未找到,请检查可用模型。", model);
        }
        Err(VllmError::MissingParameter(param)) => {
            eprintln!("缺少必需参数: {}", param);
        }
        Err(e) => {
            eprintln!("其他错误: {}", e);
        }
    }
}

HTTP 状态码

常见的 API 错误状态码:

状态码含义处理建议
400请求格式错误检查请求参数
401未授权检查 API Key
403禁止访问检查权限
404未找到检查端点或模型名称
429请求频率限制实现退避重试
500服务器内部错误重试或联系管理员
502网关错误检查 vLLM 服务器状态
503服务不可用等待后重试
504网关超时增加超时时间或重试

可重试错误

检查错误是否可重试:

#![allow(unused)]
fn main() {
use vllm_client::VllmError;

fn should_retry(error: &VllmError) -> bool {
    error.is_retryable()
}

// 手动检查
match error {
    VllmError::Timeout => true,
    VllmError::ApiError { status_code: 429, .. } => true,  // 频率限制
    VllmError::ApiError { status_code: 500..=504, .. } => true,  // 服务器错误
    _ => false,
}
}

指数退避重试

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn chat_with_retry(
    client: &VllmClient,
    prompt: &str,
    max_retries: u32,
) -> Result<String, VllmError> {
    let mut retries = 0;

    loop {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;

        match result {
            Ok(response) => {
                return Ok(response.content.unwrap_or_default());
            }
            Err(e) if e.is_retryable() && retries < max_retries => {
                retries += 1;
                let delay = Duration::from_millis(100 * 2u64.pow(retries - 1));
                eprintln!("第 {} 次重试,等待 {:?}: {}", retries, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

流式响应错误处理

处理流式响应过程中的错误:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

async fn stream_chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => content.push_str(&delta),
            StreamEvent::Done => break,
            StreamEvent::Error(e) => return Err(e),
            _ => {}
        }
    }

    Ok(content)
}
}

错误上下文

为错误添加上下文信息,便于调试:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn chat_with_context(prompt: &str) -> Result<String, String> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map_err(|e| format!("获取对话响应失败: {}", e))?;

    Ok(response.content.unwrap_or_default())
}
}

使用 anyhow 或 eyre

对于使用 anyhoweyre 的应用程序:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use anyhow::{Context, Result};

async fn chat(prompt: &str) -> Result<String> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .context("发送对话请求失败")?;

    Ok(response.content.unwrap_or_default())
}
}

最佳实践

1. 始终处理错误

#![allow(unused)]
fn main() {
// 不好的做法
let response = client.chat.completions().create()
    .send().await.unwrap();

// 好的做法
match client.chat.completions().create().send().await {
    Ok(r) => { /* 处理响应 */ },
    Err(e) => eprintln!("错误: {}", e),
}
}

2. 设置适当的超时时间

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 长时间任务设置为 5 分钟
}

3. 记录带上下文的错误

#![allow(unused)]
fn main() {
Err(e) => {
    log::error!("对话请求失败: {}", e);
    log::debug!("请求详情: model={}, prompt_len={}", model, prompt.len());
}
}

4. 实现优雅降级

#![allow(unused)]
fn main() {
match primary_client.chat.completions().create().send().await {
    Ok(r) => r,
    Err(e) => {
        log::warn!("主客户端失败: {}, 尝试备用客户端", e);
        fallback_client.chat.completions().create().send().await?
    }
}
}

相关链接

示例代码

本节包含各种使用场景的代码示例。

目录

基础聊天

简单对话

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "你好,请介绍一下你自己。"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

带系统提示的对话

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "system", "content": "你是一个专业的 Rust 编程助手,回答简洁准确。"},
            {"role": "user", "content": "什么是所有权?"}
        ]))
        .temperature(0.7)
        .max_tokens(500)
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

多轮对话

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "我叫张三"},
            {"role": "assistant", "content": "你好,张三!很高兴认识你。有什么我可以帮助你的吗?"},
            {"role": "user", "content": "我叫什么名字?"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

流式聊天

基本流式输出

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "写一首关于春天的诗"}
        ]))
        .stream(true)
        .send_stream()
        .await?;
    
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Done => break,
            StreamEvent::Error(e) => eprintln!("错误: {}", e),
            _ => {}
        }
    }
    
    println!();
    Ok(())
}

带思考模式的流式输出

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("qwen-3")
        .messages(json!([
            {"role": "user", "content": "解释相对论"}
        ]))
        .extra(json!({"chat_template_kwargs": {"enable_thinking": true}}))
        .stream(true)
        .send_stream()
        .await?;
    
    println!("=== 思考过程 ===");
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Reasoning(delta) => {
                // 思考内容
                print!("{}", delta);
            }
            StreamEvent::Content(delta) => {
                // 正式回复内容
                print!("{}", delta);
            }
            StreamEvent::Done => break,
            _ => {}
        }
    }
    
    println!();
    Ok(())
}

流式 Completions

旧版 Completions API 流式调用

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("什么是机器学习?")
        .max_tokens(500)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                print!("{}", delta);
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n[结束原因: {}]", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                println!("\nTokens: 提示词={}, 补全={}, 总计={}",
                    usage.prompt_tokens,
                    usage.completion_tokens,
                    usage.total_tokens
                );
            }
            CompletionStreamEvent::Done => {
                println!("\n[流式传输完成]");
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("错误: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

注意: 对于新项目,推荐使用 Chat Completions API (client.chat.completions()),它提供更灵活的功能和更好的消息格式。


工具调用

定义和使用工具

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    // 定义工具
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定城市的当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "城市名称,如:北京、上海"
                        }
                    },
                    "required": ["city"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "get_time",
                "description": "获取指定城市的当前时间",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {
                            "type": "string",
                            "description": "城市名称"
                        }
                    },
                    "required": ["city"]
                }
            }
        }
    ]);
    
    // 发送请求
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "北京现在天气怎么样?"}
        ]))
        .tools(tools)
        .send()
        .await?;
    
    // 检查是否有工具调用
    if let Some(tool_calls) = &response.choices[0].message.tool_calls {
        for tool_call in tool_calls {
            println!("工具: {}", tool_call.function.name);
            println!("参数: {}", tool_call.function.arguments);
            
            // 在这里执行实际的工具调用
            // let result = execute_tool(&tool_call.function.name, &tool_call.function.arguments);
        }
    }
    
    Ok(())
}

返回工具结果

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取天气信息",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string"}
                    },
                    "required": ["city"]
                }
            }
        }
    ]);
    
    // 模拟对话流程
    let response = client
        .chat
        .completions()
        .create()
        .model("llama-3-70b")
        .messages(json!([
            {"role": "user", "content": "上海天气如何?"},
            {
                "role": "assistant",
                "tool_calls": [{
                    "id": "call_001",
                    "type": "function",
                    "function": {
                        "name": "get_weather",
                        "arguments": "{\"city\": \"上海\"}"
                    }
                }]
            },
            {
                "role": "tool",
                "tool_call_id": "call_001",
                "content": "{\"temperature\": 28, \"condition\": \"多云\", \"humidity\": 65}"
            }
        ]))
        .tools(tools)
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

多模态

图像理解

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    // 使用 base64 编码的图像
    let image_base64 = "data:image/png;base64,iVBORw0KGgo...";
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llava-v1.6")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "这张图片里有什么?"},
                    {
                        "type": "image_url",
                        "image_url": {"url": image_base64}
                    }
                ]
            }
        ]))
        .max_tokens(500)
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

使用图像 URL

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("llava-v1.6")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "描述这张图片"},
                    {
                        "type": "image_url",
                        "image_url": {"url": "https://example.com/image.jpg"}
                    }
                ]
            }
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

思考模式

启用思考模式

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("qwen-3")
        .messages(json!([
            {"role": "system", "content": "你是一个善于深度思考的AI助手。"},
            {"role": "user", "content": "为什么天空是蓝色的?"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "enable_thinking": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;
    
    let mut reasoning = String::new();
    let mut content = String::new();
    
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Reasoning(delta) => reasoning.push_str(delta),
            StreamEvent::Content(delta) => content.push_str(delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }
    
    println!("=== 思考过程 ===");
    println!("{}", reasoning);
    println!("\n=== 回答 ===");
    println!("{}", content);
    
    Ok(())
}

禁用思考模式

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("qwen-3")
        .messages(json!([
            {"role": "user", "content": "你好"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "enable_thinking": false
            }
        }))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content.unwrap());
    Ok(())
}

更多示例

完整的示例代码可以在项目的 examples/ 目录中找到:

  • simple.rs - 基础聊天示例
  • simple_streaming.rs - 流式聊天示例
  • streaming_chat.rs - 带思考模式的流式聊天
  • tool_calling.rs - 工具调用示例

基础聊天示例

本页演示 vLLM Client 的基础聊天补全使用模式。

简单聊天

发送聊天消息的最简单方式:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好,你好吗?"}
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

带系统消息

添加系统消息来控制助手的行为:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个有帮助的编程助手。你编写整洁、文档完善的代码。"},
            {"role": "user", "content": "用 Rust 写一个检查数字是否为质数的函数"}
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

多轮对话

在多轮消息中保持上下文:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 构建对话历史
    let mut messages = vec![
        json!({"role": "system", "content": "你是一个有帮助的助手。"}),
    ];

    // 第一轮
    messages.push(json!({"role": "user", "content": "我叫小明"}));
    
    let response1 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!(messages.clone()))
        .send()
        .await?;

    let assistant_reply = response1.content.unwrap_or_default();
    println!("助手: {}", assistant_reply);

    // 将助手回复添加到历史
    messages.push(json!({"role": "assistant", "content": assistant_reply}));

    // 第二轮
    messages.push(json!({"role": "user", "content": "我叫什么名字?"}));

    let response2 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!(messages))
        .send()
        .await?;

    println!("助手: {}", response2.content.unwrap_or_default());
    Ok(())
}

对话辅助工具

一个可复用的对话构建辅助工具:

use vllm_client::{VllmClient, json, VllmError};
use serde_json::Value;

struct Conversation {
    client: VllmClient,
    model: String,
    messages: Vec<Value>,
}

impl Conversation {
    fn new(client: VllmClient, model: impl Into<String>) -> Self {
        Self {
            client,
            model: model.into(),
            messages: vec![
                json!({"role": "system", "content": "你是一个有帮助的助手。"})
            ],
        }
    }

    fn with_system(mut self, content: &str) -> Self {
        self.messages[0] = json!({"role": "system", "content": content});
        self
    }

    async fn send(&mut self, user_message: &str) -> Result<String, VllmError> {
        self.messages.push(json!({
            "role": "user",
            "content": user_message
        }));

        let response = self.client
            .chat
            .completions()
            .create()
            .model(&self.model)
            .messages(json!(&self.messages))
            .send()
            .await?;

        let content = response.content.unwrap_or_default();
        self.messages.push(json!({
            "role": "assistant",
            "content": &content
        }));

        Ok(content)
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut conv = Conversation::new(client, "Qwen/Qwen2.5-7B-Instruct")
        .with_system("你是一个数学辅导员。简单地解释概念。");

    println!("用户: 2 + 2 等于几?");
    let reply = conv.send("2 + 2 等于几?").await?;
    println!("助手: {}", reply);

    println!("\n用户: 那乘以 3 等于几?");
    let reply = conv.send("那乘以 3 等于几?").await?;
    println!("助手: {}", reply);

    Ok(())
}

使用采样参数

通过采样参数控制生成:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一个关于机器人的创意故事"}
        ]))
        .temperature(1.2)      // 更高的温度增加创意性
        .top_p(0.95)           // 核采样
        .top_k(50)             // vLLM 扩展参数
        .max_tokens(512)       // 限制输出长度
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

确定性输出

要获得可重复的结果,将温度设置为 0:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "2 + 2 等于几?"}
        ]))
        .temperature(0.0)      // 确定性输出
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

使用停止序列

在特定序列处停止生成:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "列出三种水果,每行一个"}
        ]))
        .stop(json!(["\n\n", "END"]))  // 在双换行或 END 处停止
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Token 使用追踪

追踪 token 使用情况以监控成本:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "解释量子计算"}
        ]))
        .send()
        .await?;

    println!("响应: {}", response.content.unwrap_or_default());

    if let Some(usage) = response.usage {
        println!("\n--- Token 使用统计 ---");
        println!("提示词 tokens: {}", usage.prompt_tokens);
        println!("补全 tokens: {}", usage.completion_tokens);
        println!("总 tokens: {}", usage.total_tokens);
    }

    Ok(())
}

批量处理

高效处理多个提示:

use vllm_client::{VllmClient, json, VllmError};

async fn process_prompts(
    client: &VllmClient,
    prompts: &[&str],
) -> Vec<Result<String, VllmError>> {
    let mut results = Vec::new();

    for prompt in prompts {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default());

        results.push(result);
    }

    results
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(120);

    let prompts = [
        "Rust 是什么?",
        "Python 是什么?",
        "Go 是什么?",
    ];

    let results = process_prompts(&client, &prompts).await;

    for (prompt, result) in prompts.iter().zip(results.iter()) {
        match result {
            Ok(response) => println!("问: {}\n答: {}\n", prompt, response),
            Err(e) => eprintln!("'{}' 出错: {}", prompt, e),
        }
    }

    Ok(())
}

错误处理

生产代码的正确错误处理:

use vllm_client::{VllmClient, json, VllmError};

async fn safe_chat(prompt: &str) -> Result<String, String> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(60);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map_err(|e| format!("请求失败: {}", e))?;

    response.content.ok_or_else(|| "响应中无内容".to_string())
}

#[tokio::main]
async fn main() {
    match safe_chat("你好!").await {
        Ok(text) => println!("响应: {}", text),
        Err(e) => eprintln!("错误: {}", e),
    }
}

相关链接

流式聊天示例

本示例演示如何使用流式响应实现实时输出。

基础流式响应

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一个关于机器人学习绘画的短篇故事。"}
        ]))
        .temperature(0.8)
        .max_tokens(1024)
        .stream(true)
        .send_stream()
        .await?;

    print!("响应: ");
    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }
    println!();

    Ok(())
}

带推理过程的流式响应(思考模型)

对于支持思考/推理模式的模型:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "计算: 15 * 23 + 47 等于多少?"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut reasoning = String::new();
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
                eprintln!("[思考中] {}", delta);
            }
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n");
    if !reasoning.is_empty() {
        println!("--- 推理过程 ---");
        println!("{}", reasoning);
    }

    Ok(())
}

带进度指示器的流式响应

在等待第一个 token 时显示输入指示器:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use std::time::{Duration, Instant};
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let waiting = Arc::new(AtomicBool::new(true));
    let waiting_clone = Arc::clone(&waiting);

    // 启动输入指示器任务
    let indicator = tokio::spawn(async move {
        let chars = ['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏'];
        let mut i = 0;
        while waiting_clone.load(Ordering::Relaxed) {
            print!("\r{} 思考中...", chars[i]);
            std::io::Write::flush(&mut std::io::stdout()).ok();
            i = (i + 1) % chars.len();
            tokio::time::sleep(Duration::from_millis(80)).await;
        }
        print!("\r        \r"); // 清除指示器
    });

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "用简单的语言解释量子纠缠。"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut first_token = true;
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                if first_token {
                    waiting.store(false, Ordering::Relaxed);
                    indicator.await.ok();
                    first_token = false;
                    println!("响应:");
                    println!("---------");
                }
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                waiting.store(false, Ordering::Relaxed);
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n");

    Ok(())
}

多轮流式对话

处理带有流式响应的对话:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use std::io::{self, BufRead, Write};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    let mut messages: Vec<serde_json::Value> = Vec::new();

    println!("与 AI 聊天(输入 'quit' 退出)");
    println!("----------------------------------------\n");

    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let input = line?;
        if input.trim() == "quit" {
            break;
        }
        if input.trim().is_empty() {
            continue;
        }

        // 添加用户消息
        messages.push(json!({"role": "user", "content": input}));

        // 流式响应
        let mut stream = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(messages))
            .stream(true)
            .send_stream()
            .await?;

        print!("AI: ");
        io::stdout().flush().ok();

        let mut response_content = String::new();

        while let Some(event) = stream.next().await {
            match event {
                StreamEvent::Content(delta) => {
                    response_content.push_str(&delta);
                    print!("{}", delta);
                    io::stdout().flush().ok();
                }
                StreamEvent::Done => break,
                StreamEvent::Error(e) => {
                    eprintln!("\n错误: {}", e);
                    break;
                }
                _ => {}
            }
        }

        println!("\n");

        // 将助手响应添加到历史
        messages.push(json!({"role": "assistant", "content": response_content}));
    }

    println!("再见!");
    Ok(())
}

带超时的流式响应

为慢速响应添加超时处理:

use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;
use tokio::time::{timeout, Duration};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(300);

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一篇关于人工智能的详细论文。"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    loop {
        // 每个事件 30 秒超时
        match timeout(Duration::from_secs(30), stream.next()).await {
            Ok(Some(event)) => {
                match event {
                    StreamEvent::Content(delta) => {
                        content.push_str(&delta);
                        print!("{}", delta);
                        std::io::Write::flush(&mut std::io::stdout()).ok();
                    }
                    StreamEvent::Done => break,
                    StreamEvent::Error(e) => {
                        eprintln!("\n流式错误: {}", e);
                        return Err(e.into());
                    }
                    _ => {}
                }
            }
            Ok(None) => break,
            Err(_) => {
                eprintln!("\n等待下一个 token 超时");
                break;
            }
        }
    }

    println!("\n\n生成了 {} 个字符", content.len());

    Ok(())
}

收集使用统计

在流式响应过程中追踪 token 使用情况:

use vllm_client::{VllmClient, json, StreamEvent, Usage};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "写一首关于海洋的诗。"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();
    let mut usage: Option<Usage> = None;
    let mut start_time = std::time::Instant::now();
    let mut token_count = 0;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                token_count += 1;
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Usage(u) => {
                usage = Some(u);
            }
            StreamEvent::Done => break,
            _ => {}
        }
    }

    let elapsed = start_time.elapsed();

    println!("\n");
    println!("--- 统计信息 ---");
    println!("耗时: {:.2}秒", elapsed.as_secs_f64());
    println!("字符数: {}", content.len());

    if let Some(usage) = usage {
        println!("提示词 tokens: {}", usage.prompt_tokens);
        println!("补全 tokens: {}", usage.completion_tokens);
        println!("总 tokens: {}", usage.total_tokens);
        println!("每秒 tokens: {:.2}", 
            usage.completion_tokens as f64 / elapsed.as_secs_f64());
    }

    Ok(())
}

相关链接

Streaming Completions 示例

本示例演示如何使用旧版 /v1/completions API 进行流式调用。

基础流式 Completions

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    println!("=== 流式 Completions 示例 ===\n");
    println!("模型: Qwen/Qwen2.5-7B-Instruct\n");
    println!("提示词: 什么是机器学习?");
    println!("\n生成文本: ");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("什么是机器学习?")
        .max_tokens(500)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    // 处理流式事件
    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                // 打印文本增量(实时输出)
                print!("{}", delta);
                // 刷新缓冲区,实现实时显示
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n\n--- 结束原因: {} ---", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                // 流结束时输出 token 使用统计
                println!("\n\n--- Token 使用统计 ---");
                println!("提示词 tokens: {}", usage.prompt_tokens);
                println!("生成 tokens: {}", usage.completion_tokens);
                println!("总计 tokens: {}", usage.total_tokens);
            }
            CompletionStreamEvent::Done => {
                println!("\n\n=== 生成完成 ===");
                break;
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

与 Chat 流式的区别

方面Chat CompletionsCompletions
事件类型StreamEventCompletionStreamEvent
内容变体Content(String)Text(String)
额外事件Reasoning, ToolCallFinishReason
适用场景对话式单提示词

何时使用 Completions API

  • 简单的单提示词文本生成
  • 与 OpenAI API 的旧版兼容
  • 不需要聊天消息格式的场景

对于新项目,建议使用 Chat Completions API (client.chat.completions()),它提供更灵活的功能和更好的消息格式。

相关链接

工具调用示例

本示例演示如何在 vLLM Client 中使用工具调用(函数调用)。

基础工具调用

定义工具,让模型决定何时调用它们:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 定义可用工具
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "城市名称,如:东京、纽约"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "温度单位"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "东京的天气怎么样?"}
        ]))
        .tools(tools)
        .send()
        .await?;

    // 检查模型是否要调用工具
    if response.has_tool_calls() {
        if let Some(tool_calls) = &response.tool_calls {
            for tool_call in tool_calls {
                println!("函数: {}", tool_call.name);
                println!("参数: {}", tool_call.arguments);
            }
        }
    } else {
        println!("响应: {}", response.content.unwrap_or_default());
    }

    Ok(())
}

完整工具调用流程

执行工具并返回结果以继续对话:

use vllm_client::{VllmClient, json, ToolCall};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
    unit: Option<String>,
}

#[derive(Serialize)]
struct WeatherResult {
    temperature: f32,
    condition: String,
    humidity: u32,
}

// 模拟天气函数
fn get_weather(location: &str, unit: Option<&str>) -> WeatherResult {
    // 实际代码中,调用真实的天气 API
    let temp = match location {
        "Tokyo" => 25.0,
        "New York" => 20.0,
        "London" => 15.0,
        _ => 22.0,
    };

    WeatherResult {
        temperature: if unit == Some("fahrenheit") {
            temp * 9.0 / 5.0 + 32.0
        } else {
            temp
        },
        condition: "晴朗".to_string(),
        humidity: 60,
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的当前天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let user_message = "东京和纽约的天气怎么样?";

    // 第一次请求 - 模型可能调用工具
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": user_message}
        ]))
        .tools(tools.clone())
        .send()
        .await?;

    if response.has_tool_calls() {
        // 构建消息历史
        let mut messages = vec![
            json!({"role": "user", "content": user_message})
        ];

        // 添加助手的工具调用
        messages.push(response.assistant_message());

        // 执行每个工具并添加结果
        if let Some(tool_calls) = &response.tool_calls {
            for tool_call in tool_calls {
                if tool_call.name == "get_weather" {
                    let args: WeatherArgs = tool_call.parse_args_as()?;
                    let result = get_weather(&args.location, args.unit.as_deref());
                    messages.push(tool_call.result(json!(result)));
                }
            }
        }

        // 使用工具结果继续对话
        let final_response = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(messages))
            .tools(tools)
            .send()
            .await?;

        println!("{}", final_response.content.unwrap_or_default());
    } else {
        println!("{}", response.content.unwrap_or_default());
    }

    Ok(())
}

多个工具

为不同目的定义多个工具:

use vllm_client::{VllmClient, json};
use serde::Deserialize;

#[derive(Deserialize)]
struct SearchArgs {
    query: String,
    limit: Option<u32>,
}

#[derive(Deserialize)]
struct CalcArgs {
    expression: String,
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "web_search",
                "description": "在网络上搜索信息",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "搜索查询"
                        },
                        "limit": {
                            "type": "integer",
                            "description": "最大结果数"
                        }
                    },
                    "required": ["query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "calculate",
                "description": "执行数学计算",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {
                            "type": "string",
                            "description": "要计算的数学表达式,如 '2 + 2 * 3'"
                        }
                    },
                    "required": ["expression"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "搜索 Rust 编程语言并计算 42 * 17"}
        ]))
        .tools(tools)
        .send()
        .await?;

    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            match tool_call.name.as_str() {
                "web_search" => {
                    let args: SearchArgs = tool_call.parse_args_as()?;
                    println!("搜索: {} (限制: {:?})", args.query, args.limit);
                }
                "calculate" => {
                    let args: CalcArgs = tool_call.parse_args_as()?;
                    println!("计算: {}", args.expression);
                }
                _ => println!("未知工具: {}", tool_call.name),
            }
        }
    }

    Ok(())
}

流式工具调用

实时流式传输工具调用更新:

use vllm_client::{VllmClient, json, StreamEvent, ToolCall};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "东京、巴黎和伦敦的天气怎么样?"}
        ]))
        .tools(tools)
        .stream(true)
        .send_stream()
        .await?;

    let mut tool_calls: Vec<ToolCall> = Vec::new();
    let mut content = String::new();

    println!("流式响应:\n");

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
            }
            StreamEvent::ToolCallDelta { index, id, name, arguments } => {
                println!("[工具 {}] {} - 部分参数: {}", index, name, arguments);
            }
            StreamEvent::ToolCallComplete(tool_call) => {
                println!("[工具完成] {}({})", tool_call.name, tool_call.arguments);
                tool_calls.push(tool_call);
            }
            StreamEvent::Done => {
                println!("\n--- 流式完成 ---");
                break;
            }
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n收集到 {} 个工具调用", tool_calls.len());
    for (i, tc) in tool_calls.iter().enumerate() {
        println!("  {}. {}({})", i + 1, tc.name, tc.arguments);
    }

    Ok(())
}

多轮工具调用

处理多轮工具调用:

use vllm_client::{VllmClient, json, VllmError};
use serde_json::Value;

async fn run_agent(
    client: &VllmClient,
    user_message: &str,
    tools: &Value,
    max_rounds: usize,
) -> Result<String, VllmError> {
    let mut messages = vec![
        json!({"role": "user", "content": user_message})
    ];

    for round in 0..max_rounds {
        println!("--- 第 {} 轮 ---", round + 1);

        let response = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(&messages))
            .tools(tools.clone())
            .send()
            .await?;

        if response.has_tool_calls() {
            // 添加包含工具调用的助手消息
            messages.push(response.assistant_message());

            // 执行工具并添加结果
            if let Some(tool_calls) = &response.tool_calls {
                for tool_call in tool_calls {
                    println!("调用: {}({})", tool_call.name, tool_call.arguments);

                    // 执行工具
                    let result = execute_tool(&tool_call.name, &tool_call.arguments);
                    println!("结果: {}", result);

                    // 将工具结果添加到消息
                    messages.push(tool_call.result(result));
                }
            }
        } else {
            // 没有更多工具调用,返回最终响应
            return Ok(response.content.unwrap_or_default());
        }
    }

    Err(VllmError::Other("超过最大轮数".to_string()))
}

fn execute_tool(name: &str, args: &str) -> Value {
    // 在这里实现工具执行逻辑
    match name {
        "get_weather" => json!({"temperature": 22, "condition": "晴朗"}),
        "web_search" => json!({"results": ["结果1", "结果2"]}),
        _ => json!({"error": "未知工具"}),
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "web_search",
                "description": "在网络上搜索",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    },
                    "required": ["query"]
                }
            }
        }
    ]);

    let result = run_agent(
        &client,
        "东京的天气怎么样?并查找关于樱花的信息",
        &tools,
        5
    ).await?;

    println!("\n最终答案: {}", result);

    Ok(())
}

工具选择选项

控制工具选择行为:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    // 选项 1: 让模型决定(默认)
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "你好!"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("auto"))
        .send()
        .await?;

    // 选项 2: 禁止工具使用
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "东京的天气怎么样?"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("none"))
        .send()
        .await?;

    // 选项 3: 强制使用工具
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "我需要天气信息"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("required"))
        .send()
        .await?;

    // 选项 4: 强制使用特定工具
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "查看东京天气"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!({
            "type": "function",
            "function": {"name": "get_weather"}
        }))
        .send()
        .await?;

    Ok(())
}

错误处理

优雅地处理工具执行错误:

use vllm_client::{VllmClient, json, ToolCall};
use serde_json::Value;

fn execute_tool_safely(tool_call: &ToolCall) -> Value {
    match tool_call.name.as_str() {
        "get_weather" => {
            // 安全地解析参数
            match tool_call.parse_args() {
                Ok(args) => {
                    // 执行工具
                    match get_weather_internal(&args) {
                        Ok(result) => json!({"success": true, "data": result}),
                        Err(e) => json!({"success": false, "error": e.to_string()}),
                    }
                }
                Err(e) => json!({
                    "success": false,
                    "error": format!("无效参数: {}", e)
                }),
            }
        }
        _ => json!({
            "success": false,
            "error": format!("未知工具: {}", tool_call.name)
        }),
    }
}

fn get_weather_internal(args: &Value) -> Result<Value, String> {
    let location = args["location"].as_str()
        .ok_or("location 是必需的")?;

    // 模拟 API 调用
    Ok(json!({
        "location": location,
        "temperature": 22,
        "condition": "晴朗"
    }))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "获取指定地点的天气",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "天气怎么样?"}
        ]))
        .tools(tools)
        .send()
        .await?;

    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            let result = execute_tool_safely(tool_call);
            println!("工具结果: {}", result);
        }
    }

    Ok(())
}

相关链接

多模态示例

多模态功能允许您将图像和其他媒体类型与文本一起发送给模型。

概述

vLLM 通过 OpenAI 兼容的 API 支持多模态输入。您可以使用 base64 编码或 URL 在聊天消息中包含图像。

基础图像输入(Base64)

发送 base64 编码的图像:

use vllm_client::{VllmClient, json};
use base64::{Engine as _, engine::general_purpose};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 读取并编码图像
    let image_data = std::fs::read("image.png")?;
    let base64_image = general_purpose::STANDARD.encode(&image_data);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")  // 视觉模型
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "这张图片里有什么?"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(512)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

使用 URL 引用图像

通过 URL 引用图像:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "详细描述这张图片。"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://example.com/image.jpg"
                        }
                    }
                ]
            }
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

图像消息辅助函数

创建可复用的图像消息辅助函数:

use vllm_client::{VllmClient, json};
use serde_json::Value;

fn image_message(text: &str, image_path: &str) -> Result<Value, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};

    let image_data = std::fs::read(image_path)?;
    let base64_image = general_purpose::STANDARD.encode(&image_data);

    // 根据扩展名检测图像类型
    let mime_type = match image_path.to_lowercase().rsplit('.').next() {
        Some("png") => "image/png",
        Some("jpg") | Some("jpeg") => "image/jpeg",
        Some("gif") => "image/gif",
        Some("webp") => "image/webp",
        _ => "image/png",
    };

    Ok(json!({
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": text
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": format!("data:{};base64,{}", mime_type, base64_image)
                }
            }
        ]
    }))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let user_msg = image_message("这张图片里有什么?", "photo.jpg")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([user_msg]))
        .max_tokens(1024)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

多图像处理

在单个请求中发送多张图像:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 读取并编码多张图像
    let image1 = encode_image("image1.png")?;
    let image2 = encode_image("image2.png")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "比较这两张图片。它们有什么不同?"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", image1)
                        }
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", image2)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(1024)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

fn encode_image(path: &str) -> Result<String, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};
    let data = std::fs::read(path)?;
    Ok(general_purpose::STANDARD.encode(&data))
}

带图像的流式响应

对图像查询进行流式响应:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let base64_image = encode_image("chart.png")?;

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "分析这个图表并解释趋势。"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        if let StreamEvent::Content(delta) = event {
            print!("{}", delta);
            std::io::Write::flush(&mut std::io::stdout()).ok();
        }
    }

    println!();
    Ok(())
}

带图像的多轮对话

在对话中保持图像上下文:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let base64_image = encode_image("screenshot.png")?;

    // 第一条带图像的消息
    let messages = json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这个截图里有什么?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": format!("data:image/png;base64,{}", base64_image)
                    }
                }
            ]
        }
    ]);

    let response1 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(messages.clone())
        .send()
        .await?;

    println!("第一次响应: {}", response1.content.unwrap_or_default());

    // 继续对话(不需要新图像)
    let messages2 = json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这个截图里有什么?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": format!("data:image/png;base64,{}", base64_image)
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": response1.content.unwrap_or_default()
        },
        {
            "role": "user",
            "content": "你能翻译图片中的文本吗?"
        }
    ]);

    let response2 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(messages2)
        .send()
        .await?;

    println!("\n第二次响应: {}", response2.content.unwrap_or_default());

    Ok(())
}

OCR 和文档分析

使用视觉模型进行 OCR 和文档分析:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let document_image = encode_image("document.png")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "system",
                "content": "你是一个 OCR 助手。准确提取图像中的文本并正确格式化。"
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "从这个文档图像中提取所有文本。尽可能保留格式。"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", document_image)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(2048)
        .send()
        .await?;

    println!("提取的文本:\n{}", response.content.unwrap_or_default());
    Ok(())
}

图像大小考虑

正确处理大图像:

use vllm_client::{VllmClient, json};

fn encode_and_resize_image(path: &str, max_size: u32) -> Result<String, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};
    use image::ImageReader;

    // 加载并调整图像大小
    let img = ImageReader::open(path)?.decode()?;
    let img = img.resize(max_size, max_size, image::imageops::FilterType::Lanczos3);

    // 转换为 PNG
    let mut buffer = std::io::Cursor::new(Vec::new());
    img.write_to(&mut buffer, image::ImageFormat::Png)?;

    Ok(general_purpose::STANDARD.encode(&buffer.into_inner()))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // 调整大小到最大 1024px,保持宽高比
    let base64_image = encode_and_resize_image("large_image.jpg", 1024)?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "描述这张图片。"},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

支持的模型

对于多模态输入,请使用支持视觉的模型:

模型描述
Qwen/Qwen2-VL-7B-InstructQwen2 视觉语言模型
Qwen/Qwen2-VL-72B-InstructQwen2 视觉语言大模型
meta-llama/Llama-3.2-11B-Vision-InstructLlama 3.2 视觉模型
openai/clip-vit-large-patch14CLIP 模型

使用以下命令检查 vLLM 服务器的可用模型:

curl http://localhost:8000/v1/models

必需的依赖

对于图像处理,添加以下依赖:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
base64 = "0.22"
image = "0.25"  # 可选,用于图像处理

故障排除

图像过大

如果遇到图像大小错误,请减小图像尺寸:

#![allow(unused)]
fn main() {
// 发送前调整大小
let img = image::load_from_memory(&image_data)?;
let resized = img.resize(1024, 1024, image::imageops::FilterType::Lanczos3);
}

不支持的格式

将图像转换为支持的格式:

#![allow(unused)]
fn main() {
// 转换为 PNG
let img = image::load_from_memory(&image_data)?;
let mut output = Vec::new();
img.write_to(&mut std::io::Cursor::new(&mut output), image::ImageFormat::Png)?;
}

模型不支持视觉

确保使用支持视觉的模型。非视觉模型会忽略图像输入。

相关链接

高级主题

本文档介绍 vLLM Client 的高级功能和用法。

目录

思考模式

某些模型(如 Qwen-3)支持"思考模式",可以输出推理过程。

启用思考模式

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let client = VllmClient::new("http://localhost:8000/v1");

let mut stream = client
    .chat
    .completions()
    .create()
    .model("qwen-3")
    .messages(json!([
        {"role": "user", "content": "请解释什么是递归"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "enable_thinking": true
        }
    }))
    .stream(true)
    .send_stream()
    .await?;

while let Some(event) = stream.next().await {
    match &event {
        // 思考/推理内容
        StreamEvent::Reasoning(delta) => {
            print!("[思考] {}", delta);
        }
        // 常规回复内容
        StreamEvent::Content(delta) => {
            print!("{}", delta);
        }
        _ => {}
    }
}
}

思考内容格式

在思考模式下,模型的输出分为两部分:

事件类型描述
StreamEvent::Reasoning模型的推理/思考过程
StreamEvent::Content最终的回复内容

思考内容通常包含在 <think> 标签中,客户端会自动解析。

禁用思考模式

#![allow(unused)]
fn main() {
.extra(json!({
    "chat_template_kwargs": {
        "enable_thinking": false
    }
}))
}

自定义请求头

如果需要添加自定义请求头(如代理认证、追踪ID等):

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("X-Custom-Header", "custom-value")
    .with_header("X-Request-ID", "req-12345");
}

常见用例

#![allow(unused)]
fn main() {
// 添加代理认证
let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("Proxy-Authorization", "Bearer proxy-token");

// 添加追踪ID用于调试
let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("X-Trace-ID", &uuid::Uuid::new_v4().to_string());
}

超时与重试

设置超时

#![allow(unused)]
fn main() {
use std::time::Duration;
use vllm_client::VllmClient;

// 设置60秒超时
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(Duration::from_secs(60));

// 设置5分钟超时(适用于长文本生成)
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(Duration::from_secs(300));
}

实现重试逻辑

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn send_with_retry(
    client: &VllmClient,
    messages: serde_json::Value,
    max_retries: u32,
) -> Result<vllm_client::ChatCompletionResponse, VllmError> {
    let mut attempts = 0;
    
    loop {
        match client
            .chat
            .completions()
            .create()
            .model("llama-3-70b")
            .messages(messages.clone())
            .send()
            .await
        {
            Ok(response) => return Ok(response),
            Err(e) => {
                attempts += 1;
                if attempts >= max_retries {
                    return Err(e);
                }
                // 指数退避
                sleep(Duration::from_millis(100 * 2u64.pow(attempts))).await;
            }
        }
    }
}
}

多模态支持

图像输入

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

// 使用图像URL
let response = client
    .chat
    .completions()
    .create()
    .model("llava-v1.6")
    .messages(json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这张图片里有什么?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]))
    .send()
    .await?;

// 使用Base64编码图像
let base64_image = "data:image/jpeg;base64,/9j/4AAQ...";
let response = client
    .chat
    .completions()
    .create()
    .model("llava-v1.6")
    .messages(json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "描述这张图片"},
                {
                    "type": "image_url",
                    "image_url": {"url": base64_image}
                }
            ]
        }
    ]))
    .send()
    .await?;
}

多图像支持

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("llava-v1.6")
    .messages(json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "比较这两张图片"},
                {"type": "image_url", "image_url": {"url": "https://example.com/image1.jpg"}},
                {"type": "image_url", "image_url": {"url": "https://example.com/image2.jpg"}}
            ]
        }
    ]))
    .send()
    .await?;
}

最佳实践

1. 连接池管理

对于高并发场景,建议复用客户端实例:

#![allow(unused)]
fn main() {
// 推荐:共享客户端实例
use std::sync::Arc;

let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// 在多个任务中使用
let client_clone = client.clone();
tokio::spawn(async move {
    client_clone.chat.completions().create()
        .model("llama-3")
        .messages(json!([{"role": "user", "content": "Hello"}]))
        .send()
        .await
});
}

2. 错误处理

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, VllmError};

match client.chat.completions().create().send().await {
    Ok(response) => {
        println!("成功: {:?}", response);
    }
    Err(VllmError::ApiError { message, code }) => {
        eprintln!("API 错误 ({}): {}", code, message);
        // 根据错误码处理
        match code {
            429 => println!("被限流,请稍后重试"),
            401 => println!("认证失败,检查API密钥"),
            _ => {}
        }
    }
    Err(e) => {
        eprintln!("其他错误: {}", e);
    }
}
}

3. 流式响应的资源管理

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client
    .chat
    .completions()
    .create()
    .model("llama-3")
    .messages(json!([{"role": "user", "content": "Hello"}]))
    .stream(true)
    .send_stream()
    .await?;

// 使用 take 限制处理的消息数量
while let Some(event) = stream.take(1000).next().await {
    match &event {
        StreamEvent::Content(delta) => print!("{}", delta),
        StreamEvent::Done | StreamEvent::Error(_) => break,
        _ => {}
    }
}
}

思考模式

思考模式(也称为推理模式)允许模型在给出最终答案之前输出其推理过程。这对于复杂推理任务特别有用。

概述

一些模型,如启用思考模式的 Qwen,可以输出两种类型的内容:

  1. 推理内容 - 模型的内部"思考"过程
  2. 内容 - 给用户的最终响应

启用思考模式

Qwen 模型

对于 Qwen 模型,通过 extra 参数启用思考模式:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "计算: 15 * 23 + 47 等于多少?"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;
}

检查推理内容

在非流式响应中,单独访问推理内容:

#![allow(unused)]
fn main() {
// 检查推理内容
if let Some(reasoning) = response.reasoning_content {
    println!("推理: {}", reasoning);
}

// 获取最终内容
if let Some(content) = response.content {
    println!("答案: {}", content);
}
}

带思考模式的流式响应

使用思考模式的最佳方式是流式响应:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": "逐步思考: 如果我有 5 个苹果,给朋友 2 个,然后又买了 3 个,我有多少个?"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    println!("=== 思考过程 ===\n");
    
    let mut in_thinking = true;
    let mut reasoning = String::new();
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Content(delta) => {
                if in_thinking {
                    in_thinking = false;
                    println!("\n\n=== 最终答案 ===\n");
                }
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\n错误: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!();

    Ok(())
}

使用场景

数学推理

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

async fn solve_math_problem(client: &VllmClient, problem: &str) -> Result<String, Box<dyn std::error::Error>> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "你是一个数学辅导员。清晰地展示你的工作过程。"},
            {"role": "user", "content": problem}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut answer = String::new();

    while let Some(event) = stream.next().await {
        if let StreamEvent::Content(delta) = event {
            answer.push_str(&delta);
        }
    }

    Ok(answer)
}
}

代码分析

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "分析这段代码的潜在 bug 和安全问题:\n\n```rust\nfn process_input(input: &str) -> String {\n    let mut result = String::new();\n    for c in input.chars() {\n        result.push(c);\n    }\n    result\n}\n```"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;
}

复杂决策

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "system", "content": "你是一个决策支持助手。仔细考虑所有选项。"},
        {"role": "user", "content": "我需要在公司 A(高薪,通勤远)和公司 B(中等薪资,远程工作)之间选择。帮我决定。"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .max_tokens(2048)
    .send()
    .await?;
}

分离推理和答案

对于需要将推理与最终答案分离的应用:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

struct ThinkingResponse {
    reasoning: String,
    content: String,
}

async fn think_and_respond(
    client: &VllmClient,
    prompt: &str,
) -> Result<ThinkingResponse, Box<dyn std::error::Error>> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": prompt}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut response = ThinkingResponse {
        reasoning: String::new(),
        content: String::new(),
    };

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => response.reasoning.push_str(&delta),
            StreamEvent::Content(delta) => response.content.push_str(&delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    Ok(response)
}
}

模型支持

模型思考模式支持
Qwen/Qwen2.5-72B-Instruct✅ 支持
Qwen/Qwen2.5-32B-Instruct✅ 支持
Qwen/Qwen2.5-7B-Instruct✅ 支持
DeepSeek-R1✅ 支持(内置)
其他模型❌ 取决于模型

检查您的 vLLM 服务器配置以验证思考模式支持。

配置选项

思考模型检测

模型自动处理思考标记:

#![allow(unused)]
fn main() {
// 推理内容从特殊标记中解析
// 通常结构为: <tool_call>...</think> 或类似格式
}

非流式访问

对于带推理的非流式请求:

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "解释量子纠缠"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;

// 访问推理内容(如果存在)
if let Some(reasoning) = response.reasoning_content {
    println!("推理:\n{}\n", reasoning);
}

// 访问最终答案
println!("答案:\n{}", response.content.unwrap_or_default());
}

最佳实践

1. 用于复杂任务

思考模式对于以下场景最有价值:

  • 多步推理
  • 数学问题
  • 代码分析
  • 复杂决策
#![allow(unused)]
fn main() {
// 好: 复杂推理任务
.messages(json!([
    {"role": "user", "content": "解这道题: 父亲的年龄是儿子的 4 倍。20 年后,他只会是儿子的 2 倍。他们现在各多少岁?"}
]))

// 收益较小: 简单查询
.messages(json!([
    {"role": "user", "content": "2 + 2 等于几?"}
]))
}

2. 选择性显示推理

您可能希望在生产环境中隐藏推理,但在调试时显示:

#![allow(unused)]
fn main() {
let show_reasoning = std::env::var("SHOW_REASONING").is_ok();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => {
            if show_reasoning {
                eprintln!("[思考中] {}", delta);
            }
        }
        StreamEvent::Content(delta) => print!("{}", delta),
        _ => {}
    }
}
}

3. 结合系统提示

使用系统提示引导思考过程:

#![allow(unused)]
fn main() {
.messages(json!([
    {
        "role": "system", 
        "content": "逐步思考问题。在确定答案之前考虑多种方法。"
    },
    {"role": "user", "content": problem}
]))
}

4. 调整最大 Token 数

思考模式使用更多 token。请相应调整:

#![allow(unused)]
fn main() {
.max_tokens(4096)  // 考虑推理和答案两部分
}

故障排除

没有推理内容

如果看不到推理内容:

  1. 确保在 extra 参数中启用了思考模式
  2. 验证模型支持思考模式
  3. 检查 vLLM 服务器配置
# 检查 vLLM 服务器日志以发现问题

流式响应不完整

如果流式响应似乎不完整:

#![allow(unused)]
fn main() {
// 确保处理所有事件类型
while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => { /* 处理 */ },
        StreamEvent::Content(delta) => { /* 处理 */ },
        StreamEvent::Done => break,
        StreamEvent::Error(e) => {
            eprintln!("错误: {}", e);
            break;
        }
        _ => {}  // 不要忘记其他事件
    }
}
}

相关链接

自定义请求头

本文档介绍如何在 vLLM Client 中使用自定义 HTTP 请求头。

概述

虽然 vLLM Client 通过 API Key 处理标准认证,但您可能需要添加自定义请求头用于:

  • 自定义认证方案
  • 请求追踪和调试
  • 速率限制标识符
  • 自定义元数据

当前限制

当前版本的 vLLM Client 不提供内置的自定义请求头方法。但是,您可以通过几种方式解决这个限制。

变通方法:环境变量

如果您的 vLLM 服务器通过环境变量或特定 API 参数接受配置:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key(std::env::var("MY_API_KEY").unwrap_or_default());
}

变通方法:通过额外参数

一些自定义配置可以通过 extra() 方法传递:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([{"role": "user", "content": "你好!"}]))
    .extra(json!({
        "custom_field": "custom_value",
        "request_id": "req-12345"
    }))
    .send()
    .await?;
}

未来支持

自定义请求头支持计划在未来版本中实现。API 可能类似于:

// 未来 API(尚未实现)
let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("X-Custom-Header", "value")
    .with_header("X-Request-ID", "req-123");

常见使用案例

追踪请求头

用于分布式追踪(当支持时):

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-Trace-ID", trace_id)
    .header("X-Span-ID", span_id)
    .build();

自定义认证

用于非标准认证方案:

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-API-Key", "custom-key")
    .header("X-Tenant-ID", "tenant-123")
    .build();

请求元数据

添加元数据用于日志或分析:

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-Request-Source", "mobile-app")
    .header("X-User-ID", "user-456")
    .build();

替代方案:自定义 HTTP 客户端

对于高级用例,您可以直接使用底层的 reqwest 客户端:

use reqwest::Client;
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    
    let response = client
        .post("http://localhost:8000/v1/chat/completions")
        .header("Content-Type", "application/json")
        .header("Authorization", "Bearer your-api-key")
        .header("X-Custom-Header", "custom-value")
        .json(&json!({
            "model": "Qwen/Qwen2.5-7B-Instruct",
            "messages": [{"role": "user", "content": "你好!"}]
        }))
        .send()
        .await?;
    
    let result: serde_json::Value = response.json().await?;
    println!("{:?}", result);
    
    Ok(())
}

最佳实践

1. 尽可能使用标准认证

#![allow(unused)]
fn main() {
// 推荐
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");

// 除非必要,避免使用自定义认证
}

2. 文档化自定义请求头

使用自定义请求头时,记录其用途:

// 未来 API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    // 用于日志中的请求追踪
    .header("X-Request-ID", &request_id)
    // 用于多租户标识
    .header("X-Tenant-ID", &tenant_id)
    .build();

3. 验证服务器支持

确保您的 vLLM 服务器接受并处理自定义请求头。一些代理或负载均衡器可能会移除未知的请求头。

安全考虑

不要暴露敏感请求头

避免记录包含敏感信息的请求头:

// 记录日志时要小心
let auth_header = "Bearer secret-key";
// 不要直接记录这个!

使用 HTTPS

传输敏感请求头时始终使用 HTTPS:

#![allow(unused)]
fn main() {
// 好
let client = VllmClient::new("https://api.example.com/v1");

// 对于敏感数据避免使用
let client = VllmClient::new("http://api.example.com/v1");
}

请求此功能

如果您需要自定义请求头支持,请在 GitHub 上提交 issue,包括:

  1. 您的使用场景
  2. 需要的请求头
  3. 您希望 API 如何设计

相关链接

超时与重试

本页介绍超时配置和重试策略,用于构建健壮的生产应用程序。

设置超时

客户端级别超时

创建客户端时设置超时:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

// 简单超时
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(120);

// 使用构建器
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .timeout_secs(300)  // 5 分钟
    .build();
}

选择合适的超时时间

使用场景推荐超时时间
简单查询30-60 秒
代码生成2-3 分钟
长文档生成5-10 分钟
复杂推理任务10+ 分钟

请求耗时因素

请求所需时间取决于:

  1. 提示词长度 - 更长的提示词需要更多处理时间
  2. 输出 token 数 - 更多 token = 更长生成时间
  3. 模型大小 - 更大的模型更慢
  4. 服务器负载 - 繁忙的服务器响应更慢

超时错误

处理超时

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn chat_with_timeout(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(60);

    let result = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await;

    match result {
        Ok(response) => Ok(response.content.unwrap_or_default()),
        Err(VllmError::Timeout) => {
            eprintln!("请求在 60 秒后超时");
            Err(VllmError::Timeout)
        }
        Err(e) => Err(e),
    }
}
}

重试策略

基础重试

使用指数退避重试失败的请求:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn send_with_retry(
    client: &VllmClient,
    prompt: &str,
    max_retries: u32,
) -> Result<String, VllmError> {
    let mut attempts = 0;

    loop {
        match client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
        {
            Ok(response) => {
                return Ok(response.content.unwrap_or_default());
            }
            Err(e) if e.is_retryable() && attempts < max_retries => {
                attempts += 1;
                let delay = Duration::from_millis(100 * 2u64.pow(attempts - 1));
                eprintln!("第 {} 次重试,等待 {:?}: {}", attempts, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

带抖动的重试

添加抖动以防止惊群效应:

#![allow(unused)]
fn main() {
use rand::Rng;
use std::time::Duration;
use tokio::time::sleep;

fn backoff_with_jitter(attempt: u32, base_ms: u64, max_ms: u64) -> Duration {
    let exponential = base_ms * 2u64.pow(attempt);
    let jitter = rand::thread_rng().gen_range(0..base_ms);
    let delay = (exponential + jitter).min(max_ms);
    Duration::from_millis(delay)
}

async fn retry_with_jitter<F, T, E>(
    mut f: F,
    max_retries: u32,
) -> Result<T, E>
where
    F: FnMut() -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<T, E>> + Send>>,
    E: std::fmt::Debug,
{
    let mut attempts = 0;

    loop {
        match f().await {
            Ok(result) => return Ok(result),
            Err(e) if attempts < max_retries => {
                attempts += 1;
                let delay = backoff_with_jitter(attempts, 100, 10_000);
                eprintln!("第 {} 次重试,等待 {:?}: {:?}", attempts, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

仅重试可重试错误

并非所有错误都应该重试:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn smart_retry(
    client: &VllmClient,
    prompt: &str,
) -> Result<String, VllmError> {
    let mut attempts = 0;
    let max_retries = 3;

    loop {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;

        match result {
            Ok(response) => return Ok(response.content.unwrap_or_default()),
            Err(e) => {
                // 检查错误是否可重试
                if !e.is_retryable() {
                    return Err(e);
                }

                if attempts >= max_retries {
                    return Err(e);
                }

                attempts += 1;
                tokio::time::sleep(std::time::Duration::from_secs(2u64.pow(attempts))).await;
            }
        }
    }
}
}

可重试错误

错误可重试原因
Timeout服务器可能较慢
429 频率限制等待后重试
500 服务器错误临时服务器问题
502 网关错误服务器可能正在重启
503 服务不可用临时过载
504 网关超时服务器错误
400 请求错误客户端错误
401 未授权认证问题
404 未找到资源不存在

断路器模式

使用断路器防止级联故障:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU32, Ordering};
use std::time::{Duration, Instant};
use std::sync::Mutex;

struct CircuitBreaker {
    failures: AtomicU32,
    last_failure: Mutex<Option<Instant>>,
    threshold: u32,
    reset_duration: Duration,
}

impl CircuitBreaker {
    fn new(threshold: u32, reset_duration: Duration) -> Self {
        Self {
            failures: AtomicU32::new(0),
            last_failure: Mutex::new(None),
            threshold,
            reset_duration,
        }
    }

    fn can_attempt(&self) -> bool {
        let failures = self.failures.load(Ordering::Relaxed);
        if failures < self.threshold {
            return true;
        }

        let last = self.last_failure.lock().unwrap();
        if let Some(time) = *last {
            if time.elapsed() > self.reset_duration {
                // 重置断路器
                self.failures.store(0, Ordering::Relaxed);
                return true;
            }
        }

        false
    }

    fn record_success(&self) {
        self.failures.store(0, Ordering::Relaxed);
    }

    fn record_failure(&self) {
        self.failures.fetch_add(1, Ordering::Relaxed);
        *self.last_failure.lock().unwrap() = Some(Instant::now());
    }
}
}

流式响应超时

处理流式响应过程中的超时:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use tokio::time::{timeout, Duration};

async fn stream_with_timeout(
    client: &VllmClient,
    prompt: &str,
    per_event_timeout: Duration,
) -> Result<String, vllm_client::VllmError> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    loop {
        match timeout(per_event_timeout, stream.next()).await {
            Ok(Some(event)) => {
                match event {
                    StreamEvent::Content(delta) => content.push_str(&delta),
                    StreamEvent::Done => break,
                    StreamEvent::Error(e) => return Err(e),
                    _ => {}
                }
            }
            Ok(None) => break,
            Err(_) => {
                return Err(vllm_client::VllmError::Timeout);
            }
        }
    }

    Ok(content)
}
}

速率限制

实现客户端速率限制:

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;
use std::sync::Arc;

struct RateLimitedClient {
    client: vllm_client::VllmClient,
    semaphore: Arc<Semaphore>,
}

impl RateLimitedClient {
    fn new(base_url: &str, max_concurrent: usize) -> Self {
        Self {
            client: vllm_client::VllmClient::new(base_url),
            semaphore: Arc::new(Semaphore::new(max_concurrent)),
        }
    }

    async fn chat(&self, prompt: &str) -> Result<String, vllm_client::VllmError> {
        let _permit = self.semaphore.acquire().await.unwrap();
        
        self.client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(vllm_client::json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default())
    }
}
}

生产环境配置

完整示例

use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

struct RobustClient {
    client: VllmClient,
    max_retries: u32,
    base_backoff_ms: u64,
    max_backoff_ms: u64,
}

impl RobustClient {
    fn new(base_url: &str, timeout_secs: u64) -> Self {
        Self {
            client: VllmClient::builder()
                .base_url(base_url)
                .timeout_secs(timeout_secs)
                .build(),
            max_retries: 3,
            base_backoff_ms: 100,
            max_backoff_ms: 10_000,
        }
    }

    async fn chat(&self, prompt: &str) -> Result<String, VllmError> {
        let mut attempts = 0;

        loop {
            match self.send_request(prompt).await {
                Ok(response) => return Ok(response),
                Err(e) if self.should_retry(&e, attempts) => {
                    attempts += 1;
                    let delay = self.calculate_backoff(attempts);
                    eprintln!("第 {} 次重试,等待 {:?}: {}", attempts, delay, e);
                    sleep(delay).await;
                }
                Err(e) => return Err(e),
            }
        }
    }

    async fn send_request(&self, prompt: &str) -> Result<String, VllmError> {
        self.client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default())
    }

    fn should_retry(&self, error: &VllmError, attempts: u32) -> bool {
        attempts < self.max_retries && error.is_retryable()
    }

    fn calculate_backoff(&self, attempt: u32) -> Duration {
        let delay = self.base_backoff_ms * 2u64.pow(attempt);
        Duration::from_millis(delay.min(self.max_backoff_ms))
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = RobustClient::new("http://localhost:8000/v1", 300);

    match client.chat("你好!").await {
        Ok(response) => println!("响应: {}", response),
        Err(e) => eprintln!("重试后仍然失败: {}", e),
    }

    Ok(())
}

最佳实践

  1. 根据预期响应时间设置适当的超时
  2. 使用指数退避以避免压垮服务器
  3. 添加抖动以防止惊群效应问题
  4. 仅重试可重试错误 - 不要重试客户端错误
  5. 为生产系统实现断路器
  6. 记录重试尝试用于调试和监控
  7. 设置最大重试次数以避免无限循环

相关链接

贡献指南

感谢您有兴趣为 vLLM Client 做贡献!本文档提供了贡献的指南和说明。

目录

行为准则

请保持尊重和包容。我们欢迎所有人的贡献。

入门指南

  1. 在 GitHub 上 Fork 仓库
  2. 克隆您的 Fork 到本地
  3. 为您的更改创建分支
git clone https://github.com/YOUR_USERNAME/vllm-client.git
cd vllm-client
git checkout -b my-feature

开发环境设置

前提条件

  • Rust 1.70 或更高版本
  • Cargo(随 Rust 一起安装)
  • 用于集成测试的 vLLM 服务器(可选)

构建

# 构建库
cargo build

# 构建所有功能
cargo build --all-features

运行测试

# 运行单元测试
cargo test

# 运行测试并显示输出
cargo test -- --nocapture

# 运行特定测试
cargo test test_name

# 运行集成测试(需要 vLLM 服务器)
cargo test --test integration

进行更改

分支命名

使用描述性的分支名称:

  • feature/add-new-feature - 用于新功能
  • fix/bug-description - 用于 bug 修复
  • docs/documentation-update - 用于文档更改
  • refactor/code-cleanup - 用于重构

提交消息

遵循约定式提交格式:

类型(范围): 描述

[可选正文]

[可选页脚]

类型:

  • feat: 新功能
  • fix: Bug 修复
  • docs: 文档更改
  • style: 代码风格更改(格式化等)
  • refactor: 代码重构
  • test: 添加或更新测试
  • chore: 维护任务

示例:

feat(client): 添加连接池支持

fix(streaming): 正确处理空数据块

docs(api): 更新流式文档

测试

单元测试

所有新功能都应该有单元测试:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_new_feature() {
        // 测试实现
    }
}
}

集成测试

集成测试放在 tests/ 目录中:

#![allow(unused)]
fn main() {
// tests/integration_test.rs
use vllm_client::{VllmClient, json};

#[tokio::test]
async fn test_chat_completion() {
    let client = VllmClient::new("http://localhost:8000/v1");
    // ... 测试代码
}
}

测试覆盖率

我们追求良好的测试覆盖率。运行覆盖率报告:

cargo tarpaulin --out Html

文档

代码文档

使用文档注释记录所有公共 API:

#![allow(unused)]
fn main() {
/// 创建新的聊天补全请求。
///
/// # 参数
///
/// * `model` - 用于生成的模型名称
///
/// # 返回
///
/// 新的 `ChatCompletionsRequest` 构建器
///
/// # 示例
///
/// ```rust
/// use vllm_client::{VllmClient, json};
///
/// let client = VllmClient::new("http://localhost:8000/v1");
/// let response = client.chat.completions().create()
///     .model("Qwen/Qwen2.5-7B-Instruct")
///     .messages(json!([{"role": "user", "content": "你好"}]))
///     .send()
///     .await?;
/// ```
pub fn create(&self) -> ChatCompletionsRequest {
    // 实现
}
}

更新文档

添加新功能时:

  1. 更新内联文档
  2. 更新 docs/src/api/ 中的 API 参考
  3. docs/src/examples/ 中添加示例
  4. 更新变更日志

构建文档

# 构建并预览文档
cd docs && mdbook serve --open

Pull Request 流程

  1. 更新文档:确保文档反映您的更改
  2. 添加测试:为新功能包含测试
  3. 运行测试:确保所有测试通过
  4. 格式化代码:运行 cargo fmt
  5. 检查 Lint:运行 cargo clippy
  6. 更新 CHANGELOG:在变更日志中添加条目

PR 前检查清单

# 格式化代码
cargo fmt

# 检查 lint
cargo clippy -- -D warnings

# 运行所有测试
cargo test

# 构建文档
mdbook build docs
mdbook build docs/zh

提交 PR

  1. 将您的分支推送到您的 Fork
  2. main 分支发起 PR
  3. 填写 PR 模板
  4. 等待审查

PR 模板

## 描述

更改的简要描述

## 更改类型

- [ ] Bug 修复
- [ ] 新功能
- [ ] 破坏性更改
- [ ] 文档更新

## 测试

- [ ] 单元测试已添加/更新
- [ ] 集成测试已添加/更新
- [ ] 已完成手动测试

## 检查清单

- [ ] 代码已用 `cargo fmt` 格式化
- [ ] 无 clippy 警告
- [ ] 文档已更新
- [ ] 变更日志已更新

编码标准

Rust 风格

遵循标准 Rust 约定:

  • 使用 cargo fmt 进行格式化
  • 解决所有 clippy 警告
  • 遵循 Rust API 指南

命名约定

  • 类型:PascalCase(ChatCompletionResponse
  • 函数/方法:snake_case(send_stream
  • 常量:SCREAMING_SNAKE_CASE(MAX_RETRIES
  • 模块:snake_case(chatcompletions

错误处理

对所有错误使用 VllmError

#![allow(unused)]
fn main() {
// 好
pub fn parse_response(data: &str) -> Result<Response, VllmError> {
    serde_json::from_str(data).map_err(VllmError::Json)
}

// 避免
pub fn parse_response(data: &str) -> Result<Response, String> {
    // ...
}
}

异步代码

对所有异步操作使用 async/await

#![allow(unused)]
fn main() {
// 好
pub async fn send(&self) -> Result<Response, VllmError> {
    let response = self.http.post(&url).send().await?;
    // ...
}

// 避免在异步上下文中阻塞
pub async fn bad_example(&self) -> Result<Response, VllmError> {
    std::thread::sleep(Duration::from_secs(1)); // 不要这样做
    // ...
}
}

项目结构

vllm-client/
├── src/
│   ├── lib.rs         # 库入口点
│   ├── client.rs      # 客户端实现
│   ├── chat.rs        # 聊天 API
│   ├── completions.rs # 传统补全
│   ├── types.rs       # 类型定义
│   └── error.rs       # 错误类型
├── tests/
│   └── integration/   # 集成测试
├── docs/
│   ├── src/           # 英文文档
│   └── zh/src/        # 中文文档
├── examples/
│   └── *.rs           # 示例程序
└── Cargo.toml

获取帮助

  • 对于 bug 或功能请求,请提交 issue
  • 对于问题,请发起讨论
  • 创建新 issue 前请先检查现有 issue

许可证

通过贡献,您同意您的贡献将根据 MIT OR Apache-2.0 许可证授权。

致谢

贡献者将在我们的 README 和发布说明中得到认可。

感谢您为 vLLM Client 做贡献!

更新日志

本文件记录了项目的所有重要更改。

格式基于 Keep a Changelog, 本项目遵循 语义化版本

0.1.0 - 2024-01-XX

新增

  • vLLM Client 初始版本发布
  • VllmClient 用于连接 vLLM 服务器
  • 聊天补全 API (client.chat.completions())
  • 使用 MessageStream 的流式响应支持
  • 工具/函数调用支持
  • 兼容模型的推理/思考模式支持
  • 使用 VllmError 枚举的错误处理
  • 客户端配置的构建器模式
  • 聊天补全的请求构建器模式
  • 通过 extra() 支持 vLLM 特定参数
  • 响应中的 token 使用追踪
  • 超时配置
  • API Key 认证

功能

客户端

  • VllmClient::new(base_url) - 创建新客户端
  • VllmClient::builder() - 使用构建器模式创建客户端
  • with_api_key() - 设置用于认证的 API Key
  • timeout_secs() - 设置请求超时

聊天补全

  • model() - 设置模型名称
  • messages() - 设置对话消息
  • temperature() - 设置采样温度
  • max_tokens() - 设置最大输出 token 数
  • top_p() - 设置核采样参数
  • top_k() - 设置 top-k 采样(vLLM 扩展)
  • stop() - 设置停止序列
  • stream() - 启用流式模式
  • tools() - 定义可用工具
  • tool_choice() - 控制工具选择
  • extra() - 传递 vLLM 特定参数

流式响应

  • StreamEvent::Content - 内容 token
  • StreamEvent::Reasoning - 推理内容(思考模型)
  • StreamEvent::ToolCallDelta - 流式工具调用更新
  • StreamEvent::ToolCallComplete - 完整的工具调用
  • StreamEvent::Usage - Token 使用统计
  • StreamEvent::Done - 流式完成
  • StreamEvent::Error - 错误事件

响应类型

  • ChatCompletionResponse - 聊天补全响应
  • ToolCall - 带解析方法的工具调用数据
  • Usage - Token 使用统计

依赖项

  • reqwest - HTTP 客户端
  • serde / serde_json - JSON 序列化
  • tokio - 异步运行时
  • thiserror - 错误处理

[未发布]

计划中

  • 自定义 HTTP 请求头支持
  • 连接池配置
  • 请求/响应日志
  • 重试中间件
  • 多模态输入辅助工具
  • 批量处理的异步迭代器
  • OpenTelemetry 集成
  • WebSocket 传输

版本历史

版本日期亮点
0.1.02024-01初始版本