vLLM Client
A Rust client library for vLLM API with OpenAI-compatible interface.
Features
- OpenAI Compatible: Uses the same API structure as OpenAI, making it easy to switch
- Streaming Support: Full support for streaming responses with Server-Sent Events (SSE)
- Tool Calling: Support for function/tool calling with streaming delta updates
- Reasoning Models: Built-in support for reasoning/thinking models (like Qwen with thinking mode)
- Async/Await: Fully async using Tokio runtime
- Type Safe: Strong types with Serde serialization
Quick Start
Add to your Cargo.toml:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
Basic Usage
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("your-model-name") .messages(json!([ {"role": "user", "content": "Hello, world!"} ])) .send() .await?; println!("{}", response.choices[0].message.content); Ok(()) }
Documentation
- Getting Started - Installation and basic setup
- API Reference - Complete API documentation
- Examples - Code examples
- Advanced Topics - Streaming, tools, and more
License
Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Getting Started
Installation
Add vllm-client to your Cargo.toml:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
Quick Start
Basic Chat Completion
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { // Create a client let client = VllmClient::new("http://localhost:8000/v1"); // Send a chat completion request let response = client .chat .completions() .create() .model("your-model-name") .messages(json!([ {"role": "user", "content": "Hello, how are you?"} ])) .send() .await?; // Print the response println!("{}", response.choices[0].message.content); Ok(()) }
Streaming Response
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("your-model-name") .messages(json!([ {"role": "user", "content": "Write a poem about spring"} ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match &event { StreamEvent::Reasoning(delta) => print!("{}", delta), StreamEvent::Content(delta) => print!("{}", delta), _ => {} } } println!(); Ok(()) }
Configuration
API Key
If your vLLM server requires authentication:
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("your-api-key"); }
Custom Timeout
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_timeout(std::time::Duration::from_secs(60)); }
Next Steps
- API Reference - Complete API documentation
- Examples - More usage examples
- Advanced Features - Thinking mode, tool calling, etc.
Installation
Requirements
- Rust: 1.70 or later
- Cargo: Comes with Rust installation
Adding to Your Project
Add vllm-client to your Cargo.toml:
[dependencies]
vllm-client = "0.1"
Or use cargo add:
cargo add vllm-client
Required Dependencies
The library requires tokio for async runtime. Add it to your Cargo.toml:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
Optional Dependencies
For convenience, the library re-exports serde_json::json:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
Feature Flags
Currently, vllm-client does not have additional feature flags. All functionality is included by default.
Verifying Installation
Create a simple test to verify the installation:
use vllm_client::VllmClient; fn main() { let client = VllmClient::new("http://localhost:8000/v1"); println!("Client created with base URL: {}", client.base_url()); }
Run with:
cargo run
vLLM Server Setup
To use this client, you need a vLLM server running. Install and start vLLM:
# Install vLLM
pip install vllm
# Start vLLM server with a model
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000
The server will be available at http://localhost:8000/v1.
Troubleshooting
Connection Refused
If you see connection errors, ensure:
- The vLLM server is running
- The server URL is correct (default:
http://localhost:8000/v1) - The port is not blocked by firewall
TLS/SSL Issues
If your vLLM server uses HTTPS with a self-signed certificate, you may need to handle certificate validation in your application.
Timeout Errors
For long-running requests, configure a longer timeout:
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); // 5 minutes }
Next Steps
- Quick Start - Learn basic usage
- Configuration - Configure the client
Quick Start
This guide will help you make your first API call with vLLM Client.
Prerequisites
- Rust 1.70 or later
- A running vLLM server
Basic Chat Completion
The simplest way to use the client is with a synchronous-style chat completion:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { // Create a client pointing to your vLLM server let client = VllmClient::new("http://localhost:8000/v1"); // Send a chat completion request let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Hello, how are you?"} ])) .send() .await?; // Print the response println!("Response: {}", response.content.unwrap_or_default()); Ok(()) }
Streaming Response
For real-time output, use streaming:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // Create a streaming request let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Write a short poem about spring"} ])) .stream(true) .send_stream() .await?; // Process streaming events while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Reasoning(delta) => eprint!("[thinking: {}]", delta), StreamEvent::Done => println!("\n[Done]"), StreamEvent::Error(e) => eprintln!("\nError: {}", e), _ => {} } } Ok(()) }
Using the Builder Pattern
For more configuration options, use the builder:
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("your-api-key") // Optional .timeout_secs(120) // Optional .build(); }
Complete Example with Options
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"} ])) .temperature(0.7) .max_tokens(1024) .top_p(0.9) .send() .await?; println!("Response: {}", response.content.unwrap_or_default()); // Print usage statistics if available if let Some(usage) = response.usage { println!("Tokens: prompt={}, completion={}, total={}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } Ok(()) }
Error Handling
Handle errors gracefully:
use vllm_client::{VllmClient, json, VllmError}; async fn chat() -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Hello!"} ])) .send() .await?; Ok(response.content.unwrap_or_default()) } #[tokio::main] async fn main() { match chat().await { Ok(text) => println!("Response: {}", text), Err(VllmError::ApiError { status_code, message, .. }) => { eprintln!("API Error ({}): {}", status_code, message); } Err(VllmError::Timeout) => { eprintln!("Request timed out"); } Err(e) => { eprintln!("Error: {}", e); } } }
Next Steps
- Configuration - Learn about all configuration options
- API Reference - Detailed API documentation
- Examples - More usage examples
Configuration
This page covers all configuration options for vllm-client.
Client Configuration
Basic Setup
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1"); }
Using the Builder Pattern
For more complex configurations, use the builder pattern:
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("your-api-key") .timeout_secs(120) .build(); }
Configuration Options
Base URL
The base URL of your vLLM server. This should include the /v1 path for OpenAI compatibility.
#![allow(unused)] fn main() { // Local development let client = VllmClient::new("http://localhost:8000/v1"); // Remote server let client = VllmClient::new("https://api.example.com/v1"); // With trailing slash (automatically normalized) let client = VllmClient::new("http://localhost:8000/v1/"); // Equivalent to: "http://localhost:8000/v1" }
API Key
If your vLLM server requires authentication, configure the API key:
#![allow(unused)] fn main() { // Using method chain let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-your-api-key"); // Using builder let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("sk-your-api-key") .build(); }
The API key is sent as a Bearer token in the Authorization header.
Timeout
Configure the request timeout for long-running operations:
#![allow(unused)] fn main() { // Using method chain let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); // 5 minutes // Using builder let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .timeout_secs(300) .build(); }
Default timeout uses the underlying HTTP client's default (usually 30 seconds).
Request Configuration
When making requests, you can configure various parameters:
Model Selection
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .send() .await?; }
Sampling Parameters
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .temperature(0.7) // 0.0 - 2.0 .top_p(0.9) // 0.0 - 1.0 .top_k(50) // vLLM extension .max_tokens(1024) // Max output tokens .send() .await?; }
| Parameter | Type | Range | Description |
|---|---|---|---|
temperature | f32 | 0.0 - 2.0 | Controls randomness. Higher = more random |
top_p | f32 | 0.0 - 1.0 | Nucleus sampling threshold |
top_k | i32 | 1+ | Top-K sampling (vLLM extension) |
max_tokens | u32 | 1+ | Maximum tokens to generate |
Stop Sequences
#![allow(unused)] fn main() { use serde_json::json; // Multiple stop sequences let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .stop(json!(["END", "STOP", "\n\n"])) .send() .await?; // Single stop sequence let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .stop(json!("END")) .send() .await?; }
Extra Parameters
vLLM supports additional parameters via the extra() method:
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Think about this"}])) .extra(json!({ "chat_template_kwargs": { "think_mode": true }, "reasoning_effort": "high" })) .send() .await?; }
Environment Variables
You can use environment variables to configure the client:
#![allow(unused)] fn main() { use std::env; use vllm_client::VllmClient; let base_url = env::var("VLLM_BASE_URL") .unwrap_or_else(|_| "http://localhost:8000/v1".to_string()); let api_key = env::var("VLLM_API_KEY").ok(); let mut client_builder = VllmClient::builder() .base_url(&base_url); if let Some(key) = api_key { client_builder = client_builder.api_key(&key); } let client = client_builder.build(); }
Recommended Environment Variables
| Variable | Description | Example |
|---|---|---|
VLLM_BASE_URL | vLLM server URL | http://localhost:8000/v1 |
VLLM_API_KEY | API key (optional) | sk-xxx |
VLLM_TIMEOUT | Timeout in seconds | 300 |
Best Practices
Reusing the Client
Create the client once and reuse it for multiple requests:
#![allow(unused)] fn main() { // Good: Reuse client let client = VllmClient::new("http://localhost:8000/v1"); for prompt in prompts { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await?; } // Avoid: Creating client for each request for prompt in prompts { let client = VllmClient::new("http://localhost:8000/v1"); // Inefficient! // ... } }
Timeout Selection
Choose appropriate timeouts based on your use case:
| Use Case | Recommended Timeout |
|---|---|
| Simple queries | 30 seconds |
| Complex reasoning | 2-5 minutes |
| Long document generation | 10+ minutes |
Error Handling
Always handle errors appropriately:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, VllmError}; match client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .send() .await { Ok(response) => println!("{}", response.content.unwrap()), Err(VllmError::Timeout) => eprintln!("Request timed out"), Err(VllmError::ApiError { status_code, message, .. }) => { eprintln!("API error ({}): {}", status_code, message); } Err(e) => eprintln!("Error: {}", e), } }
Next Steps
- Quick Start - Basic usage examples
- API Reference - Complete API documentation
- Error Handling - Detailed error handling guide
API Reference
This section provides detailed documentation for the vLLM Client API.
Design Philosophy
The vLLM Client API follows these design principles:
Builder Pattern
All request constructions use the builder pattern for ergonomic and flexible API calls:
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("model-name") .messages(json!([{"role": "user", "content": "Hello"}])) .temperature(0.7) .max_tokens(1024) .send() .await?; }
Async-First
All API operations are async, built on Tokio. Use #[tokio::main] or integrate with your existing runtime:
#[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { // Your async code here }
Type Safety
Strong types are used throughout the library with Serde serialization:
ChatCompletionResponse- Response from chat completionsStreamEvent- Events from streaming responsesToolCall- Tool/function call dataVllmError- Comprehensive error types
OpenAI Compatibility
The API mirrors the OpenAI API structure, making it easy to migrate existing code:
| OpenAI | vLLM Client |
|---|---|
client.chat.completions.create(...) | client.chat.completions().create()...send().await |
stream=True | .stream(true).send_stream().await |
tools=[...] | .tools(json!([...])) |
Module Structure
VllmClient
├── chat
│ └── completions() # Chat completions API
│ ├── create() # Create request builder
│ └── send() # Execute request
│ └── send_stream() # Execute with streaming
├── completions # Legacy completions API
└── builder() # Client builder
Core Types
Request Types
| Type | Description |
|---|---|
ChatCompletionsRequest | Builder for chat completion requests |
VllmClientBuilder | Builder for client configuration |
Response Types
| Type | Description |
|---|---|
ChatCompletionResponse | Response from chat completions |
CompletionResponse | Response from legacy completions |
MessageStream | Streaming response iterator |
StreamEvent | Individual stream events |
ToolCall | Tool/function call data |
Usage | Token usage statistics |
Error Types
| Type | Description |
|---|---|
VllmError::Http | HTTP request failed |
VllmError::Json | JSON serialization error |
VllmError::ApiError | API returned error |
VllmError::Stream | Streaming error |
VllmError::Timeout | Connection timeout |
Quick Reference
Creating a Client
#![allow(unused)] fn main() { use vllm_client::VllmClient; // Simple let client = VllmClient::new("http://localhost:8000/v1"); // With API key let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-xxx"); // With builder let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("sk-xxx") .timeout_secs(120) .build(); }
Chat Completion
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Hello!"} ])) .temperature(0.7) .max_tokens(1024) .send() .await?; println!("{}", response.content.unwrap()); }
Streaming
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; let mut stream = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Reasoning(delta) => eprintln!("[thinking] {}", delta), StreamEvent::Done => break, _ => {} } } }
Sections
- Client - VllmClient configuration and methods
- Chat Completions - Chat completions API
- Streaming - Streaming response handling
- Tool Calling - Function/tool calling
- Error Handling - Error types and handling
Client API
The VllmClient is the main entry point for interacting with the vLLM API.
Creating a Client
Simple Construction
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1"); }
With API Key
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-your-api-key"); }
With Timeout
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(120); // 2 minutes }
Using the Builder Pattern
For more complex configurations, use the builder:
#![allow(unused)] fn main() { use vllm_client::VllmClient; let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("sk-your-api-key") .timeout_secs(300) .build(); }
Methods Reference
new(base_url: impl Into<String>) -> Self
Create a new client with the given base URL.
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1"); }
Parameters:
base_url- The base URL of the vLLM server (should include/v1path)
Notes:
- Trailing slashes are automatically removed
- The client is cheap to create but should be reused when possible
with_api_key(self, api_key: impl Into<String>) -> Self
Set the API key for authentication (builder pattern).
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-xxx"); }
Parameters:
api_key- The API key to use for Bearer authentication
Notes:
- The API key is sent as a Bearer token in the
Authorizationheader - This method returns a new client instance
timeout_secs(self, secs: u64) -> Self
Set the request timeout in seconds (builder pattern).
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); }
Parameters:
secs- Timeout duration in seconds
Notes:
- Applies to all requests made by this client
- For long-running generation tasks, consider setting a higher timeout
base_url(&self) -> &str
Get the base URL of the client.
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1"); assert_eq!(client.base_url(), "http://localhost:8000/v1"); }
api_key(&self) -> Option<&str>
Get the API key, if configured.
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("sk-xxx"); assert_eq!(client.api_key(), Some("sk-xxx")); }
builder() -> VllmClientBuilder
Create a new client builder for more configuration options.
#![allow(unused)] fn main() { let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .api_key("sk-xxx") .timeout_secs(120) .build(); }
API Modules
The client provides access to different API modules:
chat - Chat Completions API
Access the chat completions API for conversational interactions:
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .send() .await?; }
completions - Legacy Completions API
Access the legacy completions API for text completion:
#![allow(unused)] fn main() { let response = client.completions.create() .model("Qwen/Qwen2.5-72B-Instruct") .prompt("Once upon a time") .send() .await?; }
VllmClientBuilder
The builder provides a flexible way to configure the client.
Methods
| Method | Type | Description |
|---|---|---|
base_url(url) | impl Into<String> | Set the base URL |
api_key(key) | impl Into<String> | Set the API key |
timeout_secs(secs) | u64 | Set timeout in seconds |
build() | - | Build the client |
Default Values
| Option | Default |
|---|---|
base_url | http://localhost:8000/v1 |
api_key | None |
timeout_secs | HTTP client default (30s) |
Usage Examples
Basic Usage
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Hello!"} ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
With Environment Variables
#![allow(unused)] fn main() { use std::env; use vllm_client::VllmClient; fn create_client() -> VllmClient { let base_url = env::var("VLLM_BASE_URL") .unwrap_or_else(|_| "http://localhost:8000/v1".to_string()); let api_key = env::var("VLLM_API_KEY").ok(); let mut builder = VllmClient::builder().base_url(&base_url); if let Some(key) = api_key { builder = builder.api_key(&key); } builder.build() } }
Multiple Requests
Reuse the client for multiple requests:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; async fn process_prompts(client: &VllmClient, prompts: &[&str]) -> Vec<String> { let mut results = Vec::new(); for prompt in prompts { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match response { Ok(r) => results.push(r.content.unwrap_or_default()), Err(e) => eprintln!("Error: {}", e), } } results } }
Thread Safety
The VllmClient is thread-safe and can be shared across threads:
#![allow(unused)] fn main() { use std::sync::Arc; use vllm_client::VllmClient; let client = Arc::new(VllmClient::new("http://localhost:8000/v1")); // Can be cloned and shared across threads let client_clone = Arc::clone(&client); }
See Also
- Chat Completions - Chat completions API
- Streaming - Streaming response handling
- Configuration - Configuration options
Chat Completions API
The Chat Completions API is the primary interface for generating text responses from a language model.
Overview
Access the chat completions API through client.chat.completions():
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "Hello!"} ])) .send() .await?; }
Request Builder
Required Parameters
model(name: impl Into<String>)
Set the model name to use for generation.
#![allow(unused)] fn main() { .model("Qwen/Qwen2.5-72B-Instruct") // or .model("meta-llama/Llama-3-70b") }
messages(messages: Value)
Set the conversation messages as a JSON array.
#![allow(unused)] fn main() { .messages(json!([ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Rust?"} ])) }
Message Types
| Role | Description |
|---|---|
system | Set the behavior of the assistant |
user | User input |
assistant | Assistant response (for multi-turn) |
tool | Tool result (for function calling) |
Sampling Parameters
temperature(temp: f32)
Controls randomness. Range: 0.0 to 2.0.
#![allow(unused)] fn main() { .temperature(0.7) // Default-like behavior .temperature(0.0) // Deterministic .temperature(1.5) // More creative }
max_tokens(tokens: u32)
Maximum number of tokens to generate.
#![allow(unused)] fn main() { .max_tokens(1024) .max_tokens(4096) }
top_p(p: f32)
Nucleus sampling threshold. Range: 0.0 to 1.0.
#![allow(unused)] fn main() { .top_p(0.9) }
top_k(k: i32)
Top-K sampling (vLLM extension). Limits to top K tokens.
#![allow(unused)] fn main() { .top_k(50) }
stop(sequences: Value)
Stop generation when encountering these sequences.
#![allow(unused)] fn main() { // Multiple sequences .stop(json!(["END", "STOP", "\n\n"])) // Single sequence .stop(json!("---")) }
Tool Calling Parameters
tools(tools: Value)
Define tools/functions that the model can call.
#![allow(unused)] fn main() { .tools(json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ])) }
tool_choice(choice: Value)
Control tool selection behavior.
#![allow(unused)] fn main() { .tool_choice(json!("auto")) // Model decides .tool_choice(json!("none")) // No tools .tool_choice(json!("required")) // Force tool use .tool_choice(json!({ "type": "function", "function": {"name": "get_weather"} })) }
Advanced Parameters
stream(enable: bool)
Enable streaming response.
#![allow(unused)] fn main() { .stream(true) }
extra(params: Value)
Pass vLLM-specific or additional parameters.
#![allow(unused)] fn main() { .extra(json!({ "chat_template_kwargs": { "think_mode": true }, "reasoning_effort": "high" })) }
Sending Requests
send() - Synchronous Response
Returns the complete response at once.
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .send() .await?; }
send_stream() - Streaming Response
Returns a stream for real-time output.
#![allow(unused)] fn main() { let mut stream = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .stream(true) .send_stream() .await?; }
See Streaming for detailed streaming documentation.
Response Structure
ChatCompletionResponse
| Field | Type | Description |
|---|---|---|
raw | Value | Raw JSON response |
id | String | Response ID |
object | String | Object type |
model | String | Model used |
content | Option<String> | Generated content |
reasoning_content | Option<String> | Reasoning content (thinking models) |
tool_calls | Option<Vec<ToolCall>> | Tool calls made |
finish_reason | Option<String> | Why generation stopped |
usage | Option<Usage> | Token usage statistics |
Example Usage
#![allow(unused)] fn main() { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "What is 2+2?"} ])) .send() .await?; // Access content println!("Content: {}", response.content.unwrap_or_default()); // Check for reasoning (thinking models) if let Some(reasoning) = response.reasoning_content { println!("Reasoning: {}", reasoning); } // Check finish reason match response.finish_reason.as_deref() { Some("stop") => println!("Natural stop"), Some("length") => println!("Max tokens reached"), Some("tool_calls") => println!("Tool calls made"), _ => {} } // Token usage if let Some(usage) = response.usage { println!("Prompt tokens: {}", usage.prompt_tokens); println!("Completion tokens: {}", usage.completion_tokens); println!("Total tokens: {}", usage.total_tokens); } }
Complete Example
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "Write a function to reverse a string in Rust"} ])) .temperature(0.7) .max_tokens(1024) .top_p(0.9) .send() .await?; if let Some(content) = response.content { println!("{}", content); } Ok(()) }
Multi-turn Conversation
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); // First message let response1 = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "My name is Alice"} ])) .send() .await?; // Continue conversation let response2 = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "My name is Alice"}, {"role": "assistant", "content": response1.content.unwrap()}, {"role": "user", "content": "What's my name?"} ])) .send() .await?; }
See Also
- Streaming - Streaming responses
- Tool Calling - Function calling
- Client - Client configuration
Streaming API
Streaming responses allow you to process LLM output in real-time, token by token, instead of waiting for the complete response.
Overview
vLLM Client provides streaming support through Server-Sent Events (SSE). Use send_stream() instead of send() to get a streaming response.
Basic Streaming
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Write a poem about spring"} ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::Done => break, _ => {} } } println!(); Ok(()) }
StreamEvent Types
The StreamEvent enum represents different types of streaming events:
| Variant | Description |
|---|---|
Content(String) | Regular content token delta |
Reasoning(String) | Reasoning/thinking content (for thinking models) |
ToolCallDelta | Streaming tool call delta |
ToolCallComplete(ToolCall) | Complete tool call ready to execute |
Usage(Usage) | Token usage statistics |
Done | Stream completed successfully |
Error(VllmError) | An error occurred |
Content Events
The most common event type, containing text tokens:
#![allow(unused)] fn main() { match event { StreamEvent::Content(delta) => { print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } _ => {} } }
Reasoning Events
For models with reasoning capabilities (like Qwen with thinking mode):
#![allow(unused)] fn main() { match event { StreamEvent::Reasoning(delta) => { eprintln!("[thinking] {}", delta); } StreamEvent::Content(delta) => { print!("{}", delta); } _ => {} } }
Tool Call Events
Tool calls are streamed incrementally and then completed:
#![allow(unused)] fn main() { match event { StreamEvent::ToolCallDelta { index, id, name, arguments } => { println!("Tool delta: index={}, name={}", index, name); // Arguments are streamed as partial JSON } StreamEvent::ToolCallComplete(tool_call) => { println!("Tool ready: {}({})", tool_call.name, tool_call.arguments); // Execute the tool and return result } _ => {} } }
Usage Events
Token usage information is typically sent at the end:
#![allow(unused)] fn main() { match event { StreamEvent::Usage(usage) => { println!("Tokens: prompt={}, completion={}, total={}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } _ => {} } }
MessageStream
The MessageStream type is an async iterator that yields StreamEvent values.
Methods
| Method | Return Type | Description |
|---|---|---|
next() | Option<StreamEvent> | Get next event (async) |
collect_content() | String | Collect all content into a string |
into_stream() | impl Stream | Convert to generic stream |
Collect All Content
For convenience, you can collect all content at once:
#![allow(unused)] fn main() { let content = stream.collect_content().await?; println!("Full response: {}", content); }
Note: This waits for the complete response, defeating the purpose of streaming. Use only when you need both streaming display and the full text.
Complete Streaming Example
use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing in simple terms"} ])) .temperature(0.7) .max_tokens(1024) .stream(true) .send_stream() .await?; let mut reasoning = String::new(); let mut content = String::new(); let mut usage = None; while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { reasoning.push_str(&delta); } StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Usage(u) => { usage = Some(u); } StreamEvent::Done => { println!("\n[Stream completed]"); } StreamEvent::Error(e) => { eprintln!("\nError: {}", e); return Err(e); } _ => {} } } // Print summary if !reasoning.is_empty() { eprintln!("\n--- Reasoning ---"); eprintln!("{}", reasoning); } if let Some(usage) = usage { eprintln!("\n--- Token Usage ---"); eprintln!("Prompt: {}, Completion: {}, Total: {}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } Ok(()) }
Streaming with Tool Calling
When streaming with tools, you'll receive incremental tool call updates:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent, ToolCall}; use futures::StreamExt; let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather in Tokyo?"} ])) .tools(tools) .stream(true) .send_stream() .await?; let mut tool_calls: Vec<ToolCall> = Vec::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => print!("{}", delta), StreamEvent::ToolCallComplete(tool_call) => { tool_calls.push(tool_call); } StreamEvent::Done => break, _ => {} } } // Execute tool calls for tool_call in tool_calls { println!("Tool: {} with args: {}", tool_call.name, tool_call.arguments); // Execute and return result in next message } }
Error Handling
Streaming errors can occur at any point:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; async fn stream_chat(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .stream(true) .send_stream() .await?; let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => content.push_str(&delta), StreamEvent::Error(e) => return Err(e), StreamEvent::Done => break, _ => {} } } Ok(content) } }
Best Practices
Flush Output
For real-time display, flush stdout after each token:
#![allow(unused)] fn main() { use std::io::{self, Write}; match event { StreamEvent::Content(delta) => { print!("{}", delta); io::stdout().flush().ok(); } _ => {} } }
Handle Interruption
For interactive applications, handle Ctrl+C gracefully:
#![allow(unused)] fn main() { use tokio::signal; tokio::select! { result = process_stream(&mut stream) => { // Normal completion } _ = signal::ctrl_c() => { println!("\n[interrupted]"); } } }
Timeout for Idle Streams
Set a timeout for streams that may hang:
#![allow(unused)] fn main() { use tokio::time::{timeout, Duration}; let result = timeout( Duration::from_secs(60), stream.next() ).await; match result { Ok(Some(event)) => { /* process event */ } Ok(None) => { /* stream ended */ } Err(_) => { /* timeout */ } } }
Completions Streaming
The vLLM Client also supports streaming for the legacy /v1/completions API using CompletionStreamEvent.
CompletionStreamEvent Types
| Variant | Description |
|---|---|
Text(String) | Text token delta |
FinishReason(String) | Reason why the stream finished (e.g., "stop", "length") |
Usage(Usage) | Token usage statistics |
Done | Stream completed successfully |
Error(VllmError) | An error occurred |
Completions Streaming Example
use vllm_client::{VllmClient, json, CompletionStreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .completions .create() .model("Qwen/Qwen2.5-7B-Instruct") .prompt("Write a poem about spring") .max_tokens(1024) .temperature(0.7) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match event { CompletionStreamEvent::Text(delta) => { print!("{}", delta); std::io::stdout().flush().ok(); } CompletionStreamEvent::FinishReason(reason) => { println!("\n[Finish reason: {}]", reason); } CompletionStreamEvent::Usage(usage) => { println!("\nTokens: prompt={}, completion={}, total={}", usage.prompt_tokens, usage.completion_tokens, usage.total_tokens ); } CompletionStreamEvent::Done => { println!("\n[Stream completed]"); } CompletionStreamEvent::Error(e) => { eprintln!("Error: {}", e); return Err(e.into()); } } } Ok(()) }
CompletionStream Methods
| Method | Return Type | Description |
|---|---|---|
next() | Option<CompletionStreamEvent> | Get next event (async) |
collect_text() | String | Collect all text into a string |
into_stream() | impl Stream | Convert to generic stream |
Next Steps
- Tool Calling - Using function calling
- Error Handling - Comprehensive error handling
- Examples - More streaming examples
Tool Calling API
Tool calling (also known as function calling) allows the model to call external functions during generation. This enables integration with external APIs, databases, and custom logic.
Overview
The vLLM Client supports OpenAI-compatible tool calling:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather in Tokyo?"} ])) .tools(tools) .send() .await?; }
Defining Tools
Basic Tool Definition
Tools are defined as JSON following the OpenAI schema:
#![allow(unused)] fn main() { let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city name, e.g., Tokyo" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" } }, "required": ["location"] } } } ]); }
Multiple Tools
#![allow(unused)] fn main() { let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather information", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } }, { "type": "function", "function": { "name": "search_web", "description": "Search the web for information", "parameters": { "type": "object", "properties": { "query": {"type": "string"}, "limit": {"type": "integer"} }, "required": ["query"] } } } ]); }
Tool Choice
Control how the model selects tools:
#![allow(unused)] fn main() { // Let the model decide (default) .tool_choice(json!("auto")) // Prevent tool use .tool_choice(json!("none")) // Force tool use .tool_choice(json!("required")) // Force a specific tool .tool_choice(json!({ "type": "function", "function": {"name": "get_weather"} })) }
Handling Tool Calls
Checking for Tool Calls
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather in Tokyo?"} ])) .tools(tools) .send() .await?; // Check if the response contains tool calls if response.has_tool_calls() { if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { println!("Function: {}", tool_call.name); println!("Arguments: {}", tool_call.arguments); } } } }
ToolCall Structure
#![allow(unused)] fn main() { pub struct ToolCall { pub id: String, // Unique identifier for the call pub name: String, // Function name pub arguments: String, // JSON string of arguments } }
Parsing Arguments
Parse the arguments string into typed data:
#![allow(unused)] fn main() { use serde::Deserialize; #[derive(Deserialize)] struct WeatherArgs { location: String, unit: Option<String>, } if let Some(tool_call) = response.first_tool_call() { // Parse as a specific type match tool_call.parse_args_as::<WeatherArgs>() { Ok(args) => { println!("Location: {}", args.location); if let Some(unit) = args.unit { println!("Unit: {}", unit); } } Err(e) => { eprintln!("Failed to parse arguments: {}", e); } } // Or parse as generic JSON let args: Value = tool_call.parse_args()?; } }
Tool Result Method
Create a tool result message:
#![allow(unused)] fn main() { // Create a tool result message let tool_result = tool_call.result(json!({ "temperature": 25, "condition": "sunny", "humidity": 60 })); // Returns a JSON object ready to be added to messages // { // "role": "tool", // "tool_call_id": "...", // "content": "{\"temperature\": 25, ...}" // } }
Complete Tool Calling Flow
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, ToolCall}; use serde::{Deserialize, Serialize}; #[derive(Deserialize)] struct WeatherArgs { location: String, } #[derive(Serialize)] struct WeatherResult { temperature: f32, condition: String, } // Simulate weather API fn get_weather(location: &str) -> WeatherResult { WeatherResult { temperature: 25.0, condition: "sunny".to_string(), } } async fn chat_with_tools(client: &VllmClient, user_message: &str) -> Result<String, Box<dyn std::error::Error>> { let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); // First request let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": user_message} ])) .tools(tools.clone()) .send() .await?; // Check if model wants to call a tool if response.has_tool_calls() { let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; // Add assistant's tool calls to messages if let Some(tool_calls) = &response.tool_calls { let assistant_msg = response.assistant_message(); messages.push(assistant_msg); // Execute each tool and add results for tool_call in tool_calls { if tool_call.name == "get_weather" { let args: WeatherArgs = tool_call.parse_args_as()?; let result = get_weather(&args.location); messages.push(tool_call.result(json!(result))); } } } // Continue conversation with tool results let final_response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!(messages)) .tools(tools) .send() .await?; return Ok(final_response.content.unwrap_or_default()); } Ok(response.content.unwrap_or_default()) } }
Streaming Tool Calls
Tool calls are streamed incrementally during streaming responses:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; let mut stream = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather in Tokyo and Paris?"} ])) .tools(tools) .stream(true) .send_stream() .await?; let mut tool_calls: Vec<ToolCall> = Vec::new(); let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); } StreamEvent::ToolCallDelta { index, id, name, arguments } => { println!("[Tool delta {}] {}({})", index, name, arguments); } StreamEvent::ToolCallComplete(tool_call) => { println!("[Tool complete] {}({})", tool_call.name, tool_call.arguments); tool_calls.push(tool_call); } StreamEvent::Done => break, _ => {} } } // Execute all collected tool calls for tool_call in tool_calls { // Execute and return results... } }
Tool Calling with Multiple Rounds
#![allow(unused)] fn main() { async fn multi_round_tool_calling( client: &VllmClient, user_message: &str, max_rounds: usize, ) -> Result<String, Box<dyn std::error::Error>> { let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; for _ in 0..max_rounds { let response = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!(&messages)) .tools(tools.clone()) .send() .await?; if response.has_tool_calls() { // Add assistant message with tool calls messages.push(response.assistant_message()); // Execute tools and add results if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { let result = execute_tool(&tool_call.name, &tool_call.arguments); messages.push(tool_call.result(result)); } } } else { // No more tool calls, return the content return Ok(response.content.unwrap_or_default()); } } Err("Max rounds exceeded".into()) } }
Best Practices
Clear Tool Descriptions
Write clear, detailed descriptions:
#![allow(unused)] fn main() { // Good "description": "Get the current weather conditions for a specific city. Returns temperature, humidity, and weather condition." // Avoid "description": "Get weather" }
Precise Parameter Schemas
Define accurate JSON schemas:
#![allow(unused)] fn main() { "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City name or coordinates" }, "days": { "type": "integer", "minimum": 1, "maximum": 7, "description": "Number of days for forecast" } }, "required": ["location"] } }
Error Handling
Handle tool execution errors gracefully:
#![allow(unused)] fn main() { let tool_result = match execute_tool(&tool_call.name, &tool_call.arguments) { Ok(result) => json!({"success": true, "data": result}), Err(e) => json!({"success": false, "error": e.to_string()}), }; messages.push(tool_call.result(tool_result)); }
See Also
- Chat Completions - Base chat API
- Streaming - Streaming responses
- Examples - More tool calling examples
Error Handling
This document covers error handling in vLLM Client.
VllmError Enum
All errors in vLLM Client are represented by the VllmError enum:
#![allow(unused)] fn main() { use thiserror::Error; #[derive(Debug, Error, Clone)] pub enum VllmError { #[error("HTTP request failed: {0}")] Http(String), #[error("JSON error: {0}")] Json(String), #[error("API error (status {status_code}): {message}")] ApiError { status_code: u16, message: String, error_type: Option<String>, }, #[error("Stream error: {0}")] Stream(String), #[error("Connection timeout")] Timeout, #[error("Model not found: {0}")] ModelNotFound(String), #[error("Missing required parameter: {0}")] MissingParameter(String), #[error("No response content")] NoContent, #[error("Invalid response format: {0}")] InvalidResponse(String), #[error("{0}")] Other(String), } }
Error Types
| Variant | When It Occurs |
|---|---|
Http | Network errors, connection failures |
Json | Serialization/deserialization errors |
ApiError | Server returned an error response |
Stream | Errors during streaming response |
Timeout | Request timed out |
ModelNotFound | Specified model doesn't exist |
MissingParameter | Required parameter not provided |
NoContent | Response has no content |
InvalidResponse | Unexpected response format |
Other | Miscellaneous errors |
Basic Error Handling
use vllm_client::{VllmClient, json, VllmError}; async fn chat(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await?; Ok(response.content.unwrap_or_default()) } #[tokio::main] async fn main() { match chat("Hello!").await { Ok(text) => println!("Response: {}", text), Err(e) => eprintln!("Error: {}", e), } }
Detailed Error Handling
Handle specific error types differently:
use vllm_client::{VllmClient, json, VllmError}; #[tokio::main] async fn main() { let client = VllmClient::new("http://localhost:8000/v1"); let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .send() .await; match result { Ok(response) => { println!("Success: {}", response.content.unwrap_or_default()); } Err(VllmError::ApiError { status_code, message, error_type }) => { eprintln!("API Error (HTTP {}): {}", status_code, message); if let Some(etype) = error_type { eprintln!("Error type: {}", etype); } } Err(VllmError::Timeout) => { eprintln!("Request timed out. Try increasing timeout."); } Err(VllmError::Http(msg)) => { eprintln!("Network error: {}", msg); } Err(VllmError::ModelNotFound(model)) => { eprintln!("Model '{}' not found. Check available models.", model); } Err(VllmError::MissingParameter(param)) => { eprintln!("Missing required parameter: {}", param); } Err(e) => { eprintln!("Other error: {}", e); } } }
HTTP Status Codes
Common API error status codes:
| Code | Meaning | Action |
|---|---|---|
| 400 | Bad Request | Check request parameters |
| 401 | Unauthorized | Check API key |
| 403 | Forbidden | Check permissions |
| 404 | Not Found | Check endpoint or model name |
| 429 | Rate Limited | Implement retry with backoff |
| 500 | Server Error | Retry or contact admin |
| 502 | Bad Gateway | Check vLLM server status |
| 503 | Service Unavailable | Wait and retry |
| 504 | Gateway Timeout | Increase timeout or retry |
Retryable Errors
Check if an error is retryable:
#![allow(unused)] fn main() { use vllm_client::VllmError; fn should_retry(error: &VllmError) -> bool { error.is_retryable() } // Manual check match error { VllmError::Timeout => true, VllmError::ApiError { status_code: 429, .. } => true, // Rate limit VllmError::ApiError { status_code: 500..=504, .. } => true, // Server errors _ => false, } }
Retry with Exponential Backoff
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; use std::time::Duration; use tokio::time::sleep; async fn chat_with_retry( client: &VllmClient, prompt: &str, max_retries: u32, ) -> Result<String, VllmError> { let mut retries = 0; loop { let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match result { Ok(response) => { return Ok(response.content.unwrap_or_default()); } Err(e) if e.is_retryable() && retries < max_retries => { retries += 1; let delay = Duration::from_millis(100 * 2u64.pow(retries - 1)); eprintln!("Retry {} after {:?}: {}", retries, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } }
Streaming Error Handling
Handle errors during streaming:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; async fn stream_chat(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .stream(true) .send_stream() .await?; let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => content.push_str(&delta), StreamEvent::Done => break, StreamEvent::Error(e) => return Err(e), _ => {} } } Ok(content) } }
Error Context
Add context to errors for better debugging:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; async fn chat_with_context(prompt: &str) -> Result<String, String> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map_err(|e| format!("Failed to get chat response: {}", e))?; Ok(response.content.unwrap_or_default()) } }
Using anyhow or eyre
For applications using anyhow or eyre:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; use anyhow::{Context, Result}; async fn chat(prompt: &str) -> Result<String> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .context("Failed to send chat request")?; Ok(response.content.unwrap_or_default()) } }
Best Practices
1. Always Handle Errors
#![allow(unused)] fn main() { // Bad let response = client.chat.completions().create() .send().await.unwrap(); // Good match client.chat.completions().create().send().await { Ok(r) => { /* handle */ }, Err(e) => eprintln!("Error: {}", e), } }
2. Use Appropriate Timeout
#![allow(unused)] fn main() { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); // 5 minutes for long tasks }
3. Log Errors with Context
#![allow(unused)] fn main() { Err(e) => { log::error!("Chat request failed: {}", e); log::debug!("Request details: model={}, prompt_len={}", model, prompt.len()); } }
4. Implement Graceful Degradation
#![allow(unused)] fn main() { match primary_client.chat.completions().create().send().await { Ok(r) => r, Err(e) => { log::warn!("Primary client failed: {}, trying fallback", e); fallback_client.chat.completions().create().send().await? } } }
See Also
- Client - Client configuration
- Streaming - Streaming error handling
- Timeouts & Retries - Advanced timeout configuration
Examples
This section contains practical code examples demonstrating vLLM Client usage patterns.
Available Examples
Basic Usage
| Example | Description |
|---|---|
| Basic Chat | Simple chat completion requests |
| Streaming Chat | Real-time streaming responses |
| Streaming Completions | Legacy completions streaming |
| Tool Calling | Function calling integration |
| Multi-modal | Image and multi-modal inputs |
Quick Examples
Hello World
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Streaming Output
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": "Tell me a story"}])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { if let StreamEvent::Content(delta) = event { print!("{}", delta); } } println!(); Ok(()) }
Tool Calling
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather in Tokyo?"} ])) .tools(tools) .send() .await?; if response.has_tool_calls() { // Execute tools and return results } }
Example Structure
Each example includes:
- Complete, runnable code
- Required dependencies
- Step-by-step explanations
- Common variations and use cases
Running Examples
Prerequisites
-
A running vLLM server:
pip install vllm vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000 -
Rust toolchain:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Running an Example
# Create a new project
cargo new my-vllm-app
cd my-vllm-app
# Add dependencies
cargo add vllm-client
cargo add tokio --features full
cargo add serde_json
# Copy example code to src/main.rs
# Then run:
cargo run
Common Patterns
Environment Configuration
#![allow(unused)] fn main() { use std::env; use vllm_client::VllmClient; fn create_client() -> VllmClient { VllmClient::builder() .base_url(env::var("VLLM_BASE_URL") .unwrap_or_else(|_| "http://localhost:8000/v1".to_string())) .api_key(env::var("VLLM_API_KEY").ok()) .timeout_secs(300) .build() } }
Error Handling
#![allow(unused)] fn main() { use vllm_client::{VllmClient, VllmError}; async fn safe_chat(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await?; Ok(response.content.unwrap_or_default()) } }
Reusing Client
#![allow(unused)] fn main() { use std::sync::Arc; use vllm_client::VllmClient; // Share client across threads let client = Arc::new(VllmClient::new("http://localhost:8000/v1")); // Use in multiple async tasks let client1 = Arc::clone(&client); let client2 = Arc::clone(&client); }
See Also
- Getting Started - Installation and setup
- API Reference - Detailed API documentation
- Advanced Topics - Advanced usage patterns
Basic Chat Examples
This page demonstrates basic chat completion usage patterns with vLLM Client.
Simple Chat
The simplest way to send a chat message:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Hello, how are you?"} ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
With System Message
Add a system message to control the assistant's behavior:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "system", "content": "You are a helpful coding assistant. You write clean, well-documented code."}, {"role": "user", "content": "Write a function to check if a number is prime in Rust"} ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Multi-turn Conversation
Maintain context across multiple messages:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // Build conversation history let mut messages = vec![ json!({"role": "system", "content": "You are a helpful assistant."}), ]; // First turn messages.push(json!({"role": "user", "content": "My name is Alice"})); let response1 = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages.clone())) .send() .await?; let assistant_reply = response1.content.unwrap_or_default(); println!("Assistant: {}", assistant_reply); // Add assistant reply to history messages.push(json!({"role": "assistant", "content": assistant_reply})); // Second turn messages.push(json!({"role": "user", "content": "What's my name?"})); let response2 = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages)) .send() .await?; println!("Assistant: {}", response2.content.unwrap_or_default()); Ok(()) }
Conversation Helper
A reusable helper for building conversations:
use vllm_client::{VllmClient, json, VllmError}; use serde_json::Value; struct Conversation { client: VllmClient, model: String, messages: Vec<Value>, } impl Conversation { fn new(client: VllmClient, model: impl Into<String>) -> Self { Self { client, model: model.into(), messages: vec![ json!({"role": "system", "content": "You are a helpful assistant."}) ], } } fn with_system(mut self, content: &str) -> Self { self.messages[0] = json!({"role": "system", "content": content}); self } async fn send(&mut self, user_message: &str) -> Result<String, VllmError> { self.messages.push(json!({ "role": "user", "content": user_message })); let response = self.client .chat .completions() .create() .model(&self.model) .messages(json!(&self.messages)) .send() .await?; let content = response.content.unwrap_or_default(); self.messages.push(json!({ "role": "assistant", "content": &content })); Ok(content) } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut conv = Conversation::new(client, "Qwen/Qwen2.5-7B-Instruct") .with_system("You are a math tutor. Explain concepts simply."); println!("User: What is 2 + 2?"); let reply = conv.send("What is 2 + 2?").await?; println!("Assistant: {}", reply); println!("\nUser: And what is that multiplied by 3?"); let reply = conv.send("And what is that multiplied by 3?").await?; println!("Assistant: {}", reply); Ok(()) }
With Sampling Parameters
Control the generation with sampling parameters:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Write a creative story about a robot"} ])) .temperature(1.2) // Higher temperature for more creativity .top_p(0.95) // Nucleus sampling .top_k(50) // vLLM extension .max_tokens(512) // Limit output length .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Deterministic Output
For reproducible results, set temperature to 0:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "What is 2 + 2?"} ])) .temperature(0.0) // Deterministic output .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
With Stop Sequences
Stop generation at specific sequences:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "List three fruits, one per line"} ])) .stop(json!(["\n\n", "END"])) // Stop at double newline or END .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Token Usage Tracking
Track token usage for cost monitoring:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Explain quantum computing"} ])) .send() .await?; println!("Response: {}", response.content.unwrap_or_default()); if let Some(usage) = response.usage { println!("\n--- Token Usage ---"); println!("Prompt tokens: {}", usage.prompt_tokens); println!("Completion tokens: {}", usage.completion_tokens); println!("Total tokens: {}", usage.total_tokens); } Ok(()) }
Batch Processing
Process multiple prompts efficiently:
use vllm_client::{VllmClient, json, VllmError}; async fn process_prompts( client: &VllmClient, prompts: &[&str], ) -> Vec<Result<String, VllmError>> { let mut results = Vec::new(); for prompt in prompts { let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map(|r| r.content.unwrap_or_default()); results.push(result); } results } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(120); let prompts = [ "What is Rust?", "What is Python?", "What is Go?", ]; let results = process_prompts(&client, &prompts).await; for (prompt, result) in prompts.iter().zip(results.iter()) { match result { Ok(response) => println!("Q: {}\nA: {}\n", prompt, response), Err(e) => eprintln!("Error for '{}': {}", prompt, e), } } Ok(()) }
Error Handling
Proper error handling for production code:
use vllm_client::{VllmClient, json, VllmError}; async fn safe_chat(prompt: &str) -> Result<String, String> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(60); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map_err(|e| format!("Request failed: {}", e))?; response.content.ok_or_else(|| "No content in response".to_string()) } #[tokio::main] async fn main() { match safe_chat("Hello!").await { Ok(text) => println!("Response: {}", text), Err(e) => eprintln!("Error: {}", e), } }
See Also
- Streaming Chat - Real-time response streaming
- Tool Calling - Function calling examples
- API Reference - Complete API documentation
Streaming Chat Example
This example demonstrates how to use streaming responses for real-time output.
Basic Streaming
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Write a short story about a robot learning to paint."} ])) .temperature(0.8) .max_tokens(1024) .stream(true) .send_stream() .await?; print!("Response: "); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\nError: {}", e); break; } _ => {} } } println!(); Ok(()) }
Streaming with Reasoning (Thinking Models)
For models that support thinking/reasoning mode:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Solve: What is 15 * 23 + 47?"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; let mut reasoning = String::new(); let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { reasoning.push_str(&delta); eprintln!("[thinking] {}", delta); } StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\nError: {}", e); break; } _ => {} } } println!("\n"); if !reasoning.is_empty() { println!("--- Reasoning Process ---"); println!("{}", reasoning); } Ok(()) }
Streaming with Progress Indicator
Add a typing indicator while waiting for the first token:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; use std::time::{Duration, Instant}; use std::sync::atomic::{AtomicBool, Ordering}; use std::sync::Arc; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let waiting = Arc::new(AtomicBool::new(true)); let waiting_clone = Arc::clone(&waiting); // Spawn typing indicator task let indicator = tokio::spawn(async move { let chars = ['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏']; let mut i = 0; while waiting_clone.load(Ordering::Relaxed) { print!("\r{} Thinking...", chars[i]); std::io::Write::flush(&mut std::io::stdout()).ok(); i = (i + 1) % chars.len(); tokio::time::sleep(Duration::from_millis(80)).await; } print!("\r \r"); // Clear the indicator }); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Explain quantum entanglement in simple terms."} ])) .stream(true) .send_stream() .await?; let mut first_token = true; let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { if first_token { waiting.store(false, Ordering::Relaxed); indicator.await.ok(); first_token = false; println!("Response:"); println!("---------"); } content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { waiting.store(false, Ordering::Relaxed); eprintln!("\nError: {}", e); break; } _ => {} } } println!("\n"); Ok(()) }
Multi-turn Streaming Conversation
Handle a conversation with streaming responses:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; use std::io::{self, BufRead, Write}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut messages: Vec<serde_json::Value> = Vec::new(); println!("Chat with the AI (type 'quit' to exit)"); println!("----------------------------------------\n"); let stdin = io::stdin(); for line in stdin.lock().lines() { let input = line?; if input.trim() == "quit" { break; } if input.trim().is_empty() { continue; } // Add user message messages.push(json!({"role": "user", "content": input})); // Stream response let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages)) .stream(true) .send_stream() .await?; print!("AI: "); io::stdout().flush().ok(); let mut response_content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { response_content.push_str(&delta); print!("{}", delta); io::stdout().flush().ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\nError: {}", e); break; } _ => {} } } println!("\n"); // Add assistant response to history messages.push(json!({"role": "assistant", "content": response_content})); } println!("Goodbye!"); Ok(()) }
Streaming with Timeout
Add timeout handling for slow responses:
use vllm_client::{VllmClient, json, StreamEvent, VllmError}; use futures::StreamExt; use tokio::time::{timeout, Duration}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(300); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Write a detailed essay about AI."} ])) .stream(true) .send_stream() .await?; let mut content = String::new(); loop { // 30 second timeout per event match timeout(Duration::from_secs(30), stream.next()).await { Ok(Some(event)) => { match event { StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\nStream error: {}", e); return Err(e.into()); } _ => {} } } Ok(None) => break, Err(_) => { eprintln!("\nTimeout waiting for next token"); break; } } } println!("\n\nGenerated {} characters", content.len()); Ok(()) }
Collecting Usage Statistics
Track token usage during streaming:
use vllm_client::{VllmClient, json, StreamEvent, Usage}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Write a poem about the ocean."} ])) .stream(true) .send_stream() .await?; let mut content = String::new(); let mut usage: Option<Usage> = None; let mut start_time = std::time::Instant::now(); let mut token_count = 0; while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { content.push_str(&delta); token_count += 1; print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Usage(u) => { usage = Some(u); } StreamEvent::Done => break, _ => {} } } let elapsed = start_time.elapsed(); println!("\n"); println!("--- Statistics ---"); println!("Time: {:.2}s", elapsed.as_secs_f64()); println!("Characters: {}", content.len()); if let Some(usage) = usage { println!("Prompt tokens: {}", usage.prompt_tokens); println!("Completion tokens: {}", usage.completion_tokens); println!("Total tokens: {}", usage.total_tokens); println!("Tokens/second: {:.2}", usage.completion_tokens as f64 / elapsed.as_secs_f64()); } Ok(()) }
See Also
- Basic Chat - Simple chat completion
- Tool Calling - Function calling examples
- Streaming API - Streaming API reference
Streaming Completions Example
This example demonstrates streaming completions using the legacy /v1/completions API.
Basic Streaming Completions
use vllm_client::{VllmClient, json, CompletionStreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); println!("=== Streaming Completions ===\n"); println!("Model: Qwen/Qwen2.5-7B-Instruct\n"); println!("Prompt: What is machine learning?"); println!("\nGenerated text: "); let mut stream = client .completions .create() .model("Qwen/Qwen2.5-7B-Instruct") .prompt("What is machine learning?") .max_tokens(500) .temperature(0.7) .stream(true) .send_stream() .await?; // Process streaming events while let Some(event) = stream.next().await { match event { CompletionStreamEvent::Text(delta) => { // Print text delta (real-time output) print!("{}", delta); // Flush buffer for real-time display std::io::stdout().flush().ok(); } CompletionStreamEvent::FinishReason(reason) => { println!("\n\n--- Finish reason: {} ---", reason); } CompletionStreamEvent::Usage(usage) => { // Output token usage statistics at the end println!("\n\n--- Token Usage ---"); println!("Prompt tokens: {}", usage.prompt_tokens); println!("Completion tokens: {}", usage.completion_tokens); println!("Total tokens: {}", usage.total_tokens); } CompletionStreamEvent::Done => { println!("\n\n=== Generation Complete ==="); break; } CompletionStreamEvent::Error(e) => { eprintln!("\nError: {}", e); return Err(e.into()); } } } Ok(()) }
Key Differences from Chat Streaming
| Aspect | Chat Completions | Completions |
|---|---|---|
| Event type | StreamEvent | CompletionStreamEvent |
| Content variant | Content(String) | Text(String) |
| Additional event | Reasoning, ToolCall | FinishReason |
| Use case | Conversation-based | Single prompt |
When to Use Completions API
- Simple text generation with a single prompt
- Legacy compatibility with OpenAI API
- Situations where chat messages format is not needed
For new projects, we recommend using the Chat Completions API (client.chat.completions()) which provides more flexibility and better message formatting.
Related Links
- Streaming - Chat streaming examples
- API Streaming - Streaming API reference
- Basic Chat - Non-streaming completions example
Tool Calling Examples
This example demonstrates how to use tool calling (function calling) with vLLM Client.
Basic Tool Calling
Define tools and let the model decide when to call them:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // Define available tools let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City name, e.g., Tokyo, New York" }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit" } }, "required": ["location"] } } } ]); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather like in Tokyo?"} ])) .tools(tools) .send() .await?; // Check if the model wants to call a tool if response.has_tool_calls() { if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { println!("Function: {}", tool_call.name); println!("Arguments: {}", tool_call.arguments); } } } else { println!("Response: {}", response.content.unwrap_or_default()); } Ok(()) }
Complete Tool Calling Flow
Execute tools and return results to continue the conversation:
use vllm_client::{VllmClient, json, ToolCall}; use serde::{Deserialize, Serialize}; #[derive(Deserialize)] struct WeatherArgs { location: String, unit: Option<String>, } #[derive(Serialize)] struct WeatherResult { temperature: f32, condition: String, humidity: u32, } // Simulated weather function fn get_weather(location: &str, unit: Option<&str>) -> WeatherResult { // In real code, call an actual weather API let temp = match location { "Tokyo" => 25.0, "New York" => 20.0, "London" => 15.0, _ => 22.0, }; WeatherResult { temperature: if unit == Some("fahrenheit") { temp * 9.0 / 5.0 + 32.0 } else { temp }, condition: "sunny".to_string(), humidity: 60, } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"}, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} }, "required": ["location"] } } } ]); let user_message = "What's the weather like in Tokyo and New York?"; // First request - model may call tools let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": user_message} ])) .tools(tools.clone()) .send() .await?; if response.has_tool_calls() { // Build message history let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; // Add assistant's tool calls messages.push(response.assistant_message()); // Execute each tool and add results if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { if tool_call.name == "get_weather" { let args: WeatherArgs = tool_call.parse_args_as()?; let result = get_weather(&args.location, args.unit.as_deref()); messages.push(tool_call.result(json!(result))); } } } // Continue conversation with tool results let final_response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(messages)) .tools(tools) .send() .await?; println!("{}", final_response.content.unwrap_or_default()); } else { println!("{}", response.content.unwrap_or_default()); } Ok(()) }
Multiple Tools
Define multiple tools for different purposes:
use vllm_client::{VllmClient, json}; use serde::Deserialize; #[derive(Deserialize)] struct SearchArgs { query: String, limit: Option<u32>, } #[derive(Deserialize)] struct CalcArgs { expression: String, } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "web_search", "description": "Search the web for information", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "Search query" }, "limit": { "type": "integer", "description": "Maximum number of results" } }, "required": ["query"] } } }, { "type": "function", "function": { "name": "calculate", "description": "Perform mathematical calculations", "parameters": { "type": "object", "properties": { "expression": { "type": "string", "description": "Math expression to evaluate, e.g., '2 + 2 * 3'" } }, "required": ["expression"] } } } ]); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Search for Rust programming language and calculate 42 * 17"} ])) .tools(tools) .send() .await?; if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { match tool_call.name.as_str() { "web_search" => { let args: SearchArgs = tool_call.parse_args_as()?; println!("Searching for: {} (limit: {:?})", args.query, args.limit); } "calculate" => { let args: CalcArgs = tool_call.parse_args_as()?; println!("Calculating: {}", args.expression); } _ => println!("Unknown tool: {}", tool_call.name), } } } Ok(()) }
Streaming Tool Calls
Stream tool call updates in real-time:
use vllm_client::{VllmClient, json, StreamEvent, ToolCall}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather in Tokyo, Paris, and London?"} ])) .tools(tools) .stream(true) .send_stream() .await?; let mut tool_calls: Vec<ToolCall> = Vec::new(); let mut content = String::new(); println!("Streaming response:\n"); while let Some(event) = stream.next().await { match event { StreamEvent::Content(delta) => { content.push_str(&delta); print!("{}", delta); } StreamEvent::ToolCallDelta { index, id, name, arguments } => { println!("[Tool {}] {} - partial args: {}", index, name, arguments); } StreamEvent::ToolCallComplete(tool_call) => { println!("[Tool Complete] {}({})", tool_call.name, tool_call.arguments); tool_calls.push(tool_call); } StreamEvent::Done => { println!("\n--- Stream Complete ---"); break; } StreamEvent::Error(e) => { eprintln!("\nError: {}", e); break; } _ => {} } } println!("\nCollected {} tool calls", tool_calls.len()); for (i, tc) in tool_calls.iter().enumerate() { println!(" {}. {}({})", i + 1, tc.name, tc.arguments); } Ok(()) }
Multi-Round Tool Calling
Handle multiple rounds of tool calls:
use vllm_client::{VllmClient, json, VllmError}; use serde_json::Value; async fn run_agent( client: &VllmClient, user_message: &str, tools: &Value, max_rounds: usize, ) -> Result<String, VllmError> { let mut messages = vec![ json!({"role": "user", "content": user_message}) ]; for round in 0..max_rounds { println!("--- Round {} ---", round + 1); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!(&messages)) .tools(tools.clone()) .send() .await?; if response.has_tool_calls() { // Add assistant message with tool calls messages.push(response.assistant_message()); // Execute tools and add results if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { println!("Calling: {}({})", tool_call.name, tool_call.arguments); // Execute the tool let result = execute_tool(&tool_call.name, &tool_call.arguments); println!("Result: {}", result); // Add tool result to messages messages.push(tool_call.result(result)); } } } else { // No more tool calls, return the final response return Ok(response.content.unwrap_or_default()); } } Err(VllmError::Other("Max rounds exceeded".to_string())) } fn execute_tool(name: &str, args: &str) -> Value { // Your tool execution logic here match name { "get_weather" => json!({"temperature": 22, "condition": "sunny"}), "web_search" => json!({"results": ["result1", "result2"]}), _ => json!({"error": "Unknown tool"}), } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } }, { "type": "function", "function": { "name": "web_search", "description": "Search the web", "parameters": { "type": "object", "properties": { "query": {"type": "string"} }, "required": ["query"] } } } ]); let result = run_agent( &client, "What's the weather in Tokyo and find information about cherry blossoms?", &tools, 5 ).await?; println!("\nFinal Answer: {}", result); Ok(()) }
Tool Choice Options
Control tool selection behavior:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); // Option 1: Let the model decide (default) let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Hello!"} ])) .tools(tools.clone()) .tool_choice(json!("auto")) .send() .await?; // Option 2: Prevent tool use let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather in Tokyo?"} ])) .tools(tools.clone()) .tool_choice(json!("none")) .send() .await?; // Option 3: Force tool use let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "I need weather info"} ])) .tools(tools.clone()) .tool_choice(json!("required")) .send() .await?; // Option 4: Force specific tool let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "Check Tokyo weather"} ])) .tools(tools.clone()) .tool_choice(json!({ "type": "function", "function": {"name": "get_weather"} })) .send() .await?; Ok(()) }
Error Handling
Handle tool execution errors gracefully:
use vllm_client::{VllmClient, json, ToolCall}; use serde_json::Value; fn execute_tool_safely(tool_call: &ToolCall) -> Value { match tool_call.name.as_str() { "get_weather" => { // Parse arguments safely match tool_call.parse_args() { Ok(args) => { // Execute tool match get_weather_internal(&args) { Ok(result) => json!({"success": true, "data": result}), Err(e) => json!({"success": false, "error": e.to_string()}), } } Err(e) => json!({ "success": false, "error": format!("Invalid arguments: {}", e) }), } } _ => json!({ "success": false, "error": format!("Unknown tool: {}", tool_call.name) }), } } fn get_weather_internal(args: &Value) -> Result<Value, String> { let location = args["location"].as_str() .ok_or("location is required")?; // Simulate API call Ok(json!({ "location": location, "temperature": 22, "condition": "sunny" })) } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let tools = json!([ { "type": "function", "function": { "name": "get_weather", "description": "Get weather for a location", "parameters": { "type": "object", "properties": { "location": {"type": "string"} }, "required": ["location"] } } } ]); let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([ {"role": "user", "content": "What's the weather?"} ])) .tools(tools) .send() .await?; if let Some(tool_calls) = &response.tool_calls { for tool_call in tool_calls { let result = execute_tool_safely(tool_call); println!("Tool result: {}", result); } } Ok(()) }
See Also
- API: Tool Calling - Tool calling API reference
- Streaming Chat - Streaming responses
- Basic Chat - Basic chat completion
Multi-modal Examples
Multi-modal capabilities allow you to send images and other media types along with text to the model.
Overview
vLLM supports multi-modal inputs through the OpenAI-compatible API. You can include images in your chat messages using base64 encoding or URLs.
Basic Image Input (Base64)
Send an image encoded as base64:
use vllm_client::{VllmClient, json}; use base64::{Engine as _, engine::general_purpose}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // Read and encode image let image_data = std::fs::read("image.png")?; let base64_image = general_purpose::STANDARD.encode(&image_data); let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") // Vision model .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "What's in this image?" }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ])) .max_tokens(512) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Image from URL
Reference an image by URL:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in detail." }, { "type": "image_url", "image_url": { "url": "https://example.com/image.jpg" } } ] } ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Helper Function for Images
Create a reusable helper for image messages:
use vllm_client::{VllmClient, json}; use serde_json::Value; fn image_message(text: &str, image_path: &str) -> Result<Value, Box<dyn std::error::Error>> { use base64::{Engine as _, engine::general_purpose}; let image_data = std::fs::read(image_path)?; let base64_image = general_purpose::STANDARD.encode(&image_data); // Detect image type from extension let mime_type = match image_path.to_lowercase().rsplit('.').next() { Some("png") => "image/png", Some("jpg") | Some("jpeg") => "image/jpeg", Some("gif") => "image/gif", Some("webp") => "image/webp", _ => "image/png", }; Ok(json!({ "role": "user", "content": [ { "type": "text", "text": text }, { "type": "image_url", "image_url": { "url": format!("data:{};base64,{}", mime_type, base64_image) } } ] })) } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let user_msg = image_message("What do you see in this image?", "photo.jpg")?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([user_msg])) .max_tokens(1024) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Multiple Images
Send multiple images in a single request:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // Read and encode multiple images let image1 = encode_image("image1.png")?; let image2 = encode_image("image2.png")?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "Compare these two images. What are the differences?" }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", image1) } }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", image2) } } ] } ])) .max_tokens(1024) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) } fn encode_image(path: &str) -> Result<String, Box<dyn std::error::Error>> { use base64::{Engine as _, engine::general_purpose}; let data = std::fs::read(path)?; Ok(general_purpose::STANDARD.encode(&data)) }
Streaming with Images
Stream responses for image-based queries:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let base64_image = encode_image("chart.png")?; let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ { "type": "text", "text": "Analyze this chart and explain the trends." }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ])) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { if let StreamEvent::Content(delta) = event { print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } } println!(); Ok(()) }
Multi-turn with Images
Maintain conversation context with images:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let base64_image = encode_image("screenshot.png")?; // First message with image let messages = json!([ { "role": "user", "content": [ {"type": "text", "text": "What's in this screenshot?"}, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ]); let response1 = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(messages.clone()) .send() .await?; println!("First response: {}", response1.content.unwrap_or_default()); // Continue conversation (no new image needed) let messages2 = json!([ { "role": "user", "content": [ {"type": "text", "text": "What's in this screenshot?"}, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] }, { "role": "assistant", "content": response1.content.unwrap_or_default() }, { "role": "user", "content": "Can you translate any text you see in the image?" } ]); let response2 = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(messages2) .send() .await?; println!("\nSecond response: {}", response2.content.unwrap_or_default()); Ok(()) }
OCR and Document Analysis
Use vision models for OCR and document analysis:
use vllm_client::{VllmClient, json}; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let document_image = encode_image("document.png")?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "system", "content": "You are an OCR assistant. Extract text from images accurately and format it properly." }, { "role": "user", "content": [ { "type": "text", "text": "Extract all text from this document image. Preserve the formatting as much as possible." }, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", document_image) } } ] } ])) .max_tokens(2048) .send() .await?; println!("Extracted Text:\n{}", response.content.unwrap_or_default()); Ok(()) }
Image Size Considerations
Handle large images appropriately:
use vllm_client::{VllmClient, json}; fn encode_and_resize_image(path: &str, max_size: u32) -> Result<String, Box<dyn std::error::Error>> { use base64::{Engine as _, engine::general_purpose}; use image::ImageReader; // Load and resize image let img = ImageReader::open(path)?.decode()?; let img = img.resize(max_size, max_size, image::imageops::FilterType::Lanczos3); // Convert to PNG let mut buffer = std::io::Cursor::new(Vec::new()); img.write_to(&mut buffer, image::ImageFormat::Png)?; Ok(general_purpose::STANDARD.encode(&buffer.into_inner())) } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); // Resize to max 1024px while maintaining aspect ratio let base64_image = encode_and_resize_image("large_image.jpg", 1024)?; let response = client .chat .completions() .create() .model("Qwen/Qwen2-VL-7B-Instruct") .messages(json!([ { "role": "user", "content": [ {"type": "text", "text": "Describe this image."}, { "type": "image_url", "image_url": { "url": format!("data:image/png;base64,{}", base64_image) } } ] } ])) .send() .await?; println!("{}", response.content.unwrap_or_default()); Ok(()) }
Supported Models
For multi-modal inputs, use models that support vision:
| Model | Description |
|---|---|
Qwen/Qwen2-VL-7B-Instruct | Qwen2 Vision Language |
Qwen/Qwen2-VL-72B-Instruct | Qwen2 VL Large |
meta-llama/Llama-3.2-11B-Vision-Instruct | Llama 3.2 Vision |
openai/clip-vit-large-patch14 | CLIP model |
Check your vLLM server's available models with:
curl http://localhost:8000/v1/models
Required Dependencies
For image handling, add these dependencies:
[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
base64 = "0.22"
image = "0.25" # Optional, for image processing
Troubleshooting
Image Too Large
If you get errors about image size, reduce the image dimensions:
#![allow(unused)] fn main() { // Resize before sending let img = image::load_from_memory(&image_data)?; let resized = img.resize(1024, 1024, image::imageops::FilterType::Lanczos3); }
Unsupported Format
Convert images to supported formats:
#![allow(unused)] fn main() { // Convert to PNG let img = image::load_from_memory(&image_data)?; let mut output = Vec::new(); img.write_to(&mut std::io::Cursor::new(&mut output), image::ImageFormat::Png)?; }
Model Doesn't Support Vision
Ensure you're using a vision-capable model. Non-vision models will ignore image inputs.
See Also
- Basic Chat - Text-only examples
- Streaming Chat - Streaming responses
- API Reference - Complete API docs
Advanced Topics
This section covers advanced features and patterns for vLLM Client.
Available Topics
| Topic | Description |
|---|---|
| Thinking Mode | Reasoning models and thinking content |
| Custom Headers | Custom HTTP headers and authentication |
| Timeouts & Retries | Timeout configuration and retry strategies |
Thinking Mode
For models that support reasoning (like Qwen with thinking mode), access the reasoning_content field:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; let mut stream = client.chat.completions().create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([{"role": "user", "content": "Solve this puzzle"}])) .extra(json!({"chat_template_kwargs": {"think_mode": true}})) .stream(true) .send_stream() .await?; while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => eprintln!("[thinking] {}", delta), StreamEvent::Content(delta) => print!("{}", delta), _ => {} } } }
Custom Configuration
Environment-Based Configuration
#![allow(unused)] fn main() { use std::env; use vllm_client::VllmClient; fn create_client() -> VllmClient { VllmClient::builder() .base_url(env::var("VLLM_BASE_URL") .unwrap_or_else(|_| "http://localhost:8000/v1".to_string())) .api_key(env::var("VLLM_API_KEY").ok()) .timeout_secs(env::var("VLLM_TIMEOUT") .ok() .and_then(|s| s.parse().ok()) .unwrap_or(300)) .build() } }
Multiple Clients
#![allow(unused)] fn main() { use vllm_client::VllmClient; let primary = VllmClient::new("http://primary-server:8000/v1"); let fallback = VllmClient::new("http://fallback-server:8000/v1"); }
Production Patterns
Connection Pooling
The client reuses HTTP connections automatically. Create once and share:
#![allow(unused)] fn main() { use std::sync::Arc; use vllm_client::VllmClient; let client = Arc::new(VllmClient::new("http://localhost:8000/v1")); // Clone the Arc for each task let client1 = Arc::clone(&client); let client2 = Arc::clone(&client); }
Graceful Shutdown
Handle graceful shutdown with channels:
#![allow(unused)] fn main() { use tokio::signal; use tokio::sync::broadcast; let (shutdown_tx, _) = broadcast::channel::<()>(1); // In your request loop tokio::select! { result = make_request(&client) => { // Handle result } _ = shutdown_rx.recv() => { println!("Shutting down gracefully"); break; } } }
Request Queuing
For rate limiting, implement a queue:
#![allow(unused)] fn main() { use tokio::sync::Semaphore; let semaphore = Arc::new(Semaphore::new(10)); // Max 10 concurrent async fn queued_request(client: &VllmClient, prompt: &str) -> Result<String, VllmError> { let _permit = semaphore.acquire().await.unwrap(); client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map(|r| r.content.unwrap_or_default()) } }
Performance Tips
1. Reuse the Client
Creating a client has some overhead. Reuse it across requests:
#![allow(unused)] fn main() { // Good let client = VllmClient::new("http://localhost:8000/v1"); for prompt in prompts { let _ = client.chat.completions().create()...; } // Avoid for prompt in prompts { let client = VllmClient::new("http://localhost:8000/v1"); // Inefficient! let _ = client.chat.completions().create()...; } }
2. Use Streaming for Long Responses
Get faster time-to-first-token with streaming:
#![allow(unused)] fn main() { // Faster perceived latency let mut stream = client.chat.completions().create() .stream(true) .send_stream() .await?; }
3. Set Appropriate Timeouts
Match timeout to expected response time:
#![allow(unused)] fn main() { // Short queries let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(30); // Long generation tasks let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(600); }
4. Batch Requests
Process multiple prompts concurrently:
#![allow(unused)] fn main() { use futures::stream::{self, StreamExt}; let prompts = vec!["Hello", "Hi", "Hey"]; let results: Vec<_> = stream::iter(prompts) .map(|p| async { client.chat.completions().create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": p}])) .send() .await }) .buffer_unordered(5) // Max 5 concurrent .collect() .await; }
Security Considerations
API Key Storage
Never hardcode API keys:
#![allow(unused)] fn main() { // Good: Use environment variables let api_key = std::env::var("VLLM_API_KEY")?; // Avoid: Hardcoded keys let api_key = "sk-secret-key"; // DON'T DO THIS! }
TLS Verification
The client uses reqwest which verifies TLS certificates by default. For development with self-signed certificates:
#![allow(unused)] fn main() { // Use a custom HTTP client if needed let http = reqwest::Client::builder() .danger_accept_invalid_certs(true) // Only for development! .timeout(std::time::Duration::from_secs(300)) .build()?; }
See Also
- API Reference - Complete API documentation
- Examples - Practical code examples
- Error Handling - Error handling strategies
Thinking Mode
Thinking mode (also known as reasoning mode) allows models to output their reasoning process before giving a final answer. This is particularly useful for complex reasoning tasks.
Overview
Some models, like Qwen with thinking mode enabled, can output two types of content:
- Reasoning Content - The model's internal "thinking" process
- Content - The final response to the user
Enabling Thinking Mode
Qwen Models
For Qwen models, enable thinking mode via the extra parameter:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "Solve: What is 15 * 23 + 47?"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .send() .await?; }
Checking for Reasoning Content
In non-streaming responses, access reasoning content separately:
#![allow(unused)] fn main() { // Check for reasoning content if let Some(reasoning) = response.reasoning_content { println!("Reasoning: {}", reasoning); } // Get final content if let Some(content) = response.content { println!("Answer: {}", content); } }
Streaming with Thinking Mode
The best way to use thinking mode is with streaming:
use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = VllmClient::new("http://localhost:8000/v1"); let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "Think step by step: If I have 5 apples and give 2 to my friend, then buy 3 more, how many do I have?"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; println!("=== Thinking Process ===\n"); let mut in_thinking = true; let mut reasoning = String::new(); let mut content = String::new(); while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { reasoning.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Content(delta) => { if in_thinking { in_thinking = false; println!("\n\n=== Final Answer ===\n"); } content.push_str(&delta); print!("{}", delta); std::io::Write::flush(&mut std::io::stdout()).ok(); } StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("\nError: {}", e); break; } _ => {} } } println!(); Ok(()) }
Use Cases
Mathematical Reasoning
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; async fn solve_math_problem(client: &VllmClient, problem: &str) -> Result<String, Box<dyn std::error::Error>> { let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "system", "content": "You are a math tutor. Show your work clearly."}, {"role": "user", "content": problem} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; let mut answer = String::new(); while let Some(event) = stream.next().await { if let StreamEvent::Content(delta) = event { answer.push_str(&delta); } } Ok(answer) } }
Code Analysis
#![allow(unused)] fn main() { let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "Analyze this code for potential bugs and security issues:\n\n```rust\nfn process_input(input: &str) -> String {\n let mut result = String::new();\n for c in input.chars() {\n result.push(c);\n }\n result\n}\n```"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .send() .await?; }
Complex Decision Making
#![allow(unused)] fn main() { let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "system", "content": "You are a decision support assistant. Think through all options carefully."}, {"role": "user", "content": "I need to choose between job offers from Company A (high salary, long commute) and Company B (moderate salary, remote work). Help me decide."} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .max_tokens(2048) .send() .await?; }
Separating Reasoning from Answer
For applications that need to separate reasoning from the final answer:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; struct ThinkingResponse { reasoning: String, content: String, } async fn think_and_respond( client: &VllmClient, prompt: &str, ) -> Result<ThinkingResponse, Box<dyn std::error::Error>> { let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": prompt} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .stream(true) .send_stream() .await?; let mut response = ThinkingResponse { reasoning: String::new(), content: String::new(), }; while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => response.reasoning.push_str(&delta), StreamEvent::Content(delta) => response.content.push_str(&delta), StreamEvent::Done => break, _ => {} } } Ok(response) } }
Model Support
| Model | Thinking Mode Support |
|---|---|
| Qwen/Qwen2.5-72B-Instruct | ✅ Yes |
| Qwen/Qwen2.5-32B-Instruct | ✅ Yes |
| Qwen/Qwen2.5-7B-Instruct | ✅ Yes |
| DeepSeek-R1 | ✅ Yes (built-in) |
| Other models | ❌ Model dependent |
Check your vLLM server configuration to verify thinking mode support.
Configuration Options
Thinking Model Detection
The model automatically handles thinking tokens:
#![allow(unused)] fn main() { // Reasoning content is parsed from special tokens // Usually structured as: <think>...</think> or similar }
Non-Streaming Access
For non-streaming requests with reasoning:
#![allow(unused)] fn main() { let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-72B-Instruct") .messages(json!([ {"role": "user", "content": "Explain quantum entanglement"} ])) .extra(json!({ "chat_template_kwargs": { "think_mode": true } })) .send() .await?; // Access reasoning (if present) if let Some(reasoning) = response.reasoning_content { println!("Reasoning:\n{}\n", reasoning); } // Access final answer println!("Answer:\n{}", response.content.unwrap_or_default()); }
Best Practices
1. Use for Complex Tasks
Thinking mode is most beneficial for:
- Multi-step reasoning
- Mathematical problems
- Code analysis
- Complex decision making
#![allow(unused)] fn main() { // Good: Complex reasoning task .messages(json!([ {"role": "user", "content": "Solve this puzzle: A father is 4 times as old as his son. In 20 years, he will be only twice as old. How old are they now?"} ])) // Less beneficial: Simple query .messages(json!([ {"role": "user", "content": "What is 2 + 2?"} ])) }
2. Display Reasoning Selectively
You may want to hide reasoning in production but show it for debugging:
#![allow(unused)] fn main() { let show_reasoning = std::env::var("SHOW_REASONING").is_ok(); while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { if show_reasoning { eprintln!("[thinking] {}", delta); } } StreamEvent::Content(delta) => print!("{}", delta), _ => {} } } }
3. Combine with System Prompts
Guide the thinking process with system prompts:
#![allow(unused)] fn main() { .messages(json!([ { "role": "system", "content": "Think through problems step by step. Consider multiple approaches before settling on an answer." }, {"role": "user", "content": problem} ])) }
4. Adjust Max Tokens
Thinking mode uses more tokens. Adjust accordingly:
#![allow(unused)] fn main() { .max_tokens(4096) // Account for both reasoning and answer }
Troubleshooting
No Reasoning Content
If you don't see reasoning content:
- Ensure thinking mode is enabled in
extraparameters - Verify the model supports thinking mode
- Check vLLM server configuration
# Check vLLM server logs for any issues
Incomplete Streaming
If streaming seems incomplete:
#![allow(unused)] fn main() { // Ensure you handle all event types while let Some(event) = stream.next().await { match event { StreamEvent::Reasoning(delta) => { /* handle */ }, StreamEvent::Content(delta) => { /* handle */ }, StreamEvent::Done => break, StreamEvent::Error(e) => { eprintln!("Error: {}", e); break; } _ => {} // Don't forget other events } } }
See Also
- Streaming API - Streaming response documentation
- Examples - More usage examples
- Advanced Topics - Other advanced features
Custom Headers
This document explains how to use custom HTTP headers with vLLM Client.
Overview
While the vLLM Client handles standard authentication via API keys, you may need to add custom headers for:
- Custom authentication schemes
- Request tracing and debugging
- Rate limiting identifiers
- Custom metadata
Current Limitations
The current version of vLLM Client does not provide a built-in method for custom headers. However, you can work around this limitation in several ways.
Workaround: Environment Variables
If your vLLM server accepts configuration via environment variables or specific API parameters:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let client = VllmClient::new("http://localhost:8000/v1") .with_api_key(std::env::var("MY_API_KEY").unwrap_or_default()); }
Workaround: Via Extra Parameters
Some custom configurations can be passed through the extra() method:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json}; let response = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": "Hello!"}])) .extra(json!({ "custom_field": "custom_value", "request_id": "req-12345" })) .send() .await?; }
Future Support
Custom header support is planned for future versions. The API will likely look like:
// Future API (not yet implemented)
let client = VllmClient::new("http://localhost:8000/v1")
.with_header("X-Custom-Header", "value")
.with_header("X-Request-ID", "req-123");
Common Use Cases
Tracing Headers
For distributed tracing (when supported):
// Future API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
.header("X-Trace-ID", trace_id)
.header("X-Span-ID", span_id)
.build();
Custom Authentication
For non-standard authentication schemes:
// Future API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
.header("X-API-Key", "custom-key")
.header("X-Tenant-ID", "tenant-123")
.build();
Request Metadata
Add metadata for logging or analytics:
// Future API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
.header("X-Request-Source", "mobile-app")
.header("X-User-ID", "user-456")
.build();
Alternative: Custom HTTP Client
For advanced use cases, you can use the underlying reqwest client directly:
use reqwest::Client; use serde_json::json; #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = Client::new(); let response = client .post("http://localhost:8000/v1/chat/completions") .header("Content-Type", "application/json") .header("Authorization", "Bearer your-api-key") .header("X-Custom-Header", "custom-value") .json(&json!({ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] })) .send() .await?; let result: serde_json::Value = response.json().await?; println!("{:?}", result); Ok(()) }
Best Practices
1. Use Standard Authentication When Possible
#![allow(unused)] fn main() { // Preferred let client = VllmClient::new("http://localhost:8000/v1") .with_api_key("your-api-key"); // Avoid custom auth unless necessary }
2. Document Custom Headers
When using custom headers, document their purpose:
// Future API
let client = VllmClient::builder()
.base_url("http://localhost:8000/v1")
// For request tracing in logs
.header("X-Request-ID", &request_id)
// For multi-tenant identification
.header("X-Tenant-ID", &tenant_id)
.build();
3. Validate Server Support
Ensure your vLLM server accepts and processes custom headers. Some proxies or load balancers may strip unknown headers.
Security Considerations
Don't Expose Sensitive Headers
Avoid logging headers that contain sensitive information:
// Be careful with logging
let auth_header = "Bearer secret-key";
// Don't log this directly!
Use HTTPS
Always use HTTPS when transmitting sensitive headers:
#![allow(unused)] fn main() { // Good let client = VllmClient::new("https://api.example.com/v1"); // Avoid for sensitive data let client = VllmClient::new("http://api.example.com/v1"); }
Requesting This Feature
If you need custom header support, please open an issue on GitHub with:
- Your use case
- Required headers
- How you'd like the API to look
See Also
- Timeouts & Retries - Configure timeouts and retry logic
- Thinking Mode - Reasoning model support
- Client API - Client configuration options
Timeouts & Retries
This page covers timeout configuration and retry strategies for robust production applications.
Setting Timeouts
Client-Level Timeout
Set a timeout when creating the client:
#![allow(unused)] fn main() { use vllm_client::VllmClient; // Simple timeout let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(120); // Using builder let client = VllmClient::builder() .base_url("http://localhost:8000/v1") .timeout_secs(300) // 5 minutes .build(); }
Choosing the Right Timeout
| Use Case | Recommended Timeout |
|---|---|
| Simple queries | 30-60 seconds |
| Code generation | 2-3 minutes |
| Long document generation | 5-10 minutes |
| Complex reasoning tasks | 10+ minutes |
Request Duration Factors
The time a request takes depends on:
- Prompt length - Longer prompts take more time to process
- Output tokens - More tokens = longer generation time
- Model size - Larger models are slower
- Server load - Busy servers respond slower
Timeout Errors
Handling Timeout
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; async fn chat_with_timeout(prompt: &str) -> Result<String, VllmError> { let client = VllmClient::new("http://localhost:8000/v1") .timeout_secs(60); let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match result { Ok(response) => Ok(response.content.unwrap_or_default()), Err(VllmError::Timeout) => { eprintln!("Request timed out after 60 seconds"); Err(VllmError::Timeout) } Err(e) => Err(e), } } }
Retry Strategies
Basic Retry
Retry failed requests with exponential backoff:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; use std::time::Duration; use tokio::time::sleep; async fn send_with_retry( client: &VllmClient, prompt: &str, max_retries: u32, ) -> Result<String, VllmError> { let mut attempts = 0; loop { match client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await { Ok(response) => { return Ok(response.content.unwrap_or_default()); } Err(e) if e.is_retryable() && attempts < max_retries => { attempts += 1; let delay = Duration::from_millis(100 * 2u64.pow(attempts - 1)); eprintln!("Retry {} after {:?}: {}", attempts, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } }
Retry with Jitter
Add jitter to prevent thundering herd:
#![allow(unused)] fn main() { use rand::Rng; use std::time::Duration; use tokio::time::sleep; fn backoff_with_jitter(attempt: u32, base_ms: u64, max_ms: u64) -> Duration { let exponential = base_ms * 2u64.pow(attempt); let jitter = rand::thread_rng().gen_range(0..base_ms); let delay = (exponential + jitter).min(max_ms); Duration::from_millis(delay) } async fn retry_with_jitter<F, T, E>( mut f: F, max_retries: u32, ) -> Result<T, E> where F: FnMut() -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<T, E>> + Send>>, E: std::fmt::Debug, { let mut attempts = 0; loop { match f().await { Ok(result) => return Ok(result), Err(e) if attempts < max_retries => { attempts += 1; let delay = backoff_with_jitter(attempts, 100, 10_000); eprintln!("Retry {} after {:?}: {:?}", attempts, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } }
Retry Only Retryable Errors
Not all errors should be retried:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, VllmError}; async fn smart_retry( client: &VllmClient, prompt: &str, ) -> Result<String, VllmError> { let mut attempts = 0; let max_retries = 3; loop { let result = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await; match result { Ok(response) => return Ok(response.content.unwrap_or_default()), Err(e) => { // Check if error is retryable if !e.is_retryable() { return Err(e); } if attempts >= max_retries { return Err(e); } attempts += 1; tokio::time::sleep(std::time::Duration::from_secs(2u64.pow(attempts))).await; } } } } }
Retryable Errors
| Error | Retryable | Reason |
|---|---|---|
Timeout | Yes | Server may be slow |
429 Rate Limited | Yes | Wait and retry |
500 Server Error | Yes | Temporary server issue |
502 Bad Gateway | Yes | Server may restart |
503 Unavailable | Yes | Temporary overload |
504 Gateway Timeout | Yes | Server error |
429 Rate Limited | Yes | Should wait |
500 Server Error | Yes | Temporary issue |
502/503/504 | Yes | Gateway errors |
400 Bad Request | No | Client error |
401 Unauthorized | No | Authentication issue |
404 Not Found | No | Resource doesn't exist |
Circuit Breaker Pattern
Prevent cascading failures with a circuit breaker:
#![allow(unused)] fn main() { use std::sync::atomic::{AtomicU32, Ordering}; use std::time::{Duration, Instant}; use std::sync::Mutex; struct CircuitBreaker { failures: AtomicU32, last_failure: Mutex<Option<Instant>>, threshold: u32, reset_duration: Duration, } impl CircuitBreaker { fn new(threshold: u32, reset_duration: Duration) -> Self { Self { failures: AtomicU32::new(0), last_failure: Mutex::new(None), threshold, reset_duration, } } fn can_attempt(&self) -> bool { let failures = self.failures.load(Ordering::Relaxed); if failures < self.threshold { return true; } let last = self.last_failure.lock().unwrap(); if let Some(time) = *last { if time.elapsed() > self.reset_duration { // Reset circuit breaker self.failures.store(0, Ordering::Relaxed); return true; } } false } fn record_success(&self) { self.failures.store(0, Ordering::Relaxed); } fn record_failure(&self) { self.failures.fetch_add(1, Ordering::Relaxed); *self.last_failure.lock().unwrap() = Some(Instant::now()); } } }
Streaming Timeout
Handle timeouts during streaming:
#![allow(unused)] fn main() { use vllm_client::{VllmClient, json, StreamEvent}; use futures::StreamExt; use tokio::time::{timeout, Duration}; async fn stream_with_timeout( client: &VllmClient, prompt: &str, per_event_timeout: Duration, ) -> Result<String, vllm_client::VllmError> { let mut stream = client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .stream(true) .send_stream() .await?; let mut content = String::new(); loop { match timeout(per_event_timeout, stream.next()).await { Ok(Some(event)) => { match event { StreamEvent::Content(delta) => content.push_str(&delta), StreamEvent::Done => break, StreamEvent::Error(e) => return Err(e), _ => {} } } Ok(None) => break, Err(_) => { return Err(vllm_client::VllmError::Timeout); } } } Ok(content) } }
Rate Limiting
Implement client-side rate limiting:
#![allow(unused)] fn main() { use tokio::sync::Semaphore; use std::sync::Arc; struct RateLimitedClient { client: vllm_client::VllmClient, semaphore: Arc<Semaphore>, } impl RateLimitedClient { fn new(base_url: &str, max_concurrent: usize) -> Self { Self { client: vllm_client::VllmClient::new(base_url), semaphore: Arc::new(Semaphore::new(max_concurrent)), } } async fn chat(&self, prompt: &str) -> Result<String, vllm_client::VllmError> { let _permit = self.semaphore.acquire().await.unwrap(); self.client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(vllm_client::json!([{"role": "user", "content": prompt}])) .send() .await .map(|r| r.content.unwrap_or_default()) } } }
Production Configuration
Complete Example
use vllm_client::{VllmClient, json, VllmError}; use std::time::Duration; use tokio::time::sleep; struct RobustClient { client: VllmClient, max_retries: u32, base_backoff_ms: u64, max_backoff_ms: u64, } impl RobustClient { fn new(base_url: &str, timeout_secs: u64) -> Self { Self { client: VllmClient::builder() .base_url(base_url) .timeout_secs(timeout_secs) .build(), max_retries: 3, base_backoff_ms: 100, max_backoff_ms: 10_000, } } async fn chat(&self, prompt: &str) -> Result<String, VllmError> { let mut attempts = 0; loop { match self.send_request(prompt).await { Ok(response) => return Ok(response), Err(e) if self.should_retry(&e, attempts) => { attempts += 1; let delay = self.calculate_backoff(attempts); eprintln!("Retry {} after {:?}: {}", attempts, delay, e); sleep(delay).await; } Err(e) => return Err(e), } } } async fn send_request(&self, prompt: &str) -> Result<String, VllmError> { self.client .chat .completions() .create() .model("Qwen/Qwen2.5-7B-Instruct") .messages(json!([{"role": "user", "content": prompt}])) .send() .await .map(|r| r.content.unwrap_or_default()) } fn should_retry(&self, error: &VllmError, attempts: u32) -> bool { attempts < self.max_retries && error.is_retryable() } fn calculate_backoff(&self, attempt: u32) -> Duration { let delay = self.base_backoff_ms * 2u64.pow(attempt); Duration::from_millis(delay.min(self.max_backoff_ms)) } } #[tokio::main] async fn main() -> Result<(), Box<dyn std::error::Error>> { let client = RobustClient::new("http://localhost:8000/v1", 300); match client.chat("Hello!").await { Ok(response) => println!("Response: {}", response), Err(e) => eprintln!("Failed after retries: {}", e), } Ok(()) }
Best Practices
- Set appropriate timeouts based on expected response times
- Use exponential backoff to avoid overwhelming the server
- Add jitter to prevent thundering herd problems
- Only retry retryable errors - don't retry client errors
- Implement circuit breakers for production systems
- Log retry attempts for debugging and monitoring
- Set a maximum retry count to avoid infinite loops
See Also
- Error Handling - Error types and handling
- Streaming - Streaming API
- Configuration - Client configuration
Contributing to vLLM Client
Thank you for your interest in contributing to vLLM Client! This document provides guidelines and instructions for contributing.
Table of Contents
- Code of Conduct
- Getting Started
- Development Setup
- Making Changes
- Testing
- Documentation
- Pull Request Process
- Coding Standards
Code of Conduct
Be respectful and inclusive. We welcome contributions from everyone.
Getting Started
- Fork the repository on GitHub
- Clone your fork locally
- Create a branch for your changes
git clone https://github.com/YOUR_USERNAME/vllm-client.git
cd vllm-client
git checkout -b my-feature
Development Setup
Prerequisites
- Rust 1.70 or later
- Cargo (comes with Rust)
- A vLLM server for integration testing (optional)
Building
# Build the library
cargo build
# Build with all features
cargo build --all-features
Running Tests
# Run unit tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_name
# Run integration tests (requires vLLM server)
cargo test --test integration
Making Changes
Branch Naming
Use descriptive branch names:
feature/add-new-feature- for new featuresfix/bug-description- for bug fixesdocs/documentation-update- for documentation changesrefactor/code-cleanup- for refactoring
Commit Messages
Follow conventional commit format:
type(scope): description
[optional body]
[optional footer]
Types:
feat: New featurefix: Bug fixdocs: Documentation changesstyle: Code style changes (formatting, etc.)refactor: Code refactoringtest: Adding or updating testschore: Maintenance tasks
Examples:
feat(client): add connection pooling support
fix(streaming): handle empty chunks correctly
docs(api): update streaming documentation
Testing
Unit Tests
All new functionality should have unit tests:
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn test_new_feature() { // Test implementation } } }
Integration Tests
Integration tests go in the tests/ directory:
#![allow(unused)] fn main() { // tests/integration_test.rs use vllm_client::{VllmClient, json}; #[tokio::test] async fn test_chat_completion() { let client = VllmClient::new("http://localhost:8000/v1"); // ... test code } }
Test Coverage
We aim for good test coverage. Run coverage reports:
cargo tarpaulin --out Html
Documentation
Code Documentation
Document all public APIs with doc comments:
#![allow(unused)] fn main() { /// Creates a new chat completion request. /// /// # Arguments /// /// * `model` - The model name to use for generation /// /// # Returns /// /// A new `ChatCompletionsRequest` builder /// /// # Example /// /// ```rust /// use vllm_client::{VllmClient, json}; /// /// let client = VllmClient::new("http://localhost:8000/v1"); /// let response = client.chat.completions().create() /// .model("Qwen/Qwen2.5-7B-Instruct") /// .messages(json!([{"role": "user", "content": "Hello"}])) /// .send() /// .await?; /// ``` pub fn create(&self) -> ChatCompletionsRequest { // Implementation } }
Updating Documentation
When adding new features:
- Update inline documentation
- Update API reference in
docs/src/api/ - Add examples to
docs/src/examples/ - Update the changelog
Building Documentation
# Build and preview documentation
cd docs && mdbook serve --open
Pull Request Process
- Update Documentation: Ensure documentation reflects your changes
- Add Tests: Include tests for new functionality
- Run Tests: Make sure all tests pass
- Format Code: Run
cargo fmt - Check Lints: Run
cargo clippy - Update CHANGELOG: Add entry to changelog
Pre-PR Checklist
# Format code
cargo fmt
# Check for lints
cargo clippy -- -D warnings
# Run all tests
cargo test
# Build documentation
mdbook build docs
mdbook build docs/zh
Submitting the PR
- Push your branch to your fork
- Open a PR against the
mainbranch - Fill in the PR template
- Wait for review
PR Template
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing done
## Checklist
- [ ] Code formatted with `cargo fmt`
- [ ] No clippy warnings
- [ ] Documentation updated
- [ ] Changelog updated
Coding Standards
Rust Style
Follow standard Rust conventions:
- Use
cargo fmtfor formatting - Address all
clippywarnings - Follow Rust API Guidelines
Naming Conventions
- Types: PascalCase (
ChatCompletionResponse) - Functions/Methods: snake_case (
send_stream) - Constants: SCREAMING_SNAKE_CASE (
MAX_RETRIES) - Modules: snake_case (
chat,completions)
Error Handling
Use VllmError for all errors:
#![allow(unused)] fn main() { // Good pub fn parse_response(data: &str) -> Result<Response, VllmError> { serde_json::from_str(data).map_err(VllmError::Json) } // Avoid pub fn parse_response(data: &str) -> Result<Response, String> { // ... } }
Async Code
Use async/await for all async operations:
#![allow(unused)] fn main() { // Good pub async fn send(&self) -> Result<Response, VllmError> { let response = self.http.post(&url).send().await?; // ... } // Avoid blocking in async context pub async fn bad_example(&self) -> Result<Response, VllmError> { std::thread::sleep(Duration::from_secs(1)); // Don't do this // ... } }
Project Structure
vllm-client/
├── src/
│ ├── lib.rs # Library entry point
│ ├── client.rs # Client implementation
│ ├── chat.rs # Chat API
│ ├── completions.rs # Legacy completions
│ ├── types.rs # Type definitions
│ └── error.rs # Error types
├── tests/
│ └── integration/ # Integration tests
├── docs/
│ ├── src/ # English documentation
│ └── zh/src/ # Chinese documentation
├── examples/
│ └── *.rs # Example programs
└── Cargo.toml
Getting Help
- Open an issue for bugs or feature requests
- Start a discussion for questions
- Check existing issues before creating new ones
License
By contributing, you agree that your contributions will be licensed under the MIT OR Apache-2.0 license.
Recognition
Contributors are recognized in our README and release notes.
Thank you for contributing to vLLM Client!
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
0.1.0 - 2024-01-XX
Added
- Initial release of vLLM Client
VllmClientfor connecting to vLLM servers- Chat completions API (
client.chat.completions()) - Streaming response support with
MessageStream - Tool/function calling support
- Reasoning/thinking mode support for compatible models
- Error handling with
VllmErrorenum - Builder pattern for client configuration
- Request builder pattern for chat completions
- Support for vLLM-specific parameters via
extra() - Token usage tracking in responses
- Timeout configuration
- API key authentication
Features
Client
VllmClient::new(base_url)- Create a new clientVllmClient::builder()- Create a client with builder patternwith_api_key()- Set API key for authenticationtimeout_secs()- Set request timeout
Chat Completions
model()- Set model namemessages()- Set conversation messagestemperature()- Set sampling temperaturemax_tokens()- Set maximum output tokenstop_p()- Set nucleus sampling parametertop_k()- Set top-k sampling (vLLM extension)stop()- Set stop sequencesstream()- Enable streaming modetools()- Define available toolstool_choice()- Control tool selectionextra()- Pass vLLM-specific parameters
Streaming
StreamEvent::Content- Content tokensStreamEvent::Reasoning- Reasoning content (thinking models)StreamEvent::ToolCallDelta- Streaming tool call updatesStreamEvent::ToolCallComplete- Complete tool callStreamEvent::Usage- Token usage statisticsStreamEvent::Done- Stream completionStreamEvent::Error- Error events
Response Types
ChatCompletionResponse- Chat completion responseToolCall- Tool call data with parsing methodsUsage- Token usage statistics
Dependencies
reqwest- HTTP clientserde/serde_json- JSON serializationtokio- Async runtimethiserror- Error handling
[Unreleased]
Planned
- Custom HTTP headers support
- Connection pooling configuration
- Request/response logging
- Retry middleware
- Multi-modal input helpers
- Async iterator for batch processing
- OpenTelemetry integration
- WebSocket transport
Version History
| Version | Date | Highlights |
|---|---|---|
| 0.1.0 | 2024-01 | Initial release |