vLLM Client

A Rust client library for vLLM API with OpenAI-compatible interface.

Features

  • OpenAI Compatible: Uses the same API structure as OpenAI, making it easy to switch
  • Streaming Support: Full support for streaming responses with Server-Sent Events (SSE)
  • Tool Calling: Support for function/tool calling with streaming delta updates
  • Reasoning Models: Built-in support for reasoning/thinking models (like Qwen with thinking mode)
  • Async/Await: Fully async using Tokio runtime
  • Type Safe: Strong types with Serde serialization

Quick Start

Add to your Cargo.toml:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

Basic Usage

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "Hello, world!"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.choices[0].message.content);
    Ok(())
}

Documentation

License

Licensed under either of Apache License, Version 2.0 or MIT license at your option.

Getting Started

Installation

Add vllm-client to your Cargo.toml:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

Quick Start

Basic Chat Completion

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a client
    let client = VllmClient::new("http://localhost:8000/v1");
    
    // Send a chat completion request
    let response = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "Hello, how are you?"}
        ]))
        .send()
        .await?;
    
    // Print the response
    println!("{}", response.choices[0].message.content);
    
    Ok(())
}

Streaming Response

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("your-model-name")
        .messages(json!([
            {"role": "user", "content": "Write a poem about spring"}
        ]))
        .stream(true)
        .send_stream()
        .await?;
    
    while let Some(event) = stream.next().await {
        match &event {
            StreamEvent::Reasoning(delta) => print!("{}", delta),
            StreamEvent::Content(delta) => print!("{}", delta),
            _ => {}
        }
    }
    
    println!();
    Ok(())
}

Configuration

API Key

If your vLLM server requires authentication:

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");
}

Custom Timeout

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_timeout(std::time::Duration::from_secs(60));
}

Next Steps

Installation

Requirements

  • Rust: 1.70 or later
  • Cargo: Comes with Rust installation

Adding to Your Project

Add vllm-client to your Cargo.toml:

[dependencies]
vllm-client = "0.1"

Or use cargo add:

cargo add vllm-client

Required Dependencies

The library requires tokio for async runtime. Add it to your Cargo.toml:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }

Optional Dependencies

For convenience, the library re-exports serde_json::json:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"

Feature Flags

Currently, vllm-client does not have additional feature flags. All functionality is included by default.

Verifying Installation

Create a simple test to verify the installation:

use vllm_client::VllmClient;

fn main() {
    let client = VllmClient::new("http://localhost:8000/v1");
    println!("Client created with base URL: {}", client.base_url());
}

Run with:

cargo run

vLLM Server Setup

To use this client, you need a vLLM server running. Install and start vLLM:

# Install vLLM
pip install vllm

# Start vLLM server with a model
vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000

The server will be available at http://localhost:8000/v1.

Troubleshooting

Connection Refused

If you see connection errors, ensure:

  1. The vLLM server is running
  2. The server URL is correct (default: http://localhost:8000/v1)
  3. The port is not blocked by firewall

TLS/SSL Issues

If your vLLM server uses HTTPS with a self-signed certificate, you may need to handle certificate validation in your application.

Timeout Errors

For long-running requests, configure a longer timeout:

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 5 minutes
}

Next Steps

Quick Start

This guide will help you make your first API call with vLLM Client.

Prerequisites

  • Rust 1.70 or later
  • A running vLLM server

Basic Chat Completion

The simplest way to use the client is with a synchronous-style chat completion:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a client pointing to your vLLM server
    let client = VllmClient::new("http://localhost:8000/v1");

    // Send a chat completion request
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Hello, how are you?"}
        ]))
        .send()
        .await?;

    // Print the response
    println!("Response: {}", response.content.unwrap_or_default());

    Ok(())
}

Streaming Response

For real-time output, use streaming:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // Create a streaming request
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Write a short poem about spring"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    // Process streaming events
    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Reasoning(delta) => eprint!("[thinking: {}]", delta),
            StreamEvent::Done => println!("\n[Done]"),
            StreamEvent::Error(e) => eprintln!("\nError: {}", e),
            _ => {}
        }
    }

    Ok(())
}

Using the Builder Pattern

For more configuration options, use the builder:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("your-api-key")  // Optional
    .timeout_secs(120)         // Optional
    .build();
}

Complete Example with Options

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    println!("Response: {}", response.content.unwrap_or_default());
    
    // Print usage statistics if available
    if let Some(usage) = response.usage {
        println!("Tokens: prompt={}, completion={}, total={}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }

    Ok(())
}

Error Handling

Handle errors gracefully:

use vllm_client::{VllmClient, json, VllmError};

async fn chat() -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Hello!"}
        ]))
        .send()
        .await?;

    Ok(response.content.unwrap_or_default())
}

#[tokio::main]
async fn main() {
    match chat().await {
        Ok(text) => println!("Response: {}", text),
        Err(VllmError::ApiError { status_code, message, .. }) => {
            eprintln!("API Error ({}): {}", status_code, message);
        }
        Err(VllmError::Timeout) => {
            eprintln!("Request timed out");
        }
        Err(e) => {
            eprintln!("Error: {}", e);
        }
    }
}

Next Steps

Configuration

This page covers all configuration options for vllm-client.

Client Configuration

Basic Setup

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1");
}

Using the Builder Pattern

For more complex configurations, use the builder pattern:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("your-api-key")
    .timeout_secs(120)
    .build();
}

Configuration Options

Base URL

The base URL of your vLLM server. This should include the /v1 path for OpenAI compatibility.

#![allow(unused)]
fn main() {
// Local development
let client = VllmClient::new("http://localhost:8000/v1");

// Remote server
let client = VllmClient::new("https://api.example.com/v1");

// With trailing slash (automatically normalized)
let client = VllmClient::new("http://localhost:8000/v1/");
// Equivalent to: "http://localhost:8000/v1"
}

API Key

If your vLLM server requires authentication, configure the API key:

#![allow(unused)]
fn main() {
// Using method chain
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-your-api-key");

// Using builder
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-your-api-key")
    .build();
}

The API key is sent as a Bearer token in the Authorization header.

Timeout

Configure the request timeout for long-running operations:

#![allow(unused)]
fn main() {
// Using method chain
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 5 minutes

// Using builder
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .timeout_secs(300)
    .build();
}

Default timeout uses the underlying HTTP client's default (usually 30 seconds).

Request Configuration

When making requests, you can configure various parameters:

Model Selection

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .send()
    .await?;
}

Sampling Parameters

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .temperature(0.7)      // 0.0 - 2.0
    .top_p(0.9)            // 0.0 - 1.0
    .top_k(50)             // vLLM extension
    .max_tokens(1024)      // Max output tokens
    .send()
    .await?;
}
ParameterTypeRangeDescription
temperaturef320.0 - 2.0Controls randomness. Higher = more random
top_pf320.0 - 1.0Nucleus sampling threshold
top_ki321+Top-K sampling (vLLM extension)
max_tokensu321+Maximum tokens to generate

Stop Sequences

#![allow(unused)]
fn main() {
use serde_json::json;

// Multiple stop sequences
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .stop(json!(["END", "STOP", "\n\n"]))
    .send()
    .await?;

// Single stop sequence
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .stop(json!("END"))
    .send()
    .await?;
}

Extra Parameters

vLLM supports additional parameters via the extra() method:

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Think about this"}]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        },
        "reasoning_effort": "high"
    }))
    .send()
    .await?;
}

Environment Variables

You can use environment variables to configure the client:

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

let base_url = env::var("VLLM_BASE_URL")
    .unwrap_or_else(|_| "http://localhost:8000/v1".to_string());

let api_key = env::var("VLLM_API_KEY").ok();

let mut client_builder = VllmClient::builder()
    .base_url(&base_url);

if let Some(key) = api_key {
    client_builder = client_builder.api_key(&key);
}

let client = client_builder.build();
}
VariableDescriptionExample
VLLM_BASE_URLvLLM server URLhttp://localhost:8000/v1
VLLM_API_KEYAPI key (optional)sk-xxx
VLLM_TIMEOUTTimeout in seconds300

Best Practices

Reusing the Client

Create the client once and reuse it for multiple requests:

#![allow(unused)]
fn main() {
// Good: Reuse client
let client = VllmClient::new("http://localhost:8000/v1");

for prompt in prompts {
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await?;
}

// Avoid: Creating client for each request
for prompt in prompts {
    let client = VllmClient::new("http://localhost:8000/v1"); // Inefficient!
    // ...
}
}

Timeout Selection

Choose appropriate timeouts based on your use case:

Use CaseRecommended Timeout
Simple queries30 seconds
Complex reasoning2-5 minutes
Long document generation10+ minutes

Error Handling

Always handle errors appropriately:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, VllmError};

match client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .send()
    .await
{
    Ok(response) => println!("{}", response.content.unwrap()),
    Err(VllmError::Timeout) => eprintln!("Request timed out"),
    Err(VllmError::ApiError { status_code, message, .. }) => {
        eprintln!("API error ({}): {}", status_code, message);
    }
    Err(e) => eprintln!("Error: {}", e),
}
}

Next Steps

API Reference

This section provides detailed documentation for the vLLM Client API.

Design Philosophy

The vLLM Client API follows these design principles:

Builder Pattern

All request constructions use the builder pattern for ergonomic and flexible API calls:

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("model-name")
    .messages(json!([{"role": "user", "content": "Hello"}]))
    .temperature(0.7)
    .max_tokens(1024)
    .send()
    .await?;
}

Async-First

All API operations are async, built on Tokio. Use #[tokio::main] or integrate with your existing runtime:

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Your async code here
}

Type Safety

Strong types are used throughout the library with Serde serialization:

  • ChatCompletionResponse - Response from chat completions
  • StreamEvent - Events from streaming responses
  • ToolCall - Tool/function call data
  • VllmError - Comprehensive error types

OpenAI Compatibility

The API mirrors the OpenAI API structure, making it easy to migrate existing code:

OpenAIvLLM Client
client.chat.completions.create(...)client.chat.completions().create()...send().await
stream=True.stream(true).send_stream().await
tools=[...].tools(json!([...]))

Module Structure

VllmClient
├── chat
│   └── completions()      # Chat completions API
│       ├── create()       # Create request builder
│       └── send()         # Execute request
│       └── send_stream()  # Execute with streaming
├── completions            # Legacy completions API
└── builder()              # Client builder

Core Types

Request Types

TypeDescription
ChatCompletionsRequestBuilder for chat completion requests
VllmClientBuilderBuilder for client configuration

Response Types

TypeDescription
ChatCompletionResponseResponse from chat completions
CompletionResponseResponse from legacy completions
MessageStreamStreaming response iterator
StreamEventIndividual stream events
ToolCallTool/function call data
UsageToken usage statistics

Error Types

TypeDescription
VllmError::HttpHTTP request failed
VllmError::JsonJSON serialization error
VllmError::ApiErrorAPI returned error
VllmError::StreamStreaming error
VllmError::TimeoutConnection timeout

Quick Reference

Creating a Client

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

// Simple
let client = VllmClient::new("http://localhost:8000/v1");

// With API key
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-xxx");

// With builder
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-xxx")
    .timeout_secs(120)
    .build();
}

Chat Completion

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([
        {"role": "user", "content": "Hello!"}
    ]))
    .temperature(0.7)
    .max_tokens(1024)
    .send()
    .await?;

println!("{}", response.content.unwrap());
}

Streaming

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .stream(true)
    .send_stream()
    .await?;

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Content(delta) => print!("{}", delta),
        StreamEvent::Reasoning(delta) => eprintln!("[thinking] {}", delta),
        StreamEvent::Done => break,
        _ => {}
    }
}
}

Sections

Client API

The VllmClient is the main entry point for interacting with the vLLM API.

Creating a Client

Simple Construction

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1");
}

With API Key

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-your-api-key");
}

With Timeout

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(120); // 2 minutes
}

Using the Builder Pattern

For more complex configurations, use the builder:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-your-api-key")
    .timeout_secs(300)
    .build();
}

Methods Reference

new(base_url: impl Into<String>) -> Self

Create a new client with the given base URL.

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1");
}

Parameters:

  • base_url - The base URL of the vLLM server (should include /v1 path)

Notes:

  • Trailing slashes are automatically removed
  • The client is cheap to create but should be reused when possible

with_api_key(self, api_key: impl Into<String>) -> Self

Set the API key for authentication (builder pattern).

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-xxx");
}

Parameters:

  • api_key - The API key to use for Bearer authentication

Notes:

  • The API key is sent as a Bearer token in the Authorization header
  • This method returns a new client instance

timeout_secs(self, secs: u64) -> Self

Set the request timeout in seconds (builder pattern).

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300);
}

Parameters:

  • secs - Timeout duration in seconds

Notes:

  • Applies to all requests made by this client
  • For long-running generation tasks, consider setting a higher timeout

base_url(&self) -> &str

Get the base URL of the client.

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1");
assert_eq!(client.base_url(), "http://localhost:8000/v1");
}

api_key(&self) -> Option<&str>

Get the API key, if configured.

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("sk-xxx");
assert_eq!(client.api_key(), Some("sk-xxx"));
}

builder() -> VllmClientBuilder

Create a new client builder for more configuration options.

#![allow(unused)]
fn main() {
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .api_key("sk-xxx")
    .timeout_secs(120)
    .build();
}

API Modules

The client provides access to different API modules:

chat - Chat Completions API

Access the chat completions API for conversational interactions:

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .send()
    .await?;
}

completions - Legacy Completions API

Access the legacy completions API for text completion:

#![allow(unused)]
fn main() {
let response = client.completions.create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .prompt("Once upon a time")
    .send()
    .await?;
}

VllmClientBuilder

The builder provides a flexible way to configure the client.

Methods

MethodTypeDescription
base_url(url)impl Into<String>Set the base URL
api_key(key)impl Into<String>Set the API key
timeout_secs(secs)u64Set timeout in seconds
build()-Build the client

Default Values

OptionDefault
base_urlhttp://localhost:8000/v1
api_keyNone
timeout_secsHTTP client default (30s)

Usage Examples

Basic Usage

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Hello!"}
        ]))
        .send()
        .await?;
    
    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

With Environment Variables

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

fn create_client() -> VllmClient {
    let base_url = env::var("VLLM_BASE_URL")
        .unwrap_or_else(|_| "http://localhost:8000/v1".to_string());
    
    let api_key = env::var("VLLM_API_KEY").ok();
    
    let mut builder = VllmClient::builder().base_url(&base_url);
    
    if let Some(key) = api_key {
        builder = builder.api_key(&key);
    }
    
    builder.build()
}
}

Multiple Requests

Reuse the client for multiple requests:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

async fn process_prompts(client: &VllmClient, prompts: &[&str]) -> Vec<String> {
    let mut results = Vec::new();
    
    for prompt in prompts {
        let response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;
        
        match response {
            Ok(r) => results.push(r.content.unwrap_or_default()),
            Err(e) => eprintln!("Error: {}", e),
        }
    }
    
    results
}
}

Thread Safety

The VllmClient is thread-safe and can be shared across threads:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use vllm_client::VllmClient;

let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// Can be cloned and shared across threads
let client_clone = Arc::clone(&client);
}

See Also

Chat Completions API

The Chat Completions API is the primary interface for generating text responses from a language model.

Overview

Access the chat completions API through client.chat.completions():

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "Hello!"}
    ]))
    .send()
    .await?;
}

Request Builder

Required Parameters

model(name: impl Into<String>)

Set the model name to use for generation.

#![allow(unused)]
fn main() {
.model("Qwen/Qwen2.5-72B-Instruct")
// or
.model("meta-llama/Llama-3-70b")
}

messages(messages: Value)

Set the conversation messages as a JSON array.

#![allow(unused)]
fn main() {
.messages(json!([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Rust?"}
]))
}

Message Types

RoleDescription
systemSet the behavior of the assistant
userUser input
assistantAssistant response (for multi-turn)
toolTool result (for function calling)

Sampling Parameters

temperature(temp: f32)

Controls randomness. Range: 0.0 to 2.0.

#![allow(unused)]
fn main() {
.temperature(0.7)  // Default-like behavior
.temperature(0.0)  // Deterministic
.temperature(1.5)  // More creative
}

max_tokens(tokens: u32)

Maximum number of tokens to generate.

#![allow(unused)]
fn main() {
.max_tokens(1024)
.max_tokens(4096)
}

top_p(p: f32)

Nucleus sampling threshold. Range: 0.0 to 1.0.

#![allow(unused)]
fn main() {
.top_p(0.9)
}

top_k(k: i32)

Top-K sampling (vLLM extension). Limits to top K tokens.

#![allow(unused)]
fn main() {
.top_k(50)
}

stop(sequences: Value)

Stop generation when encountering these sequences.

#![allow(unused)]
fn main() {
// Multiple sequences
.stop(json!(["END", "STOP", "\n\n"]))

// Single sequence
.stop(json!("---"))
}

Tool Calling Parameters

tools(tools: Value)

Define tools/functions that the model can call.

#![allow(unused)]
fn main() {
.tools(json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]))
}

tool_choice(choice: Value)

Control tool selection behavior.

#![allow(unused)]
fn main() {
.tool_choice(json!("auto"))       // Model decides
.tool_choice(json!("none"))       // No tools
.tool_choice(json!("required"))   // Force tool use
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

Advanced Parameters

stream(enable: bool)

Enable streaming response.

#![allow(unused)]
fn main() {
.stream(true)
}

extra(params: Value)

Pass vLLM-specific or additional parameters.

#![allow(unused)]
fn main() {
.extra(json!({
    "chat_template_kwargs": {
        "think_mode": true
    },
    "reasoning_effort": "high"
}))
}

Sending Requests

send() - Synchronous Response

Returns the complete response at once.

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .send()
    .await?;
}

send_stream() - Streaming Response

Returns a stream for real-time output.

#![allow(unused)]
fn main() {
let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .stream(true)
    .send_stream()
    .await?;
}

See Streaming for detailed streaming documentation.

Response Structure

ChatCompletionResponse

FieldTypeDescription
rawValueRaw JSON response
idStringResponse ID
objectStringObject type
modelStringModel used
contentOption<String>Generated content
reasoning_contentOption<String>Reasoning content (thinking models)
tool_callsOption<Vec<ToolCall>>Tool calls made
finish_reasonOption<String>Why generation stopped
usageOption<Usage>Token usage statistics

Example Usage

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What is 2+2?"}
    ]))
    .send()
    .await?;

// Access content
println!("Content: {}", response.content.unwrap_or_default());

// Check for reasoning (thinking models)
if let Some(reasoning) = response.reasoning_content {
    println!("Reasoning: {}", reasoning);
}

// Check finish reason
match response.finish_reason.as_deref() {
    Some("stop") => println!("Natural stop"),
    Some("length") => println!("Max tokens reached"),
    Some("tool_calls") => println!("Tool calls made"),
    _ => {}
}

// Token usage
if let Some(usage) = response.usage {
    println!("Prompt tokens: {}", usage.prompt_tokens);
    println!("Completion tokens: {}", usage.completion_tokens);
    println!("Total tokens: {}", usage.total_tokens);
}
}

Complete Example

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "You are a helpful coding assistant."},
            {"role": "user", "content": "Write a function to reverse a string in Rust"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    if let Some(content) = response.content {
        println!("{}", content);
    }

    Ok(())
}

Multi-turn Conversation

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

// First message
let response1 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "My name is Alice"}
    ]))
    .send()
    .await?;

// Continue conversation
let response2 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "My name is Alice"},
        {"role": "assistant", "content": response1.content.unwrap()},
        {"role": "user", "content": "What's my name?"}
    ]))
    .send()
    .await?;
}

See Also

Streaming API

Streaming responses allow you to process LLM output in real-time, token by token, instead of waiting for the complete response.

Overview

vLLM Client provides streaming support through Server-Sent Events (SSE). Use send_stream() instead of send() to get a streaming response.

Basic Streaming

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Write a poem about spring"}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => print!("{}", delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    println!();
    Ok(())
}

StreamEvent Types

The StreamEvent enum represents different types of streaming events:

VariantDescription
Content(String)Regular content token delta
Reasoning(String)Reasoning/thinking content (for thinking models)
ToolCallDeltaStreaming tool call delta
ToolCallComplete(ToolCall)Complete tool call ready to execute
Usage(Usage)Token usage statistics
DoneStream completed successfully
Error(VllmError)An error occurred

Content Events

The most common event type, containing text tokens:

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Content(delta) => {
        print!("{}", delta);
        std::io::Write::flush(&mut std::io::stdout()).ok();
    }
    _ => {}
}
}

Reasoning Events

For models with reasoning capabilities (like Qwen with thinking mode):

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Reasoning(delta) => {
        eprintln!("[thinking] {}", delta);
    }
    StreamEvent::Content(delta) => {
        print!("{}", delta);
    }
    _ => {}
}
}

Tool Call Events

Tool calls are streamed incrementally and then completed:

#![allow(unused)]
fn main() {
match event {
    StreamEvent::ToolCallDelta { index, id, name, arguments } => {
        println!("Tool delta: index={}, name={}", index, name);
        // Arguments are streamed as partial JSON
    }
    StreamEvent::ToolCallComplete(tool_call) => {
        println!("Tool ready: {}({})", tool_call.name, tool_call.arguments);
        // Execute the tool and return result
    }
    _ => {}
}
}

Usage Events

Token usage information is typically sent at the end:

#![allow(unused)]
fn main() {
match event {
    StreamEvent::Usage(usage) => {
        println!("Tokens: prompt={}, completion={}, total={}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }
    _ => {}
}
}

MessageStream

The MessageStream type is an async iterator that yields StreamEvent values.

Methods

MethodReturn TypeDescription
next()Option<StreamEvent>Get next event (async)
collect_content()StringCollect all content into a string
into_stream()impl StreamConvert to generic stream

Collect All Content

For convenience, you can collect all content at once:

#![allow(unused)]
fn main() {
let content = stream.collect_content().await?;
println!("Full response: {}", content);
}

Note: This waits for the complete response, defeating the purpose of streaming. Use only when you need both streaming display and the full text.

Complete Streaming Example

use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum computing in simple terms"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .stream(true)
        .send_stream()
        .await?;

    let mut reasoning = String::new();
    let mut content = String::new();
    let mut usage = None;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
            }
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Usage(u) => {
                usage = Some(u);
            }
            StreamEvent::Done => {
                println!("\n[Stream completed]");
            }
            StreamEvent::Error(e) => {
                eprintln!("\nError: {}", e);
                return Err(e);
            }
            _ => {}
        }
    }

    // Print summary
    if !reasoning.is_empty() {
        eprintln!("\n--- Reasoning ---");
        eprintln!("{}", reasoning);
    }

    if let Some(usage) = usage {
        eprintln!("\n--- Token Usage ---");
        eprintln!("Prompt: {}, Completion: {}, Total: {}",
            usage.prompt_tokens,
            usage.completion_tokens,
            usage.total_tokens
        );
    }

    Ok(())
}

Streaming with Tool Calling

When streaming with tools, you'll receive incremental tool call updates:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, ToolCall};
use futures::StreamExt;

let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]);

let mut stream = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ]))
    .tools(tools)
    .stream(true)
    .send_stream()
    .await?;

let mut tool_calls: Vec<ToolCall> = Vec::new();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Content(delta) => print!("{}", delta),
        StreamEvent::ToolCallComplete(tool_call) => {
            tool_calls.push(tool_call);
        }
        StreamEvent::Done => break,
        _ => {}
    }
}

// Execute tool calls
for tool_call in tool_calls {
    println!("Tool: {} with args: {}", tool_call.name, tool_call.arguments);
    // Execute and return result in next message
}
}

Error Handling

Streaming errors can occur at any point:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

async fn stream_chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => content.push_str(&delta),
            StreamEvent::Error(e) => return Err(e),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    Ok(content)
}
}

Best Practices

Flush Output

For real-time display, flush stdout after each token:

#![allow(unused)]
fn main() {
use std::io::{self, Write};

match event {
    StreamEvent::Content(delta) => {
        print!("{}", delta);
        io::stdout().flush().ok();
    }
    _ => {}
}
}

Handle Interruption

For interactive applications, handle Ctrl+C gracefully:

#![allow(unused)]
fn main() {
use tokio::signal;

tokio::select! {
    result = process_stream(&mut stream) => {
        // Normal completion
    }
    _ = signal::ctrl_c() => {
        println!("\n[interrupted]");
    }
}
}

Timeout for Idle Streams

Set a timeout for streams that may hang:

#![allow(unused)]
fn main() {
use tokio::time::{timeout, Duration};

let result = timeout(
    Duration::from_secs(60),
    stream.next()
).await;

match result {
    Ok(Some(event)) => { /* process event */ }
    Ok(None) => { /* stream ended */ }
    Err(_) => { /* timeout */ }
}
}

Completions Streaming

The vLLM Client also supports streaming for the legacy /v1/completions API using CompletionStreamEvent.

CompletionStreamEvent Types

VariantDescription
Text(String)Text token delta
FinishReason(String)Reason why the stream finished (e.g., "stop", "length")
Usage(Usage)Token usage statistics
DoneStream completed successfully
Error(VllmError)An error occurred

Completions Streaming Example

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("Write a poem about spring")
        .max_tokens(1024)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                print!("{}", delta);
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n[Finish reason: {}]", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                println!("\nTokens: prompt={}, completion={}, total={}",
                    usage.prompt_tokens,
                    usage.completion_tokens,
                    usage.total_tokens
                );
            }
            CompletionStreamEvent::Done => {
                println!("\n[Stream completed]");
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("Error: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

CompletionStream Methods

MethodReturn TypeDescription
next()Option<CompletionStreamEvent>Get next event (async)
collect_text()StringCollect all text into a string
into_stream()impl StreamConvert to generic stream

Next Steps

Tool Calling API

Tool calling (also known as function calling) allows the model to call external functions during generation. This enables integration with external APIs, databases, and custom logic.

Overview

The vLLM Client supports OpenAI-compatible tool calling:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ]))
    .tools(tools)
    .send()
    .await?;
}

Defining Tools

Basic Tool Definition

Tools are defined as JSON following the OpenAI schema:

#![allow(unused)]
fn main() {
let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name, e.g., Tokyo"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]);
}

Multiple Tools

#![allow(unused)]
fn main() {
let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather information",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "limit": {"type": "integer"}
                },
                "required": ["query"]
            }
        }
    }
]);
}

Tool Choice

Control how the model selects tools:

#![allow(unused)]
fn main() {
// Let the model decide (default)
.tool_choice(json!("auto"))

// Prevent tool use
.tool_choice(json!("none"))

// Force tool use
.tool_choice(json!("required"))

// Force a specific tool
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

Handling Tool Calls

Checking for Tool Calls

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ]))
    .tools(tools)
    .send()
    .await?;

// Check if the response contains tool calls
if response.has_tool_calls() {
    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            println!("Function: {}", tool_call.name);
            println!("Arguments: {}", tool_call.arguments);
        }
    }
}
}

ToolCall Structure

#![allow(unused)]
fn main() {
pub struct ToolCall {
    pub id: String,           // Unique identifier for the call
    pub name: String,         // Function name
    pub arguments: String,    // JSON string of arguments
}
}

Parsing Arguments

Parse the arguments string into typed data:

#![allow(unused)]
fn main() {
use serde::Deserialize;

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
    unit: Option<String>,
}

if let Some(tool_call) = response.first_tool_call() {
    // Parse as a specific type
    match tool_call.parse_args_as::<WeatherArgs>() {
        Ok(args) => {
            println!("Location: {}", args.location);
            if let Some(unit) = args.unit {
                println!("Unit: {}", unit);
            }
        }
        Err(e) => {
            eprintln!("Failed to parse arguments: {}", e);
        }
    }
    
    // Or parse as generic JSON
    let args: Value = tool_call.parse_args()?;
}
}

Tool Result Method

Create a tool result message:

#![allow(unused)]
fn main() {
// Create a tool result message
let tool_result = tool_call.result(json!({
    "temperature": 25,
    "condition": "sunny",
    "humidity": 60
}));

// Returns a JSON object ready to be added to messages
// {
//     "role": "tool",
//     "tool_call_id": "...",
//     "content": "{\"temperature\": 25, ...}"
// }
}

Complete Tool Calling Flow

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, ToolCall};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
}

#[derive(Serialize)]
struct WeatherResult {
    temperature: f32,
    condition: String,
}

// Simulate weather API
fn get_weather(location: &str) -> WeatherResult {
    WeatherResult {
        temperature: 25.0,
        condition: "sunny".to_string(),
    }
}

async fn chat_with_tools(client: &VllmClient, user_message: &str) -> Result<String, Box<dyn std::error::Error>> {
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    // First request
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": user_message}
        ]))
        .tools(tools.clone())
        .send()
        .await?;

    // Check if model wants to call a tool
    if response.has_tool_calls() {
        let mut messages = vec![
            json!({"role": "user", "content": user_message})
        ];

        // Add assistant's tool calls to messages
        if let Some(tool_calls) = &response.tool_calls {
            let assistant_msg = response.assistant_message();
            messages.push(assistant_msg);

            // Execute each tool and add results
            for tool_call in tool_calls {
                if tool_call.name == "get_weather" {
                    let args: WeatherArgs = tool_call.parse_args_as()?;
                    let result = get_weather(&args.location);
                    messages.push(tool_call.result(json!(result)));
                }
            }
        }

        // Continue conversation with tool results
        let final_response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-72B-Instruct")
            .messages(json!(messages))
            .tools(tools)
            .send()
            .await?;

        return Ok(final_response.content.unwrap_or_default());
    }

    Ok(response.content.unwrap_or_default())
}
}

Streaming Tool Calls

Tool calls are streamed incrementally during streaming responses:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What's the weather in Tokyo and Paris?"}
    ]))
    .tools(tools)
    .stream(true)
    .send_stream()
    .await?;

let mut tool_calls: Vec<ToolCall> = Vec::new();
let mut content = String::new();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Content(delta) => {
            content.push_str(&delta);
            print!("{}", delta);
        }
        StreamEvent::ToolCallDelta { index, id, name, arguments } => {
            println!("[Tool delta {}] {}({})", index, name, arguments);
        }
        StreamEvent::ToolCallComplete(tool_call) => {
            println!("[Tool complete] {}({})", tool_call.name, tool_call.arguments);
            tool_calls.push(tool_call);
        }
        StreamEvent::Done => break,
        _ => {}
    }
}

// Execute all collected tool calls
for tool_call in tool_calls {
    // Execute and return results...
}
}

Tool Calling with Multiple Rounds

#![allow(unused)]
fn main() {
async fn multi_round_tool_calling(
    client: &VllmClient,
    user_message: &str,
    max_rounds: usize,
) -> Result<String, Box<dyn std::error::Error>> {
    let mut messages = vec![
        json!({"role": "user", "content": user_message})
    ];

    for _ in 0..max_rounds {
        let response = client.chat.completions().create()
            .model("Qwen/Qwen2.5-72B-Instruct")
            .messages(json!(&messages))
            .tools(tools.clone())
            .send()
            .await?;

        if response.has_tool_calls() {
            // Add assistant message with tool calls
            messages.push(response.assistant_message());

            // Execute tools and add results
            if let Some(tool_calls) = &response.tool_calls {
                for tool_call in tool_calls {
                    let result = execute_tool(&tool_call.name, &tool_call.arguments);
                    messages.push(tool_call.result(result));
                }
            }
        } else {
            // No more tool calls, return the content
            return Ok(response.content.unwrap_or_default());
        }
    }

    Err("Max rounds exceeded".into())
}
}

Best Practices

Clear Tool Descriptions

Write clear, detailed descriptions:

#![allow(unused)]
fn main() {
// Good
"description": "Get the current weather conditions for a specific city. Returns temperature, humidity, and weather condition."

// Avoid
"description": "Get weather"
}

Precise Parameter Schemas

Define accurate JSON schemas:

#![allow(unused)]
fn main() {
"parameters": {
    "type": "object",
    "properties": {
        "location": {
            "type": "string",
            "description": "City name or coordinates"
        },
        "days": {
            "type": "integer",
            "minimum": 1,
            "maximum": 7,
            "description": "Number of days for forecast"
        }
    },
    "required": ["location"]
}
}

Error Handling

Handle tool execution errors gracefully:

#![allow(unused)]
fn main() {
let tool_result = match execute_tool(&tool_call.name, &tool_call.arguments) {
    Ok(result) => json!({"success": true, "data": result}),
    Err(e) => json!({"success": false, "error": e.to_string()}),
};
messages.push(tool_call.result(tool_result));
}

See Also

Error Handling

This document covers error handling in vLLM Client.

VllmError Enum

All errors in vLLM Client are represented by the VllmError enum:

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Debug, Error, Clone)]
pub enum VllmError {
    #[error("HTTP request failed: {0}")]
    Http(String),

    #[error("JSON error: {0}")]
    Json(String),

    #[error("API error (status {status_code}): {message}")]
    ApiError {
        status_code: u16,
        message: String,
        error_type: Option<String>,
    },

    #[error("Stream error: {0}")]
    Stream(String),

    #[error("Connection timeout")]
    Timeout,

    #[error("Model not found: {0}")]
    ModelNotFound(String),

    #[error("Missing required parameter: {0}")]
    MissingParameter(String),

    #[error("No response content")]
    NoContent,

    #[error("Invalid response format: {0}")]
    InvalidResponse(String),

    #[error("{0}")]
    Other(String),
}
}

Error Types

VariantWhen It Occurs
HttpNetwork errors, connection failures
JsonSerialization/deserialization errors
ApiErrorServer returned an error response
StreamErrors during streaming response
TimeoutRequest timed out
ModelNotFoundSpecified model doesn't exist
MissingParameterRequired parameter not provided
NoContentResponse has no content
InvalidResponseUnexpected response format
OtherMiscellaneous errors

Basic Error Handling

use vllm_client::{VllmClient, json, VllmError};

async fn chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await?;

    Ok(response.content.unwrap_or_default())
}

#[tokio::main]
async fn main() {
    match chat("Hello!").await {
        Ok(text) => println!("Response: {}", text),
        Err(e) => eprintln!("Error: {}", e),
    }
}

Detailed Error Handling

Handle specific error types differently:

use vllm_client::{VllmClient, json, VllmError};

#[tokio::main]
async fn main() {
    let client = VllmClient::new("http://localhost:8000/v1");

    let result = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": "Hello!"}]))
        .send()
        .await;

    match result {
        Ok(response) => {
            println!("Success: {}", response.content.unwrap_or_default());
        }
        Err(VllmError::ApiError { status_code, message, error_type }) => {
            eprintln!("API Error (HTTP {}): {}", status_code, message);
            if let Some(etype) = error_type {
                eprintln!("Error type: {}", etype);
            }
        }
        Err(VllmError::Timeout) => {
            eprintln!("Request timed out. Try increasing timeout.");
        }
        Err(VllmError::Http(msg)) => {
            eprintln!("Network error: {}", msg);
        }
        Err(VllmError::ModelNotFound(model)) => {
            eprintln!("Model '{}' not found. Check available models.", model);
        }
        Err(VllmError::MissingParameter(param)) => {
            eprintln!("Missing required parameter: {}", param);
        }
        Err(e) => {
            eprintln!("Other error: {}", e);
        }
    }
}

HTTP Status Codes

Common API error status codes:

CodeMeaningAction
400Bad RequestCheck request parameters
401UnauthorizedCheck API key
403ForbiddenCheck permissions
404Not FoundCheck endpoint or model name
429Rate LimitedImplement retry with backoff
500Server ErrorRetry or contact admin
502Bad GatewayCheck vLLM server status
503Service UnavailableWait and retry
504Gateway TimeoutIncrease timeout or retry

Retryable Errors

Check if an error is retryable:

#![allow(unused)]
fn main() {
use vllm_client::VllmError;

fn should_retry(error: &VllmError) -> bool {
    error.is_retryable()
}

// Manual check
match error {
    VllmError::Timeout => true,
    VllmError::ApiError { status_code: 429, .. } => true,  // Rate limit
    VllmError::ApiError { status_code: 500..=504, .. } => true,  // Server errors
    _ => false,
}
}

Retry with Exponential Backoff

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn chat_with_retry(
    client: &VllmClient,
    prompt: &str,
    max_retries: u32,
) -> Result<String, VllmError> {
    let mut retries = 0;

    loop {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;

        match result {
            Ok(response) => {
                return Ok(response.content.unwrap_or_default());
            }
            Err(e) if e.is_retryable() && retries < max_retries => {
                retries += 1;
                let delay = Duration::from_millis(100 * 2u64.pow(retries - 1));
                eprintln!("Retry {} after {:?}: {}", retries, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

Streaming Error Handling

Handle errors during streaming:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;

async fn stream_chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => content.push_str(&delta),
            StreamEvent::Done => break,
            StreamEvent::Error(e) => return Err(e),
            _ => {}
        }
    }

    Ok(content)
}
}

Error Context

Add context to errors for better debugging:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn chat_with_context(prompt: &str) -> Result<String, String> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map_err(|e| format!("Failed to get chat response: {}", e))?;

    Ok(response.content.unwrap_or_default())
}
}

Using anyhow or eyre

For applications using anyhow or eyre:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use anyhow::{Context, Result};

async fn chat(prompt: &str) -> Result<String> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .context("Failed to send chat request")?;

    Ok(response.content.unwrap_or_default())
}
}

Best Practices

1. Always Handle Errors

#![allow(unused)]
fn main() {
// Bad
let response = client.chat.completions().create()
    .send().await.unwrap();

// Good
match client.chat.completions().create().send().await {
    Ok(r) => { /* handle */ },
    Err(e) => eprintln!("Error: {}", e),
}
}

2. Use Appropriate Timeout

#![allow(unused)]
fn main() {
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(300); // 5 minutes for long tasks
}

3. Log Errors with Context

#![allow(unused)]
fn main() {
Err(e) => {
    log::error!("Chat request failed: {}", e);
    log::debug!("Request details: model={}, prompt_len={}", model, prompt.len());
}
}

4. Implement Graceful Degradation

#![allow(unused)]
fn main() {
match primary_client.chat.completions().create().send().await {
    Ok(r) => r,
    Err(e) => {
        log::warn!("Primary client failed: {}, trying fallback", e);
        fallback_client.chat.completions().create().send().await?
    }
}
}

See Also

Examples

This section contains practical code examples demonstrating vLLM Client usage patterns.

Available Examples

Basic Usage

ExampleDescription
Basic ChatSimple chat completion requests
Streaming ChatReal-time streaming responses
Streaming CompletionsLegacy completions streaming
Tool CallingFunction calling integration
Multi-modalImage and multi-modal inputs

Quick Examples

Hello World

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": "Hello!"}]))
        .send()
        .await?;
    
    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Streaming Output

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut stream = client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": "Tell me a story"}]))
        .stream(true)
        .send_stream()
        .await?;
    
    while let Some(event) = stream.next().await {
        if let StreamEvent::Content(delta) = event {
            print!("{}", delta);
        }
    }
    
    println!();
    Ok(())
}

Tool Calling

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let tools = json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]);

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ]))
    .tools(tools)
    .send()
    .await?;

if response.has_tool_calls() {
    // Execute tools and return results
}
}

Example Structure

Each example includes:

  • Complete, runnable code
  • Required dependencies
  • Step-by-step explanations
  • Common variations and use cases

Running Examples

Prerequisites

  1. A running vLLM server:

    pip install vllm
    vllm serve Qwen/Qwen2.5-7B-Instruct --port 8000
    
  2. Rust toolchain:

    curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
    

Running an Example

# Create a new project
cargo new my-vllm-app
cd my-vllm-app

# Add dependencies
cargo add vllm-client
cargo add tokio --features full
cargo add serde_json

# Copy example code to src/main.rs
# Then run:
cargo run

Common Patterns

Environment Configuration

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

fn create_client() -> VllmClient {
    VllmClient::builder()
        .base_url(env::var("VLLM_BASE_URL")
            .unwrap_or_else(|_| "http://localhost:8000/v1".to_string()))
        .api_key(env::var("VLLM_API_KEY").ok())
        .timeout_secs(300)
        .build()
}
}

Error Handling

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, VllmError};

async fn safe_chat(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await?;
    
    Ok(response.content.unwrap_or_default())
}
}

Reusing Client

#![allow(unused)]
fn main() {
use std::sync::Arc;
use vllm_client::VllmClient;

// Share client across threads
let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// Use in multiple async tasks
let client1 = Arc::clone(&client);
let client2 = Arc::clone(&client);
}

See Also

Basic Chat Examples

This page demonstrates basic chat completion usage patterns with vLLM Client.

Simple Chat

The simplest way to send a chat message:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Hello, how are you?"}
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

With System Message

Add a system message to control the assistant's behavior:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "system", "content": "You are a helpful coding assistant. You write clean, well-documented code."},
            {"role": "user", "content": "Write a function to check if a number is prime in Rust"}
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Multi-turn Conversation

Maintain context across multiple messages:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // Build conversation history
    let mut messages = vec![
        json!({"role": "system", "content": "You are a helpful assistant."}),
    ];

    // First turn
    messages.push(json!({"role": "user", "content": "My name is Alice"}));
    
    let response1 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!(messages.clone()))
        .send()
        .await?;

    let assistant_reply = response1.content.unwrap_or_default();
    println!("Assistant: {}", assistant_reply);

    // Add assistant reply to history
    messages.push(json!({"role": "assistant", "content": assistant_reply}));

    // Second turn
    messages.push(json!({"role": "user", "content": "What's my name?"}));

    let response2 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!(messages))
        .send()
        .await?;

    println!("Assistant: {}", response2.content.unwrap_or_default());
    Ok(())
}

Conversation Helper

A reusable helper for building conversations:

use vllm_client::{VllmClient, json, VllmError};
use serde_json::Value;

struct Conversation {
    client: VllmClient,
    model: String,
    messages: Vec<Value>,
}

impl Conversation {
    fn new(client: VllmClient, model: impl Into<String>) -> Self {
        Self {
            client,
            model: model.into(),
            messages: vec![
                json!({"role": "system", "content": "You are a helpful assistant."})
            ],
        }
    }

    fn with_system(mut self, content: &str) -> Self {
        self.messages[0] = json!({"role": "system", "content": content});
        self
    }

    async fn send(&mut self, user_message: &str) -> Result<String, VllmError> {
        self.messages.push(json!({
            "role": "user",
            "content": user_message
        }));

        let response = self.client
            .chat
            .completions()
            .create()
            .model(&self.model)
            .messages(json!(&self.messages))
            .send()
            .await?;

        let content = response.content.unwrap_or_default();
        self.messages.push(json!({
            "role": "assistant",
            "content": &content
        }));

        Ok(content)
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    
    let mut conv = Conversation::new(client, "Qwen/Qwen2.5-7B-Instruct")
        .with_system("You are a math tutor. Explain concepts simply.");

    println!("User: What is 2 + 2?");
    let reply = conv.send("What is 2 + 2?").await?;
    println!("Assistant: {}", reply);

    println!("\nUser: And what is that multiplied by 3?");
    let reply = conv.send("And what is that multiplied by 3?").await?;
    println!("Assistant: {}", reply);

    Ok(())
}

With Sampling Parameters

Control the generation with sampling parameters:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Write a creative story about a robot"}
        ]))
        .temperature(1.2)      // Higher temperature for more creativity
        .top_p(0.95)           // Nucleus sampling
        .top_k(50)             // vLLM extension
        .max_tokens(512)       // Limit output length
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Deterministic Output

For reproducible results, set temperature to 0:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "What is 2 + 2?"}
        ]))
        .temperature(0.0)      // Deterministic output
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

With Stop Sequences

Stop generation at specific sequences:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "List three fruits, one per line"}
        ]))
        .stop(json!(["\n\n", "END"]))  // Stop at double newline or END
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Token Usage Tracking

Track token usage for cost monitoring:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Explain quantum computing"}
        ]))
        .send()
        .await?;

    println!("Response: {}", response.content.unwrap_or_default());

    if let Some(usage) = response.usage {
        println!("\n--- Token Usage ---");
        println!("Prompt tokens: {}", usage.prompt_tokens);
        println!("Completion tokens: {}", usage.completion_tokens);
        println!("Total tokens: {}", usage.total_tokens);
    }

    Ok(())
}

Batch Processing

Process multiple prompts efficiently:

use vllm_client::{VllmClient, json, VllmError};

async fn process_prompts(
    client: &VllmClient,
    prompts: &[&str],
) -> Vec<Result<String, VllmError>> {
    let mut results = Vec::new();

    for prompt in prompts {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default());

        results.push(result);
    }

    results
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(120);

    let prompts = [
        "What is Rust?",
        "What is Python?",
        "What is Go?",
    ];

    let results = process_prompts(&client, &prompts).await;

    for (prompt, result) in prompts.iter().zip(results.iter()) {
        match result {
            Ok(response) => println!("Q: {}\nA: {}\n", prompt, response),
            Err(e) => eprintln!("Error for '{}': {}", prompt, e),
        }
    }

    Ok(())
}

Error Handling

Proper error handling for production code:

use vllm_client::{VllmClient, json, VllmError};

async fn safe_chat(prompt: &str) -> Result<String, String> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(60);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map_err(|e| format!("Request failed: {}", e))?;

    response.content.ok_or_else(|| "No content in response".to_string())
}

#[tokio::main]
async fn main() {
    match safe_chat("Hello!").await {
        Ok(text) => println!("Response: {}", text),
        Err(e) => eprintln!("Error: {}", e),
    }
}

See Also

Streaming Chat Example

This example demonstrates how to use streaming responses for real-time output.

Basic Streaming

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Write a short story about a robot learning to paint."}
        ]))
        .temperature(0.8)
        .max_tokens(1024)
        .stream(true)
        .send_stream()
        .await?;

    print!("Response: ");
    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\nError: {}", e);
                break;
            }
            _ => {}
        }
    }
    println!();

    Ok(())
}

Streaming with Reasoning (Thinking Models)

For models that support thinking/reasoning mode:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Solve: What is 15 * 23 + 47?"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut reasoning = String::new();
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
                eprintln!("[thinking] {}", delta);
            }
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\nError: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n");
    if !reasoning.is_empty() {
        println!("--- Reasoning Process ---");
        println!("{}", reasoning);
    }

    Ok(())
}

Streaming with Progress Indicator

Add a typing indicator while waiting for the first token:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use std::time::{Duration, Instant};
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let waiting = Arc::new(AtomicBool::new(true));
    let waiting_clone = Arc::clone(&waiting);

    // Spawn typing indicator task
    let indicator = tokio::spawn(async move {
        let chars = ['⠋', '⠙', '⠹', '⠸', '⠼', '⠴', '⠦', '⠧', '⠇', '⠏'];
        let mut i = 0;
        while waiting_clone.load(Ordering::Relaxed) {
            print!("\r{} Thinking...", chars[i]);
            std::io::Write::flush(&mut std::io::stdout()).ok();
            i = (i + 1) % chars.len();
            tokio::time::sleep(Duration::from_millis(80)).await;
        }
        print!("\r        \r"); // Clear the indicator
    });

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Explain quantum entanglement in simple terms."}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut first_token = true;
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                if first_token {
                    waiting.store(false, Ordering::Relaxed);
                    indicator.await.ok();
                    first_token = false;
                    println!("Response:");
                    println!("---------");
                }
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                waiting.store(false, Ordering::Relaxed);
                eprintln!("\nError: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\n");

    Ok(())
}

Multi-turn Streaming Conversation

Handle a conversation with streaming responses:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use std::io::{self, BufRead, Write};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");
    let mut messages: Vec<serde_json::Value> = Vec::new();

    println!("Chat with the AI (type 'quit' to exit)");
    println!("----------------------------------------\n");

    let stdin = io::stdin();
    for line in stdin.lock().lines() {
        let input = line?;
        if input.trim() == "quit" {
            break;
        }
        if input.trim().is_empty() {
            continue;
        }

        // Add user message
        messages.push(json!({"role": "user", "content": input}));

        // Stream response
        let mut stream = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(messages))
            .stream(true)
            .send_stream()
            .await?;

        print!("AI: ");
        io::stdout().flush().ok();

        let mut response_content = String::new();

        while let Some(event) = stream.next().await {
            match event {
                StreamEvent::Content(delta) => {
                    response_content.push_str(&delta);
                    print!("{}", delta);
                    io::stdout().flush().ok();
                }
                StreamEvent::Done => break,
                StreamEvent::Error(e) => {
                    eprintln!("\nError: {}", e);
                    break;
                }
                _ => {}
            }
        }

        println!("\n");

        // Add assistant response to history
        messages.push(json!({"role": "assistant", "content": response_content}));
    }

    println!("Goodbye!");
    Ok(())
}

Streaming with Timeout

Add timeout handling for slow responses:

use vllm_client::{VllmClient, json, StreamEvent, VllmError};
use futures::StreamExt;
use tokio::time::{timeout, Duration};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(300);

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Write a detailed essay about AI."}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    loop {
        // 30 second timeout per event
        match timeout(Duration::from_secs(30), stream.next()).await {
            Ok(Some(event)) => {
                match event {
                    StreamEvent::Content(delta) => {
                        content.push_str(&delta);
                        print!("{}", delta);
                        std::io::Write::flush(&mut std::io::stdout()).ok();
                    }
                    StreamEvent::Done => break,
                    StreamEvent::Error(e) => {
                        eprintln!("\nStream error: {}", e);
                        return Err(e.into());
                    }
                    _ => {}
                }
            }
            Ok(None) => break,
            Err(_) => {
                eprintln!("\nTimeout waiting for next token");
                break;
            }
        }
    }

    println!("\n\nGenerated {} characters", content.len());

    Ok(())
}

Collecting Usage Statistics

Track token usage during streaming:

use vllm_client::{VllmClient, json, StreamEvent, Usage};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Write a poem about the ocean."}
        ]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();
    let mut usage: Option<Usage> = None;
    let mut start_time = std::time::Instant::now();
    let mut token_count = 0;

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                token_count += 1;
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Usage(u) => {
                usage = Some(u);
            }
            StreamEvent::Done => break,
            _ => {}
        }
    }

    let elapsed = start_time.elapsed();

    println!("\n");
    println!("--- Statistics ---");
    println!("Time: {:.2}s", elapsed.as_secs_f64());
    println!("Characters: {}", content.len());

    if let Some(usage) = usage {
        println!("Prompt tokens: {}", usage.prompt_tokens);
        println!("Completion tokens: {}", usage.completion_tokens);
        println!("Total tokens: {}", usage.total_tokens);
        println!("Tokens/second: {:.2}", 
            usage.completion_tokens as f64 / elapsed.as_secs_f64());
    }

    Ok(())
}

See Also

Streaming Completions Example

This example demonstrates streaming completions using the legacy /v1/completions API.

Basic Streaming Completions

use vllm_client::{VllmClient, json, CompletionStreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    println!("=== Streaming Completions ===\n");
    println!("Model: Qwen/Qwen2.5-7B-Instruct\n");
    println!("Prompt: What is machine learning?");
    println!("\nGenerated text: ");

    let mut stream = client
        .completions
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .prompt("What is machine learning?")
        .max_tokens(500)
        .temperature(0.7)
        .stream(true)
        .send_stream()
        .await?;

    // Process streaming events
    while let Some(event) = stream.next().await {
        match event {
            CompletionStreamEvent::Text(delta) => {
                // Print text delta (real-time output)
                print!("{}", delta);
                // Flush buffer for real-time display
                std::io::stdout().flush().ok();
            }
            CompletionStreamEvent::FinishReason(reason) => {
                println!("\n\n--- Finish reason: {} ---", reason);
            }
            CompletionStreamEvent::Usage(usage) => {
                // Output token usage statistics at the end
                println!("\n\n--- Token Usage ---");
                println!("Prompt tokens: {}", usage.prompt_tokens);
                println!("Completion tokens: {}", usage.completion_tokens);
                println!("Total tokens: {}", usage.total_tokens);
            }
            CompletionStreamEvent::Done => {
                println!("\n\n=== Generation Complete ===");
                break;
            }
            CompletionStreamEvent::Error(e) => {
                eprintln!("\nError: {}", e);
                return Err(e.into());
            }
        }
    }

    Ok(())
}

Key Differences from Chat Streaming

AspectChat CompletionsCompletions
Event typeStreamEventCompletionStreamEvent
Content variantContent(String)Text(String)
Additional eventReasoning, ToolCallFinishReason
Use caseConversation-basedSingle prompt

When to Use Completions API

  • Simple text generation with a single prompt
  • Legacy compatibility with OpenAI API
  • Situations where chat messages format is not needed

For new projects, we recommend using the Chat Completions API (client.chat.completions()) which provides more flexibility and better message formatting.

Tool Calling Examples

This example demonstrates how to use tool calling (function calling) with vLLM Client.

Basic Tool Calling

Define tools and let the model decide when to call them:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // Define available tools
    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City name, e.g., Tokyo, New York"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "Temperature unit"
                        }
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "What's the weather like in Tokyo?"}
        ]))
        .tools(tools)
        .send()
        .await?;

    // Check if the model wants to call a tool
    if response.has_tool_calls() {
        if let Some(tool_calls) = &response.tool_calls {
            for tool_call in tool_calls {
                println!("Function: {}", tool_call.name);
                println!("Arguments: {}", tool_call.arguments);
            }
        }
    } else {
        println!("Response: {}", response.content.unwrap_or_default());
    }

    Ok(())
}

Complete Tool Calling Flow

Execute tools and return results to continue the conversation:

use vllm_client::{VllmClient, json, ToolCall};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct WeatherArgs {
    location: String,
    unit: Option<String>,
}

#[derive(Serialize)]
struct WeatherResult {
    temperature: f32,
    condition: String,
    humidity: u32,
}

// Simulated weather function
fn get_weather(location: &str, unit: Option<&str>) -> WeatherResult {
    // In real code, call an actual weather API
    let temp = match location {
        "Tokyo" => 25.0,
        "New York" => 20.0,
        "London" => 15.0,
        _ => 22.0,
    };

    WeatherResult {
        temperature: if unit == Some("fahrenheit") {
            temp * 9.0 / 5.0 + 32.0
        } else {
            temp
        },
        condition: "sunny".to_string(),
        humidity: 60,
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let user_message = "What's the weather like in Tokyo and New York?";

    // First request - model may call tools
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": user_message}
        ]))
        .tools(tools.clone())
        .send()
        .await?;

    if response.has_tool_calls() {
        // Build message history
        let mut messages = vec![
            json!({"role": "user", "content": user_message})
        ];

        // Add assistant's tool calls
        messages.push(response.assistant_message());

        // Execute each tool and add results
        if let Some(tool_calls) = &response.tool_calls {
            for tool_call in tool_calls {
                if tool_call.name == "get_weather" {
                    let args: WeatherArgs = tool_call.parse_args_as()?;
                    let result = get_weather(&args.location, args.unit.as_deref());
                    messages.push(tool_call.result(json!(result)));
                }
            }
        }

        // Continue conversation with tool results
        let final_response = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(messages))
            .tools(tools)
            .send()
            .await?;

        println!("{}", final_response.content.unwrap_or_default());
    } else {
        println!("{}", response.content.unwrap_or_default());
    }

    Ok(())
}

Multiple Tools

Define multiple tools for different purposes:

use vllm_client::{VllmClient, json};
use serde::Deserialize;

#[derive(Deserialize)]
struct SearchArgs {
    query: String,
    limit: Option<u32>,
}

#[derive(Deserialize)]
struct CalcArgs {
    expression: String,
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "web_search",
                "description": "Search the web for information",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {
                            "type": "string",
                            "description": "Search query"
                        },
                        "limit": {
                            "type": "integer",
                            "description": "Maximum number of results"
                        }
                    },
                    "required": ["query"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "calculate",
                "description": "Perform mathematical calculations",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "expression": {
                            "type": "string",
                            "description": "Math expression to evaluate, e.g., '2 + 2 * 3'"
                        }
                    },
                    "required": ["expression"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Search for Rust programming language and calculate 42 * 17"}
        ]))
        .tools(tools)
        .send()
        .await?;

    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            match tool_call.name.as_str() {
                "web_search" => {
                    let args: SearchArgs = tool_call.parse_args_as()?;
                    println!("Searching for: {} (limit: {:?})", args.query, args.limit);
                }
                "calculate" => {
                    let args: CalcArgs = tool_call.parse_args_as()?;
                    println!("Calculating: {}", args.expression);
                }
                _ => println!("Unknown tool: {}", tool_call.name),
            }
        }
    }

    Ok(())
}

Streaming Tool Calls

Stream tool call updates in real-time:

use vllm_client::{VllmClient, json, StreamEvent, ToolCall};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "What's the weather in Tokyo, Paris, and London?"}
        ]))
        .tools(tools)
        .stream(true)
        .send_stream()
        .await?;

    let mut tool_calls: Vec<ToolCall> = Vec::new();
    let mut content = String::new();

    println!("Streaming response:\n");

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Content(delta) => {
                content.push_str(&delta);
                print!("{}", delta);
            }
            StreamEvent::ToolCallDelta { index, id, name, arguments } => {
                println!("[Tool {}] {} - partial args: {}", index, name, arguments);
            }
            StreamEvent::ToolCallComplete(tool_call) => {
                println!("[Tool Complete] {}({})", tool_call.name, tool_call.arguments);
                tool_calls.push(tool_call);
            }
            StreamEvent::Done => {
                println!("\n--- Stream Complete ---");
                break;
            }
            StreamEvent::Error(e) => {
                eprintln!("\nError: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!("\nCollected {} tool calls", tool_calls.len());
    for (i, tc) in tool_calls.iter().enumerate() {
        println!("  {}. {}({})", i + 1, tc.name, tc.arguments);
    }

    Ok(())
}

Multi-Round Tool Calling

Handle multiple rounds of tool calls:

use vllm_client::{VllmClient, json, VllmError};
use serde_json::Value;

async fn run_agent(
    client: &VllmClient,
    user_message: &str,
    tools: &Value,
    max_rounds: usize,
) -> Result<String, VllmError> {
    let mut messages = vec![
        json!({"role": "user", "content": user_message})
    ];

    for round in 0..max_rounds {
        println!("--- Round {} ---", round + 1);

        let response = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!(&messages))
            .tools(tools.clone())
            .send()
            .await?;

        if response.has_tool_calls() {
            // Add assistant message with tool calls
            messages.push(response.assistant_message());

            // Execute tools and add results
            if let Some(tool_calls) = &response.tool_calls {
                for tool_call in tool_calls {
                    println!("Calling: {}({})", tool_call.name, tool_call.arguments);

                    // Execute the tool
                    let result = execute_tool(&tool_call.name, &tool_call.arguments);
                    println!("Result: {}", result);

                    // Add tool result to messages
                    messages.push(tool_call.result(result));
                }
            }
        } else {
            // No more tool calls, return the final response
            return Ok(response.content.unwrap_or_default());
        }
    }

    Err(VllmError::Other("Max rounds exceeded".to_string()))
}

fn execute_tool(name: &str, args: &str) -> Value {
    // Your tool execution logic here
    match name {
        "get_weather" => json!({"temperature": 22, "condition": "sunny"}),
        "web_search" => json!({"results": ["result1", "result2"]}),
        _ => json!({"error": "Unknown tool"}),
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "web_search",
                "description": "Search the web",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"}
                    },
                    "required": ["query"]
                }
            }
        }
    ]);

    let result = run_agent(
        &client,
        "What's the weather in Tokyo and find information about cherry blossoms?",
        &tools,
        5
    ).await?;

    println!("\nFinal Answer: {}", result);

    Ok(())
}

Tool Choice Options

Control tool selection behavior:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    // Option 1: Let the model decide (default)
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Hello!"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("auto"))
        .send()
        .await?;

    // Option 2: Prevent tool use
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "What's the weather in Tokyo?"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("none"))
        .send()
        .await?;

    // Option 3: Force tool use
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "I need weather info"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!("required"))
        .send()
        .await?;

    // Option 4: Force specific tool
    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Check Tokyo weather"}
        ]))
        .tools(tools.clone())
        .tool_choice(json!({
            "type": "function",
            "function": {"name": "get_weather"}
        }))
        .send()
        .await?;

    Ok(())
}

Error Handling

Handle tool execution errors gracefully:

use vllm_client::{VllmClient, json, ToolCall};
use serde_json::Value;

fn execute_tool_safely(tool_call: &ToolCall) -> Value {
    match tool_call.name.as_str() {
        "get_weather" => {
            // Parse arguments safely
            match tool_call.parse_args() {
                Ok(args) => {
                    // Execute tool
                    match get_weather_internal(&args) {
                        Ok(result) => json!({"success": true, "data": result}),
                        Err(e) => json!({"success": false, "error": e.to_string()}),
                    }
                }
                Err(e) => json!({
                    "success": false,
                    "error": format!("Invalid arguments: {}", e)
                }),
            }
        }
        _ => json!({
            "success": false,
            "error": format!("Unknown tool: {}", tool_call.name)
        }),
    }
}

fn get_weather_internal(args: &Value) -> Result<Value, String> {
    let location = args["location"].as_str()
        .ok_or("location is required")?;

    // Simulate API call
    Ok(json!({
        "location": location,
        "temperature": 22,
        "condition": "sunny"
    }))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let tools = json!([
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {"type": "string"}
                    },
                    "required": ["location"]
                }
            }
        }
    ]);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([
            {"role": "user", "content": "What's the weather?"}
        ]))
        .tools(tools)
        .send()
        .await?;

    if let Some(tool_calls) = &response.tool_calls {
        for tool_call in tool_calls {
            let result = execute_tool_safely(tool_call);
            println!("Tool result: {}", result);
        }
    }

    Ok(())
}

See Also

Multi-modal Examples

Multi-modal capabilities allow you to send images and other media types along with text to the model.

Overview

vLLM supports multi-modal inputs through the OpenAI-compatible API. You can include images in your chat messages using base64 encoding or URLs.

Basic Image Input (Base64)

Send an image encoded as base64:

use vllm_client::{VllmClient, json};
use base64::{Engine as _, engine::general_purpose};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // Read and encode image
    let image_data = std::fs::read("image.png")?;
    let base64_image = general_purpose::STANDARD.encode(&image_data);

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")  // Vision model
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What's in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(512)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Image from URL

Reference an image by URL:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this image in detail."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://example.com/image.jpg"
                        }
                    }
                ]
            }
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Helper Function for Images

Create a reusable helper for image messages:

use vllm_client::{VllmClient, json};
use serde_json::Value;

fn image_message(text: &str, image_path: &str) -> Result<Value, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};

    let image_data = std::fs::read(image_path)?;
    let base64_image = general_purpose::STANDARD.encode(&image_data);

    // Detect image type from extension
    let mime_type = match image_path.to_lowercase().rsplit('.').next() {
        Some("png") => "image/png",
        Some("jpg") | Some("jpeg") => "image/jpeg",
        Some("gif") => "image/gif",
        Some("webp") => "image/webp",
        _ => "image/png",
    };

    Ok(json!({
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": text
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": format!("data:{};base64,{}", mime_type, base64_image)
                }
            }
        ]
    }))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let user_msg = image_message("What do you see in this image?", "photo.jpg")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([user_msg]))
        .max_tokens(1024)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Multiple Images

Send multiple images in a single request:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // Read and encode multiple images
    let image1 = encode_image("image1.png")?;
    let image2 = encode_image("image2.png")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Compare these two images. What are the differences?"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", image1)
                        }
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", image2)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(1024)
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

fn encode_image(path: &str) -> Result<String, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};
    let data = std::fs::read(path)?;
    Ok(general_purpose::STANDARD.encode(&data))
}

Streaming with Images

Stream responses for image-based queries:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let base64_image = encode_image("chart.png")?;

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Analyze this chart and explain the trends."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .stream(true)
        .send_stream()
        .await?;

    while let Some(event) = stream.next().await {
        if let StreamEvent::Content(delta) = event {
            print!("{}", delta);
            std::io::Write::flush(&mut std::io::stdout()).ok();
        }
    }

    println!();
    Ok(())
}

Multi-turn with Images

Maintain conversation context with images:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let base64_image = encode_image("screenshot.png")?;

    // First message with image
    let messages = json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this screenshot?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": format!("data:image/png;base64,{}", base64_image)
                    }
                }
            ]
        }
    ]);

    let response1 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(messages.clone())
        .send()
        .await?;

    println!("First response: {}", response1.content.unwrap_or_default());

    // Continue conversation (no new image needed)
    let messages2 = json!([
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this screenshot?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": format!("data:image/png;base64,{}", base64_image)
                    }
                }
            ]
        },
        {
            "role": "assistant",
            "content": response1.content.unwrap_or_default()
        },
        {
            "role": "user",
            "content": "Can you translate any text you see in the image?"
        }
    ]);

    let response2 = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(messages2)
        .send()
        .await?;

    println!("\nSecond response: {}", response2.content.unwrap_or_default());

    Ok(())
}

OCR and Document Analysis

Use vision models for OCR and document analysis:

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let document_image = encode_image("document.png")?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "system",
                "content": "You are an OCR assistant. Extract text from images accurately and format it properly."
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Extract all text from this document image. Preserve the formatting as much as possible."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", document_image)
                        }
                    }
                ]
            }
        ]))
        .max_tokens(2048)
        .send()
        .await?;

    println!("Extracted Text:\n{}", response.content.unwrap_or_default());
    Ok(())
}

Image Size Considerations

Handle large images appropriately:

use vllm_client::{VllmClient, json};

fn encode_and_resize_image(path: &str, max_size: u32) -> Result<String, Box<dyn std::error::Error>> {
    use base64::{Engine as _, engine::general_purpose};
    use image::ImageReader;

    // Load and resize image
    let img = ImageReader::open(path)?.decode()?;
    let img = img.resize(max_size, max_size, image::imageops::FilterType::Lanczos3);

    // Convert to PNG
    let mut buffer = std::io::Cursor::new(Vec::new());
    img.write_to(&mut buffer, image::ImageFormat::Png)?;

    Ok(general_purpose::STANDARD.encode(&buffer.into_inner()))
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    // Resize to max 1024px while maintaining aspect ratio
    let base64_image = encode_and_resize_image("large_image.jpg", 1024)?;

    let response = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2-VL-7B-Instruct")
        .messages(json!([
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image."},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": format!("data:image/png;base64,{}", base64_image)
                        }
                    }
                ]
            }
        ]))
        .send()
        .await?;

    println!("{}", response.content.unwrap_or_default());
    Ok(())
}

Supported Models

For multi-modal inputs, use models that support vision:

ModelDescription
Qwen/Qwen2-VL-7B-InstructQwen2 Vision Language
Qwen/Qwen2-VL-72B-InstructQwen2 VL Large
meta-llama/Llama-3.2-11B-Vision-InstructLlama 3.2 Vision
openai/clip-vit-large-patch14CLIP model

Check your vLLM server's available models with:

curl http://localhost:8000/v1/models

Required Dependencies

For image handling, add these dependencies:

[dependencies]
vllm-client = "0.1"
tokio = { version = "1", features = ["full"] }
serde_json = "1"
base64 = "0.22"
image = "0.25"  # Optional, for image processing

Troubleshooting

Image Too Large

If you get errors about image size, reduce the image dimensions:

#![allow(unused)]
fn main() {
// Resize before sending
let img = image::load_from_memory(&image_data)?;
let resized = img.resize(1024, 1024, image::imageops::FilterType::Lanczos3);
}

Unsupported Format

Convert images to supported formats:

#![allow(unused)]
fn main() {
// Convert to PNG
let img = image::load_from_memory(&image_data)?;
let mut output = Vec::new();
img.write_to(&mut std::io::Cursor::new(&mut output), image::ImageFormat::Png)?;
}

Model Doesn't Support Vision

Ensure you're using a vision-capable model. Non-vision models will ignore image inputs.

See Also

Advanced Topics

This section covers advanced features and patterns for vLLM Client.

Available Topics

TopicDescription
Thinking ModeReasoning models and thinking content
Custom HeadersCustom HTTP headers and authentication
Timeouts & RetriesTimeout configuration and retry strategies

Thinking Mode

For models that support reasoning (like Qwen with thinking mode), access the reasoning_content field:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Solve this puzzle"}]))
    .extra(json!({"chat_template_kwargs": {"think_mode": true}}))
    .stream(true)
    .send_stream()
    .await?;

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => eprintln!("[thinking] {}", delta),
        StreamEvent::Content(delta) => print!("{}", delta),
        _ => {}
    }
}
}

Custom Configuration

Environment-Based Configuration

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

fn create_client() -> VllmClient {
    VllmClient::builder()
        .base_url(env::var("VLLM_BASE_URL")
            .unwrap_or_else(|_| "http://localhost:8000/v1".to_string()))
        .api_key(env::var("VLLM_API_KEY").ok())
        .timeout_secs(env::var("VLLM_TIMEOUT")
            .ok()
            .and_then(|s| s.parse().ok())
            .unwrap_or(300))
        .build()
}
}

Multiple Clients

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let primary = VllmClient::new("http://primary-server:8000/v1");
let fallback = VllmClient::new("http://fallback-server:8000/v1");
}

Production Patterns

Connection Pooling

The client reuses HTTP connections automatically. Create once and share:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use vllm_client::VllmClient;

let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// Clone the Arc for each task
let client1 = Arc::clone(&client);
let client2 = Arc::clone(&client);
}

Graceful Shutdown

Handle graceful shutdown with channels:

#![allow(unused)]
fn main() {
use tokio::signal;
use tokio::sync::broadcast;

let (shutdown_tx, _) = broadcast::channel::<()>(1);

// In your request loop
tokio::select! {
    result = make_request(&client) => {
        // Handle result
    }
    _ = shutdown_rx.recv() => {
        println!("Shutting down gracefully");
        break;
    }
}
}

Request Queuing

For rate limiting, implement a queue:

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;

let semaphore = Arc::new(Semaphore::new(10)); // Max 10 concurrent

async fn queued_request(client: &VllmClient, prompt: &str) -> Result<String, VllmError> {
    let _permit = semaphore.acquire().await.unwrap();
    client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map(|r| r.content.unwrap_or_default())
}
}

Performance Tips

1. Reuse the Client

Creating a client has some overhead. Reuse it across requests:

#![allow(unused)]
fn main() {
// Good
let client = VllmClient::new("http://localhost:8000/v1");
for prompt in prompts {
    let _ = client.chat.completions().create()...;
}

// Avoid
for prompt in prompts {
    let client = VllmClient::new("http://localhost:8000/v1"); // Inefficient!
    let _ = client.chat.completions().create()...;
}
}

2. Use Streaming for Long Responses

Get faster time-to-first-token with streaming:

#![allow(unused)]
fn main() {
// Faster perceived latency
let mut stream = client.chat.completions().create()
    .stream(true)
    .send_stream()
    .await?;
}

3. Set Appropriate Timeouts

Match timeout to expected response time:

#![allow(unused)]
fn main() {
// Short queries
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(30);

// Long generation tasks
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(600);
}

4. Batch Requests

Process multiple prompts concurrently:

#![allow(unused)]
fn main() {
use futures::stream::{self, StreamExt};

let prompts = vec!["Hello", "Hi", "Hey"];
let results: Vec<_> = stream::iter(prompts)
    .map(|p| async {
        client.chat.completions().create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": p}]))
            .send()
            .await
    })
    .buffer_unordered(5) // Max 5 concurrent
    .collect()
    .await;
}

Security Considerations

API Key Storage

Never hardcode API keys:

#![allow(unused)]
fn main() {
// Good: Use environment variables
let api_key = std::env::var("VLLM_API_KEY")?;

// Avoid: Hardcoded keys
let api_key = "sk-secret-key"; // DON'T DO THIS!
}

TLS Verification

The client uses reqwest which verifies TLS certificates by default. For development with self-signed certificates:

#![allow(unused)]
fn main() {
// Use a custom HTTP client if needed
let http = reqwest::Client::builder()
    .danger_accept_invalid_certs(true) // Only for development!
    .timeout(std::time::Duration::from_secs(300))
    .build()?;
}

See Also

Thinking Mode

Thinking mode (also known as reasoning mode) allows models to output their reasoning process before giving a final answer. This is particularly useful for complex reasoning tasks.

Overview

Some models, like Qwen with thinking mode enabled, can output two types of content:

  1. Reasoning Content - The model's internal "thinking" process
  2. Content - The final response to the user

Enabling Thinking Mode

Qwen Models

For Qwen models, enable thinking mode via the extra parameter:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "Solve: What is 15 * 23 + 47?"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;
}

Checking for Reasoning Content

In non-streaming responses, access reasoning content separately:

#![allow(unused)]
fn main() {
// Check for reasoning content
if let Some(reasoning) = response.reasoning_content {
    println!("Reasoning: {}", reasoning);
}

// Get final content
if let Some(content) = response.content {
    println!("Answer: {}", content);
}
}

Streaming with Thinking Mode

The best way to use thinking mode is with streaming:

use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": "Think step by step: If I have 5 apples and give 2 to my friend, then buy 3 more, how many do I have?"}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    println!("=== Thinking Process ===\n");
    
    let mut in_thinking = true;
    let mut reasoning = String::new();
    let mut content = String::new();

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => {
                reasoning.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Content(delta) => {
                if in_thinking {
                    in_thinking = false;
                    println!("\n\n=== Final Answer ===\n");
                }
                content.push_str(&delta);
                print!("{}", delta);
                std::io::Write::flush(&mut std::io::stdout()).ok();
            }
            StreamEvent::Done => break,
            StreamEvent::Error(e) => {
                eprintln!("\nError: {}", e);
                break;
            }
            _ => {}
        }
    }

    println!();

    Ok(())
}

Use Cases

Mathematical Reasoning

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

async fn solve_math_problem(client: &VllmClient, problem: &str) -> Result<String, Box<dyn std::error::Error>> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "You are a math tutor. Show your work clearly."},
            {"role": "user", "content": problem}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut answer = String::new();

    while let Some(event) = stream.next().await {
        if let StreamEvent::Content(delta) = event {
            answer.push_str(&delta);
        }
    }

    Ok(answer)
}
}

Code Analysis

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "Analyze this code for potential bugs and security issues:\n\n```rust\nfn process_input(input: &str) -> String {\n    let mut result = String::new();\n    for c in input.chars() {\n        result.push(c);\n    }\n    result\n}\n```"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;
}

Complex Decision Making

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "system", "content": "You are a decision support assistant. Think through all options carefully."},
        {"role": "user", "content": "I need to choose between job offers from Company A (high salary, long commute) and Company B (moderate salary, remote work). Help me decide."}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .max_tokens(2048)
    .send()
    .await?;
}

Separating Reasoning from Answer

For applications that need to separate reasoning from the final answer:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

struct ThinkingResponse {
    reasoning: String,
    content: String,
}

async fn think_and_respond(
    client: &VllmClient,
    prompt: &str,
) -> Result<ThinkingResponse, Box<dyn std::error::Error>> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "user", "content": prompt}
        ]))
        .extra(json!({
            "chat_template_kwargs": {
                "think_mode": true
            }
        }))
        .stream(true)
        .send_stream()
        .await?;

    let mut response = ThinkingResponse {
        reasoning: String::new(),
        content: String::new(),
    };

    while let Some(event) = stream.next().await {
        match event {
            StreamEvent::Reasoning(delta) => response.reasoning.push_str(&delta),
            StreamEvent::Content(delta) => response.content.push_str(&delta),
            StreamEvent::Done => break,
            _ => {}
        }
    }

    Ok(response)
}
}

Model Support

ModelThinking Mode Support
Qwen/Qwen2.5-72B-Instruct✅ Yes
Qwen/Qwen2.5-32B-Instruct✅ Yes
Qwen/Qwen2.5-7B-Instruct✅ Yes
DeepSeek-R1✅ Yes (built-in)
Other models❌ Model dependent

Check your vLLM server configuration to verify thinking mode support.

Configuration Options

Thinking Model Detection

The model automatically handles thinking tokens:

#![allow(unused)]
fn main() {
// Reasoning content is parsed from special tokens
// Usually structured as: <think>...</think> or similar
}

Non-Streaming Access

For non-streaming requests with reasoning:

#![allow(unused)]
fn main() {
let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "Explain quantum entanglement"}
    ]))
    .extra(json!({
        "chat_template_kwargs": {
            "think_mode": true
        }
    }))
    .send()
    .await?;

// Access reasoning (if present)
if let Some(reasoning) = response.reasoning_content {
    println!("Reasoning:\n{}\n", reasoning);
}

// Access final answer
println!("Answer:\n{}", response.content.unwrap_or_default());
}

Best Practices

1. Use for Complex Tasks

Thinking mode is most beneficial for:

  • Multi-step reasoning
  • Mathematical problems
  • Code analysis
  • Complex decision making
#![allow(unused)]
fn main() {
// Good: Complex reasoning task
.messages(json!([
    {"role": "user", "content": "Solve this puzzle: A father is 4 times as old as his son. In 20 years, he will be only twice as old. How old are they now?"}
]))

// Less beneficial: Simple query
.messages(json!([
    {"role": "user", "content": "What is 2 + 2?"}
]))
}

2. Display Reasoning Selectively

You may want to hide reasoning in production but show it for debugging:

#![allow(unused)]
fn main() {
let show_reasoning = std::env::var("SHOW_REASONING").is_ok();

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => {
            if show_reasoning {
                eprintln!("[thinking] {}", delta);
            }
        }
        StreamEvent::Content(delta) => print!("{}", delta),
        _ => {}
    }
}
}

3. Combine with System Prompts

Guide the thinking process with system prompts:

#![allow(unused)]
fn main() {
.messages(json!([
    {
        "role": "system", 
        "content": "Think through problems step by step. Consider multiple approaches before settling on an answer."
    },
    {"role": "user", "content": problem}
]))
}

4. Adjust Max Tokens

Thinking mode uses more tokens. Adjust accordingly:

#![allow(unused)]
fn main() {
.max_tokens(4096)  // Account for both reasoning and answer
}

Troubleshooting

No Reasoning Content

If you don't see reasoning content:

  1. Ensure thinking mode is enabled in extra parameters
  2. Verify the model supports thinking mode
  3. Check vLLM server configuration
# Check vLLM server logs for any issues

Incomplete Streaming

If streaming seems incomplete:

#![allow(unused)]
fn main() {
// Ensure you handle all event types
while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => { /* handle */ },
        StreamEvent::Content(delta) => { /* handle */ },
        StreamEvent::Done => break,
        StreamEvent::Error(e) => {
            eprintln!("Error: {}", e);
            break;
        }
        _ => {}  // Don't forget other events
    }
}
}

See Also

Custom Headers

This document explains how to use custom HTTP headers with vLLM Client.

Overview

While the vLLM Client handles standard authentication via API keys, you may need to add custom headers for:

  • Custom authentication schemes
  • Request tracing and debugging
  • Rate limiting identifiers
  • Custom metadata

Current Limitations

The current version of vLLM Client does not provide a built-in method for custom headers. However, you can work around this limitation in several ways.

Workaround: Environment Variables

If your vLLM server accepts configuration via environment variables or specific API parameters:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key(std::env::var("MY_API_KEY").unwrap_or_default());
}

Workaround: Via Extra Parameters

Some custom configurations can be passed through the extra() method:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let response = client
    .chat
    .completions()
    .create()
    .model("Qwen/Qwen2.5-7B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .extra(json!({
        "custom_field": "custom_value",
        "request_id": "req-12345"
    }))
    .send()
    .await?;
}

Future Support

Custom header support is planned for future versions. The API will likely look like:

// Future API (not yet implemented)
let client = VllmClient::new("http://localhost:8000/v1")
    .with_header("X-Custom-Header", "value")
    .with_header("X-Request-ID", "req-123");

Common Use Cases

Tracing Headers

For distributed tracing (when supported):

// Future API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-Trace-ID", trace_id)
    .header("X-Span-ID", span_id)
    .build();

Custom Authentication

For non-standard authentication schemes:

// Future API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-API-Key", "custom-key")
    .header("X-Tenant-ID", "tenant-123")
    .build();

Request Metadata

Add metadata for logging or analytics:

// Future API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .header("X-Request-Source", "mobile-app")
    .header("X-User-ID", "user-456")
    .build();

Alternative: Custom HTTP Client

For advanced use cases, you can use the underlying reqwest client directly:

use reqwest::Client;
use serde_json::json;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    
    let response = client
        .post("http://localhost:8000/v1/chat/completions")
        .header("Content-Type", "application/json")
        .header("Authorization", "Bearer your-api-key")
        .header("X-Custom-Header", "custom-value")
        .json(&json!({
            "model": "Qwen/Qwen2.5-7B-Instruct",
            "messages": [{"role": "user", "content": "Hello!"}]
        }))
        .send()
        .await?;
    
    let result: serde_json::Value = response.json().await?;
    println!("{:?}", result);
    
    Ok(())
}

Best Practices

1. Use Standard Authentication When Possible

#![allow(unused)]
fn main() {
// Preferred
let client = VllmClient::new("http://localhost:8000/v1")
    .with_api_key("your-api-key");

// Avoid custom auth unless necessary
}

2. Document Custom Headers

When using custom headers, document their purpose:

// Future API
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    // For request tracing in logs
    .header("X-Request-ID", &request_id)
    // For multi-tenant identification
    .header("X-Tenant-ID", &tenant_id)
    .build();

3. Validate Server Support

Ensure your vLLM server accepts and processes custom headers. Some proxies or load balancers may strip unknown headers.

Security Considerations

Don't Expose Sensitive Headers

Avoid logging headers that contain sensitive information:

// Be careful with logging
let auth_header = "Bearer secret-key";
// Don't log this directly!

Use HTTPS

Always use HTTPS when transmitting sensitive headers:

#![allow(unused)]
fn main() {
// Good
let client = VllmClient::new("https://api.example.com/v1");

// Avoid for sensitive data
let client = VllmClient::new("http://api.example.com/v1");
}

Requesting This Feature

If you need custom header support, please open an issue on GitHub with:

  1. Your use case
  2. Required headers
  3. How you'd like the API to look

See Also

Timeouts & Retries

This page covers timeout configuration and retry strategies for robust production applications.

Setting Timeouts

Client-Level Timeout

Set a timeout when creating the client:

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

// Simple timeout
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(120);

// Using builder
let client = VllmClient::builder()
    .base_url("http://localhost:8000/v1")
    .timeout_secs(300)  // 5 minutes
    .build();
}

Choosing the Right Timeout

Use CaseRecommended Timeout
Simple queries30-60 seconds
Code generation2-3 minutes
Long document generation5-10 minutes
Complex reasoning tasks10+ minutes

Request Duration Factors

The time a request takes depends on:

  1. Prompt length - Longer prompts take more time to process
  2. Output tokens - More tokens = longer generation time
  3. Model size - Larger models are slower
  4. Server load - Busy servers respond slower

Timeout Errors

Handling Timeout

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn chat_with_timeout(prompt: &str) -> Result<String, VllmError> {
    let client = VllmClient::new("http://localhost:8000/v1")
        .timeout_secs(60);

    let result = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await;

    match result {
        Ok(response) => Ok(response.content.unwrap_or_default()),
        Err(VllmError::Timeout) => {
            eprintln!("Request timed out after 60 seconds");
            Err(VllmError::Timeout)
        }
        Err(e) => Err(e),
    }
}
}

Retry Strategies

Basic Retry

Retry failed requests with exponential backoff:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

async fn send_with_retry(
    client: &VllmClient,
    prompt: &str,
    max_retries: u32,
) -> Result<String, VllmError> {
    let mut attempts = 0;

    loop {
        match client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
        {
            Ok(response) => {
                return Ok(response.content.unwrap_or_default());
            }
            Err(e) if e.is_retryable() && attempts < max_retries => {
                attempts += 1;
                let delay = Duration::from_millis(100 * 2u64.pow(attempts - 1));
                eprintln!("Retry {} after {:?}: {}", attempts, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

Retry with Jitter

Add jitter to prevent thundering herd:

#![allow(unused)]
fn main() {
use rand::Rng;
use std::time::Duration;
use tokio::time::sleep;

fn backoff_with_jitter(attempt: u32, base_ms: u64, max_ms: u64) -> Duration {
    let exponential = base_ms * 2u64.pow(attempt);
    let jitter = rand::thread_rng().gen_range(0..base_ms);
    let delay = (exponential + jitter).min(max_ms);
    Duration::from_millis(delay)
}

async fn retry_with_jitter<F, T, E>(
    mut f: F,
    max_retries: u32,
) -> Result<T, E>
where
    F: FnMut() -> std::pin::Pin<Box<dyn std::future::Future<Output = Result<T, E>> + Send>>,
    E: std::fmt::Debug,
{
    let mut attempts = 0;

    loop {
        match f().await {
            Ok(result) => return Ok(result),
            Err(e) if attempts < max_retries => {
                attempts += 1;
                let delay = backoff_with_jitter(attempts, 100, 10_000);
                eprintln!("Retry {} after {:?}: {:?}", attempts, delay, e);
                sleep(delay).await;
            }
            Err(e) => return Err(e),
        }
    }
}
}

Retry Only Retryable Errors

Not all errors should be retried:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, VllmError};

async fn smart_retry(
    client: &VllmClient,
    prompt: &str,
) -> Result<String, VllmError> {
    let mut attempts = 0;
    let max_retries = 3;

    loop {
        let result = client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await;

        match result {
            Ok(response) => return Ok(response.content.unwrap_or_default()),
            Err(e) => {
                // Check if error is retryable
                if !e.is_retryable() {
                    return Err(e);
                }

                if attempts >= max_retries {
                    return Err(e);
                }

                attempts += 1;
                tokio::time::sleep(std::time::Duration::from_secs(2u64.pow(attempts))).await;
            }
        }
    }
}
}

Retryable Errors

ErrorRetryableReason
TimeoutYesServer may be slow
429 Rate LimitedYesWait and retry
500 Server ErrorYesTemporary server issue
502 Bad GatewayYesServer may restart
503 UnavailableYesTemporary overload
504 Gateway TimeoutYesServer error
429 Rate LimitedYesShould wait
500 Server ErrorYesTemporary issue
502/503/504YesGateway errors
400 Bad RequestNoClient error
401 UnauthorizedNoAuthentication issue
404 Not FoundNoResource doesn't exist

Circuit Breaker Pattern

Prevent cascading failures with a circuit breaker:

#![allow(unused)]
fn main() {
use std::sync::atomic::{AtomicU32, Ordering};
use std::time::{Duration, Instant};
use std::sync::Mutex;

struct CircuitBreaker {
    failures: AtomicU32,
    last_failure: Mutex<Option<Instant>>,
    threshold: u32,
    reset_duration: Duration,
}

impl CircuitBreaker {
    fn new(threshold: u32, reset_duration: Duration) -> Self {
        Self {
            failures: AtomicU32::new(0),
            last_failure: Mutex::new(None),
            threshold,
            reset_duration,
        }
    }

    fn can_attempt(&self) -> bool {
        let failures = self.failures.load(Ordering::Relaxed);
        if failures < self.threshold {
            return true;
        }

        let last = self.last_failure.lock().unwrap();
        if let Some(time) = *last {
            if time.elapsed() > self.reset_duration {
                // Reset circuit breaker
                self.failures.store(0, Ordering::Relaxed);
                return true;
            }
        }

        false
    }

    fn record_success(&self) {
        self.failures.store(0, Ordering::Relaxed);
    }

    fn record_failure(&self) {
        self.failures.fetch_add(1, Ordering::Relaxed);
        *self.last_failure.lock().unwrap() = Some(Instant::now());
    }
}
}

Streaming Timeout

Handle timeouts during streaming:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;
use tokio::time::{timeout, Duration};

async fn stream_with_timeout(
    client: &VllmClient,
    prompt: &str,
    per_event_timeout: Duration,
) -> Result<String, vllm_client::VllmError> {
    let mut stream = client
        .chat
        .completions()
        .create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .stream(true)
        .send_stream()
        .await?;

    let mut content = String::new();

    loop {
        match timeout(per_event_timeout, stream.next()).await {
            Ok(Some(event)) => {
                match event {
                    StreamEvent::Content(delta) => content.push_str(&delta),
                    StreamEvent::Done => break,
                    StreamEvent::Error(e) => return Err(e),
                    _ => {}
                }
            }
            Ok(None) => break,
            Err(_) => {
                return Err(vllm_client::VllmError::Timeout);
            }
        }
    }

    Ok(content)
}
}

Rate Limiting

Implement client-side rate limiting:

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;
use std::sync::Arc;

struct RateLimitedClient {
    client: vllm_client::VllmClient,
    semaphore: Arc<Semaphore>,
}

impl RateLimitedClient {
    fn new(base_url: &str, max_concurrent: usize) -> Self {
        Self {
            client: vllm_client::VllmClient::new(base_url),
            semaphore: Arc::new(Semaphore::new(max_concurrent)),
        }
    }

    async fn chat(&self, prompt: &str) -> Result<String, vllm_client::VllmError> {
        let _permit = self.semaphore.acquire().await.unwrap();
        
        self.client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(vllm_client::json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default())
    }
}
}

Production Configuration

Complete Example

use vllm_client::{VllmClient, json, VllmError};
use std::time::Duration;
use tokio::time::sleep;

struct RobustClient {
    client: VllmClient,
    max_retries: u32,
    base_backoff_ms: u64,
    max_backoff_ms: u64,
}

impl RobustClient {
    fn new(base_url: &str, timeout_secs: u64) -> Self {
        Self {
            client: VllmClient::builder()
                .base_url(base_url)
                .timeout_secs(timeout_secs)
                .build(),
            max_retries: 3,
            base_backoff_ms: 100,
            max_backoff_ms: 10_000,
        }
    }

    async fn chat(&self, prompt: &str) -> Result<String, VllmError> {
        let mut attempts = 0;

        loop {
            match self.send_request(prompt).await {
                Ok(response) => return Ok(response),
                Err(e) if self.should_retry(&e, attempts) => {
                    attempts += 1;
                    let delay = self.calculate_backoff(attempts);
                    eprintln!("Retry {} after {:?}: {}", attempts, delay, e);
                    sleep(delay).await;
                }
                Err(e) => return Err(e),
            }
        }
    }

    async fn send_request(&self, prompt: &str) -> Result<String, VllmError> {
        self.client
            .chat
            .completions()
            .create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": prompt}]))
            .send()
            .await
            .map(|r| r.content.unwrap_or_default())
    }

    fn should_retry(&self, error: &VllmError, attempts: u32) -> bool {
        attempts < self.max_retries && error.is_retryable()
    }

    fn calculate_backoff(&self, attempt: u32) -> Duration {
        let delay = self.base_backoff_ms * 2u64.pow(attempt);
        Duration::from_millis(delay.min(self.max_backoff_ms))
    }
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = RobustClient::new("http://localhost:8000/v1", 300);

    match client.chat("Hello!").await {
        Ok(response) => println!("Response: {}", response),
        Err(e) => eprintln!("Failed after retries: {}", e),
    }

    Ok(())
}

Best Practices

  1. Set appropriate timeouts based on expected response times
  2. Use exponential backoff to avoid overwhelming the server
  3. Add jitter to prevent thundering herd problems
  4. Only retry retryable errors - don't retry client errors
  5. Implement circuit breakers for production systems
  6. Log retry attempts for debugging and monitoring
  7. Set a maximum retry count to avoid infinite loops

See Also

Contributing to vLLM Client

Thank you for your interest in contributing to vLLM Client! This document provides guidelines and instructions for contributing.

Table of Contents

Code of Conduct

Be respectful and inclusive. We welcome contributions from everyone.

Getting Started

  1. Fork the repository on GitHub
  2. Clone your fork locally
  3. Create a branch for your changes
git clone https://github.com/YOUR_USERNAME/vllm-client.git
cd vllm-client
git checkout -b my-feature

Development Setup

Prerequisites

  • Rust 1.70 or later
  • Cargo (comes with Rust)
  • A vLLM server for integration testing (optional)

Building

# Build the library
cargo build

# Build with all features
cargo build --all-features

Running Tests

# Run unit tests
cargo test

# Run tests with output
cargo test -- --nocapture

# Run specific test
cargo test test_name

# Run integration tests (requires vLLM server)
cargo test --test integration

Making Changes

Branch Naming

Use descriptive branch names:

  • feature/add-new-feature - for new features
  • fix/bug-description - for bug fixes
  • docs/documentation-update - for documentation changes
  • refactor/code-cleanup - for refactoring

Commit Messages

Follow conventional commit format:

type(scope): description

[optional body]

[optional footer]

Types:

  • feat: New feature
  • fix: Bug fix
  • docs: Documentation changes
  • style: Code style changes (formatting, etc.)
  • refactor: Code refactoring
  • test: Adding or updating tests
  • chore: Maintenance tasks

Examples:

feat(client): add connection pooling support

fix(streaming): handle empty chunks correctly

docs(api): update streaming documentation

Testing

Unit Tests

All new functionality should have unit tests:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_new_feature() {
        // Test implementation
    }
}
}

Integration Tests

Integration tests go in the tests/ directory:

#![allow(unused)]
fn main() {
// tests/integration_test.rs
use vllm_client::{VllmClient, json};

#[tokio::test]
async fn test_chat_completion() {
    let client = VllmClient::new("http://localhost:8000/v1");
    // ... test code
}
}

Test Coverage

We aim for good test coverage. Run coverage reports:

cargo tarpaulin --out Html

Documentation

Code Documentation

Document all public APIs with doc comments:

#![allow(unused)]
fn main() {
/// Creates a new chat completion request.
///
/// # Arguments
///
/// * `model` - The model name to use for generation
///
/// # Returns
///
/// A new `ChatCompletionsRequest` builder
///
/// # Example
///
/// ```rust
/// use vllm_client::{VllmClient, json};
///
/// let client = VllmClient::new("http://localhost:8000/v1");
/// let response = client.chat.completions().create()
///     .model("Qwen/Qwen2.5-7B-Instruct")
///     .messages(json!([{"role": "user", "content": "Hello"}]))
///     .send()
///     .await?;
/// ```
pub fn create(&self) -> ChatCompletionsRequest {
    // Implementation
}
}

Updating Documentation

When adding new features:

  1. Update inline documentation
  2. Update API reference in docs/src/api/
  3. Add examples to docs/src/examples/
  4. Update the changelog

Building Documentation

# Build and preview documentation
cd docs && mdbook serve --open

Pull Request Process

  1. Update Documentation: Ensure documentation reflects your changes
  2. Add Tests: Include tests for new functionality
  3. Run Tests: Make sure all tests pass
  4. Format Code: Run cargo fmt
  5. Check Lints: Run cargo clippy
  6. Update CHANGELOG: Add entry to changelog

Pre-PR Checklist

# Format code
cargo fmt

# Check for lints
cargo clippy -- -D warnings

# Run all tests
cargo test

# Build documentation
mdbook build docs
mdbook build docs/zh

Submitting the PR

  1. Push your branch to your fork
  2. Open a PR against the main branch
  3. Fill in the PR template
  4. Wait for review

PR Template

## Description

Brief description of changes

## Type of Change

- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing

- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing done

## Checklist

- [ ] Code formatted with `cargo fmt`
- [ ] No clippy warnings
- [ ] Documentation updated
- [ ] Changelog updated

Coding Standards

Rust Style

Follow standard Rust conventions:

Naming Conventions

  • Types: PascalCase (ChatCompletionResponse)
  • Functions/Methods: snake_case (send_stream)
  • Constants: SCREAMING_SNAKE_CASE (MAX_RETRIES)
  • Modules: snake_case (chat, completions)

Error Handling

Use VllmError for all errors:

#![allow(unused)]
fn main() {
// Good
pub fn parse_response(data: &str) -> Result<Response, VllmError> {
    serde_json::from_str(data).map_err(VllmError::Json)
}

// Avoid
pub fn parse_response(data: &str) -> Result<Response, String> {
    // ...
}
}

Async Code

Use async/await for all async operations:

#![allow(unused)]
fn main() {
// Good
pub async fn send(&self) -> Result<Response, VllmError> {
    let response = self.http.post(&url).send().await?;
    // ...
}

// Avoid blocking in async context
pub async fn bad_example(&self) -> Result<Response, VllmError> {
    std::thread::sleep(Duration::from_secs(1)); // Don't do this
    // ...
}
}

Project Structure

vllm-client/
├── src/
│   ├── lib.rs         # Library entry point
│   ├── client.rs      # Client implementation
│   ├── chat.rs        # Chat API
│   ├── completions.rs # Legacy completions
│   ├── types.rs       # Type definitions
│   └── error.rs       # Error types
├── tests/
│   └── integration/   # Integration tests
├── docs/
│   ├── src/           # English documentation
│   └── zh/src/        # Chinese documentation
├── examples/
│   └── *.rs           # Example programs
└── Cargo.toml

Getting Help

  • Open an issue for bugs or feature requests
  • Start a discussion for questions
  • Check existing issues before creating new ones

License

By contributing, you agree that your contributions will be licensed under the MIT OR Apache-2.0 license.

Recognition

Contributors are recognized in our README and release notes.

Thank you for contributing to vLLM Client!

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

0.1.0 - 2024-01-XX

Added

  • Initial release of vLLM Client
  • VllmClient for connecting to vLLM servers
  • Chat completions API (client.chat.completions())
  • Streaming response support with MessageStream
  • Tool/function calling support
  • Reasoning/thinking mode support for compatible models
  • Error handling with VllmError enum
  • Builder pattern for client configuration
  • Request builder pattern for chat completions
  • Support for vLLM-specific parameters via extra()
  • Token usage tracking in responses
  • Timeout configuration
  • API key authentication

Features

Client

  • VllmClient::new(base_url) - Create a new client
  • VllmClient::builder() - Create a client with builder pattern
  • with_api_key() - Set API key for authentication
  • timeout_secs() - Set request timeout

Chat Completions

  • model() - Set model name
  • messages() - Set conversation messages
  • temperature() - Set sampling temperature
  • max_tokens() - Set maximum output tokens
  • top_p() - Set nucleus sampling parameter
  • top_k() - Set top-k sampling (vLLM extension)
  • stop() - Set stop sequences
  • stream() - Enable streaming mode
  • tools() - Define available tools
  • tool_choice() - Control tool selection
  • extra() - Pass vLLM-specific parameters

Streaming

  • StreamEvent::Content - Content tokens
  • StreamEvent::Reasoning - Reasoning content (thinking models)
  • StreamEvent::ToolCallDelta - Streaming tool call updates
  • StreamEvent::ToolCallComplete - Complete tool call
  • StreamEvent::Usage - Token usage statistics
  • StreamEvent::Done - Stream completion
  • StreamEvent::Error - Error events

Response Types

  • ChatCompletionResponse - Chat completion response
  • ToolCall - Tool call data with parsing methods
  • Usage - Token usage statistics

Dependencies

  • reqwest - HTTP client
  • serde / serde_json - JSON serialization
  • tokio - Async runtime
  • thiserror - Error handling

[Unreleased]

Planned

  • Custom HTTP headers support
  • Connection pooling configuration
  • Request/response logging
  • Retry middleware
  • Multi-modal input helpers
  • Async iterator for batch processing
  • OpenTelemetry integration
  • WebSocket transport

Version History

VersionDateHighlights
0.1.02024-01Initial release