Advanced Topics

This section covers advanced features and patterns for vLLM Client.

Available Topics

TopicDescription
Thinking ModeReasoning models and thinking content
Custom HeadersCustom HTTP headers and authentication
Timeouts & RetriesTimeout configuration and retry strategies

Thinking Mode

For models that support reasoning (like Qwen with thinking mode), access the reasoning_content field:

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json, StreamEvent};
use futures::StreamExt;

let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Solve this puzzle"}]))
    .extra(json!({"chat_template_kwargs": {"think_mode": true}}))
    .stream(true)
    .send_stream()
    .await?;

while let Some(event) = stream.next().await {
    match event {
        StreamEvent::Reasoning(delta) => eprintln!("[thinking] {}", delta),
        StreamEvent::Content(delta) => print!("{}", delta),
        _ => {}
    }
}
}

Custom Configuration

Environment-Based Configuration

#![allow(unused)]
fn main() {
use std::env;
use vllm_client::VllmClient;

fn create_client() -> VllmClient {
    VllmClient::builder()
        .base_url(env::var("VLLM_BASE_URL")
            .unwrap_or_else(|_| "http://localhost:8000/v1".to_string()))
        .api_key(env::var("VLLM_API_KEY").ok())
        .timeout_secs(env::var("VLLM_TIMEOUT")
            .ok()
            .and_then(|s| s.parse().ok())
            .unwrap_or(300))
        .build()
}
}

Multiple Clients

#![allow(unused)]
fn main() {
use vllm_client::VllmClient;

let primary = VllmClient::new("http://primary-server:8000/v1");
let fallback = VllmClient::new("http://fallback-server:8000/v1");
}

Production Patterns

Connection Pooling

The client reuses HTTP connections automatically. Create once and share:

#![allow(unused)]
fn main() {
use std::sync::Arc;
use vllm_client::VllmClient;

let client = Arc::new(VllmClient::new("http://localhost:8000/v1"));

// Clone the Arc for each task
let client1 = Arc::clone(&client);
let client2 = Arc::clone(&client);
}

Graceful Shutdown

Handle graceful shutdown with channels:

#![allow(unused)]
fn main() {
use tokio::signal;
use tokio::sync::broadcast;

let (shutdown_tx, _) = broadcast::channel::<()>(1);

// In your request loop
tokio::select! {
    result = make_request(&client) => {
        // Handle result
    }
    _ = shutdown_rx.recv() => {
        println!("Shutting down gracefully");
        break;
    }
}
}

Request Queuing

For rate limiting, implement a queue:

#![allow(unused)]
fn main() {
use tokio::sync::Semaphore;

let semaphore = Arc::new(Semaphore::new(10)); // Max 10 concurrent

async fn queued_request(client: &VllmClient, prompt: &str) -> Result<String, VllmError> {
    let _permit = semaphore.acquire().await.unwrap();
    client.chat.completions().create()
        .model("Qwen/Qwen2.5-7B-Instruct")
        .messages(json!([{"role": "user", "content": prompt}]))
        .send()
        .await
        .map(|r| r.content.unwrap_or_default())
}
}

Performance Tips

1. Reuse the Client

Creating a client has some overhead. Reuse it across requests:

#![allow(unused)]
fn main() {
// Good
let client = VllmClient::new("http://localhost:8000/v1");
for prompt in prompts {
    let _ = client.chat.completions().create()...;
}

// Avoid
for prompt in prompts {
    let client = VllmClient::new("http://localhost:8000/v1"); // Inefficient!
    let _ = client.chat.completions().create()...;
}
}

2. Use Streaming for Long Responses

Get faster time-to-first-token with streaming:

#![allow(unused)]
fn main() {
// Faster perceived latency
let mut stream = client.chat.completions().create()
    .stream(true)
    .send_stream()
    .await?;
}

3. Set Appropriate Timeouts

Match timeout to expected response time:

#![allow(unused)]
fn main() {
// Short queries
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(30);

// Long generation tasks
let client = VllmClient::new("http://localhost:8000/v1")
    .timeout_secs(600);
}

4. Batch Requests

Process multiple prompts concurrently:

#![allow(unused)]
fn main() {
use futures::stream::{self, StreamExt};

let prompts = vec!["Hello", "Hi", "Hey"];
let results: Vec<_> = stream::iter(prompts)
    .map(|p| async {
        client.chat.completions().create()
            .model("Qwen/Qwen2.5-7B-Instruct")
            .messages(json!([{"role": "user", "content": p}]))
            .send()
            .await
    })
    .buffer_unordered(5) // Max 5 concurrent
    .collect()
    .await;
}

Security Considerations

API Key Storage

Never hardcode API keys:

#![allow(unused)]
fn main() {
// Good: Use environment variables
let api_key = std::env::var("VLLM_API_KEY")?;

// Avoid: Hardcoded keys
let api_key = "sk-secret-key"; // DON'T DO THIS!
}

TLS Verification

The client uses reqwest which verifies TLS certificates by default. For development with self-signed certificates:

#![allow(unused)]
fn main() {
// Use a custom HTTP client if needed
let http = reqwest::Client::builder()
    .danger_accept_invalid_certs(true) // Only for development!
    .timeout(std::time::Duration::from_secs(300))
    .build()?;
}

See Also