Chat Completions API

The Chat Completions API is the primary interface for generating text responses from a language model.

Overview

Access the chat completions API through client.chat.completions():

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "Hello!"}
    ]))
    .send()
    .await?;
}

Request Builder

Required Parameters

`model(name: impl Into<String>)`

Set the model name to use for generation.

#![allow(unused)]
fn main() {
.model("Qwen/Qwen2.5-72B-Instruct")
// or
.model("meta-llama/Llama-3-70b")
}

`messages(messages: Value)`

Set the conversation messages as a JSON array.

#![allow(unused)]
fn main() {
.messages(json!([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Rust?"}
]))
}

Message Types

Role	Description
`system`	Set the behavior of the assistant
`user`	User input
`assistant`	Assistant response (for multi-turn)
`tool`	Tool result (for function calling)

Sampling Parameters

`temperature(temp: f32)`

Controls randomness. Range: 0.0 to 2.0.

#![allow(unused)]
fn main() {
.temperature(0.7)  // Default-like behavior
.temperature(0.0)  // Deterministic
.temperature(1.5)  // More creative
}

`max_tokens(tokens: u32)`

Maximum number of tokens to generate.

#![allow(unused)]
fn main() {
.max_tokens(1024)
.max_tokens(4096)
}

`top_p(p: f32)`

Nucleus sampling threshold. Range: 0.0 to 1.0.

#![allow(unused)]
fn main() {
.top_p(0.9)
}

`top_k(k: i32)`

Top-K sampling (vLLM extension). Limits to top K tokens.

#![allow(unused)]
fn main() {
.top_k(50)
}

`stop(sequences: Value)`

Stop generation when encountering these sequences.

#![allow(unused)]
fn main() {
// Multiple sequences
.stop(json!(["END", "STOP", "\n\n"]))

// Single sequence
.stop(json!("---"))
}

Tool Calling Parameters

`tools(tools: Value)`

Define tools/functions that the model can call.

#![allow(unused)]
fn main() {
.tools(json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]))
}

`tool_choice(choice: Value)`

Control tool selection behavior.

#![allow(unused)]
fn main() {
.tool_choice(json!("auto"))       // Model decides
.tool_choice(json!("none"))       // No tools
.tool_choice(json!("required"))   // Force tool use
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

Advanced Parameters

`stream(enable: bool)`

Enable streaming response.

#![allow(unused)]
fn main() {
.stream(true)
}

`extra(params: Value)`

Pass vLLM-specific or additional parameters.

#![allow(unused)]
fn main() {
.extra(json!({
    "chat_template_kwargs": {
        "think_mode": true
    },
    "reasoning_effort": "high"
}))
}

Sending Requests

`send()` - Synchronous Response

Returns the complete response at once.

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .send()
    .await?;
}

`send_stream()` - Streaming Response

Returns a stream for real-time output.

#![allow(unused)]
fn main() {
let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .stream(true)
    .send_stream()
    .await?;
}

See Streaming for detailed streaming documentation.

Response Structure

`ChatCompletionResponse`

Field	Type	Description
`raw`	`Value`	Raw JSON response
`id`	`String`	Response ID
`object`	`String`	Object type
`model`	`String`	Model used
`content`	`Option<String>`	Generated content
`reasoning_content`	`Option<String>`	Reasoning content (thinking models)
`tool_calls`	`Option<Vec<ToolCall>>`	Tool calls made
`finish_reason`	`Option<String>`	Why generation stopped
`usage`	`Option<Usage>`	Token usage statistics

Example Usage

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What is 2+2?"}
    ]))
    .send()
    .await?;

// Access content
println!("Content: {}", response.content.unwrap_or_default());

// Check for reasoning (thinking models)
if let Some(reasoning) = response.reasoning_content {
    println!("Reasoning: {}", reasoning);
}

// Check finish reason
match response.finish_reason.as_deref() {
    Some("stop") => println!("Natural stop"),
    Some("length") => println!("Max tokens reached"),
    Some("tool_calls") => println!("Tool calls made"),
    _ => {}
}

// Token usage
if let Some(usage) = response.usage {
    println!("Prompt tokens: {}", usage.prompt_tokens);
    println!("Completion tokens: {}", usage.completion_tokens);
    println!("Total tokens: {}", usage.total_tokens);
}
}

Complete Example

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "You are a helpful coding assistant."},
            {"role": "user", "content": "Write a function to reverse a string in Rust"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    if let Some(content) = response.content {
        println!("{}", content);
    }

    Ok(())
}

Multi-turn Conversation

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

// First message
let response1 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "My name is Alice"}
    ]))
    .send()
    .await?;

// Continue conversation
let response2 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "My name is Alice"},
        {"role": "assistant", "content": response1.content.unwrap()},
        {"role": "user", "content": "What's my name?"}
    ]))
    .send()
    .await?;
}

vLLM Client - Rust Client for vLLM API