Chat Completions API

The Chat Completions API is the primary interface for generating text responses from a language model.

Overview

Access the chat completions API through client.chat.completions():

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "Hello!"}
    ]))
    .send()
    .await?;
}

Request Builder

Required Parameters

model(name: impl Into<String>)

Set the model name to use for generation.

#![allow(unused)]
fn main() {
.model("Qwen/Qwen2.5-72B-Instruct")
// or
.model("meta-llama/Llama-3-70b")
}

messages(messages: Value)

Set the conversation messages as a JSON array.

#![allow(unused)]
fn main() {
.messages(json!([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Rust?"}
]))
}

Message Types

RoleDescription
systemSet the behavior of the assistant
userUser input
assistantAssistant response (for multi-turn)
toolTool result (for function calling)

Sampling Parameters

temperature(temp: f32)

Controls randomness. Range: 0.0 to 2.0.

#![allow(unused)]
fn main() {
.temperature(0.7)  // Default-like behavior
.temperature(0.0)  // Deterministic
.temperature(1.5)  // More creative
}

max_tokens(tokens: u32)

Maximum number of tokens to generate.

#![allow(unused)]
fn main() {
.max_tokens(1024)
.max_tokens(4096)
}

top_p(p: f32)

Nucleus sampling threshold. Range: 0.0 to 1.0.

#![allow(unused)]
fn main() {
.top_p(0.9)
}

top_k(k: i32)

Top-K sampling (vLLM extension). Limits to top K tokens.

#![allow(unused)]
fn main() {
.top_k(50)
}

stop(sequences: Value)

Stop generation when encountering these sequences.

#![allow(unused)]
fn main() {
// Multiple sequences
.stop(json!(["END", "STOP", "\n\n"]))

// Single sequence
.stop(json!("---"))
}

Tool Calling Parameters

tools(tools: Value)

Define tools/functions that the model can call.

#![allow(unused)]
fn main() {
.tools(json!([
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]))
}

tool_choice(choice: Value)

Control tool selection behavior.

#![allow(unused)]
fn main() {
.tool_choice(json!("auto"))       // Model decides
.tool_choice(json!("none"))       // No tools
.tool_choice(json!("required"))   // Force tool use
.tool_choice(json!({
    "type": "function",
    "function": {"name": "get_weather"}
}))
}

Advanced Parameters

stream(enable: bool)

Enable streaming response.

#![allow(unused)]
fn main() {
.stream(true)
}

extra(params: Value)

Pass vLLM-specific or additional parameters.

#![allow(unused)]
fn main() {
.extra(json!({
    "chat_template_kwargs": {
        "think_mode": true
    },
    "reasoning_effort": "high"
}))
}

Sending Requests

send() - Synchronous Response

Returns the complete response at once.

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .send()
    .await?;
}

send_stream() - Streaming Response

Returns a stream for real-time output.

#![allow(unused)]
fn main() {
let mut stream = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([{"role": "user", "content": "Hello!"}]))
    .stream(true)
    .send_stream()
    .await?;
}

See Streaming for detailed streaming documentation.

Response Structure

ChatCompletionResponse

FieldTypeDescription
rawValueRaw JSON response
idStringResponse ID
objectStringObject type
modelStringModel used
contentOption<String>Generated content
reasoning_contentOption<String>Reasoning content (thinking models)
tool_callsOption<Vec<ToolCall>>Tool calls made
finish_reasonOption<String>Why generation stopped
usageOption<Usage>Token usage statistics

Example Usage

#![allow(unused)]
fn main() {
let response = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "What is 2+2?"}
    ]))
    .send()
    .await?;

// Access content
println!("Content: {}", response.content.unwrap_or_default());

// Check for reasoning (thinking models)
if let Some(reasoning) = response.reasoning_content {
    println!("Reasoning: {}", reasoning);
}

// Check finish reason
match response.finish_reason.as_deref() {
    Some("stop") => println!("Natural stop"),
    Some("length") => println!("Max tokens reached"),
    Some("tool_calls") => println!("Tool calls made"),
    _ => {}
}

// Token usage
if let Some(usage) = response.usage {
    println!("Prompt tokens: {}", usage.prompt_tokens);
    println!("Completion tokens: {}", usage.completion_tokens);
    println!("Total tokens: {}", usage.total_tokens);
}
}

Complete Example

use vllm_client::{VllmClient, json};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = VllmClient::new("http://localhost:8000/v1");

    let response = client.chat.completions().create()
        .model("Qwen/Qwen2.5-72B-Instruct")
        .messages(json!([
            {"role": "system", "content": "You are a helpful coding assistant."},
            {"role": "user", "content": "Write a function to reverse a string in Rust"}
        ]))
        .temperature(0.7)
        .max_tokens(1024)
        .top_p(0.9)
        .send()
        .await?;

    if let Some(content) = response.content {
        println!("{}", content);
    }

    Ok(())
}

Multi-turn Conversation

#![allow(unused)]
fn main() {
use vllm_client::{VllmClient, json};

let client = VllmClient::new("http://localhost:8000/v1");

// First message
let response1 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "My name is Alice"}
    ]))
    .send()
    .await?;

// Continue conversation
let response2 = client.chat.completions().create()
    .model("Qwen/Qwen2.5-72B-Instruct")
    .messages(json!([
        {"role": "user", "content": "My name is Alice"},
        {"role": "assistant", "content": response1.content.unwrap()},
        {"role": "user", "content": "What's my name?"}
    ]))
    .send()
    .await?;
}

See Also