Skip to content

feat: vLLM (in-process) backend never populates mot.usage #696

@planetf1

Description

@planetf1

Summary

The in-process vLLM backend (mellea/backends/vllm.py) never sets mot.usage, so callers always receive None for token counts regardless of whether the generation succeeded.

Affected code

VLLMBackend.post_processing records tool calls, the generate log, and telemetry metadata, but contains no usage-population step.

The processing method accumulates only the decoded text from vllm.RequestOutput.outputs[0].text; the token ID arrays are discarded.

How other backends handle this

Every other backend that can compute token counts does so unconditionally in its post-processing step:

Backend Source of counts
HuggingFace GenerateDecoderOnlyOutput.sequences shape
OpenAI / LiteLLM usage field in API response
Ollama prompt_eval_count / eval_count in response
WatsonX usage field in API response

vllm.RequestOutput exposes both prompt_token_ids and outputs[0].token_ids, so counts can be derived without any extra API call.

Expected behaviour

mot.usage should be set to {"prompt_tokens": N, "completion_tokens": M, "total_tokens": N+M} after every successful vLLM generation, consistent with other backends.

Notes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions