Skip to content

test: intermittent failures from vllm tests on LSF cluster #699

@planetf1

Description

@planetf1

I'm seeing intermittent failures from vllm tests on lsf cluster when run with

uv run --all-extras --all-groups pytest --isolate-heavy -v

For example:

==== 723 passed, 142 skipped, 2 xfailed, 90 warnings in 1572.83s (0:26:12) =====

when all worked well, and

FAILED test/backends/test_openai_vllm.py::test_instruct - openai.NotFoundErro...
FAILED test/backends/test_openai_vllm.py::test_multiturn - openai.NotFoundErr...
FAILED test/backends/test_openai_vllm.py::test_chat - openai.NotFoundError: E...
FAILED test/backends/test_openai_vllm.py::test_chat_stream - openai.NotFoundE...
FAILED test/backends/test_openai_vllm.py::test_format - openai.NotFoundError:...
FAILED test/backends/test_openai_vllm.py::test_generate_from_raw - openai.Not...
FAILED test/backends/test_openai_vllm.py::test_generate_from_raw_with_format
= 7 failed, 716 passed, 142 skipped, 2 xfailed, 90 warnings in 1409.38s (0:23:29) =

at other times.

Success seems about 50-75% failure from running multiple times

On further investigation the underlying error for all these cases is:

E               openai.NotFoundError: Error code: 404 - {'error': {'message': 'The model `ibm-granite/granite-4.0-micro` does not exist.', 'type': 'NotFoundError', 'param': None, 'code': 404}}

Question to persue -- How is the vllm server initialized when tests are run with uv on a GPU enabled cluster - clearly sometimes we get access to a vllm environment with the right model, othertimes we don't

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions