The test suite has accumulated several rough edges that make it hard to run, hard
to interpret, and hard to trust as a CI signal. This issue collects eight
specific problems and proposes fixes grounded in the current repo layout.
Current state (snapshot)
pyproject.toml declares five markers: gpu, vllm, slow, deep,
requires_model.
Makefile chains targets sequentially: test-all → test-cpu (unit, composer,
hf) → test-gpu (vllm, integration). Each target invokes a separate pytest
process with -v -s --tb=short and no -n, no -x, no -p
reporting plugin.
tests/conftest.py already pins each xdist worker to one GPU via
CUDA_VISIBLE_DEVICES round-robin (pytest_configure).
- TP/PP tests under
tests/vllm/ (e.g. test_pipeline_parallelism_generation.py,
test_tp_integration.py, test_tp_lora.py) already gate themselves on
_visible_cuda_device_count() < 2 via pytest.mark.skipif.
Details
1. Marker usage is inconsistent
The five markers are declared but unevenly applied. Many tests that take
30 s have no slow marker; many GPU-required tests have no gpu marker
(they rely on implicit failures or runtime checks); vllm is sometimes used
as a skipif reason rather than a marker.
A grep across tests/ shows mixed conventions:
- Some files use
pytestmark = [pytest.mark.vllm, pytest.mark.gpu, pytest.mark.slow, ...] (good — see tests/vllm/test_pipeline_parallelism_generation.py:54).
- Others use only
@pytest.mark.skipif(not _VLLM_AVAILABLE, ...) without the vllm marker.
- Most tests in
tests/composer/ and tests/integration/ carry no markers at all.
This makes -m "not slow" or -m gpu unreliable as filters.
2. Multi-GPU tests under -n N
tests/conftest.py:13-38 pins each xdist worker to a single GPU. This is
correct for single-GPU tests but breaks multi-GPU tests: a test marked for
TP=2 or PP=2 sees only one GPU because its worker was pinned to one.
Tests that require multiple GPUs (test_tp_integration.py, test_tp_lora.py,
test_pipeline_parallelism_generation.py) skip under -n auto for this reason
even on a 2+ GPU machine. The skip is silent — see problem (6).
The -n value also matters: -n auto uses CPU count, which on a typical 8-core
GPU box is 8, far more workers than GPUs. Without per-test resource declarations,
this leads to GPU contention and OOM.
3. make test-all short-circuits on first failure
make test-all runs five pytest processes sequentially. If any one exits
non-zero, Make halts and the remaining suites are never invoked — they do
not appear as failed, skipped, errored, or xfailed. They are simply absent from
output. There is no aggregated cross-suite summary.
4. Test results are hard to consume
Output is plain pytest text streamed to the terminal. There is no machine-
readable summary (no JUnit XML, no JSON), no HTML report, and no aggregated
pass/fail count across the five suite invocations. CI integration and
historical tracking both require this.
5. Parallelism is opt-in instead of default
-n is not in addopts and not in Makefile PYTEST_FLAGS. Every developer
runs serially by default and discovers -n independently. With tests/vllm/
tests/integration/ taking ~22 minutes serially, this is a real productivity
cost.
6. Skip reasons are invisible in default output
A run ending in 243 passed, 1 skipped, 17 warnings in 1329.59s does not tell
the developer what was skipped or why. The skip reason is recorded by
pytest but only printed under -rs (or -rsxX for full reporting).
7. Sequential per-folder pytest invocations overwrite earlier summaries
make test-all runs five separate pytest processes. Each prints its own
===== N passed in Xs ===== summary. By the time the run finishes, the
terminal scrollback contains five independent summaries with no aggregation —
and if the last suite is large (e.g. tests/vllm/), earlier summaries scroll
off-screen on terminals with limited history. Reviewing results requires
manually scrolling and mentally combining five tail blocks.
8. Chronically-failing integration tests
=================================== FAILURES ===================================
________ test_hf_composed_adapter_indices[granite-4.0-micro-mid-131072] ________
[gw0] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:227: in test_hf_composed_adapter_indices
hf_model(input_ids=input_ids)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/granite_switch/hf/modeling_granite_switch.py:490: in forward
logits = logits / self.config.logits_scaling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.50 GiB. GPU 0 has a total capacity of 79.14 GiB of which 24.37 GiB is free. Process 947164 has 54.76 GiB memory in use. Of the allocated memory 47.29 GiB is allocated by PyTorch, and 6.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
_______ test_hf_composed_adapter_indices[granite-4.0-micro-late-131072] ________
[gw3] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:227: in test_hf_composed_adapter_indices
hf_model(input_ids=input_ids)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/granite_switch/hf/modeling_granite_switch.py:490: in forward
logits = logits / self.config.logits_scaling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.50 GiB. GPU 0 has a total capacity of 79.14 GiB of which 24.08 GiB is free. Process 947174 has 55.05 GiB memory in use. Of the allocated memory 47.28 GiB is allocated by PyTorch, and 7.27 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
______________ test_hf_vllm_argmax_equivalence[granite-4.0-micro] ______________
[gw3] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:404: in test_hf_vllm_argmax_equivalence
assert not failures, (
E AssertionError: HF and vLLM disagree on top-1 token for 3 position(s) (base_model=ibm-granite/granite-4.0-micro):
E [early] mismatch at positions [0, 2, 3]
E hf_argmax = [596, 287, 596, 596, 337, 337, 596]
E vllm_argmax = [337, 287, 337, 337, 337, 337, 596]
E [mid] mismatch at positions [0, 1, 2, 3, 5, 6]
E hf_argmax = [596, 596, 596, 596, 287, 337, 596]
E vllm_argmax = [337, 337, 337, 337, 287, 596, 337]
E [late] mismatch at positions [0, 1, 2, 3, 4, 5]
E hf_argmax = [596, 596, 596, 596, 596, 596, 287]
E vllm_argmax = [337, 337, 337, 337, 337, 337, 287]
E This typically indicates a vLLM gain-compensation bug: the switch
E produced wrong adapter_indices, the wrong LoRA was applied, and
E the downstream logits diverged enough to flip the top token.
E assert not [('early', [0, 2, 3], [596, 287, 596, 596, 337, 337, ...], [337, 287, 337, 337, 337, 337, ...]), ('mid', [0, 1, 2, 3, 5, 6], [596, 596, 596, 596, 287, 337, ...], [337, 337, 337, 337, 287, 596, ...]), ('late', [0, 1, 2, 3, 4, 5], [596, 596, 596, 596, 596, 596, ...], [337, 337, 337, 337, 337, 337, ...])]
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: 56 warnings
/workspace/granite-switch/.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
warnings.warn(
tests/integration/test_hf_to_vllm_weights.py::TestSingleSwitchForwardEquivalence::test_forward_logit_equivalence
/workspace/granite-switch/.venv/lib/python3.12/site-packages/flashinfer/gemm/kernels/grouped_gemm_masked_blackwell.py:2059: DeprecationWarning: tcgen05.OperandMajorMode is deprecated, use cute.nvgpu.OperandMajorMode instead
a_major_mode: tcgen05.OperandMajorMode,
tests/integration/test_hf_to_vllm_weights.py::TestSingleSwitchForwardEquivalence::test_forward_logit_equivalence
/workspace/granite-switch/.venv/lib/python3.12/site-packages/flashinfer/gemm/kernels/grouped_gemm_masked_blackwell.py:2061: DeprecationWarning: tcgen05.OperandMajorMode is deprecated, use cute.nvgpu.OperandMajorMode instead
b_major_mode: tcgen05.OperandMajorMode,
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_composed_adapter_indices[granite-4.0-micro-mid-131072]
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_composed_adapter_indices[granite-4.0-micro-late-131072]
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_vllm_argmax_equivalence[granite-4.0-micro]
9. Skipped tests understanding
There are skipped tests, we should verify the reason and better reflect the impact
Proposed fixes
Fix 1 + 6: marker audit + default skip-reason reporting
Action:
- Audit every test file under
tests/ and apply markers consistently:
gpu — any test that imports torch and uses CUDA
vllm — any test that imports vllm (replaces skipif(not _VLLM_AVAILABLE))
slow — any test or class that takes > 30 s
multi_gpu — new marker, see fix (2)
requires_model — already used; keep as-is
- Use file-level
pytestmark = [...] for whole-file gating (already done in
tests/vllm/test_pipeline_parallelism_generation.py:54); use class-level
for partial coverage.
- Add
-rfsxX to addopts in pyproject.toml so failures, skips, xfails,
and xpasses are listed by reason in every run summary.
- Promote
.claude/commands/write-tests.md to a first-class skill so new
tests follow the audited conventions automatically. The existing command
already encodes the right rules — resource tier (Axis 1) maps to markers,
scope (Axis 2) maps to tests/ subdirectories, and the "Marker Rules"
section enforces the gating discipline this issue is trying to establish
(e.g. "If a test has skipif(not _CUDA_AVAILABLE), it MUST also have
@pytest.mark.gpu"). Action items:
- Update the skill to include the new
multi_gpu marker and its
interaction with the conftest worker-pinning logic (see fix (2)).
- Update the resource-tier table to drop any markers removed by the
audit and add ones it introduces.
- Reference the skill from
CLAUDE.md and docs/ so contributors see
it before writing tests.
- Once the audit lands, the skill is the enforcement mechanism that
keeps the suite from drifting back into inconsistency.
File to edit: pyproject.toml [tool.pytest.ini_options]:
addopts = ["--ignore=tests/_legacy", "-m", "not deep", "-rfsxX"]
markers = [
"gpu: requires CUDA GPU",
"multi_gpu: requires 2+ CUDA GPUs (TP/PP tests)",
"vllm: requires vLLM installed",
"slow: takes > 30s",
"deep: expensive code-theory tests",
"requires_model: needs a real model checkpoint",
]
Fix 2: dedicated marker + targeted skip logic for multi-GPU
Action:
- Add a
multi_gpu marker (see above).
- Apply it to
tests/vllm/test_tp_integration.py, tests/vllm/test_tp_lora.py,
tests/vllm/test_pipeline_parallelism_generation.py (it is already on the
last via pytestmark, but make it the standard path).
- Modify
tests/conftest.py:pytest_configure to not pin
CUDA_VISIBLE_DEVICES for tests carrying multi_gpu. The cleanest way:
detect the marker per-test in a pytest_collection_modifyitems hook and
un-pin (or skip) accordingly. Alternatively, run multi-GPU tests in a
separate, non-xdist invocation.
- Provide a dedicated Makefile target
test-multi-gpu that runs the
multi_gpu-marked subset without -n.
Why both: -n worker pinning and multi-GPU tests are fundamentally
incompatible. Splitting them is simpler than trying to make them coexist.
Fix 3 + 7: replace Makefile chain with single pytest invocation
Action: collapse test-all into one pytest run that collects from all
five directories at once. One process, one summary, no short-circuit.
PYTEST_FLAGS = -v --tb=short -rfsxX
test-all:
$(PYTHON) -m pytest tests/unit tests/composer tests/hf tests/vllm tests/integration $(PYTEST_FLAGS)
test-cpu:
$(PYTHON) -m pytest tests/unit tests/composer tests/hf $(PYTEST_FLAGS) -m "not gpu"
test-gpu:
$(PYTHON) -m pytest tests/vllm tests/integration $(PYTEST_FLAGS) -m "gpu and not multi_gpu"
test-multi-gpu:
$(PYTHON) -m pytest tests/vllm $(PYTEST_FLAGS) -m multi_gpu
Trade-off: dropping -s from PYTEST_FLAGS lets pytest capture stdout per
test, which is required for clean parallel output. The CLAUDE.md guidance
("always use -s") is for interactive debugging; the Makefile is for full-
suite runs. Developers who want streaming output can pass -s directly.
Also removes the per-suite Make targets' coupling, so a failure in one
directory does not hide later directories' results. The single pytest
process surfaces every failure in one summary.
Fix 4: machine-readable + human-readable reports
Action: add JUnit XML + optional HTML reporting to addopts:
addopts = [
"--ignore=tests/_legacy",
"-m", "not deep",
"-rfsxX",
"--junitxml=test-results.xml",
]
For HTML, add pytest-html to the test extra:
test = ["pytest", "pytest-html", "pytest-xdist", "bitsandbytes", "optimum-quanto"]
Then runs can do pytest --html=report.html --self-contained-html.
This unblocks CI integration (any CI consumes JUnit XML natively) and gives
local developers a browsable report.
Fix 4b: make [test] extra self-sufficient for full coverage
Current state: pyproject.toml:39 defines:
test = ["pytest", "bitsandbytes", "optimum-quanto"]
This is missing hf, vllm, and compose. A user running pip install -e ".[test]" cannot actually execute tests/hf/, tests/vllm/, tests/composer/,
or tests/integration/ — every import there fails. Today contributors install
[dev] (which transitively pulls hf/vllm/compose) to run tests, but [dev]
also drags in tooling that has nothing to do with test execution.
Action: redefine [test] to be the single authoritative "I want to run
the full test suite" extra. After the changes from fixes (4) and (5), it
should look like:
test = [
"granite-switch[hf,vllm,compose]",
"pytest",
"pytest-xdist",
"pytest-html",
"bitsandbytes",
"optimum-quanto",
]
Rules:
[test] = everything required to run make test-all and produce both
human- and machine-readable reports. Nothing more, nothing less.
[dev] keeps its role as the contributor convenience extra and references
[test] rather than duplicating its contents:
dev = ["granite-switch[test]", "ruff"] # add lint/format/etc here
- Document in
README.md and CLAUDE.md: "to run tests, install with
pip install -e \".[test]\"". This is the single command a CI image or a
new contributor needs.
- The
[test] extra also becomes the source of truth that CI installs from,
so test-runtime dependencies cannot drift between local and CI environments.
Why this matters for this issue: fix (4) adds pytest-xdist and
pytest-html, fix (5) makes -n a default. Both new dependencies need a
home. Putting them in [dev] repeats today's mistake (test deps scattered
across extras); putting them in [test] while leaving [test] incomplete is
worse. Fixing [test] to be self-sufficient is a prerequisite for landing
fixes (4) and (5) cleanly.
Fix 5: parallelism by default
Action: add pytest-xdist to the test extra (above) and put a sensible
default in addopts. Two reasonable choices:
"-n", "auto" — uses CPU count; risky on small/single-GPU boxes.
"-n", "4" — fixed, matches current ad-hoc usage.
Recommend -n 4 as the static default with a note in CONTRIBUTING/README that
multi-GPU tests need make test-multi-gpu separately. CPU-bound suites (unit,
composer, hf, integration) parallelize cleanly; vLLM single-GPU tests also work
under the existing worker-pinning logic.
addopts = [
"--ignore=tests/_legacy",
"-m", "not deep",
"-rfsxX",
"-n", "4",
"--junitxml=test-results.xml",
]
Developers wanting serial execution use pytest -n0; debugging uses pytest -s -n0.
Fix 8+9: identify and quarantine chronically-failing and skipped tests
The test suite has accumulated several rough edges that make it hard to run, hard
to interpret, and hard to trust as a CI signal. This issue collects eight
specific problems and proposes fixes grounded in the current repo layout.
Current state (snapshot)
pyproject.tomldeclares five markers:gpu,vllm,slow,deep,requires_model.Makefilechains targets sequentially:test-all→test-cpu(unit, composer,hf) →
test-gpu(vllm, integration). Each target invokes a separatepytestprocess with
-v -s --tb=shortand no-n, no-x, no-preporting plugin.
tests/conftest.pyalready pins eachxdistworker to one GPU viaCUDA_VISIBLE_DEVICESround-robin (pytest_configure).tests/vllm/(e.g.test_pipeline_parallelism_generation.py,test_tp_integration.py,test_tp_lora.py) already gate themselves on_visible_cuda_device_count() < 2viapytest.mark.skipif.Details
1. Marker usage is inconsistent
The five markers are declared but unevenly applied. Many tests that take
A grep across
tests/shows mixed conventions:pytestmark = [pytest.mark.vllm, pytest.mark.gpu, pytest.mark.slow, ...](good — seetests/vllm/test_pipeline_parallelism_generation.py:54).@pytest.mark.skipif(not _VLLM_AVAILABLE, ...)without thevllmmarker.tests/composer/andtests/integration/carry no markers at all.This makes
-m "not slow"or-m gpuunreliable as filters.2. Multi-GPU tests under
-n Ntests/conftest.py:13-38pins each xdist worker to a single GPU. This iscorrect for single-GPU tests but breaks multi-GPU tests: a test marked for
TP=2 or PP=2 sees only one GPU because its worker was pinned to one.
Tests that require multiple GPUs (
test_tp_integration.py,test_tp_lora.py,test_pipeline_parallelism_generation.py) skip under-n autofor this reasoneven on a 2+ GPU machine. The skip is silent — see problem (6).
The
-nvalue also matters:-n autouses CPU count, which on a typical 8-coreGPU box is 8, far more workers than GPUs. Without per-test resource declarations,
this leads to GPU contention and OOM.
3.
make test-allshort-circuits on first failuremake test-allruns fivepytestprocesses sequentially. If any one exitsnon-zero, Make halts and the remaining suites are never invoked — they do
not appear as failed, skipped, errored, or xfailed. They are simply absent from
output. There is no aggregated cross-suite summary.
4. Test results are hard to consume
Output is plain pytest text streamed to the terminal. There is no machine-
readable summary (no JUnit XML, no JSON), no HTML report, and no aggregated
pass/fail count across the five suite invocations. CI integration and
historical tracking both require this.
5. Parallelism is opt-in instead of default
-nis not inaddoptsand not inMakefilePYTEST_FLAGS. Every developerruns serially by default and discovers
-nindependently. Withtests/vllm/tests/integration/taking ~22 minutes serially, this is a real productivitycost.
6. Skip reasons are invisible in default output
A run ending in
243 passed, 1 skipped, 17 warnings in 1329.59sdoes not tellthe developer what was skipped or why. The skip reason is recorded by
pytest but only printed under
-rs(or-rsxXfor full reporting).7. Sequential per-folder pytest invocations overwrite earlier summaries
make test-allruns five separate pytest processes. Each prints its own===== N passed in Xs =====summary. By the time the run finishes, theterminal scrollback contains five independent summaries with no aggregation —
and if the last suite is large (e.g.
tests/vllm/), earlier summaries scrolloff-screen on terminals with limited history. Reviewing results requires
manually scrolling and mentally combining five tail blocks.
8. Chronically-failing integration tests
9. Skipped tests understanding
There are skipped tests, we should verify the reason and better reflect the impact
Proposed fixes
Fix 1 + 6: marker audit + default skip-reason reporting
Action:
tests/and apply markers consistently:gpu— any test that imports torch and uses CUDAvllm— any test that importsvllm(replacesskipif(not _VLLM_AVAILABLE))slow— any test or class that takes > 30 smulti_gpu— new marker, see fix (2)requires_model— already used; keep as-ispytestmark = [...]for whole-file gating (already done intests/vllm/test_pipeline_parallelism_generation.py:54); use class-levelfor partial coverage.
-rfsxXtoaddoptsinpyproject.tomlso failures, skips, xfails,and xpasses are listed by reason in every run summary.
.claude/commands/write-tests.mdto a first-class skill so newtests follow the audited conventions automatically. The existing command
already encodes the right rules — resource tier (Axis 1) maps to markers,
scope (Axis 2) maps to
tests/subdirectories, and the "Marker Rules"section enforces the gating discipline this issue is trying to establish
(e.g. "If a test has
skipif(not _CUDA_AVAILABLE), it MUST also have@pytest.mark.gpu"). Action items:multi_gpumarker and itsinteraction with the conftest worker-pinning logic (see fix (2)).
audit and add ones it introduces.
CLAUDE.mdanddocs/so contributors seeit before writing tests.
keeps the suite from drifting back into inconsistency.
File to edit:
pyproject.toml[tool.pytest.ini_options]:Fix 2: dedicated marker + targeted skip logic for multi-GPU
Action:
multi_gpumarker (see above).tests/vllm/test_tp_integration.py,tests/vllm/test_tp_lora.py,tests/vllm/test_pipeline_parallelism_generation.py(it is already on thelast via
pytestmark, but make it the standard path).tests/conftest.py:pytest_configureto not pinCUDA_VISIBLE_DEVICESfor tests carryingmulti_gpu. The cleanest way:detect the marker per-test in a
pytest_collection_modifyitemshook andun-pin (or skip) accordingly. Alternatively, run multi-GPU tests in a
separate, non-xdist invocation.
test-multi-gputhat runs themulti_gpu-marked subset without-n.Why both:
-nworker pinning and multi-GPU tests are fundamentallyincompatible. Splitting them is simpler than trying to make them coexist.
Fix 3 + 7: replace Makefile chain with single pytest invocation
Action: collapse
test-allinto one pytest run that collects from allfive directories at once. One process, one summary, no short-circuit.
Trade-off: dropping
-sfromPYTEST_FLAGSlets pytest capture stdout pertest, which is required for clean parallel output. The CLAUDE.md guidance
("always use
-s") is for interactive debugging; the Makefile is for full-suite runs. Developers who want streaming output can pass
-sdirectly.Also removes the per-suite Make targets' coupling, so a failure in one
directory does not hide later directories' results. The single pytest
process surfaces every failure in one summary.
Fix 4: machine-readable + human-readable reports
Action: add JUnit XML + optional HTML reporting to
addopts:For HTML, add
pytest-htmlto thetestextra:Then runs can do
pytest --html=report.html --self-contained-html.This unblocks CI integration (any CI consumes JUnit XML natively) and gives
local developers a browsable report.
Fix 4b: make
[test]extra self-sufficient for full coverageCurrent state:
pyproject.toml:39defines:This is missing
hf,vllm, andcompose. A user runningpip install -e ".[test]"cannot actually executetests/hf/,tests/vllm/,tests/composer/,or
tests/integration/— every import there fails. Today contributors install[dev](which transitively pulls hf/vllm/compose) to run tests, but[dev]also drags in tooling that has nothing to do with test execution.
Action: redefine
[test]to be the single authoritative "I want to runthe full test suite" extra. After the changes from fixes (4) and (5), it
should look like:
Rules:
[test]= everything required to runmake test-alland produce bothhuman- and machine-readable reports. Nothing more, nothing less.
[dev]keeps its role as the contributor convenience extra and references[test]rather than duplicating its contents:README.mdandCLAUDE.md: "to run tests, install withpip install -e \".[test]\"". This is the single command a CI image or anew contributor needs.
[test]extra also becomes the source of truth that CI installs from,so test-runtime dependencies cannot drift between local and CI environments.
Why this matters for this issue: fix (4) adds
pytest-xdistandpytest-html, fix (5) makes-na default. Both new dependencies need ahome. Putting them in
[dev]repeats today's mistake (test deps scatteredacross extras); putting them in
[test]while leaving[test]incomplete isworse. Fixing
[test]to be self-sufficient is a prerequisite for landingfixes (4) and (5) cleanly.
Fix 5: parallelism by default
Action: add
pytest-xdistto thetestextra (above) and put a sensibledefault in
addopts. Two reasonable choices:"-n", "auto"— uses CPU count; risky on small/single-GPU boxes."-n", "4"— fixed, matches current ad-hoc usage.Recommend
-n 4as the static default with a note in CONTRIBUTING/README thatmulti-GPU tests need
make test-multi-gpuseparately. CPU-bound suites (unit,composer, hf, integration) parallelize cleanly; vLLM single-GPU tests also work
under the existing worker-pinning logic.
Developers wanting serial execution use
pytest -n0; debugging usespytest -s -n0.Fix 8+9: identify and quarantine chronically-failing and skipped tests