Skip to content

Test suite standardization and cleanup #42

@antonpibm

Description

@antonpibm

The test suite has accumulated several rough edges that make it hard to run, hard
to interpret, and hard to trust as a CI signal. This issue collects eight
specific problems and proposes fixes grounded in the current repo layout.

Current state (snapshot)

  • pyproject.toml declares five markers: gpu, vllm, slow, deep,
    requires_model.
  • Makefile chains targets sequentially: test-alltest-cpu (unit, composer,
    hf) → test-gpu (vllm, integration). Each target invokes a separate pytest
    process with -v -s --tb=short and no -n, no -x, no -p
    reporting plugin.
  • tests/conftest.py already pins each xdist worker to one GPU via
    CUDA_VISIBLE_DEVICES round-robin (pytest_configure).
  • TP/PP tests under tests/vllm/ (e.g. test_pipeline_parallelism_generation.py,
    test_tp_integration.py, test_tp_lora.py) already gate themselves on
    _visible_cuda_device_count() < 2 via pytest.mark.skipif.

Details

1. Marker usage is inconsistent

The five markers are declared but unevenly applied. Many tests that take

30 s have no slow marker; many GPU-required tests have no gpu marker
(they rely on implicit failures or runtime checks); vllm is sometimes used
as a skipif reason rather than a marker.

A grep across tests/ shows mixed conventions:

  • Some files use pytestmark = [pytest.mark.vllm, pytest.mark.gpu, pytest.mark.slow, ...] (good — see tests/vllm/test_pipeline_parallelism_generation.py:54).
  • Others use only @pytest.mark.skipif(not _VLLM_AVAILABLE, ...) without the vllm marker.
  • Most tests in tests/composer/ and tests/integration/ carry no markers at all.

This makes -m "not slow" or -m gpu unreliable as filters.

2. Multi-GPU tests under -n N

tests/conftest.py:13-38 pins each xdist worker to a single GPU. This is
correct for single-GPU tests but breaks multi-GPU tests: a test marked for
TP=2 or PP=2 sees only one GPU because its worker was pinned to one.

Tests that require multiple GPUs (test_tp_integration.py, test_tp_lora.py,
test_pipeline_parallelism_generation.py) skip under -n auto for this reason
even on a 2+ GPU machine. The skip is silent — see problem (6).

The -n value also matters: -n auto uses CPU count, which on a typical 8-core
GPU box is 8, far more workers than GPUs. Without per-test resource declarations,
this leads to GPU contention and OOM.

3. make test-all short-circuits on first failure

make test-all runs five pytest processes sequentially. If any one exits
non-zero, Make halts and the remaining suites are never invoked — they do
not appear as failed, skipped, errored, or xfailed. They are simply absent from
output. There is no aggregated cross-suite summary.

4. Test results are hard to consume

Output is plain pytest text streamed to the terminal. There is no machine-
readable summary (no JUnit XML, no JSON), no HTML report, and no aggregated
pass/fail count across the five suite invocations. CI integration and
historical tracking both require this.

5. Parallelism is opt-in instead of default

-n is not in addopts and not in Makefile PYTEST_FLAGS. Every developer
runs serially by default and discovers -n independently. With tests/vllm/

  • tests/integration/ taking ~22 minutes serially, this is a real productivity
    cost.

6. Skip reasons are invisible in default output

A run ending in 243 passed, 1 skipped, 17 warnings in 1329.59s does not tell
the developer what was skipped or why. The skip reason is recorded by
pytest but only printed under -rs (or -rsxX for full reporting).

7. Sequential per-folder pytest invocations overwrite earlier summaries

make test-all runs five separate pytest processes. Each prints its own
===== N passed in Xs ===== summary. By the time the run finishes, the
terminal scrollback contains five independent summaries with no aggregation —
and if the last suite is large (e.g. tests/vllm/), earlier summaries scroll
off-screen on terminals with limited history. Reviewing results requires
manually scrolling and mentally combining five tail blocks.

8. Chronically-failing integration tests

=================================== FAILURES ===================================
________ test_hf_composed_adapter_indices[granite-4.0-micro-mid-131072] ________
[gw0] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:227: in test_hf_composed_adapter_indices
    hf_model(input_ids=input_ids)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/granite_switch/hf/modeling_granite_switch.py:490: in forward
    logits = logits / self.config.logits_scaling
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E   torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.50 GiB. GPU 0 has a total capacity of 79.14 GiB of which 24.37 GiB is free. Process 947164 has 54.76 GiB memory in use. Of the allocated memory 47.29 GiB is allocated by PyTorch, and 6.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
_______ test_hf_composed_adapter_indices[granite-4.0-micro-late-131072] ________
[gw3] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:227: in test_hf_composed_adapter_indices
    hf_model(input_ids=input_ids)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/granite_switch/hf/modeling_granite_switch.py:490: in forward
    logits = logits / self.config.logits_scaling
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E   torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.50 GiB. GPU 0 has a total capacity of 79.14 GiB of which 24.08 GiB is free. Process 947174 has 55.05 GiB memory in use. Of the allocated memory 47.28 GiB is allocated by PyTorch, and 7.27 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
______________ test_hf_vllm_argmax_equivalence[granite-4.0-micro] ______________
[gw3] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:404: in test_hf_vllm_argmax_equivalence
    assert not failures, (
E   AssertionError: HF and vLLM disagree on top-1 token for 3 position(s) (base_model=ibm-granite/granite-4.0-micro):
E       [early] mismatch at positions [0, 2, 3]
E         hf_argmax   = [596, 287, 596, 596, 337, 337, 596]
E         vllm_argmax = [337, 287, 337, 337, 337, 337, 596]
E       [mid] mismatch at positions [0, 1, 2, 3, 5, 6]
E         hf_argmax   = [596, 596, 596, 596, 287, 337, 596]
E         vllm_argmax = [337, 337, 337, 337, 287, 596, 337]
E       [late] mismatch at positions [0, 1, 2, 3, 4, 5]
E         hf_argmax   = [596, 596, 596, 596, 596, 596, 287]
E         vllm_argmax = [337, 337, 337, 337, 337, 337, 287]
E       This typically indicates a vLLM gain-compensation bug: the switch
E       produced wrong adapter_indices, the wrong LoRA was applied, and
E       the downstream logits diverged enough to flip the top token.
E   assert not [('early', [0, 2, 3], [596, 287, 596, 596, 337, 337, ...], [337, 287, 337, 337, 337, 337, ...]), ('mid', [0, 1, 2, 3, 5, 6], [596, 596, 596, 596, 287, 337, ...], [337, 337, 337, 337, 287, 596, ...]), ('late', [0, 1, 2, 3, 4, 5], [596, 596, 596, 596, 596, 596, ...], [337, 337, 337, 337, 337, 337, ...])]
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: 56 warnings
  /workspace/granite-switch/.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/integration/test_hf_to_vllm_weights.py::TestSingleSwitchForwardEquivalence::test_forward_logit_equivalence
  /workspace/granite-switch/.venv/lib/python3.12/site-packages/flashinfer/gemm/kernels/grouped_gemm_masked_blackwell.py:2059: DeprecationWarning: tcgen05.OperandMajorMode is deprecated, use cute.nvgpu.OperandMajorMode instead
    a_major_mode: tcgen05.OperandMajorMode,

tests/integration/test_hf_to_vllm_weights.py::TestSingleSwitchForwardEquivalence::test_forward_logit_equivalence
  /workspace/granite-switch/.venv/lib/python3.12/site-packages/flashinfer/gemm/kernels/grouped_gemm_masked_blackwell.py:2061: DeprecationWarning: tcgen05.OperandMajorMode is deprecated, use cute.nvgpu.OperandMajorMode instead
    b_major_mode: tcgen05.OperandMajorMode,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_composed_adapter_indices[granite-4.0-micro-mid-131072]
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_composed_adapter_indices[granite-4.0-micro-late-131072]
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_vllm_argmax_equivalence[granite-4.0-micro]

9. Skipped tests understanding

There are skipped tests, we should verify the reason and better reflect the impact


Proposed fixes

Fix 1 + 6: marker audit + default skip-reason reporting

Action:

  1. Audit every test file under tests/ and apply markers consistently:
    • gpu — any test that imports torch and uses CUDA
    • vllm — any test that imports vllm (replaces skipif(not _VLLM_AVAILABLE))
    • slow — any test or class that takes > 30 s
    • multi_gpunew marker, see fix (2)
    • requires_model — already used; keep as-is
  2. Use file-level pytestmark = [...] for whole-file gating (already done in
    tests/vllm/test_pipeline_parallelism_generation.py:54); use class-level
    for partial coverage.
  3. Add -rfsxX to addopts in pyproject.toml so failures, skips, xfails,
    and xpasses are listed by reason in every run summary.
  4. Promote .claude/commands/write-tests.md to a first-class skill so new
    tests follow the audited conventions automatically.
    The existing command
    already encodes the right rules — resource tier (Axis 1) maps to markers,
    scope (Axis 2) maps to tests/ subdirectories, and the "Marker Rules"
    section enforces the gating discipline this issue is trying to establish
    (e.g. "If a test has skipif(not _CUDA_AVAILABLE), it MUST also have
    @pytest.mark.gpu"
    ). Action items:
    • Update the skill to include the new multi_gpu marker and its
      interaction with the conftest worker-pinning logic (see fix (2)).
    • Update the resource-tier table to drop any markers removed by the
      audit and add ones it introduces.
    • Reference the skill from CLAUDE.md and docs/ so contributors see
      it before writing tests.
    • Once the audit lands, the skill is the enforcement mechanism that
      keeps the suite from drifting back into inconsistency.

File to edit: pyproject.toml [tool.pytest.ini_options]:

addopts = ["--ignore=tests/_legacy", "-m", "not deep", "-rfsxX"]
markers = [
    "gpu: requires CUDA GPU",
    "multi_gpu: requires 2+ CUDA GPUs (TP/PP tests)",
    "vllm: requires vLLM installed",
    "slow: takes > 30s",
    "deep: expensive code-theory tests",
    "requires_model: needs a real model checkpoint",
]

Fix 2: dedicated marker + targeted skip logic for multi-GPU

Action:

  1. Add a multi_gpu marker (see above).
  2. Apply it to tests/vllm/test_tp_integration.py, tests/vllm/test_tp_lora.py,
    tests/vllm/test_pipeline_parallelism_generation.py (it is already on the
    last via pytestmark, but make it the standard path).
  3. Modify tests/conftest.py:pytest_configure to not pin
    CUDA_VISIBLE_DEVICES for tests carrying multi_gpu. The cleanest way:
    detect the marker per-test in a pytest_collection_modifyitems hook and
    un-pin (or skip) accordingly. Alternatively, run multi-GPU tests in a
    separate, non-xdist invocation.
  4. Provide a dedicated Makefile target test-multi-gpu that runs the
    multi_gpu-marked subset without -n.

Why both: -n worker pinning and multi-GPU tests are fundamentally
incompatible. Splitting them is simpler than trying to make them coexist.

Fix 3 + 7: replace Makefile chain with single pytest invocation

Action: collapse test-all into one pytest run that collects from all
five directories at once. One process, one summary, no short-circuit.

PYTEST_FLAGS = -v --tb=short -rfsxX

test-all:
	$(PYTHON) -m pytest tests/unit tests/composer tests/hf tests/vllm tests/integration $(PYTEST_FLAGS)

test-cpu:
	$(PYTHON) -m pytest tests/unit tests/composer tests/hf $(PYTEST_FLAGS) -m "not gpu"

test-gpu:
	$(PYTHON) -m pytest tests/vllm tests/integration $(PYTEST_FLAGS) -m "gpu and not multi_gpu"

test-multi-gpu:
	$(PYTHON) -m pytest tests/vllm $(PYTEST_FLAGS) -m multi_gpu

Trade-off: dropping -s from PYTEST_FLAGS lets pytest capture stdout per
test, which is required for clean parallel output. The CLAUDE.md guidance
("always use -s") is for interactive debugging; the Makefile is for full-
suite runs. Developers who want streaming output can pass -s directly.

Also removes the per-suite Make targets' coupling, so a failure in one
directory does not hide later directories' results. The single pytest
process surfaces every failure in one summary.

Fix 4: machine-readable + human-readable reports

Action: add JUnit XML + optional HTML reporting to addopts:

addopts = [
    "--ignore=tests/_legacy",
    "-m", "not deep",
    "-rfsxX",
    "--junitxml=test-results.xml",
]

For HTML, add pytest-html to the test extra:

test = ["pytest", "pytest-html", "pytest-xdist", "bitsandbytes", "optimum-quanto"]

Then runs can do pytest --html=report.html --self-contained-html.

This unblocks CI integration (any CI consumes JUnit XML natively) and gives
local developers a browsable report.

Fix 4b: make [test] extra self-sufficient for full coverage

Current state: pyproject.toml:39 defines:

test = ["pytest", "bitsandbytes", "optimum-quanto"]

This is missing hf, vllm, and compose. A user running pip install -e ".[test]" cannot actually execute tests/hf/, tests/vllm/, tests/composer/,
or tests/integration/ — every import there fails. Today contributors install
[dev] (which transitively pulls hf/vllm/compose) to run tests, but [dev]
also drags in tooling that has nothing to do with test execution.

Action: redefine [test] to be the single authoritative "I want to run
the full test suite" extra. After the changes from fixes (4) and (5), it
should look like:

test = [
    "granite-switch[hf,vllm,compose]",
    "pytest",
    "pytest-xdist",
    "pytest-html",
    "bitsandbytes",
    "optimum-quanto",
]

Rules:

  • [test] = everything required to run make test-all and produce both
    human- and machine-readable reports. Nothing more, nothing less.
  • [dev] keeps its role as the contributor convenience extra and references
    [test] rather than duplicating its contents:
    dev = ["granite-switch[test]", "ruff"]   # add lint/format/etc here
  • Document in README.md and CLAUDE.md: "to run tests, install with
    pip install -e \".[test]\""
    . This is the single command a CI image or a
    new contributor needs.
  • The [test] extra also becomes the source of truth that CI installs from,
    so test-runtime dependencies cannot drift between local and CI environments.

Why this matters for this issue: fix (4) adds pytest-xdist and
pytest-html, fix (5) makes -n a default. Both new dependencies need a
home. Putting them in [dev] repeats today's mistake (test deps scattered
across extras); putting them in [test] while leaving [test] incomplete is
worse. Fixing [test] to be self-sufficient is a prerequisite for landing
fixes (4) and (5) cleanly.

Fix 5: parallelism by default

Action: add pytest-xdist to the test extra (above) and put a sensible
default in addopts. Two reasonable choices:

  • "-n", "auto" — uses CPU count; risky on small/single-GPU boxes.
  • "-n", "4" — fixed, matches current ad-hoc usage.

Recommend -n 4 as the static default with a note in CONTRIBUTING/README that
multi-GPU tests need make test-multi-gpu separately. CPU-bound suites (unit,
composer, hf, integration) parallelize cleanly; vLLM single-GPU tests also work
under the existing worker-pinning logic.

addopts = [
    "--ignore=tests/_legacy",
    "-m", "not deep",
    "-rfsxX",
    "-n", "4",
    "--junitxml=test-results.xml",
]

Developers wanting serial execution use pytest -n0; debugging uses pytest -s -n0.

Fix 8+9: identify and quarantine chronically-failing and skipped tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions