Test suite standardization and cleanup

The test suite has accumulated several rough edges that make it hard to run, hard
to interpret, and hard to trust as a CI signal. This issue collects eight
specific problems and proposes fixes grounded in the current repo layout.

## Current state (snapshot)

- `pyproject.toml` declares five markers: `gpu`, `vllm`, `slow`, `deep`,
  `requires_model`.
- `Makefile` chains targets sequentially: `test-all` → `test-cpu` (unit, composer,
  hf) → `test-gpu` (vllm, integration). Each target invokes a separate `pytest`
  process with `-v -s --tb=short` and **no** `-n`, **no** `-x`, **no** `-p`
  reporting plugin.
- `tests/conftest.py` already pins each `xdist` worker to one GPU via
  `CUDA_VISIBLE_DEVICES` round-robin (`pytest_configure`).
- TP/PP tests under `tests/vllm/` (e.g. `test_pipeline_parallelism_generation.py`,
  `test_tp_integration.py`, `test_tp_lora.py`) already gate themselves on
  `_visible_cuda_device_count() < 2` via `pytest.mark.skipif`.

---

## Details

### 1. Marker usage is inconsistent

The five markers are declared but unevenly applied. Many tests that take
> 30 s have no `slow` marker; many GPU-required tests have no `gpu` marker
(they rely on implicit failures or runtime checks); `vllm` is sometimes used
as a `skipif` reason rather than a marker.

A grep across `tests/` shows mixed conventions:
- Some files use `pytestmark = [pytest.mark.vllm, pytest.mark.gpu, pytest.mark.slow, ...]` (good — see `tests/vllm/test_pipeline_parallelism_generation.py:54`).
- Others use only `@pytest.mark.skipif(not _VLLM_AVAILABLE, ...)` without the `vllm` marker.
- Most tests in `tests/composer/` and `tests/integration/` carry no markers at all.

This makes `-m "not slow"` or `-m gpu` unreliable as filters.

### 2. Multi-GPU tests under `-n N`

`tests/conftest.py:13-38` pins each xdist worker to a single GPU. This is
correct for single-GPU tests but **breaks multi-GPU tests**: a test marked for
TP=2 or PP=2 sees only one GPU because its worker was pinned to one.

Tests that require multiple GPUs (`test_tp_integration.py`, `test_tp_lora.py`,
`test_pipeline_parallelism_generation.py`) skip under `-n auto` for this reason
even on a 2+ GPU machine. The skip is silent — see problem (6).

The `-n` value also matters: `-n auto` uses CPU count, which on a typical 8-core
GPU box is 8, far more workers than GPUs. Without per-test resource declarations,
this leads to GPU contention and OOM.

### 3. `make test-all` short-circuits on first failure

`make test-all` runs five `pytest` processes sequentially. If any one exits
non-zero, Make halts and the remaining suites **are never invoked** — they do
not appear as failed, skipped, errored, or xfailed. They are simply absent from
output. There is no aggregated cross-suite summary.

### 4. Test results are hard to consume

Output is plain pytest text streamed to the terminal. There is no machine-
readable summary (no JUnit XML, no JSON), no HTML report, and no aggregated
pass/fail count across the five suite invocations. CI integration and
historical tracking both require this.

### 5. Parallelism is opt-in instead of default

`-n` is not in `addopts` and not in `Makefile` `PYTEST_FLAGS`. Every developer
runs serially by default and discovers `-n` independently. With `tests/vllm/`
+ `tests/integration/` taking ~22 minutes serially, this is a real productivity
cost.

### 6. Skip reasons are invisible in default output

A run ending in `243 passed, 1 skipped, 17 warnings in 1329.59s` does not tell
the developer **what** was skipped or **why**. The skip reason is recorded by
pytest but only printed under `-rs` (or `-rsxX` for full reporting).

### 7. Sequential per-folder pytest invocations overwrite earlier summaries

`make test-all` runs five separate pytest processes. Each prints its own
`===== N passed in Xs =====` summary. By the time the run finishes, the
terminal scrollback contains five independent summaries with no aggregation —
and if the last suite is large (e.g. `tests/vllm/`), earlier summaries scroll
off-screen on terminals with limited history. Reviewing results requires
manually scrolling and mentally combining five tail blocks.

### 8. Chronically-failing integration tests 

```
=================================== FAILURES ===================================
________ test_hf_composed_adapter_indices[granite-4.0-micro-mid-131072] ________
[gw0] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:227: in test_hf_composed_adapter_indices
    hf_model(input_ids=input_ids)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/granite_switch/hf/modeling_granite_switch.py:490: in forward
    logits = logits / self.config.logits_scaling
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E   torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.50 GiB. GPU 0 has a total capacity of 79.14 GiB of which 24.37 GiB is free. Process 947164 has 54.76 GiB memory in use. Of the allocated memory 47.29 GiB is allocated by PyTorch, and 6.98 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
_______ test_hf_composed_adapter_indices[granite-4.0-micro-late-131072] ________
[gw3] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:227: in test_hf_composed_adapter_indices
    hf_model(input_ids=input_ids)
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1776: in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.12/site-packages/torch/nn/modules/module.py:1787: in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src/granite_switch/hf/modeling_granite_switch.py:490: in forward
    logits = logits / self.config.logits_scaling
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E   torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.50 GiB. GPU 0 has a total capacity of 79.14 GiB of which 24.08 GiB is free. Process 947174 has 55.05 GiB memory in use. Of the allocated memory 47.28 GiB is allocated by PyTorch, and 7.27 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
______________ test_hf_vllm_argmax_equivalence[granite-4.0-micro] ______________
[gw3] linux -- Python 3.12.11 /workspace/granite-switch/.venv/bin/python3
tests/integration/test_switch_e2e_compose.py:404: in test_hf_vllm_argmax_equivalence
    assert not failures, (
E   AssertionError: HF and vLLM disagree on top-1 token for 3 position(s) (base_model=ibm-granite/granite-4.0-micro):
E       [early] mismatch at positions [0, 2, 3]
E         hf_argmax   = [596, 287, 596, 596, 337, 337, 596]
E         vllm_argmax = [337, 287, 337, 337, 337, 337, 596]
E       [mid] mismatch at positions [0, 1, 2, 3, 5, 6]
E         hf_argmax   = [596, 596, 596, 596, 287, 337, 596]
E         vllm_argmax = [337, 337, 337, 337, 287, 596, 337]
E       [late] mismatch at positions [0, 1, 2, 3, 4, 5]
E         hf_argmax   = [596, 596, 596, 596, 596, 596, 287]
E         vllm_argmax = [337, 337, 337, 337, 337, 337, 287]
E       This typically indicates a vLLM gain-compensation bug: the switch
E       produced wrong adapter_indices, the wrong LoRA was applied, and
E       the downstream logits diverged enough to flip the top token.
E   assert not [('early', [0, 2, 3], [596, 287, 596, 596, 337, 337, ...], [337, 287, 337, 337, 337, 337, ...]), ('mid', [0, 1, 2, 3, 5, 6], [596, 596, 596, 596, 287, 337, ...], [337, 337, 337, 337, 287, 596, ...]), ('late', [0, 1, 2, 3, 4, 5], [596, 596, 596, 596, 596, 596, ...], [337, 337, 337, 337, 337, 337, ...])]
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: 56 warnings
  /workspace/granite-switch/.venv/lib/python3.12/site-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/integration/test_hf_to_vllm_weights.py::TestSingleSwitchForwardEquivalence::test_forward_logit_equivalence
  /workspace/granite-switch/.venv/lib/python3.12/site-packages/flashinfer/gemm/kernels/grouped_gemm_masked_blackwell.py:2059: DeprecationWarning: tcgen05.OperandMajorMode is deprecated, use cute.nvgpu.OperandMajorMode instead
    a_major_mode: tcgen05.OperandMajorMode,

tests/integration/test_hf_to_vllm_weights.py::TestSingleSwitchForwardEquivalence::test_forward_logit_equivalence
  /workspace/granite-switch/.venv/lib/python3.12/site-packages/flashinfer/gemm/kernels/grouped_gemm_masked_blackwell.py:2061: DeprecationWarning: tcgen05.OperandMajorMode is deprecated, use cute.nvgpu.OperandMajorMode instead
    b_major_mode: tcgen05.OperandMajorMode,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_composed_adapter_indices[granite-4.0-micro-mid-131072]
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_composed_adapter_indices[granite-4.0-micro-late-131072]
FAILED tests/integration/test_switch_e2e_compose.py::test_hf_vllm_argmax_equivalence[granite-4.0-micro]
```


### 9. Skipped tests understanding
There are skipped tests, we should verify the reason and better reflect the impact

---

## Proposed fixes

### Fix 1 + 6: marker audit + default skip-reason reporting

**Action**:
1. Audit every test file under `tests/` and apply markers consistently:
   - `gpu` — any test that imports torch and uses CUDA
   - `vllm` — any test that imports `vllm` (replaces `skipif(not _VLLM_AVAILABLE)`)
   - `slow` — any test or class that takes > 30 s
   - `multi_gpu` — **new marker**, see fix (2)
   - `requires_model` — already used; keep as-is
2. Use file-level `pytestmark = [...]` for whole-file gating (already done in
   `tests/vllm/test_pipeline_parallelism_generation.py:54`); use class-level
   for partial coverage.
3. Add `-rfsxX` to `addopts` in `pyproject.toml` so failures, skips, xfails,
   and xpasses are listed by reason in every run summary.
4. **Promote `.claude/commands/write-tests.md` to a first-class skill so new
   tests follow the audited conventions automatically.** The existing command
   already encodes the right rules — resource tier (Axis 1) maps to markers,
   scope (Axis 2) maps to `tests/` subdirectories, and the "Marker Rules"
   section enforces the gating discipline this issue is trying to establish
   (e.g. *"If a test has `skipif(not _CUDA_AVAILABLE)`, it MUST also have
   `@pytest.mark.gpu`"*). Action items:
   - Update the skill to include the new `multi_gpu` marker and its
     interaction with the conftest worker-pinning logic (see fix (2)).
   - Update the resource-tier table to drop any markers removed by the
     audit and add ones it introduces.
   - Reference the skill from `CLAUDE.md` and `docs/` so contributors see
     it before writing tests.
   - Once the audit lands, the skill is the enforcement mechanism that
     keeps the suite from drifting back into inconsistency.

**File to edit**: `pyproject.toml` `[tool.pytest.ini_options]`:
```toml
addopts = ["--ignore=tests/_legacy", "-m", "not deep", "-rfsxX"]
markers = [
    "gpu: requires CUDA GPU",
    "multi_gpu: requires 2+ CUDA GPUs (TP/PP tests)",
    "vllm: requires vLLM installed",
    "slow: takes > 30s",
    "deep: expensive code-theory tests",
    "requires_model: needs a real model checkpoint",
]
```

### Fix 2: dedicated marker + targeted skip logic for multi-GPU

**Action**:
1. Add a `multi_gpu` marker (see above).
2. Apply it to `tests/vllm/test_tp_integration.py`, `tests/vllm/test_tp_lora.py`,
   `tests/vllm/test_pipeline_parallelism_generation.py` (it is already on the
   last via `pytestmark`, but make it the standard path).
3. Modify `tests/conftest.py:pytest_configure` to **not** pin
   `CUDA_VISIBLE_DEVICES` for tests carrying `multi_gpu`. The cleanest way:
   detect the marker per-test in a `pytest_collection_modifyitems` hook and
   un-pin (or skip) accordingly. Alternatively, run multi-GPU tests in a
   separate, non-xdist invocation.
4. Provide a dedicated Makefile target `test-multi-gpu` that runs the
   `multi_gpu`-marked subset without `-n`.

**Why both**: `-n` worker pinning and multi-GPU tests are fundamentally
incompatible. Splitting them is simpler than trying to make them coexist.

### Fix 3 + 7: replace Makefile chain with single pytest invocation

**Action**: collapse `test-all` into one pytest run that collects from all
five directories at once. One process, one summary, no short-circuit.

```makefile
PYTEST_FLAGS = -v --tb=short -rfsxX

test-all:
	$(PYTHON) -m pytest tests/unit tests/composer tests/hf tests/vllm tests/integration $(PYTEST_FLAGS)

test-cpu:
	$(PYTHON) -m pytest tests/unit tests/composer tests/hf $(PYTEST_FLAGS) -m "not gpu"

test-gpu:
	$(PYTHON) -m pytest tests/vllm tests/integration $(PYTEST_FLAGS) -m "gpu and not multi_gpu"

test-multi-gpu:
	$(PYTHON) -m pytest tests/vllm $(PYTEST_FLAGS) -m multi_gpu
```

Trade-off: dropping `-s` from `PYTEST_FLAGS` lets pytest capture stdout per
test, which is required for clean parallel output. The CLAUDE.md guidance
("always use `-s`") is for interactive debugging; the Makefile is for full-
suite runs. Developers who want streaming output can pass `-s` directly.

Also removes the per-suite Make targets' coupling, so a failure in one
directory does not hide later directories' results. The single pytest
process surfaces every failure in one summary.

### Fix 4: machine-readable + human-readable reports

**Action**: add JUnit XML + optional HTML reporting to `addopts`:

```toml
addopts = [
    "--ignore=tests/_legacy",
    "-m", "not deep",
    "-rfsxX",
    "--junitxml=test-results.xml",
]
```

For HTML, add `pytest-html` to the `test` extra:
```toml
test = ["pytest", "pytest-html", "pytest-xdist", "bitsandbytes", "optimum-quanto"]
```

Then runs can do `pytest --html=report.html --self-contained-html`.

This unblocks CI integration (any CI consumes JUnit XML natively) and gives
local developers a browsable report.

### Fix 4b: make `[test]` extra self-sufficient for full coverage

**Current state**: `pyproject.toml:39` defines:
```toml
test = ["pytest", "bitsandbytes", "optimum-quanto"]
```
This is missing `hf`, `vllm`, and `compose`. A user running `pip install -e
".[test]"` cannot actually execute `tests/hf/`, `tests/vllm/`, `tests/composer/`,
or `tests/integration/` — every import there fails. Today contributors install
`[dev]` (which transitively pulls hf/vllm/compose) to run tests, but `[dev]`
also drags in tooling that has nothing to do with test execution.

**Action**: redefine `[test]` to be the single authoritative "I want to run
the full test suite" extra. After the changes from fixes (4) and (5), it
should look like:

```toml
test = [
    "granite-switch[hf,vllm,compose]",
    "pytest",
    "pytest-xdist",
    "pytest-html",
    "bitsandbytes",
    "optimum-quanto",
]
```

Rules:
- `[test]` = everything required to run `make test-all` and produce both
  human- and machine-readable reports. Nothing more, nothing less.
- `[dev]` keeps its role as the contributor convenience extra and references
  `[test]` rather than duplicating its contents:
  ```toml
  dev = ["granite-switch[test]", "ruff"]   # add lint/format/etc here
  ```
- Document in `README.md` and `CLAUDE.md`: *"to run tests, install with
  `pip install -e \".[test]\"`"*. This is the single command a CI image or a
  new contributor needs.
- The `[test]` extra also becomes the source of truth that CI installs from,
  so test-runtime dependencies cannot drift between local and CI environments.

**Why this matters for this issue**: fix (4) adds `pytest-xdist` and
`pytest-html`, fix (5) makes `-n` a default. Both new dependencies need a
home. Putting them in `[dev]` repeats today's mistake (test deps scattered
across extras); putting them in `[test]` while leaving `[test]` incomplete is
worse. Fixing `[test]` to be self-sufficient is a prerequisite for landing
fixes (4) and (5) cleanly.

### Fix 5: parallelism by default

**Action**: add `pytest-xdist` to the `test` extra (above) and put a sensible
default in `addopts`. Two reasonable choices:

- `"-n", "auto"` — uses CPU count; risky on small/single-GPU boxes.
- `"-n", "4"` — fixed, matches current ad-hoc usage.

Recommend `-n 4` as the static default with a note in CONTRIBUTING/README that
multi-GPU tests need `make test-multi-gpu` separately. CPU-bound suites (unit,
composer, hf, integration) parallelize cleanly; vLLM single-GPU tests also work
under the existing worker-pinning logic.

```toml
addopts = [
    "--ignore=tests/_legacy",
    "-m", "not deep",
    "-rfsxX",
    "-n", "4",
    "--junitxml=test-results.xml",
]
```

Developers wanting serial execution use `pytest -n0`; debugging uses `pytest -s -n0`.

### Fix 8+9: identify and quarantine chronically-failing and skipped tests




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test suite standardization and cleanup #42

Current state (snapshot)

Details

1. Marker usage is inconsistent

2. Multi-GPU tests under `-n N`

3. `make test-all` short-circuits on first failure

4. Test results are hard to consume

5. Parallelism is opt-in instead of default

6. Skip reasons are invisible in default output

7. Sequential per-folder pytest invocations overwrite earlier summaries

8. Chronically-failing integration tests

9. Skipped tests understanding

Proposed fixes

Fix 1 + 6: marker audit + default skip-reason reporting

Fix 2: dedicated marker + targeted skip logic for multi-GPU

Fix 3 + 7: replace Makefile chain with single pytest invocation

Fix 4: machine-readable + human-readable reports

Fix 4b: make `[test]` extra self-sufficient for full coverage

Fix 5: parallelism by default

Fix 8+9: identify and quarantine chronically-failing and skipped tests

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test suite standardization and cleanup #42

Description

Current state (snapshot)

Details

1. Marker usage is inconsistent

2. Multi-GPU tests under -n N

3. make test-all short-circuits on first failure

4. Test results are hard to consume

5. Parallelism is opt-in instead of default

6. Skip reasons are invisible in default output

7. Sequential per-folder pytest invocations overwrite earlier summaries

8. Chronically-failing integration tests

9. Skipped tests understanding

Proposed fixes

Fix 1 + 6: marker audit + default skip-reason reporting

Fix 2: dedicated marker + targeted skip logic for multi-GPU

Fix 3 + 7: replace Makefile chain with single pytest invocation

Fix 4: machine-readable + human-readable reports

Fix 4b: make [test] extra self-sufficient for full coverage

Fix 5: parallelism by default

Fix 8+9: identify and quarantine chronically-failing and skipped tests

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. Multi-GPU tests under `-n N`

3. `make test-all` short-circuits on first failure

Fix 4b: make `[test]` extra self-sufficient for full coverage