Staging to Main#59
Merged
Merged
Conversation
Adds end-to-end MTPLX support — isolated venv at ~/.chaosengine/mtplx-venv plus engine, capability probe, routing, install flow, and full UI wiring. Backend - _mtp.py model registry: MTP_MODEL_MAP + _MTP_ALIASES (Qwen3.5/3.6, DeepSeek V3/R1, Coder-Next, Youssofal/Qwen3.6-27B-MTPLX-Optimized-*) - MtplxEngine subprocess wrapper (load/generate/stream over OpenAI- compatible HTTP). Fails into RuntimeError; controller catches and falls back to MLXWorkerEngine with a runtimeNote. - capabilities: cheap file-existence probe for mtplx-venv + version file. - controller._select_engine routes MLX + speculativeDecoding + has_mtp_heads + mtplxAvailable -> MtplxEngine. - controller orphan-prune extended to MTPLX subprocesses. - controller _LLAMA_HELP_LOCK import (pre-existing latent NameError). - helpers/system.py exposes mtplx info to workspace system stats. Install flow - scripts/install-mtplx.sh + routes/setup/mtplx.py background-job pattern (POST start, GET status). Path was parents[4] -> parents[3]. Frontend - useMtplxInstall hook with poll + unmount cleanup. - RuntimeControls renders MTPLX section when model has MTP heads; hides DFlash when MTPLX supersedes (backend already prefers MTPLX). - Install terminal panel hidden when phase=="done" so it doesn't re-appear on every modal open after a successful install. - ModelLaunchModal mtplx props plumbed through all four wrappers: Chat (LaunchModal), HTML Challenge (ChallengePickerModal), Compare (CompareView), Benchmarks (BenchmarkRunTab). - MyModelsTab MTPLX strategy filter; OnlineModelsTab Acceleration row. - summarizeLaunchSettings + modelUsesMtplx helper so summary label shows "MTPLX" when backend will actually route through it. - MyModelsTab MTPLX filter tightened to modelName-only candidateKeys (was matching via fuzzy matchedVariant.repo, giving false positives on UD/Unsloth repacks + Gemma). Cleanup - Removed Compare button from chat sidebar (Compare is own tab now). Tests - 10 integration tests in tests/test_mtplx_engine_integration.py exercising engine spawn -> /health -> /v1/chat/completions (JSON + SSE) -> done chunk via stub mtplx server at tests/fixtures/ stub_mtplx_server.py. Plus controller-routing assertions. - 1365 Python tests pass; 371 TypeScript tests pass; tsc clean.
Feature/mtplx
Stdlib-only Python shim over the FastAPI backend's HTTP surface so features can be tested headlessly (load + prompt + bench + status) without GUI click-through. - scripts/chaosengine-cli: serve/status/load/unload/prompt/bench/ mtplx-install/mtplx-status subcommands; SSE streaming for live token output; JSON to stdout for jq composition - tests/test_cli_smoke.py: 12 unit tests covering parser + happy path for every subcommand via mocked urllib No new pip deps. No bundle-size impact (script lives in scripts/, not packaged into the .app).
Expands chaosengine-cli to wrap all 125 backend routes (95 typed shortcuts + generic call/routes/openapi dispatchers) and adds a phased E2E test suite that drives the CLI end-to-end against a live backend, mirroring every major app surface. CLI (scripts/chaosengine-cli, 1656 LOC) - Generic dispatcher: call <METHOD> <PATH> with --body/--file/--query /--stream; reaches every endpoint regardless of typed coverage. - routes + openapi subcommands fetch /openapi.json so route inventory stays in sync without codegen. - Typed shortcuts cover Chat (load/unload/prompt/bench/sessions/compare), HTML Challenges (list/get/file/create/repair/retry/validate/delete), Image Studio (generate/progress/cancel/outputs/library/catalog/ download lifecycle), Video Studio (same shape + output-file binary save), Server (status/shutdown/logs SSE), Setup (mtplx/longlive/wan/ cuda-torch/gpu-bundle install + status + probes), Diagnostics, and misc (gpu-status, cache-preview, prompts, settings, workspaces, plugins, tools, adapters, finetuning, v1-models). - Fixed two real schema-shape bugs caught while building the E2E suite: image-generate sends modelId/guidance (was modelRef/ guidanceScale); video-generate sends modelId/numFrames/guidance. E2E suite (scripts/e2e_test_suite.py) - 8 phases: 0 Environment probe, 1 Chat (MLX+GGUF+cache+DFlash+ MTPLX+long-context+fused-attn), 2 Chat Compare, 3 HTML Challenge, 4 Image Studio, 5 Video Studio, 6 Setup probes (read-only), 7 Diagnostics + cleanup hygiene. - Pass criteria concrete: HTTP 200, tokS>0 for generation, expected substring in runtimeNote for DFlash/MTPLX routing, zero unclean orphan workers after the sweep. - Auto-skip semantics: checks skip cleanly when a prerequisite is missing (model not on disk, install missing) rather than failing. - Reports JSON + Markdown to ~/.chaosengine/test-results/. - --smoke runs Phases 0,3,4,5,6,7 (≤60s, no heavy model loads). - Live smoke pass green on M-series box: 6/6 phases, 26 checks, 45s wall — real FLUX image catalog probe, real LTX-2 video generate, real HTML challenge round-trip. Tests (tests/test_cli_smoke.py, 39 → 39 passing) - Updated to match new schema fields (modelId, numFrames). - All 95 typed subcommands have parser coverage; representative behaviour tests for each category. Docs (docs/E2E_TESTING.md) - Standardised procedure + skip semantics + adding-new-checks guide. CLAUDE.md - New Build Checklist entries: --smoke + full E2E required for release builds and any PR touching inference routing. - "New feature gate": every user-visible feature, engine wiring, catalog model family, install endpoint, or cache/spec-dec strategy must land with an E2E check in the relevant phase.
The resolver walked the parent's sibling subdirectories and the grandparent looking for vision projectors. Under the flat ~/AI_Models/<org>/<repo>/ layout that picked up an unrelated neighbour's mmproj — loading lmstudio-community/gemma-4-31B-it-GGUF attached Qwen3.6-27B's mmproj-Qwen3.6-27B-BF16.gguf and crashed llama-server with a text-vs-mmproj n_embd mismatch (5376 vs 5120). Restrict the scan to parent.iterdir() of the main .gguf — no recursion into subdirs, no walk into the grandparent. Returns None for text-only models so --mmproj is never passed. Adds two regression tests: (a) a flat sibling-dir layout with a stray mmproj in the neighbour returns None, (b) an mmproj inside a subdirectory of the model's own folder is also ignored.
… match Two related bugs surfaced by the E2E suite: 1. After deleting an HF-cache snapshot on disk, /api/workspace still returned the stale entry (broken=true) because nothing rescanned the library. _library() now runs a stat-only existence check on each cached entry per request and kicks a background rescan when anything was pruned. Sub-millisecond on a 500-entry cap. 2. lifecycle.load_model() rejected loads with "Cannot load 'X': <reason>" whenever the library lookup matched a broken entry, even when the caller supplied an explicit request.path pointing at the real weights elsewhere. The broken-entry guard now defers to the caller when path is set AND exists on disk. _find_library_entry() also prefers a healthy match over a broken one when multiple share the same name. Tests cover: per-request prune, end-to-end /api/workspace exclusion, two-pass healthy-over-broken lookup, path-trust escape hatch happy path, and the negative case where a non-existent path still falls through to the rejection.
Loading Qwen3.5 / Qwen3.6 MoE (and any other model that mixes
self-attention with linear-attention layers) with cacheStrategy=
turboquant + cacheBits=4 crashed the first generation call with:
'TurboQuantKVCache' object is not subscriptable
Root cause: ``make_adaptive_cache`` unconditionally built every cache
slot as ``TurboQuantKVCache`` / ``KVCache``. The model's linear-attn
layer forward accesses ``cache[0]`` / ``cache[1]`` (it expects an
``ArraysCache(size=2)``), which raises ``TypeError`` on a KV cache.
Fix: when a ``model`` is passed and exposes ``make_cache()``, use it
as the base. Preserve every non-KV slot (ArraysCache, MambaCache, …)
verbatim and only swap the actual ``KVCache`` instances for
``TurboQuantKVCache``. Plain models without ``make_cache`` keep the
previous behaviour.
Added regression tests in ``test_cache_strategies.py`` covering both
the hybrid model path and the no-``make_cache`` fallback. Live-
verified against ``mlx-community/Qwen3.6-35B-A3B-4bit`` at 4-bit
TurboQuant: generation now completes at ~47 tok/s with no crash.
CLI
- cmd_prompt/cmd_bench now read tokS, promptTokens, completionTokens,
responseSeconds, runtimeNote from the nested ``assistant.metrics``
payload instead of the (always-null) top level. Was effectively
hiding live tok/s numbers from every --metrics call.
E2E suite (scripts/e2e_test_suite.py)
- _load_unload_prompt: ``canonical_repo`` + ``load_timeout`` parameters
threaded through; subprocess timeout = load_timeout + 60s so a
backend-level timeout cleanup beat the harness kill.
- Phase 1 picker uses Qwen3.6-35B-A3B-4bit (MoE) as the fast model for
every Chat check — much quicker to load than 80B Qwen3-Next while
exercising the same MLX / cache / spec / fused paths.
- Phase 1 MTPLX check uses leaf-name modelRef + --canonical-repo
Youssofal/... so the controller routes through MtplxEngine while
avoiding the broken-library-entry path-shadow that previously
blocked the load (separate bug fixed by Agent B's library-prune +
path-trust commit, suite belt-and-braces).
- Phase 1 GGUF check cycles through local .gguf files instead of
picking the first one; a single broken mmproj pairing no longer
fails the whole check (Agent A's mmproj scope fix made this less
necessary but the resilience is a keeper).
- Phase 7 ``no orphan workers`` tolerates ``terminated`` / ``killed``
records as expected backend cleanup; only ``kill_failed`` or
similar non-cleaned states count as failure.
Full sweep result with this commit + the three preceding ``fix:``
commits (mmproj scoping, library prune + path-trust, TurboQuant
ArraysCache preservation):
8/8 phases PASS — 32/32 checks PASS — 128s wall
Phase 1 detail:
- MLX native cache: PASS 8.1s
- MLX TurboQuant cache: PASS 5.9s (was 500 → fixed by 30441f9)
- MLX + DFlash speculative: PASS 19.1s
- MLX + MTPLX speculative: PASS 13.2s (was load-blocked → fixed by 566fd64)
- GGUF llama.cpp: PASS 12.8s (was mmproj-crash → fixed by 51305c6)
- long context cache-preview: PASS 1.2s
- fused attention flag: PASS 10.8s
MTPLX (https://github.com/youssofal/mtplx) is the native MTP speculative-decoding runtime ChaosEngineAI shells out to for MTP-bearing models. Installed on-demand into an isolated venv at ~/.chaosengine/mtplx-venv/, not bundled in the desktop .app, driven via subprocess + HTTP from backend_service/inference/mtplx_engine.py. Apache 2.0 — compatible with our MIT+Apache+BSD permissive licence gate (CLAUDE.md §2). Full LICENSE shipped with the wheel under mtplx-*.dist-info/licenses/.
…al response shape
Pre-build (scripts/pre-build-check.sh)
- New phase 9/9: runs ./scripts/e2e_test_suite.py --smoke when backend
is reachable on :8876; warn-skips when not (pre-build doesn't spawn
one). Full E2E sweep stays a release-time gate per CLAUDE.md +
docs/E2E_TESTING.md.
- Notices dep-check list synced with current THIRD_PARTY_NOTICES.md:
added mtplx + mlx-video; dropped stale ChaosEngine probe (vendored
package was removed in FU-030).
Tests (tests/test_cli_smoke.py)
- test_prompt_non_streaming_prints_text_and_metrics fixture rebuilt
around the real /api/chat/generate response shape: { session,
runtime, assistant: { text, metrics: {...} } }. The earlier flat
shape was a guess that masked the real CLI bug fixed in 2d1128c.
Verification: ./scripts/pre-build-check.sh — 10 passed, 0 failed,
1 warning (unrelated llama-server-turbo update available).
Adds three new feature surfaces to the top-level README without reflowing existing sections. - MTPLX (Multi-Token Prediction) speculative decoding gets a feature-highlight bullet, a mention in the "Why ChaosEngineAI" speculative-decoding paragraph, and a dedicated subsection under "Speculative Decoding" alongside DFlash + DDTree. Covers Apple Silicon support, the isolated mtplx-venv, the model registry in backend_service/inference/_mtp.py, and the auto-routing fallback chain (MTPLX -> DFlash -> standard MLX). - chaosengine-cli gets a new "Headless Automation" section between the Building a Release and Project Layout sections. Documents the generic call dispatcher + 95 typed shortcuts, four quick-start examples, the optional PATH symlink, and the no-GUI install path. - E2E test suite gets a brief subsection inside the CLI section linking out to docs/E2E_TESTING.md.
Builds with `mkdocs build --strict` (zero warnings) into a publishable site covering install, usage, features (MTPLX, DFlash, cache strategies), CLI reference (driven from live /openapi.json), architecture (controller routing + engines + runtime paths), testing (importing the existing E2E_TESTING content), troubleshooting, contributing, and a reference section (HTTP API, env vars, third-party deps, changelog). - mkdocs.yml: Material theme + tabs nav + standard pymdownx extensions. - requirements-docs.txt: mkdocs, mkdocs-material, pymdown-extensions. - .gitignore: exclude the site/ build output. - exclude_docs in mkdocs.yml hides four pre-existing legacy docs (E2E_TESTING.md root copy + the image-discover/MVP/provenance notes) that are not part of the new site nav.
Builds the strict MkDocs site on every push to staging that touches docs/, mkdocs.yml, requirements-docs.txt, or this workflow, then rsyncs the output into cryptopoly/ChaosEngineAI-Site under docs/ and pushes to that repo's main branch. The marketing site serves the result at https://chaosengineai.com/docs/ — subdirectory hosting so backlinks accrue to the main domain for SEO. mkdocs.yml site_url updated from readthedocs.io to chaosengineai.com/docs/ so generated canonical URLs, sitemap, and OG tags point at the real host. Requires a single new secret on this repo: SITE_REPO_DEPLOY_KEY — SSH deploy key with write access to the ChaosEngineAI-Site repo. Generate with ssh-keygen, add the public half there as a deploy key (write enabled), private half here as an Actions secret. Documented inline in the workflow header. Manual workflow_dispatch is also wired for hot-fixes outside the push-trigger window.
Investigation of recent activity on spec-decoding + KV cache compression upstreams. Findings: - llama.cpp PR #22673 (MTP support) merged 2026-05-16; ships --spec-type draft-mtp --spec-draft-n-max N. Canonical MTP GGUFs published under ggml-org/ for Qwen3.6-27B and Qwen3.6-35B-A3B. - turboquant-mlx-full unchanged at 0.3.0 (our current pin). - WeianMao/triattention HEAD c3744ee6 = our pin; no new MLX work. - TheTom/turboquant_plus has a C++ TriAttention V3 hybrid policy in the llama-cpp-turboquant fork's experiment branch; not yet independently reproduced. - Tweet at leftcurvedev_ status unable to verify (X auth wall). Includes diff sketch + recommended PR sequence + open questions. See doc for sources.
New follow-up row tracking the GGUF half of FU-028 now that PR #22673 merged upstream. Lists action plan (6 wiring steps), upstream caveats, and links to the upstream-research write-up.
Closes the GGUF half of FU-028. PR #22673 by am17an merged upstream
2026-05-16T12:06:24Z (merge commit 2555826) shipping
--spec-type draft-mtp --spec-draft-n-max N for models with baked-in
Multi-Token Prediction heads. Upstream-reported ~72% acceptance
@ N=3 on Qwen3.6-27B, ~2x tok/s vs no-spec baseline.
Code changes
- _mtp.py: new is_mtp_gguf_repo() + _MTP_GGUF_REPOS frozenset.
4 new aliases for the canonical mirrors (ggml-org/*) and author
preview (am17an/*) GGUF repos so has_mtp_heads + get_mtp_draft_n
return the right canonical N.
- llama_cpp_engine.py: _build_command grew speculative_decoding +
canonical_repo + model_ref kwargs. Emits --spec-type draft-mtp
--spec-draft-n-max <get_mtp_draft_n> when the binary supports
--spec-type AND the canonical repo matches is_mtp_gguf_repo.
Falls back to standard decode + clear runtimeNote when the
binary lacks --spec-type (older llama-server builds, e.g.
homebrew bottles built before 2026-05-16T12Z).
- base.py: new ggufMtpAvailable: bool on BackendCapabilities,
serialised in to_dict so the frontend can show MTP affordances
for GGUF models alongside the existing mtplxAvailable flag.
- capabilities.py: _probe_native_backends sets ggufMtpAvailable
from _llama_server_supports("--spec-type") against either the
standard or turbo binary.
Catalog (text_models.py)
- ggml-org/Qwen3.6-27B-MTP-GGUF (Q8_0, 29 GB)
- ggml-org/Qwen3.6-35B-A3B-MTP-GGUF (Q8_0, 37 GB, MoE)
Both with the qwen3.6 family, vision via auto-detected mmproj
sibling, runtime note flags D2H prompt-processing caveat per
upstream PR body.
Tests (test_inference.py)
- 5 new cases: happy-path MTP flag emission; binary-lacks-spec-type
fallback runtimeNote; non-MTP repo no-op; canonical + author
alias coverage; draft-n lookup through aliases.
- Full suite: 1418 passed, 1 skipped (up from 1413, no regressions).
Tracker
- CLAUDE.md FU-047 row flipped to ~~shipped~~ with full landing
receipt. FU-028 stays open for the MLX side (mlx-lm has no
native MTP head loader; MTPLX subprocess remains the workaround).
Live-verification status
- Backend probe reports ggufMtpAvailable=True against homebrew
llama.cpp 9150 (advertises --spec-type) BUT homebrew bottle
9150 was built before PR #22673 merged today, so its
--spec-type help list still omits draft-mtp. Backend wiring
is correct; users need a llama-server built from master at or
after commit 2555826 to actually fire draft-mtp speculative
decoding. Next homebrew bottle picks this up automatically.
- MLX side comparison: MTPLX path (subprocess via /v1) runs the
same Qwen3.6-27B-MTPLX-Optimized-Speed model at ~24.7 tok/s
versus ~29.0 tok/s for the standard mlx-lm worker — currently
*slower* on this hardware (M5), likely from HTTP-proxy overhead
on per-token roundtrips eating the spec-dec acceptance gains.
Investigating separately; not blocking this commit.
Research write-up: docs/UPSTREAM_RESEARCH_2026-05-16.md
…s stale llama-server Three independent fixes shipped together because they all surfaced while live-benching MTPLX vs MLX baseline on Qwen3.6-27B. 1. MTPLX no longer pops a browser window ``mtplx start`` defaults to MTPLX's interactive onboarding which on first run picks the ``web`` surface and opens a chat UI in a browser tab. Users who only asked ChaosEngineAI to load a model got an unrelated browser window. Switched the subprocess invocation in MtplxEngine to ``mtplx quickstart --yes`` which is the server-only entry point: pure HTTP at /v1, no UI, no prompts. Also pass ``--host 127.0.0.1`` explicitly + ``--mtp --depth N`` so the speculative path actually fires with the registered draft-token count. 2. Draft depth bumped 1 -> 3 for Youssofal Optimised models The earlier conservative N=1 made HTTP-proxy overhead dominate any spec-dec acceptance gain. Live bench: depth=1 ran the same model at ~24.7 tok/s vs ~29.0 tok/s for plain MLX (15% SLOWER). With depth=3 (matches MTPLX's own UI default), the same bench averaged ~27.2 tok/s with the first run hitting 30.4 tok/s — within 5% of baseline and occasionally beating it. The remaining gap is HTTP-roundtrip overhead, not algorithm. 3. Pre-build gate warns when staged llama-server lacks draft-mtp FU-047 wired GGUF MTP via llama.cpp PR #22673 merged today, but homebrew bottle 9150 was built before the merge — it advertises ``--spec-type`` but the value ``draft-mtp`` isn't in its help. Catalog rows for the MTP GGUFs will fail-load until the bundled binary is at master >= 2026-05-16. Pre-build now greps the help text and surfaces a WARN row pointing operators at ``brew upgrade llama.cpp`` or a rebuild-from-master. Test fixtures (stub_mtplx_server.py) - Accept both ``quickstart`` (new) and ``start`` (legacy) subcommands. - Accept the new flags MtplxEngine emits (--host, --mtp, --no-mtp, --depth, --yes) so the 10 integration tests still pass against the new command shape. Live verification - ./scripts/chaosengine-cli load Qwen3.6-27B-MTPLX-Optimized-Speed --canonical-repo Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed --backend mlx --spec returns runtimeNote "MTPLX MTP speculative decoding active (draft tokens: 3, model: Qwen3.6-27B-MTPLX-Optimized-Speed)". - Three sequential generations: 30.4, 27.5, 23.6 tok/s (avg 27.2); no browser window opens; subprocess stays clean. Test suite: 1418 passed, 1 skipped (no regressions). Answers a user-flagged Q from this session: - "MTPLX page just opened in browser — won't happen to users right?" No: this commit silences the pop. - "Homebrew llama-server too old — should ChaosEngineAI versions ship a newer one?" Stage-runtime.mjs auto-downloads the latest ggml-org/llama.cpp release at build time, so once a new tagged release lands post-merge the next ChaosEngineAI build picks it up automatically. The new pre-build warning makes the gap visible immediately rather than only at first user attempt.
Bumps the four version sources of truth that downstream code reads: - pyproject.toml (backend uses this via _resolve_app_version + reports back through /api/health + /api/diagnostics/snapshot) - package.json (frontend bundling, npm scripts) - src-tauri/tauri.conf.json (desktop installer + auto-updater) - src-tauri/Cargo.toml (Rust shell binary) Cargo.lock regenerates on next ``cargo build``. Headline for this release (full notes in RELEASE_NOTES_v0.9.2.md): - chaosengine-cli — full headless automation, 95 typed shortcuts + 100% backend route coverage - MTPLX native MTP speculative decoding on Apple Silicon (Apache 2.0) - GGUF MTP via llama.cpp PR #22673 (--spec-type draft-mtp) - Phased E2E test suite + auto-deployed MkDocs documentation site - 5 backend bug fixes (TurboQuant hybrid-attn, stale library scan, mmproj scoping, CLI response shape, MTPLX browser-pop) No v0.9.1 was tagged — going 0.9.0 -> 0.9.2.
Backend ships 9 diffusion cache strategies via cache_compression.registry
but the frontend only let users pick 2 (fbcache, teacache). Marketing
site claimed '9 strategies' — discrepancy.
Triaged the 4 hidden ones against value-vs-noise:
Worth UI exposure:
- TaylorSeer: native diffusers 0.38 core, generic across FLUX / SD3
/ Wan / Hunyuan / LTX / CogVideoX / Mochi. ~2.4x speedup.
- PAB (Pyramid Attention Broadcast): native diffusers 0.38 config,
~2x speedup. Different mechanism than FBCache (attention reuse vs
first-block-skip) so a real alternative not a duplicate.
Kept backend-only (CLI / API):
- MagCache: FLUX-only without calibration UX; footgun on other DiTs.
- FasterCache: ~1.9x — same ballpark as FBCache so adds choice
without adding capability.
Changes:
- ImageCacheStrategyId + VideoCacheStrategyId unions extended:
'none' | 'fbcache' | 'teacache' -> ... | 'taylorseer' | 'pab'
- IMAGE_CACHE_STRATEGIES list grows by 2 entries with hints describing
what each does.
- IMAGE_CACHE_STRATEGY_DEFAULT_THRESH + VIDEO_CACHE_STRATEGY_DEFAULT_THRESH
set taylorseer/pab thresholds to 0 (means 'use diffusers default skip
interval' — these adapters key off cache_interval not threshold).
- imageCacheStrategiesForRepo gating unchanged: UNet pipelines still
only see 'Off'; FLUX gets all 5; other DiTs get all 5 minus TeaCache
(no calibration tables for those pipelines).
Backend already accepts any string for cacheStrategy (Pydantic field is
'str | None' and registry.get() handles the lookup), so no Python
schema change needed — the new ids route straight through to the
existing cache_compression.{taylorseer,pab} adapters.
Tests: 214 cache/image/video tests pass; full Py + TS suites green;
tsc clean.
Two follow-ups from the v0.9.2 MTPLX bench:
1. MTPLX --profile performance-cold --max
The MTPLX subprocess defaults to ``sustained`` runtime profile,
which thermally throttles for long-running serves. For chat where
the user is staring at the textarea waiting on the first response,
``performance-cold --max`` is the right preset — full clocks, no
throttling. Live re-bench on M5 with N=3 + burst still didn't beat
plain mlx-lm (avg 23.9 tok/s vs 29 baseline; throughput degraded
over consecutive runs from 27.4 -> 20.8 suggesting M5 thermal
limits dominate regardless of profile). Keep the flag — burst is
the *right* config for interactive use; the gap is hardware, not
ChaosEngineAI's fault.
2. /api/setup/llama-server-status + chaosengine-cli llama-server-status
New read-only endpoint probes the resolved llama-server binary:
- Reports build number from ``llama-server --version``
- Greps ``llama-server --help`` for ``draft-mtp`` in --spec-type
- Returns platform-aware upgrade command (brew on macOS, tarball
on Linux, scoop on Windows)
- Surfaces a clear ``message`` field telling the user *why* MTP
GGUF won't fire on their current binary
The frontend can now show an "Outdated llama-server" banner under
any MTP GGUF catalog entry and link the upgrade command directly.
Why a read-only probe (not a one-click installer):
- Each OS has a preferred package-manager path (brew / apt / pacman /
scoop / chocolatey); wrapping all of them is a footgun until we
vendor our own llama.cpp build.
- The release pipeline already pulls ggml-org/llama.cpp's latest
GitHub release tarball at stage-runtime time. After a fresh ggml-org
release tag the bundled binary catches up automatically; users on
homebrew need ``brew upgrade llama.cpp`` once the bottle refreshes.
- This endpoint makes the gap legible without taking responsibility
for the install.
Test fixture (stub_mtplx_server.py) — accept the new --profile and
--max flags that MtplxEngine now passes so the integration tests
keep passing.
Full suite: 1418 passed, 1 skipped.
…3.6 MTP N to 3 Two follow-ups while running the FU-047 head-to-head benchmark: 1. Status probe was bypassing the env override _resolve_llama_server() in routes/setup/llama_server.py only checked /opt/homebrew/bin and PATH, ignoring CHAOSENGINE_LLAMA_SERVER. The inference engine resolver in inference/binaries.py does honour the env var (set by the Tauri shell pointing at the bundled binary, or by developers pointing at a freshly-built source build). Aligning the status probe with the engine's resolution priority — env override > homebrew > PATH — so the UI banner is honest about which binary actually runs. 2. MTP_MODEL_MAP Qwen3.6 entries: N 1 -> 3 Earlier sustained-bench at N=1 left tokens on the table; upstream PR #22673 reports ~72% acceptance at N=3 on Qwen3.6-27B. Live bench on M5 with N=3 on Q8_0 GGUF: 1.51x speedup over Q8_0 baseline (20.9 vs 13.8 tok/s). N=1 was already 1.46x; bumping to 3 nudges it up another 4%. Diminishing returns past N=3 per the PR body. Full head-to-head bench (M5, same prompt, 256 max tokens, 3 runs): MLX baseline (Youssofal MTPLX-Optimized-Speed) 29.0 tok/s MTPLX subprocess N=3 burst 24-27 (variable) GGUF Q4_K_M baseline (lmstudio Qwen3.6-27B) 18.4 tok/s GGUF Q8_0 baseline (ggml-org Qwen3.6-27B-MTP) 13.8 tok/s GGUF Q8_0 + MTP N=1 (this fix) 20.1 tok/s (+1.46x) GGUF Q8_0 + MTP N=3 (this commit) 20.9 tok/s (+1.51x) Net findings: - FU-047 (GGUF MTP) delivers the real, measurable speedup. 1.5x on the same model + quant + hardware is the headline. - MTPLX subprocess via HTTP underperforms on M5 even with depth=3 + burst profile. Subprocess overhead > MTP acceptance gain. - Plain MLX-LM at Youssofal's BF16-ish encoding still wins absolute throughput because the per-token compute is just smaller. Verified with /tmp/llama.cpp HEAD build (commit 6049906) installed to ~/.chaosengine/bin/llama-server. ggml-org/llama.cpp release b9181 (2026-05-16 17:06 UTC) is one commit past the MTP merge and is what stage-runtime.mjs will pull on the next ChaosEngineAI build, so the bundled binary in .app installs will ship with draft-mtp out of the box — no homebrew dependency for end users.
Three colleague-feedback items after MTPLX root-cause investigation:
- _mtp.py: model_has_mtp_tensors() peeks GGUF header for mtp_decoder /
mtp_emb / mtp_heads byte strings, or probes safetensors index for
mtp_*. keys / mtp.safetensors shard. has_mtp_heads_strict(repo, path)
prefers tensor probe over name aliases — catches new MTP-bearing
repos we haven't enumerated and rejects name collisions that don't
carry the tensors (FU-041-style false positives).
- controller._select_engine + llama_cpp_engine._build_command both
switch to the strict / tensor-probe path; GGUF MTP gate falls back
to is_mtp_gguf_repo when no local path is available.
- routes/setup/mtplx.py: /api/setup/mtplx-status now reports
fanControl.{thermalforge,tgPro,anyAvailable,recommendedAction} so
the Setup tab can prompt for ThermalForge install before users hit
the silent-throttle ceiling on MTPLX --max burst runs.
- CLAUDE.md FU-048: deferred prefer-GGUF-MTP routing preference —
needs Settings UX before flipping the default since MTPLX-Optimized
quants aren't GGUF-mirrored.
- tests: 7 new tensor-probe cases in test_inference.py; bumped stale
N=1 assertion to N=3 for Qwen3.6 MTP GGUFs (matches MTP_MODEL_MAP).
PR #22673 names MTP weights as ``blk.{N}.nextn.*`` ("Next-N
prediction") and emits ``<arch>.nextn_predict_layers`` in the GGUF
metadata header, neither of which matched the legacy ``mtp_decoder``
/ ``mtp_emb`` / ``mtp_heads`` needles. As a result tensor probe
returned False on a real MTP-GGUF model and the engine never emitted
``--spec-type draft-mtp`` (verified live: ggml-org/Qwen3.6-27B-MTP-GGUF
ran at 14.5 tok/s instead of MTP-accelerated 23 tok/s).
The metadata key lives in the first few KB of the file, so a 2 MB
read window catches both the cheap canonical marker and the legacy
patterns. Probe now returns True for ggml-org/Qwen3.6-27B-MTP-GGUF.
Head-to-head live numbers (M5 Max, 27B Qwen3.6, MTP enabled both sides):
- GGUF MTP Q8_0: 23.0 tok/s mean (14.5 baseline -> +58.6%)
- MTPLX MTP 4-bit (Optimized-Speed): 28.95 tok/s mean
Tests: rename legacy-tensor-name case + new case pinning the
nextn_predict metadata marker. 15 MTP tests green.
v0.9.0 release shipped with package.json / Cargo.toml / tauri.conf.json at 0.9.0 but pyproject.toml still at 0.8.0 — users downloaded "v0.9.0" from the site and the bundled backend reported ``appVersion: 0.8.0`` because ``_resolve_app_version`` reads the staged pyproject. Nothing enforced cross-manifest sync. Pre-build gate now reads version from all four sources, fails the build when any drift apart. Mirrors the existing dflash-mlx pin assert in both ``pre-build-check.mjs`` and ``pre-build-check.sh``.
…ies + LTX series - Cache compression table gains TaylorSeer / MagCache / PAB / FasterCache rows (previously only TeaCache + FBCache appeared even though the four diffusers-0.38 strategies shipped via FU-026) - DFlash family list adds Gemma-4 (FU-031), Kimi-K2.6, MiniMax-M2.5/M2.7, Qwen3.5-122B-A10B — all in DRAFT_MODEL_MAP - Video model table splits Lightricks LTX-Video (base diffusers) from LTX-2 / LTX-2.3 (mlx-video subprocess) — both ship in the catalog - Feature-map line now reads "FBCache + TeaCache + TaylorSeer + MagCache + PAB + FasterCache" instead of TeaCache-only
…ound job)
Adds a self-contained path for users with a working CUDA torch install to
upgrade to a newer wheel without re-running the full 2.5 GB GPU bundle.
Surfaces as a compact pill in the Image / Video Studio runtime banners
when ``realGenerationAvailable`` AND the matching cu{N} pip index serves
a newer wheel than the one on disk; silent otherwise.
Backend
- ``_install_helpers.py``: ``_extract_cuda_tag``, ``_index_url_for_cuda_tag``,
``_parse_version_triple``, ``_classify_torch_upgrade``,
``_query_latest_torch_version`` (pip index versions parser, both output
shapes), ``_abi_dependents_present``, ``_move_torch_to_rollback`` +
``_restore_torch_from_rollback`` + ``_cleanup_old_torch_rollbacks``,
``_TORCH_ABI_DEPENDENT_PACKAGES`` constant.
- ``routes/setup/torch_upgrade.py``: GET /api/setup/torch-upgrade-available
(synchronous detection, returns ``{available, current, latest,
upgradeType, rebuildPackages, indexUrl}`` or ``{available: false,
reason}``) and POST /api/setup/upgrade-torch (background job mirroring
install-gpu-bundle pattern). Worker moves existing torch to
``.torch-rollback-<version>/`` instead of purging, installs target from
the same cu{N} index, re-pins constraint, force-reinstalls ABI deps on
minor/major bumps (bitsandbytes/torchao/nunchaku/sageattention),
verifies CUDA in a subprocess, restores rollback on verify failure,
keeps the most recent rollback as a safety net.
Frontend
- ``src/api/setup.ts``: ``checkTorchUpgradeAvailable``, ``startTorchUpgrade``,
``getTorchUpgradeStatus`` + 4 exported types. Re-exported via
``src/api/index.ts``.
- ``src/components/TorchUpgradePill.tsx``: one-shot probe on mount,
hides when ``available: false``, three display states (available /
in-progress / done-or-error), polls status at 1.5 Hz with cleanup
keyed by ``job.done``, inline collapsible install log with
phase-named markers. Restart Backend hook plumbed through.
- ``src/styles.css``: color-coded badges per upgrade type
(patch=green / minor=amber / major=red).
- Wired into ``ImageStudioRuntimeBanner`` + ``VideoStudioRuntimeBanner``;
renders only when ``realGenerationAvailable`` so users with broken
torch are not second-guessed.
Tests
- 24 new tests in ``tests/test_setup_routes.py`` covering every helper
(version parsing edge cases including ``2.6.0rc1`` that caught a real
bug in the first cut where digits across non-digit boundaries leaked
into the parsed triple), both pip output shapes, rollback move/restore
round-trip with simulated half-install in extras, cleanup mtime
ordering, and all 8 detection-response shapes plus the apple-silicon
rejection and running-job POST cases.
Drive-by: package-lock.json version field was lagging at 0.8.0 after
the 0.9.0 bump; synced.
Verified: 24/24 new tests pass, 80/80 ``test_setup_routes.py`` pass,
217/217 setup + backend + services + inference pass, 371/371 vitest
pass, ``tsc --noEmit`` clean. Pre-existing ``test_cache_strategies`` /
``test_sdcpp_*`` / ``test_preview_thumbnails`` failures verified to
exist on baseline (Windows env + optional diffusers deps), unrelated.
Both branches added a new setup submodule + API client surface at the
same insertion points:
staging: routes/setup/mtplx.py + Mtplx{Attempt,JobState,Status}
this PR: routes/setup/torch_upgrade.py + TorchUpgrade{Availability,
Attempt,JobState,Type,UnavailableReason}
Resolution keeps both — order in routes/setup/__init__.py is now
alphabetical (longlive → mtplx → torch_upgrade → turbo → wan_install),
both routers register. In src/api/setup.ts the MTPLX block goes first
(it landed on staging first) and the Torch upgrade block follows;
src/api/index.ts re-exports both, alphabetised.
Verified post-merge: 80/80 test_setup_routes pass, 371/371 vitest
pass, ``tsc --noEmit`` clean, and both routers register the expected
paths (``/api/setup/install-mtplx`` + ``/api/setup/upgrade-torch``).
…ound job)
Adds a self-contained path for users with a working CUDA torch install to
upgrade to a newer wheel without re-running the full 2.5 GB GPU bundle.
Surfaces as a compact pill in the Image / Video Studio runtime banners
when ``realGenerationAvailable`` AND the matching cu{N} pip index serves
a newer wheel than the one on disk; silent otherwise.
Backend
- ``_install_helpers.py``: ``_extract_cuda_tag``, ``_index_url_for_cuda_tag``,
``_parse_version_triple``, ``_classify_torch_upgrade``,
``_query_latest_torch_version`` (pip index versions parser, both output
shapes), ``_abi_dependents_present``, ``_move_torch_to_rollback`` +
``_restore_torch_from_rollback`` + ``_cleanup_old_torch_rollbacks``,
``_TORCH_ABI_DEPENDENT_PACKAGES`` constant.
- ``routes/setup/torch_upgrade.py``: GET /api/setup/torch-upgrade-available
(synchronous detection, returns ``{available, current, latest,
upgradeType, rebuildPackages, indexUrl}`` or ``{available: false,
reason}``) and POST /api/setup/upgrade-torch (background job mirroring
install-gpu-bundle pattern). Worker moves existing torch to
``.torch-rollback-<version>/`` instead of purging, installs target from
the same cu{N} index, re-pins constraint, force-reinstalls ABI deps on
minor/major bumps (bitsandbytes/torchao/nunchaku/sageattention),
verifies CUDA in a subprocess, restores rollback on verify failure,
keeps the most recent rollback as a safety net.
Frontend
- ``src/api/setup.ts``: ``checkTorchUpgradeAvailable``, ``startTorchUpgrade``,
``getTorchUpgradeStatus`` + 4 exported types. Re-exported via
``src/api/index.ts``.
- ``src/components/TorchUpgradePill.tsx``: one-shot probe on mount,
hides when ``available: false``, three display states (available /
in-progress / done-or-error), polls status at 1.5 Hz with cleanup
keyed by ``job.done``, inline collapsible install log with
phase-named markers. Restart Backend hook plumbed through.
- ``src/styles.css``: color-coded badges per upgrade type
(patch=green / minor=amber / major=red).
- Wired into ``ImageStudioRuntimeBanner`` + ``VideoStudioRuntimeBanner``;
renders only when ``realGenerationAvailable`` so users with broken
torch are not second-guessed.
Tests
- 24 new tests in ``tests/test_setup_routes.py`` covering every helper
(version parsing edge cases including ``2.6.0rc1`` that caught a real
bug in the first cut where digits across non-digit boundaries leaked
into the parsed triple), both pip output shapes, rollback move/restore
round-trip with simulated half-install in extras, cleanup mtime
ordering, and all 8 detection-response shapes plus the apple-silicon
rejection and running-job POST cases.
Drive-by: package-lock.json version field was lagging at 0.8.0 after
the 0.9.0 bump; synced.
Verified: 24/24 new tests pass, 80/80 ``test_setup_routes.py`` pass,
217/217 setup + backend + services + inference pass, 371/371 vitest
pass, ``tsc --noEmit`` clean. Pre-existing ``test_cache_strategies`` /
``test_sdcpp_*`` / ``test_preview_thumbnails`` failures verified to
exist on baseline (Windows env + optional diffusers deps), unrelated.
Run the CLI-driven E2E suite reliably on Windows by invoking the extensionless CLI through Python, writing reports as UTF-8, and treating missing video runtime prerequisites as skips. Also make the Vitest config ESM-safe for runner-mode loading and keep the Tauri lockfile version in sync.
feat: in-place torch upgrade with rollback (detection + pill + backgr…
Bundles the M4 Max test-suite session work — tracker rows, two real bugs, two user-requested features. Tracker (CLAUDE.md): - FU-049 Python 3.14 support gate (deferred; pyproject stays >=3.10) - FU-050 matrix runner: reasoning-channel capture + max-tokens 96->512 + stale endpoint/path fixes (/api/chat/generate/stream, runtime.loadedModel) - FU-051 /api/models/load echoes legacy cacheStrategy verbatim (open) - FU-052 matrix grows 9->15 cells: MTPLX MLX, GGUF MTP, 4x vLLM (CUDA-gated) - FU-053 distill variants flagged installed when only base repo on disk - FU-054 same-repo siblings: per-file size + shares-storage badge - FU-055 in-app storage explorer in Diagnostics tab Bugs fixed: - _distill_transformer_validation_error checks distillTransformerRepo + high/low-noise filenames before marking availableLocally true. Closes the FU-053 false positive on Wan2.2-I2V-A14B distill bf16/fp8. - pre-build-check.sh + .mjs pointed at the wrong turbo fork (johndpope/...planarquant); corrected to TheTom/...turboquant-kv-cache matching build-llama-turbo.sh + CLAUDE.md. Features: - Star/favourite models on Chat -> My Models. New favoriteModelRefs in settings (+ UpdateSettingsRequest field, dedup-trimmed apply, payload), ActionIconName 'star'/'starOutline' SVG, .action-favorite CSS, toggle handler in App.tsx writes via PATCH /api/settings + refreshes. Starred rows lift to top of the library list. - Diagnostics 'Disk usage - top 20 model repos' section. New GET /api/diagnostics/storage-top endpoint walks every enabled modelDirectories entry one level deep, sums via _path_size_bytes (inode-deduped so HF snapshot/blob symlinks count once). Closes the Stuff Diver gap on HF cache layouts. Live: 1213 GB total on this box. Matrix runner (scripts/cache-strategy-matrix.py): - Smoke models bumped Qwen2.5-0.5B -> Qwen3-0.6B (current gen) - New cells for MTPLX MLX, GGUF MTP (FU-047), vLLM native/turboquant/ triattention/dflash. BackendCapabilities adds mtplx_available, gguf_mtp_available, vllm_available probed from /api/health. Tests: - 4 new unit tests pin the FU-053 distill validator (no-distill, missing snapshot, partial snapshot, both files present). - test_cache_strategy_matrix_runner kwargs widened for new caps. - Full suite: 1455 passed, 1 skipped, 132 subtests passed. - npx tsc --noEmit clean; npm test 32 files / 371 tests pass.
Brings PR #57 up to date with staging (29 commits behind). Two conflicts resolved manually: - src/api/index.ts: re-exports from ./setup. Both branches added MTPLX-related exports independently; took the union alphabetically sorted (getMtplxInstallStatus, getMtplxStatus, startMtplxInstall). - src/api/setup.ts: torch-upgrade machinery was introduced via parallel commits on both branches (our dca2c12 + staging f514ea4). Auto-merge produced duplicate TorchUpgradeAvailability / TorchUpgradeType / TorchUpgradeUnavailableReason / TorchUpgradeAttempt / TorchUpgradeJobState declarations + duplicate checkTorchUpgradeAvailable / startTorchUpgrade / getTorchUpgradeStatus functions. Removed the duplicate block; kept one canonical section with types + functions in proper order. Verified post-merge: - npx tsc --noEmit: clean - npm test: 32 files / 371 tests pass - pytest tests/: 1455 passed, 1 skipped, 132 subtests passed
Feature/chaos engine ai cli
Foundation for in-app install UX. Lazy importability + version probes for nunchaku / sageattention / dflash-mlx / dflash-cuda / triattention / kvpress, plus a Windows-only wsl2 detector that seeds the upcoming vLLM-via-WSL bridge. Eleven new fields on BackendCapabilities surface through /api/health; the placeholder probe primes them on first paint so the UI never flashes Install for a package that is actually present. Probes resilient to the half-baked-install failure mode we hit on Windows (torch directory present but Python source missing): find_spec swallows ValueError, version reads swallow ImportError and missing __version__. DFlash MLX vs CUDA flags delegate to the existing dflash.is_mlx_available / dflash.is_vllm_available helpers so the upstream package-layout dance stays in one place. Tests: 25 in tests/test_accelerator_capabilities.py covering present / absent / broken-install / WSL-status branches.
Tests should exercise the same install users have, not a parallel
.venv install. New tests/conftest.py calls ensure_extras_on_sys_path
at collection time, so pytest tests/ resolves torch / diffusers /
mlx / nunchaku / sageattention / triattention / vllm against the
persistent extras dir at:
Windows: %LOCALAPPDATA%\ChaosEngineAI\extras\cp{XY}\site-packages
macOS: ~/Library/Application Support/ChaosEngineAI/extras/cp{XY}/site-packages
Linux: ${XDG_DATA_HOME}/ChaosEngineAI/extras/cp{XY}/site-packages
A torch upgrade landing via the in-app installer is reflected in the
next pytest run automatically; no pip install dance in .venv. On a
fresh CI box without the extras dir the conftest is a silent no-op,
so existing test boxes keep working.
Set CHAOSENGINE_TEST_TRACE_EXTRAS=1 to log which extras path got
loaded for a given run.
Runners (e2e_test_suite.py, cache-strategy-matrix.py) now print an
actionable hint when the backend is not reachable: open the
ChaosEngineAI app, rather than just backend not reachable; aborting.
Both still exit 2/3 respectively so CI gates stay reliable.
Docs (testing/overview.md, testing/e2e-testing.md) updated with the
canonical open-the-app-then-run-tests flow, with the headless dev
backend kept as an advanced option for contributors.
build-sdcpp.sh failed for the user with ``fatal: not a git repository`` because /tmp/stable-diffusion.cpp/.git survived a partial /tmp cleanup as an empty directory — the existence test ``[[ -d \$DIR/.git ]]`` passed but ``git fetch`` immediately failed inside it. Same latent bug across 5 bash scripts + 2 PowerShell scripts. All now use ``git rev-parse --git-dir`` to validate the checkout is a real repo, and ``rm -rf`` the stale dir before re-cloning when not. Patched: - scripts/build-sdcpp.sh - scripts/build-llama-turbo.sh - scripts/update-llama-turbo.sh - scripts/update-sdcpp.sh - scripts/update-llama-cpp.sh (improved error message) - scripts/build-llama-turbo.ps1 - scripts/update-llama-turbo.ps1
Reusable card for the six CUDA-side accelerators (nunchaku,
sageattention, dflash-mlx, dflash-cuda, triattention, kvpress).
Three placement variants share one component so the per-feature
surfaces in Phases 3-6 stay in sync without re-implementing the
three states (idle / installing / installed / failed) per surface:
- card: full banner with title, claim, applies-to, size pill,
primary action. Lands in the Image / Video Studio runtime
banners and the Diagnostics Boost Pack.
- pill: compact horizontal chip with 4-bit-style copy. Lands on
catalog variant cards in the Discover / Models tabs.
- row: table form for Diagnostics Boost Pack's scannable view.
State ownership: parent owns the install lifecycle (which package
is in flight, success/failure, captured pip output). The card
only owns the log-expanded toggle. Mirrors the CudaTorchLogPanel
contract so the card is cheap to render in many places without
duplicating polling work.
New catalog (src/components/acceleratorCatalog.ts) is the single
source of truth for each accelerator's pip name, capability flag,
speedup claim, size, install mode, and platform gate. Adding a
seventh accelerator is one entry here, one Phase 1 capability
flag, and one row in the backend's _INSTALLABLE_PIP_PACKAGES.
NativeBackendStatus (src/types/server.ts) extended with the 13
FU-056 Phase 1 fields plus the older vllm/mtplx/ggufMtp fields
that were already on the wire but missing from the TS interface.
All fields optional so a backend running an older build than the
frontend doesn't break the type contract.
Tests (28 new): catalog shape pinning + getAccelerator lookup +
isPlatformCompatible matrix + readInstalled / readVersion /
platformLabel / actionLabelFor branch coverage. Vitest harness
stays at pure-function level - no React Testing Library yet, per
the existing src/components/__tests__/ convention.
CSS: .accelerator-card / -pill / -row variants in styles.css,
matching the existing .torch-upgrade-pill colour vocabulary
(rgba(80, 140, 220, ...) for the not-installed accent,
rgba(80, 180, 100, ...) for installed, --border + --surface
tokens for the chrome).
First end-to-end UX slice for FU-056. The Diagnostics tab gains a Boost Pack section listing all six CUDA-side accelerators (nunchaku, sageattention, dflash-mlx, dflash-cuda, triattention, kvpress) as a single scannable table. Status pill + Install / Retry button per row; click installs via the existing POST /api/setup/install-package endpoint, output captured into a collapsible details, then capabilities re-probe so the "Installed v1.2.1" pill flips without a parent refetch. Self-probes capabilities on mount via refreshCapabilities() so the panel works standalone — DiagnosticsPanel only passes backendOnline. Per-accelerator install state lives in a record keyed by pip name, so multiple installs can run concurrently if the user is impatient (the backend serialises pip writes at the OS-FS layer). Renders every catalog row with showIncompatible=true: this is the "see everything" surface, not a per-feature gate. Apple-Silicon and CUDA accelerators both list; the platform column tells the user which apply to their box, and disabled state + tooltip blocks an ill-fitting install. Phases 3-5 will filter per surface. Closes the first observable loop: Phase 1 probe → Phase 2 card (row variant) → install → re-probe → installed state. Same Component renders pill + card + row, so the per-feature surfaces in Phases 3-5 ride the same diff. No new tests — the pure logic (readInstalled, readVersion, actionLabelFor, platformLabel, isPlatformCompatible) is already pinned by Phase 2's 28 unit tests. The Boost Pack itself is wiring: fetch capabilities, dispatch install, re-fetch on success. Mirrors the existing CudaTorchLogPanel pattern.
Wires accelerator install affordances into the three Image surfaces
users actually look at when picking + running a model:
1. Image Models tab — every installed FLUX / SD3.5 / Qwen-Image /
SANA / PixArt row gets read-only pills next to the style tags:
"🚀 SVDQuant 4-bit" + "🚀 Fast attention DiT" when the
accelerator is missing, "✓ ..." when present. UNet pipelines
(SD1.5 / SDXL) show no pills — neither nunchaku nor
sageattention applies.
2. Image Discover tab — same pills on catalog variant cards in
the same position. Lets users see acceleration potential
before committing to a download.
3. Image Studio runtime banner — new "Performance boosters"
section between the torch-upgrade pill and the model-load
summary. Card variants of the same accelerators with full
Install / Retry buttons. Self-contained install state: clicks
POST /api/setup/install-package, capture the response
capabilities, and overlay them onto the parent-provided
snapshot so the card flips to "✓ Installed v..." without
waiting for the next workspace refetch.
The pills on the Models / Discover tabs are deliberately
read-only — the install action lives in Studio's runtime banner so
install state stays concentrated. A new optional onInstall prop on
AcceleratorCard drives this: when omitted, the card renders as
passive info.
New helper getApplicableAccelerators(repo) maps a model repo to the
accelerator IDs that apply. Pattern-matches on the family slug
(FLUX.1, sd3.5, qwen-image, sana, pixart-sigma) so we don't have to
edit catalog/image_models.py to land this — the catalog-side
recommendedAccelerators metadata pattern is reserved for Phase 7
when the i18n + per-variant overrides land together. 7 new unit
tests pin the matrix (FLUX, SD3.5, Qwen-Image, SANA, PixArt for
nunchaku+sageattention; Wan / HunyuanVideo / LTX / CogVideoX /
Mochi for sageattention-only; Wan2.1-T2V-1.3B for the triattention
LongLive bonus; SDXL / SD1.5 return empty).
NativeBackendStatus threads from App.tsx → ImageModelsTab,
ImageDiscoverTab, ImageStudioTab → ImageStudioRuntimeBanner →
ImageStudioBoosters. The prop is optional everywhere so older
backends without FU-056 Phase 1 fields collapse pills to their
"available" state rather than crashing the tab.
Deferred to a follow-up commit: the post-generation suggestion
toast (fires when a non-Nunchaku FLUX gen takes >12s on CUDA). The
discovery + install surfaces in this commit already give users a
clean path to install accelerators contextually; the toast adds a
nudge but the install affordance is reachable without it.
Mirrors the Image-side wiring from Phase 3 onto the Video tabs:
1. Video Models tab - every Wan / HunyuanVideo / LTX / CogVideoX /
Mochi row gets read-only accelerator pills next to the style
tags. SageAttention applies to all CUDA video DiTs;
TriAttention surfaces specifically on Wan 2.1 T2V 1.3B for the
LongLive real-time long-clip mode.
2. Video Discover tab - same pills on catalog variant cards in
the same chip-row position.
3. Video Studio runtime banner - new "Performance boosters"
section between the torch-upgrade pill and the LongLive
install row. Full card variants with working Install / Retry
buttons + collapsible pip output.
Implementation note: the booster section was identical to the
image-side equivalent (same install state machine, same card
rendering, same overlay-on-install-success pattern). Renamed
ImageStudioBoosters -> MediaStudioBoosters and moved to
src/components/ so both surfaces share one file. The component
now takes a minimal {repo, name?} variant slice rather than a
concrete ImageModelVariant / VideoModelVariant - both shapes
carry those fields and the booster logic doesn't need anything
else. One source of truth for the install / overlay / re-probe
dance.
NativeBackendStatus threads from App.tsx -> VideoDiscoverTab,
VideoModelsTab, VideoStudioTab -> VideoStudioRuntimeBanner ->
MediaStudioBoosters. Prop is optional everywhere so older
backends without FU-056 Phase 1 fields collapse pills to their
"available" state rather than crashing the tab.
No new tests required - the getApplicableAccelerators repo-pattern
matrix is already pinned by Phase 3's 7 tests, including all four
relevant video repos (Wan2.1-T2V-1.3B with triattention bonus,
Wan2.2-T2V-A14B without, HunyuanVideo, LTX-Video, CogVideoX,
Mochi). MediaStudioBoosters internals match the previous
ImageStudioBoosters, no behavioural changes.
Brings the in-app accelerator install affordance to the chat
surface. When the user is chatting with a model that has a
registered DFlash draft AND the appropriate pip package isn't
installed yet, an unobtrusive nudge bar appears above the prompt
textarea:
DFlash speculative decoding can ~2x this model with no quality
loss. [Install DFlash]
Click installs the right package for the active backend
(``dflash-mlx`` on Apple Silicon MLX, ``dflash`` on CUDA vLLM)
via the existing ``handleInstallPackage`` dispatcher. The bar
self-hides when the package lands and capabilities re-probe.
Twin gating logic to the AcceleratorCard pattern: the hint only
renders when all three signals line up (model in supportedModels,
package missing for active backend, supported backend). The
backend probe + ``resolveDflashSupport`` helper already exist
from FU-034; this commit wires them into the composer.
Drive-by fix in RuntimeControls.tsx: the existing "Install DFlash"
button next to the launch-settings toggle hard-coded
``onInstallPackage("dflash-mlx")``, which silently installed the
Apple-Silicon package on CUDA / Windows boxes running vLLM. Both
the launch-settings button and the new composer hint now route
through a shared ``dflashPackageFor(backend)`` helper that picks
the right package per backend. 3 new unit tests pin the matrix
(mlx -> dflash-mlx, vllm -> dflash, null / unknown -> dflash-mlx
as safe default).
Net change for the user: discover acceleration potential from
the place where you generate (chat composer / studio runtime
banner / catalog cards), not from a settings page you have to
remember to visit.
vLLM ships no native Windows wheels; this commit lets Windows users
install vLLM into an isolated WSL venv with one click. Three pieces:
1. **Detector** (backend_service/inference/accelerators.py):
four new probes layered on top of the existing wsl2_available
helper:
- wsl_default_distro() reads "Default Distribution: Ubuntu-X" out
of the UTF-16 ``wsl --status`` output
- wsl_cuda_available() runs ``wsl -- nvidia-smi -L`` to confirm
CUDA passthrough is working inside the distro
- wsl_vllm_available() runs an ``import vllm`` inside the managed
venv at ~/.chaosengine/vllm-venv
- wsl_vllm_version() reads __version__ from the same venv
Four matching fields on BackendCapabilities (wslDistroName,
wslCudaAvailable, wslVllmAvailable, wslVllmVersion). The detail
probes shell out via wsl.exe and can take a few seconds on a
cold WSL service start, so they're gated behind a wsl2_active
short-circuit — hosts without WSL pay zero subprocess cost.
2. **Install endpoint** (backend_service/routes/setup/vllm_wsl.py):
POST /api/setup/install-vllm-wsl + /status. Background-thread job
with five steps:
- preflight (verify CUDA visible in WSL)
- venv (python3 -m venv ~/.chaosengine/vllm-venv)
- pip-upgrade (pip + setuptools + wheel)
- pip-vllm (the long one, ~2 GB / 5-15 min)
- verify (import vllm)
Same single-job semantics as install-longlive: a second POST
while running returns the running job state. The venv is rooted
in the WSL user's $HOME (ext4-backed) so CUDA torch wheels don't
pay the ~10x IO penalty of being on /mnt/c/.
3. **WslBridgePanel** (src/features/settings/WslBridgePanel.tsx):
Windows-only Setup panel rendered alongside the Boost Pack on
the Diagnostics tab. Four bucket states:
- WSL2 not installed → ``wsl --install`` copy-paste hint + MS docs
- WSL2 ready, no CUDA → NVIDIA WSL driver kicker link
- WSL2 + CUDA ready, vLLM missing → one-click install button
- vLLM ready → green pill with version + "Reinstall" affordance
Self-probes capabilities on mount, polls install status at 1.5 Hz
while a job is in flight, refreshes capabilities on completion so
the bucket flips without a parent refetch. Uses the existing
InstallLogPanel for log tail (extended to accept the new
"vllm-wsl" variant).
Tests: 12 new probe tests covering the present / absent / cold-host
matrix for each WSL detail probe, plus 4 endpoint tests pinning the
job-state shape + the Windows platform gate + the start/status
contract. Live-verified on Windows + RTX 4090: detector returns
``distro=Ubuntu-24.04, cuda=True, vllm=False, version=None`` —
correct for the dev box right now.
Deferred to a follow-up commit: the actual engine routing so a
vLLM model load transparently launches inside the WSL venv. This
commit ships only the install path so users can stand up the venv
today; the engine wiring needs careful path translation
(/mnt/c/Users/... → Windows paths) and stdout streaming that
deserves its own focused PR.
Completes the WSL bridge so Windows users get transparent vLLM
inference. A model load with backend=vllm on Windows + wslVllm
installed transparently spawns the OpenAI-compatible server inside
the WSL Ubuntu venv and proxies /v1/chat/completions through it.
No user action beyond clicking "Install vLLM in WSL" once.
Three pieces:
1. **VllmWslEngine** (backend_service/inference/vllm_wsl_engine.py):
HTTP-bridge engine modelled on MtplxEngine. Subprocess shape:
wsl -- ~/.chaosengine/vllm-venv/bin/python
-m vllm.entrypoints.openai.api_server
--model <ref> --host 127.0.0.1 --port <free>
--max-model-len <ctx> --trust-remote-code
WSL2 mirrors loopback to the Windows host so the Windows backend
reaches the listener at 127.0.0.1:<port> without any port-forward
ceremony. Implements both generate() and stream_generate() so the
existing chat surface stream path works end to end.
2. **windows_path_to_wsl helper**: a local model at
C:\Users\Dan\AI_Models\Qwen3-7B gets translated to
/mnt/c/Users/Dan/AI_Models/Qwen3-7B before being passed to vLLM,
so a Windows-side download is reachable from inside WSL. HF repo
ids (Qwen/Qwen3.5-7B) pass through unchanged - vLLM downloads them
into its WSL-native HF cache, which avoids the ~10x IO penalty
of /mnt/c-based cache reads.
3. **Routing** (backend_service/inference/controller.py): when
``hint == "vllm"`` the controller now prefers VllmWslEngine on
Windows + wslVllmAvailable=True, falling through to the in-process
VLLMEngine on Linux. On Windows boxes without the bridge, the
error message points the user at Diagnostics → WSL2 vLLM bridge
instead of the bare "pip install vllm" hint that doesn't work on
Windows.
Speculative decoding via the WSL bridge isn't wired yet - the
in-process VLLMEngine uses vllm.LLM's speculative_config= kwarg, but
the OpenAI server entry-point uses --speculative-model /
--num-speculative-tokens which need separate wiring. The runtime
note honestly flags the gap rather than silently dropping requests.
Tests: 13 new in test_vllm_wsl_engine.py covering:
- windows_path_to_wsl matrix (backslash, forward-slash, drive
casing, WSL passthrough, repo-id passthrough, UNC, relative)
- load_model platform gate (off-Windows rejects)
- load_model capability gate (wslVllm missing rejects)
- argv composition (every required vllm flag present + ordered)
- happy-path lifecycle (Popen called once, /health polled,
LoadedModelInfo populated correctly, pid reachable)
- path translation on a Windows model path
Caught during a live end-to-end test on a fresh Ubuntu 24.04 WSL
install: ``python3 -m venv ~/.chaosengine/vllm-venv`` fails with
``ensurepip is not available`` because Ubuntu 24.04 ships python3
without the venv module. Before this commit the user would see a
confusing error mid-install ("Failed to create the WSL venv. See
output above.") with the real fix buried in stderr.
Now the preflight step explicitly probes ``python3 -c 'import
ensurepip'`` after the CUDA check. When it fails, the install
endpoint surfaces the exact apt command:
sudo apt update && sudo apt install -y python3-venv
instead of trying to create the venv and erroring out. Same
pattern as the existing NVIDIA-driver-not-found path: tell the
user what to do, don't pretend to recover.
…se 8)
End-to-end test validated against real CUDA + real vLLM 0.21.0 +
real WSL2 Ubuntu-24.04 on Windows + RTX 4090. Loaded
Qwen2.5-0.5B-Instruct in 96 s and generated "Paris." for the
prompt "The capital of France is" — 1.19 s HTTP round-trip from
the Windows backend into WSL and back.
Four fixes the live test surfaced, none of which would have been
caught by mocked unit tests:
1. **PATH plumbing through grandchild processes**: the engine
subprocess inside vLLM (EngineCore) couldn't find ``ninja`` for
flashinfer's JIT-compiled sampling kernels, even though it lived
in the venv's bin/. The command builder now wraps the python
invocation in ``bash -c`` so we can prepend
``~/.chaosengine/vllm-venv/bin`` to PATH explicitly. The PATH
value is double-quoted because WSL2 interopts the Windows PATH
into bash, and that PATH contains paths with spaces
(``/mnt/c/Program Files/NVIDIA…``) which otherwise word-split
into ``export: 'Files/NVIDIA': not a valid identifier`` errors.
2. **vLLM 0.21+ flashinfer JIT escape hatches**: even with ninja
reachable, flashinfer needs ``nvcc`` for the second compile
stage. Setting ``VLLM_USE_FLASHINFER_SAMPLER=0`` +
``VLLM_ATTENTION_BACKEND=TORCH_SDPA`` routes through pre-built
PyTorch kernels. ``--enforce-eager`` disables CUDA-graph
compilation. Loses some perf but avoids the second JIT.
3. **/v1/models probe instead of /health**: vLLM's ``/health``
returns 200 with an empty body, which tripped ``_http_json``'s
``json.loads`` and made ``_wait_for_server`` retry indefinitely
until the timeout. ``/v1/models`` returns the loaded-model list
as JSON so the parse succeeds and we return on first OK.
4. **shlex-quoted model arg**: a model path with spaces (e.g. a
Windows-translated ``/mnt/c/My Models/Qwen3-7B``) would
word-split through the bash -c parse without quoting. New test
pins the round-trip.
Plus the install endpoint's preflight already grew a clear
"sudo apt install python3-venv" message (last commit) — caught the
same way, just earlier in the chain.
New file ``scripts/live_e2e_vllm_wsl.py`` — not part of the
regular test suite; one-shot script that probes capabilities,
constructs the engine, loads a tiny chat-tuned model
(Qwen/Qwen2.5-0.5B-Instruct), generates a deterministic prompt,
prints metrics, tears down. Run from Windows + WSL with
vllm-venv installed: ``.venv\Scripts\python.exe scripts\live_e2e_vllm_wsl.py``.
Exit 0 on success, 1 with full traceback on failure.
Tests: 15 in test_vllm_wsl_engine.py still pass (3 lifecycle +
3 command-shape + 5 path-translation + 2 platform-gate +
2 capability-gate). All 42 in the wider WSL-bridge test files green.
Live-test run output:
Loaded in 96.3s
engine: vllm-wsl
runtimeNote: vLLM 0.21.0 running inside WSL (Ubuntu-24.04).
pid: 34036
port: 58586
text: 'Paris.'
finishReason: stop
promptTokens: 34
completionTokens: 3
responseSeconds: 1.19
Closes the test-coverage gap on everything FU-056 has shipped over the previous eight commits. Three small additions across the existing test-gate scripts: 1. **scripts/cache-strategy-matrix.py** — capability probe now considers the WSL vLLM bridge a valid vllm provider. Without this, all four vllm matrix cells would skip with "vLLM not installed (CUDA-only)" on Windows boxes even though the bridge route works (validated by the live e2e in commit c4f3701). New ``wsl_vllm_available`` field on BackendCapabilities; the skip-reason copy now names both routes so a user reading a skip-row knows their actionable next step regardless of OS. 2. **scripts/pre-build-check.mjs [5/8]** — extended with a new sub-probe that walks ``src/components/acceleratorCatalog.ts`` for every (pipPackage, capabilityField) pair and asserts each one exists in (a) the backend's _INSTALLABLE_PIP_PACKAGES allow-list and (b) the BackendCapabilities dataclass. Surface: ``PASS Accelerator catalog ↔ backend (6 entries)``. Catches drift: adding a 7th catalog row without wiring its pip package + capability flag would fail the gate at build time rather than at first user click. Six entries today (nunchaku, sageattention, dflash-mlx, dflash-cuda, triattention, kvpress). 3. **scripts/e2e_test_suite.py phase 6** — two new read-only probes alongside the existing 7: - ``vllm-wsl-status``: GETs /api/setup/install-vllm-wsl/status and asserts the JSON shape (phase + done fields present). Verifies the Phase 8 install endpoint at minimum returns the expected schema even when no install has been started. - ``fu-056-capability-flags``: GETs /api/health and asserts all 7 FU-056 Phase 1 capability fields are present on ``nativeBackends``. The fields are optional in the schema (older backends shouldn't crash the frontend), but the gate ensures release builds expose them. Phase 6 grows from 7 to 9 checks. Verified live against the user's running backend: PASS 9/9. No new test files. Phase 9 is gate plumbing on existing scripts.
Caught during the live WSL test sweep: ``backend_service.app.main()``
hard-coded ``port=DEFAULT_PORT`` in the ``uvicorn.run`` call and ignored
the ``--port`` flag the test scripts have been passing. Worked
historically because DEFAULT_PORT already reads ``CHAOSENGINE_PORT``
env, so test runs that set the env var got the right port — but
``python -m backend_service.app --port 8877`` silently bound 8876.
Now ``main()`` uses argparse with env-var fallbacks:
--port → $CHAOSENGINE_PORT → 8876
--host → $CHAOSENGINE_HOST → 127.0.0.1
CLI > env > default. Surfaces ``--help`` properly (the user can
discover the args). The existing env-var path keeps working for the
Tauri shell + headless test scripts that already set ``CHAOSENGINE_PORT``.
Three new helper scripts under ``scripts/`` for the WSL dev workflow:
- ``install_llama_server_wsl.sh`` — downloads the latest llama.cpp
Linux release into ``~/.chaosengine/bin/`` for the WSL backend.
- ``run_backend_wsl.sh`` — launches the backend on port 8877 with
auth disabled (env: ``CHAOSENGINE_REQUIRE_AUTH=0``), pointing at the
WSL-side llama-server. Detached via nohup + disown.
- ``probe_backend_wsl.sh`` — diagnostic helper; runs the backend
foreground for 3 s and surfaces import / bind errors.
WSL test sweep results (Ubuntu-24.04, RTX 4090, vllm-venv at 0.21.0):
- pytest tests/ — 1472 passed, 21 failed, 21 skipped
(49 more passes than Windows — fewer platform-specific failures)
- e2e_test_suite.py --smoke — 6/0/0 PASS including the two new
FU-056 Phase 9 phase-6 probes (vllm-wsl-status + capability flags)
- cache-strategy-matrix.py --quick — 0/0 ran, 15/15 skipped honestly
(only ``native`` strategy in dev venv; no turbo binary, no dflash,
no models in dev library — all skip reasons accurate)
Two related UX cleanups landed together because they share the same
plumbing pattern (App.tsx → tabs → leaf components):
1. **Hide MTPLX install affordances on Windows / Linux.** The MTPLX
block in RuntimeControls (the launch-settings modal that opens
from Chat / Compare / HTML Challenge / Benchmarks) used to render
the MTPLX checkbox + "Install MTPLX" button + info disclosure on
every host. MTPLX is Apple-Silicon-only — the install would error
on Windows and the checkbox would render disabled with no path to
recovery. Per the FU-034 rule (hide unrecoverable options, don't
grey them out), the whole block is now gated on a new
``isAppleSilicon`` prop threaded from App.tsx via:
App.tsx
→ LaunchModal / CompareView / HtmlChallengeTab /
BenchmarkRunTab
→ ChallengePickerModal (for HtmlChallengeTab)
→ ModelLaunchModal
→ RuntimeControls
Three call sites on RuntimeControls (the MTPLX label, the
info-panel expand, the info button) now ALL gate on the prop.
``dflash-mlx`` was already platform-gated via the FU-056 Phase 2
AcceleratorCard catalog (platformGate: "apple-silicon").
2. **Chat empty-state banner.** Fresh-install users opening the
Chat tab used to see "Send a message to start the conversation."
followed by a silent auto-load of the largest MLX direct variant
(a 15+ GB download that doesn't even work on Windows/Linux —
MLX backend doesn't exist there). Replaced with a
``<ChatEmptyStateBanner>`` that surfaces a clear CTA: "Browse
Discover" when library is empty, "Open Models" when models are
present but none loaded. No silent auto-loads, no confused users
waiting on the wrong download.
The banner is purely additive — composer textarea still usable
above (users can also type + the banner suggests Discover).
Plumbing:
- New ``src/utils/platform.ts`` with ``isAppleSiliconHost``,
``isCudaHost``, ``isIntelMac`` helpers. Reads from
``workspace.system`` (platform + arch) which the backend already
populates from ``platform.system()`` + ``platform.machine()``.
- 15 unit tests in ``src/utils/__tests__/platform.test.ts`` pin
every host-classifier branch (Darwin arm64, aarch64, Intel Mac,
Windows, Linux, null/undefined, case-insensitive).
- ``isAppleSiliconHost(workspace.system)`` computed once at App.tsx
top-level, threaded as ``isAppleSilicon`` prop to the four call
sites that own MTPLX surfaces.
- New ``<ChatEmptyStateBanner>`` component with two states
(no-models / no-loaded-model), each with appropriate CTA.
Tests: 35 files / 424 vitest tests pass (+15 from platform helper).
tsc clean. No new pytest needed — backend unchanged.
Not addressed in this commit (deferred):
- MLX-only image / video catalog variants still surface in Discover
/ Models tabs on Win/Linux. Filtering those is a larger UX call —
hide entirely vs. show with "Apple Silicon only" pill — deserves
its own decision before code.
- "llama-server installed by default" — already the case via
scripts/stage-runtime.mjs for release builds. No code change.
Per FU-034 "hide unrecoverable options" policy, extend it to whole
catalog rows. Windows / Linux users no longer see MLX / mlx-video /
mflux / MTPLX entries they can never run, and Apple Silicon users no
longer see vLLM / nunchaku / CUDA-only entries.
- src/utils/platform.ts: imageOrVideoVariantPlatformGate +
chatVariantPlatformGate + isVariantCompatibleWithHost derive a
PlatformGate ("apple-silicon" | "cuda" | "any") from existing variant
fields (runtime / backend / styleTags / repo prefix). No catalog
schema change required.
- ImageModelsTab / ImageDiscoverTab / VideoModelsTab / VideoDiscoverTab:
new hostSystem prop, filtered through isVariantCompatibleWithHost in
the rows/filteredResults useMemo.
- App.tsx: threaded workspace.system into all four tabs;
libraryChatOptions now also filtered so the launch dropdown drops
MLX backends on Win/Linux.
- AcceleratorsBoostPack: showIncompatible flipped off, the table now
surfaces only accelerators the current host can install.
16 new vitest cases pin the helper boundaries (Apple Silicon host
hides CUDA-only variants, Linux x86_64 hides Apple-Silicon-only
variants, "any" gate passes on every host, etc). All 440 frontend
tests pass; tsc clean.
Feature/accelerator install ux
Image / Video Studio
- LTX-2 mlx-video: snap dimensions to multiples of 32 (was crashing
16:9 / 9:16 / 21:9 presets with "Height must be divisible by 32").
Surfaces snap in runtimeNote + reports actual rendered dims.
- Warm-cache phase fix: second-generation of same model variant no
longer flashes "Loading..." — pre-checks variant_key + begins on
PHASE_ENCODING with "Reusing {modelName}" when pipeline cached.
- mlx-video progress wiring: subprocess stdout now feeds VIDEO_PROGRESS
via begin/finish lifecycle + on_progress callback. fraction=None
preserves step counter instead of jittering to 0.5.
- mlx-video device memory: populate deviceMemoryGb (was always null
on /api/video/mlx-runtime → frontend defaulted to 16 GB).
- Hide CUDA-only FP8 layerwise toggle on Apple Silicon.
- Hide platform-incompatible variants from Studio dropdown (Nunchaku
INT4 CUDA on macOS, mlx-video on Win/Linux).
Discover
- FU-061: "Watching upstream" badge + disabled download CTA for
tracked-only seeds (ERNIE-Image, Nucleus-Image, Z-Image, HiDream,
GLM-Image, FLUX.2 family) that lack Studio pipeline routing.
- Tracked-seed model classifier: ERNIE-Image / Nucleus / Z-Image /
HiDream / GLM-Image keywords added so they don't leak into Chat
My Models.
Upstream pin bumps
- FU-058: vLLM floor >=0.8.0 → >=0.21.0 (gemma4 MTP, spec-dec
thinking budget, TurboQuant hybrid).
- FU-059: nunchaku pin >=1.2.1 → >=0.16.0 (1.2.1 unsatisfiable;
upstream version reset to 0.x).
- diffusers .venv upgrade 0.37.1 → 0.38.0 (pin already allowed).
- FU-057: dflash-mlx v0.1.7 migration documented (deferred — major
API rewrite). v0.1.6/v0.1.7 release notes + breaking changes
surfaced in tracker.
Test infra
- FU-060: memory-pressure gate mocked in test_video_routes.py +
test_backend_service.py setUp/tearDown — deterministic regardless
of host load (was flaky on busy dev boxes).
- e2e_test_suite.py phases 4 + 5 now treat memory-gate refusals as
SKIP (with explanatory reason), not FAIL — pre-build ships green
on memory-constrained hosts.
Tracker rows: FU-057, FU-058, FU-059, FU-060, FU-061 added to
CLAUDE.md.
Gates green:
- pytest 1541 passed, 1 skipped, 0 failed
- vitest 441 / 35 files passed
- tsc clean
- pre-build 13 / 13 passed
- E2E full suite 8 / 8 phases · 36 / 36 checks · 0 fail
Three threads landed in one commit because they surfaced from the same
session — a Windows-side pytest sweep, the matching WSL+CUDA dry run,
and a chat-tab UX gap the user flagged mid-flight.
**Chat empty-state banner (FU-056 follow-up)**
ChatEmptyStateBanner showed "No model is loaded yet. Pick one from
Models to start chatting." even while the header strip showed
"LOADING MODEL..." for a model already in flight. Two fixes:
- ChatThread.tsx hides the banner when serverLoading is non-null —
the ModelLoadingProgress bubble below already conveys the state.
- ChatEmptyStateBanner.tsx rewords the "models present but none
loaded" branch from "Pick one from Models" + "Open Models" to
"A model needs to be loaded before you can chat." + "Load Model"
for actionable copy. Fresh-install branch (no models downloaded)
is unchanged.
**Test suite — Windows 1526 -> 1528 / 16 fails -> 0; WSL 1510 -> 1544**
The cross-platform sweep caught 16 Windows pytest failures and 3 WSL
failures, all platform-portability test infra bugs rather than
product issues:
- test_cache_strategies.py: 2 turboquant tests imported mlx_lm /
turboquant_mlx at function scope without skip guards — added
@unittest.skipUnless(_MLX_LM_AVAILABLE) so they skip cleanly on
non-Apple-Silicon (caught on both Windows AND WSL Linux).
- test_mtplx_engine_integration.py: 5 tests build a #!/usr/bin/env
bash wrapper script — class-level @unittest.skipIf(sys.platform
== "win32") since Windows can't honour the bash shebang.
- test_sdcpp_image.py + test_sdcpp_video.py: 3 tests asserted str
equality against "/tmp/sd" but the source does str(Path(...))
which yields "\tmp\sd" on Windows. Centralized via
_FAKE_SD_BIN = str(Path("/tmp/sd")) so both sides of the
comparison round-trip through the OS-native separator.
- test_mlx_video_wan_convert.py: same POSIX-path issue with the
env-var override test — now uses tempfile.gettempdir().
- test_preview_vae.py: 5 tests shallowly guarded on `import
diffusers` but the real failure mode on Windows is the deeper
`from diffusers import AutoencoderTiny` (torchao ABI break vs
torch 2.6.0+cu124). Replaced with _autoencoder_tiny_importable()
probe that exercises the actual symbol.
- test_gpu.py::test_nvidia_smi_parsing: fixture said
"NVIDIA RTX 4090" but the real nvidia-smi emits "NVIDIA GeForce
RTX 4090" (live-confirmed via WSL2 GPU passthrough). Also
patched _snapshot_torch_cuda to None so the parser actually
exercises the nvidia-smi codepath instead of short-circuiting
through torch.cuda when a real GPU is present.
**Real product bug — VLLMEngine.generate() missing kwargs**
Live regression caught on WSL2 + RTX 4090: any chat turn with a
vLLM-loaded model raised
TypeError: VLLMEngine.generate() got an unexpected keyword
argument 'samplers'
because the controller passes through samplers / reasoning_effort /
json_schema (parity with LlamaCppEngine) but the vLLM engine's
signature only accepted the basic generate kwargs. Fix in
vllm_engine.py:185-219 — accepts all three, translates llama-server's
`repeat_penalty` to vLLM's `repetition_penalty`, floors temperature=0
to 0.01 (vLLM forbids exactly 0).
New tests/test_vllm_engine.py (5 cases) pin:
- the engine.generate() signature must match the controller's
full call shape (kwarg-name regression gate)
- samplers are forwarded into SamplingParams
- repeat_penalty renames correctly
- temperature=0 gets bumped above the vLLM floor
All skip on hosts without the vllm wheel installed, so the test
runs cheap on macOS / Windows CPU-only paths.
**WSL + CUDA test plan (docs/WSL_CUDA_TESTING.md)**
7-phase practical guide, live-validated end-to-end against WSL2
Ubuntu 24.04 + RTX 4090 (CUDA 12.6.85 toolkit, GPU passthrough).
Final result of the dry run:
- Phase C (pytest with CUDA): 1544 / 0 / 3
- Phase D (E2E full): 7 / 0 / 1
- Phase E (matrix --full): vllm native Qwen3-0.6B PASS,
SHA d18c2b8cb410
- Phase G (real workload): Qwen3-0.6B via vLLM on RTX 4090,
24 GB VRAM allocated, 44% GPU utilization confirmed via
nvidia-smi
Doc captures 13 gotchas surfaced during the dry run — most useful:
- Don't pre-pin torch before installing vllm (vllm's resolver
drives the torch pin; pinning first creates a conflict)
- Use `tr -d '\r'` not `sed 's/\r$//'` for CRLF stripping
(the latter ate trailing `r` chars under WSL's bash-in-PowerShell
quoting chain — broke test_mlx_video_wan_installer.py with
"cannot import name 'mlx_video_wan_installe'")
- CHAOSENGINE_REQUIRE_AUTH=0 mandatory for headless e2e scripts
- ninja must be on PATH at backend launch time (flashinfer JIT)
- --backend auto picks MLX on safetensors even when MLX is
unavailable (FU-063 candidate — auto should fall through to
vllm on Linux)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.