Staging to Main by cryptopoly · Pull Request #59 · cryptopoly/ChaosEngineAI

cryptopoly · 2026-05-18T16:36:52Z

No description provided.

Adds end-to-end MTPLX support — isolated venv at ~/.chaosengine/mtplx-venv plus engine, capability probe, routing, install flow, and full UI wiring. Backend - _mtp.py model registry: MTP_MODEL_MAP + _MTP_ALIASES (Qwen3.5/3.6, DeepSeek V3/R1, Coder-Next, Youssofal/Qwen3.6-27B-MTPLX-Optimized-*) - MtplxEngine subprocess wrapper (load/generate/stream over OpenAI- compatible HTTP). Fails into RuntimeError; controller catches and falls back to MLXWorkerEngine with a runtimeNote. - capabilities: cheap file-existence probe for mtplx-venv + version file. - controller._select_engine routes MLX + speculativeDecoding + has_mtp_heads + mtplxAvailable -> MtplxEngine. - controller orphan-prune extended to MTPLX subprocesses. - controller _LLAMA_HELP_LOCK import (pre-existing latent NameError). - helpers/system.py exposes mtplx info to workspace system stats. Install flow - scripts/install-mtplx.sh + routes/setup/mtplx.py background-job pattern (POST start, GET status). Path was parents[4] -> parents[3]. Frontend - useMtplxInstall hook with poll + unmount cleanup. - RuntimeControls renders MTPLX section when model has MTP heads; hides DFlash when MTPLX supersedes (backend already prefers MTPLX). - Install terminal panel hidden when phase=="done" so it doesn't re-appear on every modal open after a successful install. - ModelLaunchModal mtplx props plumbed through all four wrappers: Chat (LaunchModal), HTML Challenge (ChallengePickerModal), Compare (CompareView), Benchmarks (BenchmarkRunTab). - MyModelsTab MTPLX strategy filter; OnlineModelsTab Acceleration row. - summarizeLaunchSettings + modelUsesMtplx helper so summary label shows "MTPLX" when backend will actually route through it. - MyModelsTab MTPLX filter tightened to modelName-only candidateKeys (was matching via fuzzy matchedVariant.repo, giving false positives on UD/Unsloth repacks + Gemma). Cleanup - Removed Compare button from chat sidebar (Compare is own tab now). Tests - 10 integration tests in tests/test_mtplx_engine_integration.py exercising engine spawn -> /health -> /v1/chat/completions (JSON + SSE) -> done chunk via stub mtplx server at tests/fixtures/ stub_mtplx_server.py. Plus controller-routing assertions. - 1365 Python tests pass; 371 TypeScript tests pass; tsc clean.

Feature/mtplx

Stdlib-only Python shim over the FastAPI backend's HTTP surface so features can be tested headlessly (load + prompt + bench + status) without GUI click-through. - scripts/chaosengine-cli: serve/status/load/unload/prompt/bench/ mtplx-install/mtplx-status subcommands; SSE streaming for live token output; JSON to stdout for jq composition - tests/test_cli_smoke.py: 12 unit tests covering parser + happy path for every subcommand via mocked urllib No new pip deps. No bundle-size impact (script lives in scripts/, not packaged into the .app).

Expands chaosengine-cli to wrap all 125 backend routes (95 typed shortcuts + generic call/routes/openapi dispatchers) and adds a phased E2E test suite that drives the CLI end-to-end against a live backend, mirroring every major app surface. CLI (scripts/chaosengine-cli, 1656 LOC) - Generic dispatcher: call <METHOD> <PATH> with --body/--file/--query /--stream; reaches every endpoint regardless of typed coverage. - routes + openapi subcommands fetch /openapi.json so route inventory stays in sync without codegen. - Typed shortcuts cover Chat (load/unload/prompt/bench/sessions/compare), HTML Challenges (list/get/file/create/repair/retry/validate/delete), Image Studio (generate/progress/cancel/outputs/library/catalog/ download lifecycle), Video Studio (same shape + output-file binary save), Server (status/shutdown/logs SSE), Setup (mtplx/longlive/wan/ cuda-torch/gpu-bundle install + status + probes), Diagnostics, and misc (gpu-status, cache-preview, prompts, settings, workspaces, plugins, tools, adapters, finetuning, v1-models). - Fixed two real schema-shape bugs caught while building the E2E suite: image-generate sends modelId/guidance (was modelRef/ guidanceScale); video-generate sends modelId/numFrames/guidance. E2E suite (scripts/e2e_test_suite.py) - 8 phases: 0 Environment probe, 1 Chat (MLX+GGUF+cache+DFlash+ MTPLX+long-context+fused-attn), 2 Chat Compare, 3 HTML Challenge, 4 Image Studio, 5 Video Studio, 6 Setup probes (read-only), 7 Diagnostics + cleanup hygiene. - Pass criteria concrete: HTTP 200, tokS>0 for generation, expected substring in runtimeNote for DFlash/MTPLX routing, zero unclean orphan workers after the sweep. - Auto-skip semantics: checks skip cleanly when a prerequisite is missing (model not on disk, install missing) rather than failing. - Reports JSON + Markdown to ~/.chaosengine/test-results/. - --smoke runs Phases 0,3,4,5,6,7 (≤60s, no heavy model loads). - Live smoke pass green on M-series box: 6/6 phases, 26 checks, 45s wall — real FLUX image catalog probe, real LTX-2 video generate, real HTML challenge round-trip. Tests (tests/test_cli_smoke.py, 39 → 39 passing) - Updated to match new schema fields (modelId, numFrames). - All 95 typed subcommands have parser coverage; representative behaviour tests for each category. Docs (docs/E2E_TESTING.md) - Standardised procedure + skip semantics + adding-new-checks guide. CLAUDE.md - New Build Checklist entries: --smoke + full E2E required for release builds and any PR touching inference routing. - "New feature gate": every user-visible feature, engine wiring, catalog model family, install endpoint, or cache/spec-dec strategy must land with an E2E check in the relevant phase.

The resolver walked the parent's sibling subdirectories and the grandparent looking for vision projectors. Under the flat ~/AI_Models/<org>/<repo>/ layout that picked up an unrelated neighbour's mmproj — loading lmstudio-community/gemma-4-31B-it-GGUF attached Qwen3.6-27B's mmproj-Qwen3.6-27B-BF16.gguf and crashed llama-server with a text-vs-mmproj n_embd mismatch (5376 vs 5120). Restrict the scan to parent.iterdir() of the main .gguf — no recursion into subdirs, no walk into the grandparent. Returns None for text-only models so --mmproj is never passed. Adds two regression tests: (a) a flat sibling-dir layout with a stray mmproj in the neighbour returns None, (b) an mmproj inside a subdirectory of the model's own folder is also ignored.

… match Two related bugs surfaced by the E2E suite: 1. After deleting an HF-cache snapshot on disk, /api/workspace still returned the stale entry (broken=true) because nothing rescanned the library. _library() now runs a stat-only existence check on each cached entry per request and kicks a background rescan when anything was pruned. Sub-millisecond on a 500-entry cap. 2. lifecycle.load_model() rejected loads with "Cannot load 'X': <reason>" whenever the library lookup matched a broken entry, even when the caller supplied an explicit request.path pointing at the real weights elsewhere. The broken-entry guard now defers to the caller when path is set AND exists on disk. _find_library_entry() also prefers a healthy match over a broken one when multiple share the same name. Tests cover: per-request prune, end-to-end /api/workspace exclusion, two-pass healthy-over-broken lookup, path-trust escape hatch happy path, and the negative case where a non-existent path still falls through to the rejection.

Loading Qwen3.5 / Qwen3.6 MoE (and any other model that mixes self-attention with linear-attention layers) with cacheStrategy= turboquant + cacheBits=4 crashed the first generation call with: 'TurboQuantKVCache' object is not subscriptable Root cause: ``make_adaptive_cache`` unconditionally built every cache slot as ``TurboQuantKVCache`` / ``KVCache``. The model's linear-attn layer forward accesses ``cache[0]`` / ``cache[1]`` (it expects an ``ArraysCache(size=2)``), which raises ``TypeError`` on a KV cache. Fix: when a ``model`` is passed and exposes ``make_cache()``, use it as the base. Preserve every non-KV slot (ArraysCache, MambaCache, …) verbatim and only swap the actual ``KVCache`` instances for ``TurboQuantKVCache``. Plain models without ``make_cache`` keep the previous behaviour. Added regression tests in ``test_cache_strategies.py`` covering both the hybrid model path and the no-``make_cache`` fallback. Live- verified against ``mlx-community/Qwen3.6-35B-A3B-4bit`` at 4-bit TurboQuant: generation now completes at ~47 tok/s with no crash.

CLI - cmd_prompt/cmd_bench now read tokS, promptTokens, completionTokens, responseSeconds, runtimeNote from the nested ``assistant.metrics`` payload instead of the (always-null) top level. Was effectively hiding live tok/s numbers from every --metrics call. E2E suite (scripts/e2e_test_suite.py) - _load_unload_prompt: ``canonical_repo`` + ``load_timeout`` parameters threaded through; subprocess timeout = load_timeout + 60s so a backend-level timeout cleanup beat the harness kill. - Phase 1 picker uses Qwen3.6-35B-A3B-4bit (MoE) as the fast model for every Chat check — much quicker to load than 80B Qwen3-Next while exercising the same MLX / cache / spec / fused paths. - Phase 1 MTPLX check uses leaf-name modelRef + --canonical-repo Youssofal/... so the controller routes through MtplxEngine while avoiding the broken-library-entry path-shadow that previously blocked the load (separate bug fixed by Agent B's library-prune + path-trust commit, suite belt-and-braces). - Phase 1 GGUF check cycles through local .gguf files instead of picking the first one; a single broken mmproj pairing no longer fails the whole check (Agent A's mmproj scope fix made this less necessary but the resilience is a keeper). - Phase 7 ``no orphan workers`` tolerates ``terminated`` / ``killed`` records as expected backend cleanup; only ``kill_failed`` or similar non-cleaned states count as failure. Full sweep result with this commit + the three preceding ``fix:`` commits (mmproj scoping, library prune + path-trust, TurboQuant ArraysCache preservation): 8/8 phases PASS — 32/32 checks PASS — 128s wall Phase 1 detail: - MLX native cache: PASS 8.1s - MLX TurboQuant cache: PASS 5.9s (was 500 → fixed by 30441f9) - MLX + DFlash speculative: PASS 19.1s - MLX + MTPLX speculative: PASS 13.2s (was load-blocked → fixed by 566fd64) - GGUF llama.cpp: PASS 12.8s (was mmproj-crash → fixed by 51305c6) - long context cache-preview: PASS 1.2s - fused attention flag: PASS 10.8s

MTPLX (https://github.com/youssofal/mtplx) is the native MTP speculative-decoding runtime ChaosEngineAI shells out to for MTP-bearing models. Installed on-demand into an isolated venv at ~/.chaosengine/mtplx-venv/, not bundled in the desktop .app, driven via subprocess + HTTP from backend_service/inference/mtplx_engine.py. Apache 2.0 — compatible with our MIT+Apache+BSD permissive licence gate (CLAUDE.md §2). Full LICENSE shipped with the wheel under mtplx-*.dist-info/licenses/.

…al response shape Pre-build (scripts/pre-build-check.sh) - New phase 9/9: runs ./scripts/e2e_test_suite.py --smoke when backend is reachable on :8876; warn-skips when not (pre-build doesn't spawn one). Full E2E sweep stays a release-time gate per CLAUDE.md + docs/E2E_TESTING.md. - Notices dep-check list synced with current THIRD_PARTY_NOTICES.md: added mtplx + mlx-video; dropped stale ChaosEngine probe (vendored package was removed in FU-030). Tests (tests/test_cli_smoke.py) - test_prompt_non_streaming_prints_text_and_metrics fixture rebuilt around the real /api/chat/generate response shape: { session, runtime, assistant: { text, metrics: {...} } }. The earlier flat shape was a guess that masked the real CLI bug fixed in 2d1128c. Verification: ./scripts/pre-build-check.sh — 10 passed, 0 failed, 1 warning (unrelated llama-server-turbo update available).

Adds three new feature surfaces to the top-level README without reflowing existing sections. - MTPLX (Multi-Token Prediction) speculative decoding gets a feature-highlight bullet, a mention in the "Why ChaosEngineAI" speculative-decoding paragraph, and a dedicated subsection under "Speculative Decoding" alongside DFlash + DDTree. Covers Apple Silicon support, the isolated mtplx-venv, the model registry in backend_service/inference/_mtp.py, and the auto-routing fallback chain (MTPLX -> DFlash -> standard MLX). - chaosengine-cli gets a new "Headless Automation" section between the Building a Release and Project Layout sections. Documents the generic call dispatcher + 95 typed shortcuts, four quick-start examples, the optional PATH symlink, and the no-GUI install path. - E2E test suite gets a brief subsection inside the CLI section linking out to docs/E2E_TESTING.md.

Builds with `mkdocs build --strict` (zero warnings) into a publishable site covering install, usage, features (MTPLX, DFlash, cache strategies), CLI reference (driven from live /openapi.json), architecture (controller routing + engines + runtime paths), testing (importing the existing E2E_TESTING content), troubleshooting, contributing, and a reference section (HTTP API, env vars, third-party deps, changelog). - mkdocs.yml: Material theme + tabs nav + standard pymdownx extensions. - requirements-docs.txt: mkdocs, mkdocs-material, pymdown-extensions. - .gitignore: exclude the site/ build output. - exclude_docs in mkdocs.yml hides four pre-existing legacy docs (E2E_TESTING.md root copy + the image-discover/MVP/provenance notes) that are not part of the new site nav.

Builds the strict MkDocs site on every push to staging that touches docs/, mkdocs.yml, requirements-docs.txt, or this workflow, then rsyncs the output into cryptopoly/ChaosEngineAI-Site under docs/ and pushes to that repo's main branch. The marketing site serves the result at https://chaosengineai.com/docs/ — subdirectory hosting so backlinks accrue to the main domain for SEO. mkdocs.yml site_url updated from readthedocs.io to chaosengineai.com/docs/ so generated canonical URLs, sitemap, and OG tags point at the real host. Requires a single new secret on this repo: SITE_REPO_DEPLOY_KEY — SSH deploy key with write access to the ChaosEngineAI-Site repo. Generate with ssh-keygen, add the public half there as a deploy key (write enabled), private half here as an Actions secret. Documented inline in the workflow header. Manual workflow_dispatch is also wired for hot-fixes outside the push-trigger window.

Investigation of recent activity on spec-decoding + KV cache compression upstreams. Findings: - llama.cpp PR #22673 (MTP support) merged 2026-05-16; ships --spec-type draft-mtp --spec-draft-n-max N. Canonical MTP GGUFs published under ggml-org/ for Qwen3.6-27B and Qwen3.6-35B-A3B. - turboquant-mlx-full unchanged at 0.3.0 (our current pin). - WeianMao/triattention HEAD c3744ee6 = our pin; no new MLX work. - TheTom/turboquant_plus has a C++ TriAttention V3 hybrid policy in the llama-cpp-turboquant fork's experiment branch; not yet independently reproduced. - Tweet at leftcurvedev_ status unable to verify (X auth wall). Includes diff sketch + recommended PR sequence + open questions. See doc for sources.

New follow-up row tracking the GGUF half of FU-028 now that PR #22673 merged upstream. Lists action plan (6 wiring steps), upstream caveats, and links to the upstream-research write-up.

Closes the GGUF half of FU-028. PR #22673 by am17an merged upstream 2026-05-16T12:06:24Z (merge commit 2555826) shipping --spec-type draft-mtp --spec-draft-n-max N for models with baked-in Multi-Token Prediction heads. Upstream-reported ~72% acceptance @ N=3 on Qwen3.6-27B, ~2x tok/s vs no-spec baseline. Code changes - _mtp.py: new is_mtp_gguf_repo() + _MTP_GGUF_REPOS frozenset. 4 new aliases for the canonical mirrors (ggml-org/*) and author preview (am17an/*) GGUF repos so has_mtp_heads + get_mtp_draft_n return the right canonical N. - llama_cpp_engine.py: _build_command grew speculative_decoding + canonical_repo + model_ref kwargs. Emits --spec-type draft-mtp --spec-draft-n-max <get_mtp_draft_n> when the binary supports --spec-type AND the canonical repo matches is_mtp_gguf_repo. Falls back to standard decode + clear runtimeNote when the binary lacks --spec-type (older llama-server builds, e.g. homebrew bottles built before 2026-05-16T12Z). - base.py: new ggufMtpAvailable: bool on BackendCapabilities, serialised in to_dict so the frontend can show MTP affordances for GGUF models alongside the existing mtplxAvailable flag. - capabilities.py: _probe_native_backends sets ggufMtpAvailable from _llama_server_supports("--spec-type") against either the standard or turbo binary. Catalog (text_models.py) - ggml-org/Qwen3.6-27B-MTP-GGUF (Q8_0, 29 GB) - ggml-org/Qwen3.6-35B-A3B-MTP-GGUF (Q8_0, 37 GB, MoE) Both with the qwen3.6 family, vision via auto-detected mmproj sibling, runtime note flags D2H prompt-processing caveat per upstream PR body. Tests (test_inference.py) - 5 new cases: happy-path MTP flag emission; binary-lacks-spec-type fallback runtimeNote; non-MTP repo no-op; canonical + author alias coverage; draft-n lookup through aliases. - Full suite: 1418 passed, 1 skipped (up from 1413, no regressions). Tracker - CLAUDE.md FU-047 row flipped to ~~shipped~~ with full landing receipt. FU-028 stays open for the MLX side (mlx-lm has no native MTP head loader; MTPLX subprocess remains the workaround). Live-verification status - Backend probe reports ggufMtpAvailable=True against homebrew llama.cpp 9150 (advertises --spec-type) BUT homebrew bottle 9150 was built before PR #22673 merged today, so its --spec-type help list still omits draft-mtp. Backend wiring is correct; users need a llama-server built from master at or after commit 2555826 to actually fire draft-mtp speculative decoding. Next homebrew bottle picks this up automatically. - MLX side comparison: MTPLX path (subprocess via /v1) runs the same Qwen3.6-27B-MTPLX-Optimized-Speed model at ~24.7 tok/s versus ~29.0 tok/s for the standard mlx-lm worker — currently *slower* on this hardware (M5), likely from HTTP-proxy overhead on per-token roundtrips eating the spec-dec acceptance gains. Investigating separately; not blocking this commit. Research write-up: docs/UPSTREAM_RESEARCH_2026-05-16.md

…s stale llama-server Three independent fixes shipped together because they all surfaced while live-benching MTPLX vs MLX baseline on Qwen3.6-27B. 1. MTPLX no longer pops a browser window ``mtplx start`` defaults to MTPLX's interactive onboarding which on first run picks the ``web`` surface and opens a chat UI in a browser tab. Users who only asked ChaosEngineAI to load a model got an unrelated browser window. Switched the subprocess invocation in MtplxEngine to ``mtplx quickstart --yes`` which is the server-only entry point: pure HTTP at /v1, no UI, no prompts. Also pass ``--host 127.0.0.1`` explicitly + ``--mtp --depth N`` so the speculative path actually fires with the registered draft-token count. 2. Draft depth bumped 1 -> 3 for Youssofal Optimised models The earlier conservative N=1 made HTTP-proxy overhead dominate any spec-dec acceptance gain. Live bench: depth=1 ran the same model at ~24.7 tok/s vs ~29.0 tok/s for plain MLX (15% SLOWER). With depth=3 (matches MTPLX's own UI default), the same bench averaged ~27.2 tok/s with the first run hitting 30.4 tok/s — within 5% of baseline and occasionally beating it. The remaining gap is HTTP-roundtrip overhead, not algorithm. 3. Pre-build gate warns when staged llama-server lacks draft-mtp FU-047 wired GGUF MTP via llama.cpp PR #22673 merged today, but homebrew bottle 9150 was built before the merge — it advertises ``--spec-type`` but the value ``draft-mtp`` isn't in its help. Catalog rows for the MTP GGUFs will fail-load until the bundled binary is at master >= 2026-05-16. Pre-build now greps the help text and surfaces a WARN row pointing operators at ``brew upgrade llama.cpp`` or a rebuild-from-master. Test fixtures (stub_mtplx_server.py) - Accept both ``quickstart`` (new) and ``start`` (legacy) subcommands. - Accept the new flags MtplxEngine emits (--host, --mtp, --no-mtp, --depth, --yes) so the 10 integration tests still pass against the new command shape. Live verification - ./scripts/chaosengine-cli load Qwen3.6-27B-MTPLX-Optimized-Speed --canonical-repo Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed --backend mlx --spec returns runtimeNote "MTPLX MTP speculative decoding active (draft tokens: 3, model: Qwen3.6-27B-MTPLX-Optimized-Speed)". - Three sequential generations: 30.4, 27.5, 23.6 tok/s (avg 27.2); no browser window opens; subprocess stays clean. Test suite: 1418 passed, 1 skipped (no regressions). Answers a user-flagged Q from this session: - "MTPLX page just opened in browser — won't happen to users right?" No: this commit silences the pop. - "Homebrew llama-server too old — should ChaosEngineAI versions ship a newer one?" Stage-runtime.mjs auto-downloads the latest ggml-org/llama.cpp release at build time, so once a new tagged release lands post-merge the next ChaosEngineAI build picks it up automatically. The new pre-build warning makes the gap visible immediately rather than only at first user attempt.

Bumps the four version sources of truth that downstream code reads: - pyproject.toml (backend uses this via _resolve_app_version + reports back through /api/health + /api/diagnostics/snapshot) - package.json (frontend bundling, npm scripts) - src-tauri/tauri.conf.json (desktop installer + auto-updater) - src-tauri/Cargo.toml (Rust shell binary) Cargo.lock regenerates on next ``cargo build``. Headline for this release (full notes in RELEASE_NOTES_v0.9.2.md): - chaosengine-cli — full headless automation, 95 typed shortcuts + 100% backend route coverage - MTPLX native MTP speculative decoding on Apple Silicon (Apache 2.0) - GGUF MTP via llama.cpp PR #22673 (--spec-type draft-mtp) - Phased E2E test suite + auto-deployed MkDocs documentation site - 5 backend bug fixes (TurboQuant hybrid-attn, stale library scan, mmproj scoping, CLI response shape, MTPLX browser-pop) No v0.9.1 was tagged — going 0.9.0 -> 0.9.2.

Backend ships 9 diffusion cache strategies via cache_compression.registry but the frontend only let users pick 2 (fbcache, teacache). Marketing site claimed '9 strategies' — discrepancy. Triaged the 4 hidden ones against value-vs-noise: Worth UI exposure: - TaylorSeer: native diffusers 0.38 core, generic across FLUX / SD3 / Wan / Hunyuan / LTX / CogVideoX / Mochi. ~2.4x speedup. - PAB (Pyramid Attention Broadcast): native diffusers 0.38 config, ~2x speedup. Different mechanism than FBCache (attention reuse vs first-block-skip) so a real alternative not a duplicate. Kept backend-only (CLI / API): - MagCache: FLUX-only without calibration UX; footgun on other DiTs. - FasterCache: ~1.9x — same ballpark as FBCache so adds choice without adding capability. Changes: - ImageCacheStrategyId + VideoCacheStrategyId unions extended: 'none' | 'fbcache' | 'teacache' -> ... | 'taylorseer' | 'pab' - IMAGE_CACHE_STRATEGIES list grows by 2 entries with hints describing what each does. - IMAGE_CACHE_STRATEGY_DEFAULT_THRESH + VIDEO_CACHE_STRATEGY_DEFAULT_THRESH set taylorseer/pab thresholds to 0 (means 'use diffusers default skip interval' — these adapters key off cache_interval not threshold). - imageCacheStrategiesForRepo gating unchanged: UNet pipelines still only see 'Off'; FLUX gets all 5; other DiTs get all 5 minus TeaCache (no calibration tables for those pipelines). Backend already accepts any string for cacheStrategy (Pydantic field is 'str | None' and registry.get() handles the lookup), so no Python schema change needed — the new ids route straight through to the existing cache_compression.{taylorseer,pab} adapters. Tests: 214 cache/image/video tests pass; full Py + TS suites green; tsc clean.

Two follow-ups from the v0.9.2 MTPLX bench: 1. MTPLX --profile performance-cold --max The MTPLX subprocess defaults to ``sustained`` runtime profile, which thermally throttles for long-running serves. For chat where the user is staring at the textarea waiting on the first response, ``performance-cold --max`` is the right preset — full clocks, no throttling. Live re-bench on M5 with N=3 + burst still didn't beat plain mlx-lm (avg 23.9 tok/s vs 29 baseline; throughput degraded over consecutive runs from 27.4 -> 20.8 suggesting M5 thermal limits dominate regardless of profile). Keep the flag — burst is the *right* config for interactive use; the gap is hardware, not ChaosEngineAI's fault. 2. /api/setup/llama-server-status + chaosengine-cli llama-server-status New read-only endpoint probes the resolved llama-server binary: - Reports build number from ``llama-server --version`` - Greps ``llama-server --help`` for ``draft-mtp`` in --spec-type - Returns platform-aware upgrade command (brew on macOS, tarball on Linux, scoop on Windows) - Surfaces a clear ``message`` field telling the user *why* MTP GGUF won't fire on their current binary The frontend can now show an "Outdated llama-server" banner under any MTP GGUF catalog entry and link the upgrade command directly. Why a read-only probe (not a one-click installer): - Each OS has a preferred package-manager path (brew / apt / pacman / scoop / chocolatey); wrapping all of them is a footgun until we vendor our own llama.cpp build. - The release pipeline already pulls ggml-org/llama.cpp's latest GitHub release tarball at stage-runtime time. After a fresh ggml-org release tag the bundled binary catches up automatically; users on homebrew need ``brew upgrade llama.cpp`` once the bottle refreshes. - This endpoint makes the gap legible without taking responsibility for the install. Test fixture (stub_mtplx_server.py) — accept the new --profile and --max flags that MtplxEngine now passes so the integration tests keep passing. Full suite: 1418 passed, 1 skipped.

…3.6 MTP N to 3 Two follow-ups while running the FU-047 head-to-head benchmark: 1. Status probe was bypassing the env override _resolve_llama_server() in routes/setup/llama_server.py only checked /opt/homebrew/bin and PATH, ignoring CHAOSENGINE_LLAMA_SERVER. The inference engine resolver in inference/binaries.py does honour the env var (set by the Tauri shell pointing at the bundled binary, or by developers pointing at a freshly-built source build). Aligning the status probe with the engine's resolution priority — env override > homebrew > PATH — so the UI banner is honest about which binary actually runs. 2. MTP_MODEL_MAP Qwen3.6 entries: N 1 -> 3 Earlier sustained-bench at N=1 left tokens on the table; upstream PR #22673 reports ~72% acceptance at N=3 on Qwen3.6-27B. Live bench on M5 with N=3 on Q8_0 GGUF: 1.51x speedup over Q8_0 baseline (20.9 vs 13.8 tok/s). N=1 was already 1.46x; bumping to 3 nudges it up another 4%. Diminishing returns past N=3 per the PR body. Full head-to-head bench (M5, same prompt, 256 max tokens, 3 runs): MLX baseline (Youssofal MTPLX-Optimized-Speed) 29.0 tok/s MTPLX subprocess N=3 burst 24-27 (variable) GGUF Q4_K_M baseline (lmstudio Qwen3.6-27B) 18.4 tok/s GGUF Q8_0 baseline (ggml-org Qwen3.6-27B-MTP) 13.8 tok/s GGUF Q8_0 + MTP N=1 (this fix) 20.1 tok/s (+1.46x) GGUF Q8_0 + MTP N=3 (this commit) 20.9 tok/s (+1.51x) Net findings: - FU-047 (GGUF MTP) delivers the real, measurable speedup. 1.5x on the same model + quant + hardware is the headline. - MTPLX subprocess via HTTP underperforms on M5 even with depth=3 + burst profile. Subprocess overhead > MTP acceptance gain. - Plain MLX-LM at Youssofal's BF16-ish encoding still wins absolute throughput because the per-token compute is just smaller. Verified with /tmp/llama.cpp HEAD build (commit 6049906) installed to ~/.chaosengine/bin/llama-server. ggml-org/llama.cpp release b9181 (2026-05-16 17:06 UTC) is one commit past the MTP merge and is what stage-runtime.mjs will pull on the next ChaosEngineAI build, so the bundled binary in .app installs will ship with draft-mtp out of the box — no homebrew dependency for end users.

Three colleague-feedback items after MTPLX root-cause investigation: - _mtp.py: model_has_mtp_tensors() peeks GGUF header for mtp_decoder / mtp_emb / mtp_heads byte strings, or probes safetensors index for mtp_*. keys / mtp.safetensors shard. has_mtp_heads_strict(repo, path) prefers tensor probe over name aliases — catches new MTP-bearing repos we haven't enumerated and rejects name collisions that don't carry the tensors (FU-041-style false positives). - controller._select_engine + llama_cpp_engine._build_command both switch to the strict / tensor-probe path; GGUF MTP gate falls back to is_mtp_gguf_repo when no local path is available. - routes/setup/mtplx.py: /api/setup/mtplx-status now reports fanControl.{thermalforge,tgPro,anyAvailable,recommendedAction} so the Setup tab can prompt for ThermalForge install before users hit the silent-throttle ceiling on MTPLX --max burst runs. - CLAUDE.md FU-048: deferred prefer-GGUF-MTP routing preference — needs Settings UX before flipping the default since MTPLX-Optimized quants aren't GGUF-mirrored. - tests: 7 new tensor-probe cases in test_inference.py; bumped stale N=1 assertion to N=3 for Qwen3.6 MTP GGUFs (matches MTP_MODEL_MAP).

PR #22673 names MTP weights as ``blk.{N}.nextn.*`` ("Next-N prediction") and emits ``<arch>.nextn_predict_layers`` in the GGUF metadata header, neither of which matched the legacy ``mtp_decoder`` / ``mtp_emb`` / ``mtp_heads`` needles. As a result tensor probe returned False on a real MTP-GGUF model and the engine never emitted ``--spec-type draft-mtp`` (verified live: ggml-org/Qwen3.6-27B-MTP-GGUF ran at 14.5 tok/s instead of MTP-accelerated 23 tok/s). The metadata key lives in the first few KB of the file, so a 2 MB read window catches both the cheap canonical marker and the legacy patterns. Probe now returns True for ggml-org/Qwen3.6-27B-MTP-GGUF. Head-to-head live numbers (M5 Max, 27B Qwen3.6, MTP enabled both sides): - GGUF MTP Q8_0: 23.0 tok/s mean (14.5 baseline -> +58.6%) - MTPLX MTP 4-bit (Optimized-Speed): 28.95 tok/s mean Tests: rename legacy-tensor-name case + new case pinning the nextn_predict metadata marker. 15 MTP tests green.

v0.9.0 release shipped with package.json / Cargo.toml / tauri.conf.json at 0.9.0 but pyproject.toml still at 0.8.0 — users downloaded "v0.9.0" from the site and the bundled backend reported ``appVersion: 0.8.0`` because ``_resolve_app_version`` reads the staged pyproject. Nothing enforced cross-manifest sync. Pre-build gate now reads version from all four sources, fails the build when any drift apart. Mirrors the existing dflash-mlx pin assert in both ``pre-build-check.mjs`` and ``pre-build-check.sh``.

…ies + LTX series - Cache compression table gains TaylorSeer / MagCache / PAB / FasterCache rows (previously only TeaCache + FBCache appeared even though the four diffusers-0.38 strategies shipped via FU-026) - DFlash family list adds Gemma-4 (FU-031), Kimi-K2.6, MiniMax-M2.5/M2.7, Qwen3.5-122B-A10B — all in DRAFT_MODEL_MAP - Video model table splits Lightricks LTX-Video (base diffusers) from LTX-2 / LTX-2.3 (mlx-video subprocess) — both ship in the catalog - Feature-map line now reads "FBCache + TeaCache + TaylorSeer + MagCache + PAB + FasterCache" instead of TeaCache-only

…ound job) Adds a self-contained path for users with a working CUDA torch install to upgrade to a newer wheel without re-running the full 2.5 GB GPU bundle. Surfaces as a compact pill in the Image / Video Studio runtime banners when ``realGenerationAvailable`` AND the matching cu{N} pip index serves a newer wheel than the one on disk; silent otherwise. Backend - ``_install_helpers.py``: ``_extract_cuda_tag``, ``_index_url_for_cuda_tag``, ``_parse_version_triple``, ``_classify_torch_upgrade``, ``_query_latest_torch_version`` (pip index versions parser, both output shapes), ``_abi_dependents_present``, ``_move_torch_to_rollback`` + ``_restore_torch_from_rollback`` + ``_cleanup_old_torch_rollbacks``, ``_TORCH_ABI_DEPENDENT_PACKAGES`` constant. - ``routes/setup/torch_upgrade.py``: GET /api/setup/torch-upgrade-available (synchronous detection, returns ``{available, current, latest, upgradeType, rebuildPackages, indexUrl}`` or ``{available: false, reason}``) and POST /api/setup/upgrade-torch (background job mirroring install-gpu-bundle pattern). Worker moves existing torch to ``.torch-rollback-<version>/`` instead of purging, installs target from the same cu{N} index, re-pins constraint, force-reinstalls ABI deps on minor/major bumps (bitsandbytes/torchao/nunchaku/sageattention), verifies CUDA in a subprocess, restores rollback on verify failure, keeps the most recent rollback as a safety net. Frontend - ``src/api/setup.ts``: ``checkTorchUpgradeAvailable``, ``startTorchUpgrade``, ``getTorchUpgradeStatus`` + 4 exported types. Re-exported via ``src/api/index.ts``. - ``src/components/TorchUpgradePill.tsx``: one-shot probe on mount, hides when ``available: false``, three display states (available / in-progress / done-or-error), polls status at 1.5 Hz with cleanup keyed by ``job.done``, inline collapsible install log with phase-named markers. Restart Backend hook plumbed through. - ``src/styles.css``: color-coded badges per upgrade type (patch=green / minor=amber / major=red). - Wired into ``ImageStudioRuntimeBanner`` + ``VideoStudioRuntimeBanner``; renders only when ``realGenerationAvailable`` so users with broken torch are not second-guessed. Tests - 24 new tests in ``tests/test_setup_routes.py`` covering every helper (version parsing edge cases including ``2.6.0rc1`` that caught a real bug in the first cut where digits across non-digit boundaries leaked into the parsed triple), both pip output shapes, rollback move/restore round-trip with simulated half-install in extras, cleanup mtime ordering, and all 8 detection-response shapes plus the apple-silicon rejection and running-job POST cases. Drive-by: package-lock.json version field was lagging at 0.8.0 after the 0.9.0 bump; synced. Verified: 24/24 new tests pass, 80/80 ``test_setup_routes.py`` pass, 217/217 setup + backend + services + inference pass, 371/371 vitest pass, ``tsc --noEmit`` clean. Pre-existing ``test_cache_strategies`` / ``test_sdcpp_*`` / ``test_preview_thumbnails`` failures verified to exist on baseline (Windows env + optional diffusers deps), unrelated.

Both branches added a new setup submodule + API client surface at the same insertion points: staging: routes/setup/mtplx.py + Mtplx{Attempt,JobState,Status} this PR: routes/setup/torch_upgrade.py + TorchUpgrade{Availability, Attempt,JobState,Type,UnavailableReason} Resolution keeps both — order in routes/setup/__init__.py is now alphabetical (longlive → mtplx → torch_upgrade → turbo → wan_install), both routers register. In src/api/setup.ts the MTPLX block goes first (it landed on staging first) and the Torch upgrade block follows; src/api/index.ts re-exports both, alphabetised. Verified post-merge: 80/80 test_setup_routes pass, 371/371 vitest pass, ``tsc --noEmit`` clean, and both routers register the expected paths (``/api/setup/install-mtplx`` + ``/api/setup/upgrade-torch``).

…ound job) Adds a self-contained path for users with a working CUDA torch install to upgrade to a newer wheel without re-running the full 2.5 GB GPU bundle. Surfaces as a compact pill in the Image / Video Studio runtime banners when ``realGenerationAvailable`` AND the matching cu{N} pip index serves a newer wheel than the one on disk; silent otherwise. Backend - ``_install_helpers.py``: ``_extract_cuda_tag``, ``_index_url_for_cuda_tag``, ``_parse_version_triple``, ``_classify_torch_upgrade``, ``_query_latest_torch_version`` (pip index versions parser, both output shapes), ``_abi_dependents_present``, ``_move_torch_to_rollback`` + ``_restore_torch_from_rollback`` + ``_cleanup_old_torch_rollbacks``, ``_TORCH_ABI_DEPENDENT_PACKAGES`` constant. - ``routes/setup/torch_upgrade.py``: GET /api/setup/torch-upgrade-available (synchronous detection, returns ``{available, current, latest, upgradeType, rebuildPackages, indexUrl}`` or ``{available: false, reason}``) and POST /api/setup/upgrade-torch (background job mirroring install-gpu-bundle pattern). Worker moves existing torch to ``.torch-rollback-<version>/`` instead of purging, installs target from the same cu{N} index, re-pins constraint, force-reinstalls ABI deps on minor/major bumps (bitsandbytes/torchao/nunchaku/sageattention), verifies CUDA in a subprocess, restores rollback on verify failure, keeps the most recent rollback as a safety net. Frontend - ``src/api/setup.ts``: ``checkTorchUpgradeAvailable``, ``startTorchUpgrade``, ``getTorchUpgradeStatus`` + 4 exported types. Re-exported via ``src/api/index.ts``. - ``src/components/TorchUpgradePill.tsx``: one-shot probe on mount, hides when ``available: false``, three display states (available / in-progress / done-or-error), polls status at 1.5 Hz with cleanup keyed by ``job.done``, inline collapsible install log with phase-named markers. Restart Backend hook plumbed through. - ``src/styles.css``: color-coded badges per upgrade type (patch=green / minor=amber / major=red). - Wired into ``ImageStudioRuntimeBanner`` + ``VideoStudioRuntimeBanner``; renders only when ``realGenerationAvailable`` so users with broken torch are not second-guessed. Tests - 24 new tests in ``tests/test_setup_routes.py`` covering every helper (version parsing edge cases including ``2.6.0rc1`` that caught a real bug in the first cut where digits across non-digit boundaries leaked into the parsed triple), both pip output shapes, rollback move/restore round-trip with simulated half-install in extras, cleanup mtime ordering, and all 8 detection-response shapes plus the apple-silicon rejection and running-job POST cases. Drive-by: package-lock.json version field was lagging at 0.8.0 after the 0.9.0 bump; synced. Verified: 24/24 new tests pass, 80/80 ``test_setup_routes.py`` pass, 217/217 setup + backend + services + inference pass, 371/371 vitest pass, ``tsc --noEmit`` clean. Pre-existing ``test_cache_strategies`` / ``test_sdcpp_*`` / ``test_preview_thumbnails`` failures verified to exist on baseline (Windows env + optional diffusers deps), unrelated.

Run the CLI-driven E2E suite reliably on Windows by invoking the extensionless CLI through Python, writing reports as UTF-8, and treating missing video runtime prerequisites as skips. Also make the Vitest config ESM-safe for runner-mode loading and keep the Tauri lockfile version in sync.

feat: in-place torch upgrade with rollback (detection + pill + backgr…

Bundles the M4 Max test-suite session work — tracker rows, two real bugs, two user-requested features. Tracker (CLAUDE.md): - FU-049 Python 3.14 support gate (deferred; pyproject stays >=3.10) - FU-050 matrix runner: reasoning-channel capture + max-tokens 96->512 + stale endpoint/path fixes (/api/chat/generate/stream, runtime.loadedModel) - FU-051 /api/models/load echoes legacy cacheStrategy verbatim (open) - FU-052 matrix grows 9->15 cells: MTPLX MLX, GGUF MTP, 4x vLLM (CUDA-gated) - FU-053 distill variants flagged installed when only base repo on disk - FU-054 same-repo siblings: per-file size + shares-storage badge - FU-055 in-app storage explorer in Diagnostics tab Bugs fixed: - _distill_transformer_validation_error checks distillTransformerRepo + high/low-noise filenames before marking availableLocally true. Closes the FU-053 false positive on Wan2.2-I2V-A14B distill bf16/fp8. - pre-build-check.sh + .mjs pointed at the wrong turbo fork (johndpope/...planarquant); corrected to TheTom/...turboquant-kv-cache matching build-llama-turbo.sh + CLAUDE.md. Features: - Star/favourite models on Chat -> My Models. New favoriteModelRefs in settings (+ UpdateSettingsRequest field, dedup-trimmed apply, payload), ActionIconName 'star'/'starOutline' SVG, .action-favorite CSS, toggle handler in App.tsx writes via PATCH /api/settings + refreshes. Starred rows lift to top of the library list. - Diagnostics 'Disk usage - top 20 model repos' section. New GET /api/diagnostics/storage-top endpoint walks every enabled modelDirectories entry one level deep, sums via _path_size_bytes (inode-deduped so HF snapshot/blob symlinks count once). Closes the Stuff Diver gap on HF cache layouts. Live: 1213 GB total on this box. Matrix runner (scripts/cache-strategy-matrix.py): - Smoke models bumped Qwen2.5-0.5B -> Qwen3-0.6B (current gen) - New cells for MTPLX MLX, GGUF MTP (FU-047), vLLM native/turboquant/ triattention/dflash. BackendCapabilities adds mtplx_available, gguf_mtp_available, vllm_available probed from /api/health. Tests: - 4 new unit tests pin the FU-053 distill validator (no-distill, missing snapshot, partial snapshot, both files present). - test_cache_strategy_matrix_runner kwargs widened for new caps. - Full suite: 1455 passed, 1 skipped, 132 subtests passed. - npx tsc --noEmit clean; npm test 32 files / 371 tests pass.

Brings PR #57 up to date with staging (29 commits behind). Two conflicts resolved manually: - src/api/index.ts: re-exports from ./setup. Both branches added MTPLX-related exports independently; took the union alphabetically sorted (getMtplxInstallStatus, getMtplxStatus, startMtplxInstall). - src/api/setup.ts: torch-upgrade machinery was introduced via parallel commits on both branches (our dca2c12 + staging f514ea4). Auto-merge produced duplicate TorchUpgradeAvailability / TorchUpgradeType / TorchUpgradeUnavailableReason / TorchUpgradeAttempt / TorchUpgradeJobState declarations + duplicate checkTorchUpgradeAvailable / startTorchUpgrade / getTorchUpgradeStatus functions. Removed the duplicate block; kept one canonical section with types + functions in proper order. Verified post-merge: - npx tsc --noEmit: clean - npm test: 32 files / 371 tests pass - pytest tests/: 1455 passed, 1 skipped, 132 subtests passed

Feature/chaos engine ai cli

Foundation for in-app install UX. Lazy importability + version probes for nunchaku / sageattention / dflash-mlx / dflash-cuda / triattention / kvpress, plus a Windows-only wsl2 detector that seeds the upcoming vLLM-via-WSL bridge. Eleven new fields on BackendCapabilities surface through /api/health; the placeholder probe primes them on first paint so the UI never flashes Install for a package that is actually present. Probes resilient to the half-baked-install failure mode we hit on Windows (torch directory present but Python source missing): find_spec swallows ValueError, version reads swallow ImportError and missing __version__. DFlash MLX vs CUDA flags delegate to the existing dflash.is_mlx_available / dflash.is_vllm_available helpers so the upstream package-layout dance stays in one place. Tests: 25 in tests/test_accelerator_capabilities.py covering present / absent / broken-install / WSL-status branches.

Tests should exercise the same install users have, not a parallel .venv install. New tests/conftest.py calls ensure_extras_on_sys_path at collection time, so pytest tests/ resolves torch / diffusers / mlx / nunchaku / sageattention / triattention / vllm against the persistent extras dir at: Windows: %LOCALAPPDATA%\ChaosEngineAI\extras\cp{XY}\site-packages macOS: ~/Library/Application Support/ChaosEngineAI/extras/cp{XY}/site-packages Linux: ${XDG_DATA_HOME}/ChaosEngineAI/extras/cp{XY}/site-packages A torch upgrade landing via the in-app installer is reflected in the next pytest run automatically; no pip install dance in .venv. On a fresh CI box without the extras dir the conftest is a silent no-op, so existing test boxes keep working. Set CHAOSENGINE_TEST_TRACE_EXTRAS=1 to log which extras path got loaded for a given run. Runners (e2e_test_suite.py, cache-strategy-matrix.py) now print an actionable hint when the backend is not reachable: open the ChaosEngineAI app, rather than just backend not reachable; aborting. Both still exit 2/3 respectively so CI gates stay reliable. Docs (testing/overview.md, testing/e2e-testing.md) updated with the canonical open-the-app-then-run-tests flow, with the headless dev backend kept as an advanced option for contributors.

build-sdcpp.sh failed for the user with ``fatal: not a git repository`` because /tmp/stable-diffusion.cpp/.git survived a partial /tmp cleanup as an empty directory — the existence test ``[[ -d \$DIR/.git ]]`` passed but ``git fetch`` immediately failed inside it. Same latent bug across 5 bash scripts + 2 PowerShell scripts. All now use ``git rev-parse --git-dir`` to validate the checkout is a real repo, and ``rm -rf`` the stale dir before re-cloning when not. Patched: - scripts/build-sdcpp.sh - scripts/build-llama-turbo.sh - scripts/update-llama-turbo.sh - scripts/update-sdcpp.sh - scripts/update-llama-cpp.sh (improved error message) - scripts/build-llama-turbo.ps1 - scripts/update-llama-turbo.ps1

Reusable card for the six CUDA-side accelerators (nunchaku, sageattention, dflash-mlx, dflash-cuda, triattention, kvpress). Three placement variants share one component so the per-feature surfaces in Phases 3-6 stay in sync without re-implementing the three states (idle / installing / installed / failed) per surface: - card: full banner with title, claim, applies-to, size pill, primary action. Lands in the Image / Video Studio runtime banners and the Diagnostics Boost Pack. - pill: compact horizontal chip with 4-bit-style copy. Lands on catalog variant cards in the Discover / Models tabs. - row: table form for Diagnostics Boost Pack's scannable view. State ownership: parent owns the install lifecycle (which package is in flight, success/failure, captured pip output). The card only owns the log-expanded toggle. Mirrors the CudaTorchLogPanel contract so the card is cheap to render in many places without duplicating polling work. New catalog (src/components/acceleratorCatalog.ts) is the single source of truth for each accelerator's pip name, capability flag, speedup claim, size, install mode, and platform gate. Adding a seventh accelerator is one entry here, one Phase 1 capability flag, and one row in the backend's _INSTALLABLE_PIP_PACKAGES. NativeBackendStatus (src/types/server.ts) extended with the 13 FU-056 Phase 1 fields plus the older vllm/mtplx/ggufMtp fields that were already on the wire but missing from the TS interface. All fields optional so a backend running an older build than the frontend doesn't break the type contract. Tests (28 new): catalog shape pinning + getAccelerator lookup + isPlatformCompatible matrix + readInstalled / readVersion / platformLabel / actionLabelFor branch coverage. Vitest harness stays at pure-function level - no React Testing Library yet, per the existing src/components/__tests__/ convention. CSS: .accelerator-card / -pill / -row variants in styles.css, matching the existing .torch-upgrade-pill colour vocabulary (rgba(80, 140, 220, ...) for the not-installed accent, rgba(80, 180, 100, ...) for installed, --border + --surface tokens for the chrome).

First end-to-end UX slice for FU-056. The Diagnostics tab gains a Boost Pack section listing all six CUDA-side accelerators (nunchaku, sageattention, dflash-mlx, dflash-cuda, triattention, kvpress) as a single scannable table. Status pill + Install / Retry button per row; click installs via the existing POST /api/setup/install-package endpoint, output captured into a collapsible details, then capabilities re-probe so the "Installed v1.2.1" pill flips without a parent refetch. Self-probes capabilities on mount via refreshCapabilities() so the panel works standalone — DiagnosticsPanel only passes backendOnline. Per-accelerator install state lives in a record keyed by pip name, so multiple installs can run concurrently if the user is impatient (the backend serialises pip writes at the OS-FS layer). Renders every catalog row with showIncompatible=true: this is the "see everything" surface, not a per-feature gate. Apple-Silicon and CUDA accelerators both list; the platform column tells the user which apply to their box, and disabled state + tooltip blocks an ill-fitting install. Phases 3-5 will filter per surface. Closes the first observable loop: Phase 1 probe → Phase 2 card (row variant) → install → re-probe → installed state. Same Component renders pill + card + row, so the per-feature surfaces in Phases 3-5 ride the same diff. No new tests — the pure logic (readInstalled, readVersion, actionLabelFor, platformLabel, isPlatformCompatible) is already pinned by Phase 2's 28 unit tests. The Boost Pack itself is wiring: fetch capabilities, dispatch install, re-fetch on success. Mirrors the existing CudaTorchLogPanel pattern.

Wires accelerator install affordances into the three Image surfaces users actually look at when picking + running a model: 1. Image Models tab — every installed FLUX / SD3.5 / Qwen-Image / SANA / PixArt row gets read-only pills next to the style tags: "🚀 SVDQuant 4-bit" + "🚀 Fast attention DiT" when the accelerator is missing, "✓ ..." when present. UNet pipelines (SD1.5 / SDXL) show no pills — neither nunchaku nor sageattention applies. 2. Image Discover tab — same pills on catalog variant cards in the same position. Lets users see acceleration potential before committing to a download. 3. Image Studio runtime banner — new "Performance boosters" section between the torch-upgrade pill and the model-load summary. Card variants of the same accelerators with full Install / Retry buttons. Self-contained install state: clicks POST /api/setup/install-package, capture the response capabilities, and overlay them onto the parent-provided snapshot so the card flips to "✓ Installed v..." without waiting for the next workspace refetch. The pills on the Models / Discover tabs are deliberately read-only — the install action lives in Studio's runtime banner so install state stays concentrated. A new optional onInstall prop on AcceleratorCard drives this: when omitted, the card renders as passive info. New helper getApplicableAccelerators(repo) maps a model repo to the accelerator IDs that apply. Pattern-matches on the family slug (FLUX.1, sd3.5, qwen-image, sana, pixart-sigma) so we don't have to edit catalog/image_models.py to land this — the catalog-side recommendedAccelerators metadata pattern is reserved for Phase 7 when the i18n + per-variant overrides land together. 7 new unit tests pin the matrix (FLUX, SD3.5, Qwen-Image, SANA, PixArt for nunchaku+sageattention; Wan / HunyuanVideo / LTX / CogVideoX / Mochi for sageattention-only; Wan2.1-T2V-1.3B for the triattention LongLive bonus; SDXL / SD1.5 return empty). NativeBackendStatus threads from App.tsx → ImageModelsTab, ImageDiscoverTab, ImageStudioTab → ImageStudioRuntimeBanner → ImageStudioBoosters. The prop is optional everywhere so older backends without FU-056 Phase 1 fields collapse pills to their "available" state rather than crashing the tab. Deferred to a follow-up commit: the post-generation suggestion toast (fires when a non-Nunchaku FLUX gen takes >12s on CUDA). The discovery + install surfaces in this commit already give users a clean path to install accelerators contextually; the toast adds a nudge but the install affordance is reachable without it.

Mirrors the Image-side wiring from Phase 3 onto the Video tabs: 1. Video Models tab - every Wan / HunyuanVideo / LTX / CogVideoX / Mochi row gets read-only accelerator pills next to the style tags. SageAttention applies to all CUDA video DiTs; TriAttention surfaces specifically on Wan 2.1 T2V 1.3B for the LongLive real-time long-clip mode. 2. Video Discover tab - same pills on catalog variant cards in the same chip-row position. 3. Video Studio runtime banner - new "Performance boosters" section between the torch-upgrade pill and the LongLive install row. Full card variants with working Install / Retry buttons + collapsible pip output. Implementation note: the booster section was identical to the image-side equivalent (same install state machine, same card rendering, same overlay-on-install-success pattern). Renamed ImageStudioBoosters -> MediaStudioBoosters and moved to src/components/ so both surfaces share one file. The component now takes a minimal {repo, name?} variant slice rather than a concrete ImageModelVariant / VideoModelVariant - both shapes carry those fields and the booster logic doesn't need anything else. One source of truth for the install / overlay / re-probe dance. NativeBackendStatus threads from App.tsx -> VideoDiscoverTab, VideoModelsTab, VideoStudioTab -> VideoStudioRuntimeBanner -> MediaStudioBoosters. Prop is optional everywhere so older backends without FU-056 Phase 1 fields collapse pills to their "available" state rather than crashing the tab. No new tests required - the getApplicableAccelerators repo-pattern matrix is already pinned by Phase 3's 7 tests, including all four relevant video repos (Wan2.1-T2V-1.3B with triattention bonus, Wan2.2-T2V-A14B without, HunyuanVideo, LTX-Video, CogVideoX, Mochi). MediaStudioBoosters internals match the previous ImageStudioBoosters, no behavioural changes.

Brings the in-app accelerator install affordance to the chat surface. When the user is chatting with a model that has a registered DFlash draft AND the appropriate pip package isn't installed yet, an unobtrusive nudge bar appears above the prompt textarea: DFlash speculative decoding can ~2x this model with no quality loss. [Install DFlash] Click installs the right package for the active backend (``dflash-mlx`` on Apple Silicon MLX, ``dflash`` on CUDA vLLM) via the existing ``handleInstallPackage`` dispatcher. The bar self-hides when the package lands and capabilities re-probe. Twin gating logic to the AcceleratorCard pattern: the hint only renders when all three signals line up (model in supportedModels, package missing for active backend, supported backend). The backend probe + ``resolveDflashSupport`` helper already exist from FU-034; this commit wires them into the composer. Drive-by fix in RuntimeControls.tsx: the existing "Install DFlash" button next to the launch-settings toggle hard-coded ``onInstallPackage("dflash-mlx")``, which silently installed the Apple-Silicon package on CUDA / Windows boxes running vLLM. Both the launch-settings button and the new composer hint now route through a shared ``dflashPackageFor(backend)`` helper that picks the right package per backend. 3 new unit tests pin the matrix (mlx -> dflash-mlx, vllm -> dflash, null / unknown -> dflash-mlx as safe default). Net change for the user: discover acceleration potential from the place where you generate (chat composer / studio runtime banner / catalog cards), not from a settings page you have to remember to visit.

vLLM ships no native Windows wheels; this commit lets Windows users install vLLM into an isolated WSL venv with one click. Three pieces: 1. **Detector** (backend_service/inference/accelerators.py): four new probes layered on top of the existing wsl2_available helper: - wsl_default_distro() reads "Default Distribution: Ubuntu-X" out of the UTF-16 ``wsl --status`` output - wsl_cuda_available() runs ``wsl -- nvidia-smi -L`` to confirm CUDA passthrough is working inside the distro - wsl_vllm_available() runs an ``import vllm`` inside the managed venv at ~/.chaosengine/vllm-venv - wsl_vllm_version() reads __version__ from the same venv Four matching fields on BackendCapabilities (wslDistroName, wslCudaAvailable, wslVllmAvailable, wslVllmVersion). The detail probes shell out via wsl.exe and can take a few seconds on a cold WSL service start, so they're gated behind a wsl2_active short-circuit — hosts without WSL pay zero subprocess cost. 2. **Install endpoint** (backend_service/routes/setup/vllm_wsl.py): POST /api/setup/install-vllm-wsl + /status. Background-thread job with five steps: - preflight (verify CUDA visible in WSL) - venv (python3 -m venv ~/.chaosengine/vllm-venv) - pip-upgrade (pip + setuptools + wheel) - pip-vllm (the long one, ~2 GB / 5-15 min) - verify (import vllm) Same single-job semantics as install-longlive: a second POST while running returns the running job state. The venv is rooted in the WSL user's $HOME (ext4-backed) so CUDA torch wheels don't pay the ~10x IO penalty of being on /mnt/c/. 3. **WslBridgePanel** (src/features/settings/WslBridgePanel.tsx): Windows-only Setup panel rendered alongside the Boost Pack on the Diagnostics tab. Four bucket states: - WSL2 not installed → ``wsl --install`` copy-paste hint + MS docs - WSL2 ready, no CUDA → NVIDIA WSL driver kicker link - WSL2 + CUDA ready, vLLM missing → one-click install button - vLLM ready → green pill with version + "Reinstall" affordance Self-probes capabilities on mount, polls install status at 1.5 Hz while a job is in flight, refreshes capabilities on completion so the bucket flips without a parent refetch. Uses the existing InstallLogPanel for log tail (extended to accept the new "vllm-wsl" variant). Tests: 12 new probe tests covering the present / absent / cold-host matrix for each WSL detail probe, plus 4 endpoint tests pinning the job-state shape + the Windows platform gate + the start/status contract. Live-verified on Windows + RTX 4090: detector returns ``distro=Ubuntu-24.04, cuda=True, vllm=False, version=None`` — correct for the dev box right now. Deferred to a follow-up commit: the actual engine routing so a vLLM model load transparently launches inside the WSL venv. This commit ships only the install path so users can stand up the venv today; the engine wiring needs careful path translation (/mnt/c/Users/... → Windows paths) and stdout streaming that deserves its own focused PR.

Completes the WSL bridge so Windows users get transparent vLLM inference. A model load with backend=vllm on Windows + wslVllm installed transparently spawns the OpenAI-compatible server inside the WSL Ubuntu venv and proxies /v1/chat/completions through it. No user action beyond clicking "Install vLLM in WSL" once. Three pieces: 1. **VllmWslEngine** (backend_service/inference/vllm_wsl_engine.py): HTTP-bridge engine modelled on MtplxEngine. Subprocess shape: wsl -- ~/.chaosengine/vllm-venv/bin/python -m vllm.entrypoints.openai.api_server --model <ref> --host 127.0.0.1 --port <free> --max-model-len <ctx> --trust-remote-code WSL2 mirrors loopback to the Windows host so the Windows backend reaches the listener at 127.0.0.1:<port> without any port-forward ceremony. Implements both generate() and stream_generate() so the existing chat surface stream path works end to end. 2. **windows_path_to_wsl helper**: a local model at C:\Users\Dan\AI_Models\Qwen3-7B gets translated to /mnt/c/Users/Dan/AI_Models/Qwen3-7B before being passed to vLLM, so a Windows-side download is reachable from inside WSL. HF repo ids (Qwen/Qwen3.5-7B) pass through unchanged - vLLM downloads them into its WSL-native HF cache, which avoids the ~10x IO penalty of /mnt/c-based cache reads. 3. **Routing** (backend_service/inference/controller.py): when ``hint == "vllm"`` the controller now prefers VllmWslEngine on Windows + wslVllmAvailable=True, falling through to the in-process VLLMEngine on Linux. On Windows boxes without the bridge, the error message points the user at Diagnostics → WSL2 vLLM bridge instead of the bare "pip install vllm" hint that doesn't work on Windows. Speculative decoding via the WSL bridge isn't wired yet - the in-process VLLMEngine uses vllm.LLM's speculative_config= kwarg, but the OpenAI server entry-point uses --speculative-model / --num-speculative-tokens which need separate wiring. The runtime note honestly flags the gap rather than silently dropping requests. Tests: 13 new in test_vllm_wsl_engine.py covering: - windows_path_to_wsl matrix (backslash, forward-slash, drive casing, WSL passthrough, repo-id passthrough, UNC, relative) - load_model platform gate (off-Windows rejects) - load_model capability gate (wslVllm missing rejects) - argv composition (every required vllm flag present + ordered) - happy-path lifecycle (Popen called once, /health polled, LoadedModelInfo populated correctly, pid reachable) - path translation on a Windows model path

Caught during a live end-to-end test on a fresh Ubuntu 24.04 WSL install: ``python3 -m venv ~/.chaosengine/vllm-venv`` fails with ``ensurepip is not available`` because Ubuntu 24.04 ships python3 without the venv module. Before this commit the user would see a confusing error mid-install ("Failed to create the WSL venv. See output above.") with the real fix buried in stderr. Now the preflight step explicitly probes ``python3 -c 'import ensurepip'`` after the CUDA check. When it fails, the install endpoint surfaces the exact apt command: sudo apt update && sudo apt install -y python3-venv instead of trying to create the venv and erroring out. Same pattern as the existing NVIDIA-driver-not-found path: tell the user what to do, don't pretend to recover.

…se 8) End-to-end test validated against real CUDA + real vLLM 0.21.0 + real WSL2 Ubuntu-24.04 on Windows + RTX 4090. Loaded Qwen2.5-0.5B-Instruct in 96 s and generated "Paris." for the prompt "The capital of France is" — 1.19 s HTTP round-trip from the Windows backend into WSL and back. Four fixes the live test surfaced, none of which would have been caught by mocked unit tests: 1. **PATH plumbing through grandchild processes**: the engine subprocess inside vLLM (EngineCore) couldn't find ``ninja`` for flashinfer's JIT-compiled sampling kernels, even though it lived in the venv's bin/. The command builder now wraps the python invocation in ``bash -c`` so we can prepend ``~/.chaosengine/vllm-venv/bin`` to PATH explicitly. The PATH value is double-quoted because WSL2 interopts the Windows PATH into bash, and that PATH contains paths with spaces (``/mnt/c/Program Files/NVIDIA…``) which otherwise word-split into ``export: 'Files/NVIDIA': not a valid identifier`` errors. 2. **vLLM 0.21+ flashinfer JIT escape hatches**: even with ninja reachable, flashinfer needs ``nvcc`` for the second compile stage. Setting ``VLLM_USE_FLASHINFER_SAMPLER=0`` + ``VLLM_ATTENTION_BACKEND=TORCH_SDPA`` routes through pre-built PyTorch kernels. ``--enforce-eager`` disables CUDA-graph compilation. Loses some perf but avoids the second JIT. 3. **/v1/models probe instead of /health**: vLLM's ``/health`` returns 200 with an empty body, which tripped ``_http_json``'s ``json.loads`` and made ``_wait_for_server`` retry indefinitely until the timeout. ``/v1/models`` returns the loaded-model list as JSON so the parse succeeds and we return on first OK. 4. **shlex-quoted model arg**: a model path with spaces (e.g. a Windows-translated ``/mnt/c/My Models/Qwen3-7B``) would word-split through the bash -c parse without quoting. New test pins the round-trip. Plus the install endpoint's preflight already grew a clear "sudo apt install python3-venv" message (last commit) — caught the same way, just earlier in the chain. New file ``scripts/live_e2e_vllm_wsl.py`` — not part of the regular test suite; one-shot script that probes capabilities, constructs the engine, loads a tiny chat-tuned model (Qwen/Qwen2.5-0.5B-Instruct), generates a deterministic prompt, prints metrics, tears down. Run from Windows + WSL with vllm-venv installed: ``.venv\Scripts\python.exe scripts\live_e2e_vllm_wsl.py``. Exit 0 on success, 1 with full traceback on failure. Tests: 15 in test_vllm_wsl_engine.py still pass (3 lifecycle + 3 command-shape + 5 path-translation + 2 platform-gate + 2 capability-gate). All 42 in the wider WSL-bridge test files green. Live-test run output: Loaded in 96.3s engine: vllm-wsl runtimeNote: vLLM 0.21.0 running inside WSL (Ubuntu-24.04). pid: 34036 port: 58586 text: 'Paris.' finishReason: stop promptTokens: 34 completionTokens: 3 responseSeconds: 1.19

Closes the test-coverage gap on everything FU-056 has shipped over the previous eight commits. Three small additions across the existing test-gate scripts: 1. **scripts/cache-strategy-matrix.py** — capability probe now considers the WSL vLLM bridge a valid vllm provider. Without this, all four vllm matrix cells would skip with "vLLM not installed (CUDA-only)" on Windows boxes even though the bridge route works (validated by the live e2e in commit c4f3701). New ``wsl_vllm_available`` field on BackendCapabilities; the skip-reason copy now names both routes so a user reading a skip-row knows their actionable next step regardless of OS. 2. **scripts/pre-build-check.mjs [5/8]** — extended with a new sub-probe that walks ``src/components/acceleratorCatalog.ts`` for every (pipPackage, capabilityField) pair and asserts each one exists in (a) the backend's _INSTALLABLE_PIP_PACKAGES allow-list and (b) the BackendCapabilities dataclass. Surface: ``PASS Accelerator catalog ↔ backend (6 entries)``. Catches drift: adding a 7th catalog row without wiring its pip package + capability flag would fail the gate at build time rather than at first user click. Six entries today (nunchaku, sageattention, dflash-mlx, dflash-cuda, triattention, kvpress). 3. **scripts/e2e_test_suite.py phase 6** — two new read-only probes alongside the existing 7: - ``vllm-wsl-status``: GETs /api/setup/install-vllm-wsl/status and asserts the JSON shape (phase + done fields present). Verifies the Phase 8 install endpoint at minimum returns the expected schema even when no install has been started. - ``fu-056-capability-flags``: GETs /api/health and asserts all 7 FU-056 Phase 1 capability fields are present on ``nativeBackends``. The fields are optional in the schema (older backends shouldn't crash the frontend), but the gate ensures release builds expose them. Phase 6 grows from 7 to 9 checks. Verified live against the user's running backend: PASS 9/9. No new test files. Phase 9 is gate plumbing on existing scripts.

Caught during the live WSL test sweep: ``backend_service.app.main()`` hard-coded ``port=DEFAULT_PORT`` in the ``uvicorn.run`` call and ignored the ``--port`` flag the test scripts have been passing. Worked historically because DEFAULT_PORT already reads ``CHAOSENGINE_PORT`` env, so test runs that set the env var got the right port — but ``python -m backend_service.app --port 8877`` silently bound 8876. Now ``main()`` uses argparse with env-var fallbacks: --port → $CHAOSENGINE_PORT → 8876 --host → $CHAOSENGINE_HOST → 127.0.0.1 CLI > env > default. Surfaces ``--help`` properly (the user can discover the args). The existing env-var path keeps working for the Tauri shell + headless test scripts that already set ``CHAOSENGINE_PORT``. Three new helper scripts under ``scripts/`` for the WSL dev workflow: - ``install_llama_server_wsl.sh`` — downloads the latest llama.cpp Linux release into ``~/.chaosengine/bin/`` for the WSL backend. - ``run_backend_wsl.sh`` — launches the backend on port 8877 with auth disabled (env: ``CHAOSENGINE_REQUIRE_AUTH=0``), pointing at the WSL-side llama-server. Detached via nohup + disown. - ``probe_backend_wsl.sh`` — diagnostic helper; runs the backend foreground for 3 s and surfaces import / bind errors. WSL test sweep results (Ubuntu-24.04, RTX 4090, vllm-venv at 0.21.0): - pytest tests/ — 1472 passed, 21 failed, 21 skipped (49 more passes than Windows — fewer platform-specific failures) - e2e_test_suite.py --smoke — 6/0/0 PASS including the two new FU-056 Phase 9 phase-6 probes (vllm-wsl-status + capability flags) - cache-strategy-matrix.py --quick — 0/0 ran, 15/15 skipped honestly (only ``native`` strategy in dev venv; no turbo binary, no dflash, no models in dev library — all skip reasons accurate)

Two related UX cleanups landed together because they share the same plumbing pattern (App.tsx → tabs → leaf components): 1. **Hide MTPLX install affordances on Windows / Linux.** The MTPLX block in RuntimeControls (the launch-settings modal that opens from Chat / Compare / HTML Challenge / Benchmarks) used to render the MTPLX checkbox + "Install MTPLX" button + info disclosure on every host. MTPLX is Apple-Silicon-only — the install would error on Windows and the checkbox would render disabled with no path to recovery. Per the FU-034 rule (hide unrecoverable options, don't grey them out), the whole block is now gated on a new ``isAppleSilicon`` prop threaded from App.tsx via: App.tsx → LaunchModal / CompareView / HtmlChallengeTab / BenchmarkRunTab → ChallengePickerModal (for HtmlChallengeTab) → ModelLaunchModal → RuntimeControls Three call sites on RuntimeControls (the MTPLX label, the info-panel expand, the info button) now ALL gate on the prop. ``dflash-mlx`` was already platform-gated via the FU-056 Phase 2 AcceleratorCard catalog (platformGate: "apple-silicon"). 2. **Chat empty-state banner.** Fresh-install users opening the Chat tab used to see "Send a message to start the conversation." followed by a silent auto-load of the largest MLX direct variant (a 15+ GB download that doesn't even work on Windows/Linux — MLX backend doesn't exist there). Replaced with a ``<ChatEmptyStateBanner>`` that surfaces a clear CTA: "Browse Discover" when library is empty, "Open Models" when models are present but none loaded. No silent auto-loads, no confused users waiting on the wrong download. The banner is purely additive — composer textarea still usable above (users can also type + the banner suggests Discover). Plumbing: - New ``src/utils/platform.ts`` with ``isAppleSiliconHost``, ``isCudaHost``, ``isIntelMac`` helpers. Reads from ``workspace.system`` (platform + arch) which the backend already populates from ``platform.system()`` + ``platform.machine()``. - 15 unit tests in ``src/utils/__tests__/platform.test.ts`` pin every host-classifier branch (Darwin arm64, aarch64, Intel Mac, Windows, Linux, null/undefined, case-insensitive). - ``isAppleSiliconHost(workspace.system)`` computed once at App.tsx top-level, threaded as ``isAppleSilicon`` prop to the four call sites that own MTPLX surfaces. - New ``<ChatEmptyStateBanner>`` component with two states (no-models / no-loaded-model), each with appropriate CTA. Tests: 35 files / 424 vitest tests pass (+15 from platform helper). tsc clean. No new pytest needed — backend unchanged. Not addressed in this commit (deferred): - MLX-only image / video catalog variants still surface in Discover / Models tabs on Win/Linux. Filtering those is a larger UX call — hide entirely vs. show with "Apple Silicon only" pill — deserves its own decision before code. - "llama-server installed by default" — already the case via scripts/stage-runtime.mjs for release builds. No code change.

Per FU-034 "hide unrecoverable options" policy, extend it to whole catalog rows. Windows / Linux users no longer see MLX / mlx-video / mflux / MTPLX entries they can never run, and Apple Silicon users no longer see vLLM / nunchaku / CUDA-only entries. - src/utils/platform.ts: imageOrVideoVariantPlatformGate + chatVariantPlatformGate + isVariantCompatibleWithHost derive a PlatformGate ("apple-silicon" | "cuda" | "any") from existing variant fields (runtime / backend / styleTags / repo prefix). No catalog schema change required. - ImageModelsTab / ImageDiscoverTab / VideoModelsTab / VideoDiscoverTab: new hostSystem prop, filtered through isVariantCompatibleWithHost in the rows/filteredResults useMemo. - App.tsx: threaded workspace.system into all four tabs; libraryChatOptions now also filtered so the launch dropdown drops MLX backends on Win/Linux. - AcceleratorsBoostPack: showIncompatible flipped off, the table now surfaces only accelerators the current host can install. 16 new vitest cases pin the helper boundaries (Apple Silicon host hides CUDA-only variants, Linux x86_64 hides Apple-Silicon-only variants, "any" gate passes on every host, etc). All 440 frontend tests pass; tsc clean.

Feature/accelerator install ux

Image / Video Studio - LTX-2 mlx-video: snap dimensions to multiples of 32 (was crashing 16:9 / 9:16 / 21:9 presets with "Height must be divisible by 32"). Surfaces snap in runtimeNote + reports actual rendered dims. - Warm-cache phase fix: second-generation of same model variant no longer flashes "Loading..." — pre-checks variant_key + begins on PHASE_ENCODING with "Reusing {modelName}" when pipeline cached. - mlx-video progress wiring: subprocess stdout now feeds VIDEO_PROGRESS via begin/finish lifecycle + on_progress callback. fraction=None preserves step counter instead of jittering to 0.5. - mlx-video device memory: populate deviceMemoryGb (was always null on /api/video/mlx-runtime → frontend defaulted to 16 GB). - Hide CUDA-only FP8 layerwise toggle on Apple Silicon. - Hide platform-incompatible variants from Studio dropdown (Nunchaku INT4 CUDA on macOS, mlx-video on Win/Linux). Discover - FU-061: "Watching upstream" badge + disabled download CTA for tracked-only seeds (ERNIE-Image, Nucleus-Image, Z-Image, HiDream, GLM-Image, FLUX.2 family) that lack Studio pipeline routing. - Tracked-seed model classifier: ERNIE-Image / Nucleus / Z-Image / HiDream / GLM-Image keywords added so they don't leak into Chat My Models. Upstream pin bumps - FU-058: vLLM floor >=0.8.0 → >=0.21.0 (gemma4 MTP, spec-dec thinking budget, TurboQuant hybrid). - FU-059: nunchaku pin >=1.2.1 → >=0.16.0 (1.2.1 unsatisfiable; upstream version reset to 0.x). - diffusers .venv upgrade 0.37.1 → 0.38.0 (pin already allowed). - FU-057: dflash-mlx v0.1.7 migration documented (deferred — major API rewrite). v0.1.6/v0.1.7 release notes + breaking changes surfaced in tracker. Test infra - FU-060: memory-pressure gate mocked in test_video_routes.py + test_backend_service.py setUp/tearDown — deterministic regardless of host load (was flaky on busy dev boxes). - e2e_test_suite.py phases 4 + 5 now treat memory-gate refusals as SKIP (with explanatory reason), not FAIL — pre-build ships green on memory-constrained hosts. Tracker rows: FU-057, FU-058, FU-059, FU-060, FU-061 added to CLAUDE.md. Gates green: - pytest 1541 passed, 1 skipped, 0 failed - vitest 441 / 35 files passed - tsc clean - pre-build 13 / 13 passed - E2E full suite 8 / 8 phases · 36 / 36 checks · 0 fail

Three threads landed in one commit because they surfaced from the same session — a Windows-side pytest sweep, the matching WSL+CUDA dry run, and a chat-tab UX gap the user flagged mid-flight. **Chat empty-state banner (FU-056 follow-up)** ChatEmptyStateBanner showed "No model is loaded yet. Pick one from Models to start chatting." even while the header strip showed "LOADING MODEL..." for a model already in flight. Two fixes: - ChatThread.tsx hides the banner when serverLoading is non-null — the ModelLoadingProgress bubble below already conveys the state. - ChatEmptyStateBanner.tsx rewords the "models present but none loaded" branch from "Pick one from Models" + "Open Models" to "A model needs to be loaded before you can chat." + "Load Model" for actionable copy. Fresh-install branch (no models downloaded) is unchanged. **Test suite — Windows 1526 -> 1528 / 16 fails -> 0; WSL 1510 -> 1544** The cross-platform sweep caught 16 Windows pytest failures and 3 WSL failures, all platform-portability test infra bugs rather than product issues: - test_cache_strategies.py: 2 turboquant tests imported mlx_lm / turboquant_mlx at function scope without skip guards — added @unittest.skipUnless(_MLX_LM_AVAILABLE) so they skip cleanly on non-Apple-Silicon (caught on both Windows AND WSL Linux). - test_mtplx_engine_integration.py: 5 tests build a #!/usr/bin/env bash wrapper script — class-level @unittest.skipIf(sys.platform == "win32") since Windows can't honour the bash shebang. - test_sdcpp_image.py + test_sdcpp_video.py: 3 tests asserted str equality against "/tmp/sd" but the source does str(Path(...)) which yields "\tmp\sd" on Windows. Centralized via _FAKE_SD_BIN = str(Path("/tmp/sd")) so both sides of the comparison round-trip through the OS-native separator. - test_mlx_video_wan_convert.py: same POSIX-path issue with the env-var override test — now uses tempfile.gettempdir(). - test_preview_vae.py: 5 tests shallowly guarded on `import diffusers` but the real failure mode on Windows is the deeper `from diffusers import AutoencoderTiny` (torchao ABI break vs torch 2.6.0+cu124). Replaced with _autoencoder_tiny_importable() probe that exercises the actual symbol. - test_gpu.py::test_nvidia_smi_parsing: fixture said "NVIDIA RTX 4090" but the real nvidia-smi emits "NVIDIA GeForce RTX 4090" (live-confirmed via WSL2 GPU passthrough). Also patched _snapshot_torch_cuda to None so the parser actually exercises the nvidia-smi codepath instead of short-circuiting through torch.cuda when a real GPU is present. **Real product bug — VLLMEngine.generate() missing kwargs** Live regression caught on WSL2 + RTX 4090: any chat turn with a vLLM-loaded model raised TypeError: VLLMEngine.generate() got an unexpected keyword argument 'samplers' because the controller passes through samplers / reasoning_effort / json_schema (parity with LlamaCppEngine) but the vLLM engine's signature only accepted the basic generate kwargs. Fix in vllm_engine.py:185-219 — accepts all three, translates llama-server's `repeat_penalty` to vLLM's `repetition_penalty`, floors temperature=0 to 0.01 (vLLM forbids exactly 0). New tests/test_vllm_engine.py (5 cases) pin: - the engine.generate() signature must match the controller's full call shape (kwarg-name regression gate) - samplers are forwarded into SamplingParams - repeat_penalty renames correctly - temperature=0 gets bumped above the vLLM floor All skip on hosts without the vllm wheel installed, so the test runs cheap on macOS / Windows CPU-only paths. **WSL + CUDA test plan (docs/WSL_CUDA_TESTING.md)** 7-phase practical guide, live-validated end-to-end against WSL2 Ubuntu 24.04 + RTX 4090 (CUDA 12.6.85 toolkit, GPU passthrough). Final result of the dry run: - Phase C (pytest with CUDA): 1544 / 0 / 3 - Phase D (E2E full): 7 / 0 / 1 - Phase E (matrix --full): vllm native Qwen3-0.6B PASS, SHA d18c2b8cb410 - Phase G (real workload): Qwen3-0.6B via vLLM on RTX 4090, 24 GB VRAM allocated, 44% GPU utilization confirmed via nvidia-smi Doc captures 13 gotchas surfaced during the dry run — most useful: - Don't pre-pin torch before installing vllm (vllm's resolver drives the torch pin; pinning first creates a conflict) - Use `tr -d '\r'` not `sed 's/\r$//'` for CRLF stripping (the latter ate trailing `r` chars under WSL's bash-in-PowerShell quoting chain — broke test_mlx_video_wan_installer.py with "cannot import name 'mlx_video_wan_installe'") - CHAOSENGINE_REQUIRE_AUTH=0 mandatory for headless e2e scripts - ninja must be on PATH at backend launch time (flashinfer JIT) - --backend auto picks MLX on safetensors even when MLX is unavailable (FU-063 candidate — auto should fall through to vllm on Linux)

cryptopoly added 30 commits May 14, 2026 17:39

Merge pull request #54 from cryptopoly/feature/mtplx

2379f3e

Feature/mtplx

docs: CLAUDE.md FU-047 entry for GGUF MTP via llama.cpp #22673

c160fb3

New follow-up row tracking the GGUF half of FU-028 now that PR #22673 merged upstream. Lists action plan (6 wiring steps), upstream caveats, and links to the upstream-research write-up.

docs: v0.9.2 release notes + changelog entry

6442769

cryptopoly added 25 commits May 17, 2026 00:06

Update CLAUDE.md

d0f4b06

Merge pull request #56 from cryptopoly/claude/wizardly-shaw-53da59

a3de63e

feat: in-place torch upgrade with rollback (detection + pill + backgr…

Update e2e_test_suite.py

7ace451

Merge pull request #57 from cryptopoly/feature/ChaosEngineAI-CLI

687daad

Feature/chaos engine ai cli

Merge pull request #58 from cryptopoly/feature/accelerator-install-ux

4fda709

Feature/accelerator install ux

cryptopoly merged commit b843ead into main May 18, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Staging to Main#59

Staging to Main#59
cryptopoly merged 55 commits into
mainfrom
staging

cryptopoly commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cryptopoly commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant