v4.0: Two-node vllm-spark:26.04 cluster, Qwen3.5-122B FP8, spark hostname by cmdlabtech · Pull Request #2 · cmdlabtech/dgx-spark-ai-stack

cmdlabtech · 2026-05-11T20:05:47Z

Summary

Hardware rename: all sparky hostname/directory references renamed to spark (keeping -01/-02 suffixes throughout)
Model upgrade: Intel/Qwen3.5-122B-A10B-int4-AutoRound → Qwen/Qwen3.5-122B-A10B-FP8; memory figures updated from ~37 GB (INT4) to ~57 GB (FP8), KV cache headroom ~71 GB
New container approach: replaced eugr/spark-vllm-docker clone/build/autodiscovery with pre-built vllm-spark:26.04 image + custom launch scripts mounted at runtime
Two-service systemd: vllm-head.service (spark-01) and vllm-worker.service (spark-02) replace the old single vllm-cluster.service
NCCL config: NCCL_SOCKET_IFNAME=enp1s0f0np0 replaces old IB HCA env vars; DAC subnet updated to 10.100.100.0/30 (10.100.100.1/2)
New Steps 1b/1c: create vllm-head.sh + vllm-worker.sh launch scripts; rsync scripts and docker save | ssh | docker load to spark-02
LiteLLM model_info: added supports_function_calling, supports_tool_choice, max_context_window: 262144, token limit fields to both node configs
SVG diagram: real IPs, correct DAC subnet, updated NCCL env var labels
Security: removed all real usernames (cameron → YOUR_USERNAME); confirmed no secrets/keys/passwords exposed
Footer: bumped to v4.0

Test plan

Review updated SVG architecture diagram for correctness (IPs, subnet, env vars)
Verify Step 1b script content matches the vllm serve / ray start commands
Verify Step 1e Docker run commands have correct VLLM_HOST_IP per node
Verify systemd unit files reference correct entrypoints and env vars
Check LiteLLM config blocks include model_info on both node configs
Confirm no cameron, sparky, AutoRound, eugr, vllm-node-tf5, or run-recipe.sh text remains
Confirm no secrets or real credentials are visible in the guide

https://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z

…park, no real usernames - Rename all sparky→spark hostname/directory references throughout - Replace model Intel/Qwen3.5-122B-A10B-int4-AutoRound with Qwen/Qwen3.5-122B-A10B-FP8 - Add Steps 1b/1c: create vllm-head.sh + vllm-worker.sh launch scripts, rsync to spark-02 - Replace vllm cluster launch with two-service systemd (vllm-head.service / vllm-worker.service) - Update Docker run command: NCCL_SOCKET_IFNAME=enp1s0f0np0 replaces IB HCA env vars - Add LiteLLM model_info block (function calling, tool choice, 262144 context) - Update memory figures: ~57 GB resident FP8 (was ~37 GB INT4), ~71 GB KV cache headroom - Update validation checklist: 57 GB and NCCL socket interface check - Update performance table: FP8 model, GPU memory utilization row - Update SVG architecture diagram: real IPs, DAC subnet 10.100.100.0/30, NCCL env vars - Update file-locations section: new scripts tree, vllm-worker.service on spark-02 - Remove all real usernames (cameron); replace with YOUR_USERNAME throughout - Ensure no secrets/keys/passwords in public guide - Bump footer to v4.0 https://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z

…chas - vllm-head.sh now starts Ray head inside container (ray start --head), waits 60 s, then calls vllm serve — reflects actual two-phase startup - Update startup order throughout: spark-02 worker must start first; head starts after worker logs "Ray runtime started" - Lower max-model-len 262144 → 131072 (higher values OOM on profiling phase) - Lower gpu-memory-utilization 0.70 → 0.68 (hard limit on GB10 x2 with 122B FP8) - Update both LiteLLM configs: max_context_window 131072, max_input_tokens 98304, add max_tokens: 32768 to litellm_params - Update performance table: context 131,072, memory-utilization 0.68 row - Update NCCL verification: expect enp1s0f0np0 socket interface (not NET/IB) - Fix performance table Network row: NCCL_SOCKET_IFNAME replaces NCCL_IB_HCA - Rewrite Step 1 intro paragraph to reflect correct container architecture - Rewrite bootstrap 35B fallback: use sed to edit vllm-head.sh in-place (entrypoint override no longer works now that Ray starts inside the script) - Add four new gotchas: fastsafetensors unsupported, OOM above 0.68, startup order failure mode, LiteLLM /v1/models no context_length https://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z

…se guide https://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z

claude added 3 commits May 11, 2026 20:05

Remove Brave Search MCP section — personal preference, not part of ba…

3921133

…se guide https://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v4.0: Two-node vllm-spark:26.04 cluster, Qwen3.5-122B FP8, spark hostname#2

v4.0: Two-node vllm-spark:26.04 cluster, Qwen3.5-122B FP8, spark hostname#2
cmdlabtech wants to merge 3 commits into
mainfrom
claude/update-dgx-spark-guide-ym4sf

cmdlabtech commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cmdlabtech commented May 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants