v4.0: Two-node vllm-spark:26.04 cluster, Qwen3.5-122B FP8, spark hostname#2
Open
cmdlabtech wants to merge 3 commits into
Open
v4.0: Two-node vllm-spark:26.04 cluster, Qwen3.5-122B FP8, spark hostname#2cmdlabtech wants to merge 3 commits into
cmdlabtech wants to merge 3 commits into
Conversation
…park, no real usernames - Rename all sparky→spark hostname/directory references throughout - Replace model Intel/Qwen3.5-122B-A10B-int4-AutoRound with Qwen/Qwen3.5-122B-A10B-FP8 - Add Steps 1b/1c: create vllm-head.sh + vllm-worker.sh launch scripts, rsync to spark-02 - Replace vllm cluster launch with two-service systemd (vllm-head.service / vllm-worker.service) - Update Docker run command: NCCL_SOCKET_IFNAME=enp1s0f0np0 replaces IB HCA env vars - Add LiteLLM model_info block (function calling, tool choice, 262144 context) - Update memory figures: ~57 GB resident FP8 (was ~37 GB INT4), ~71 GB KV cache headroom - Update validation checklist: 57 GB and NCCL socket interface check - Update performance table: FP8 model, GPU memory utilization row - Update SVG architecture diagram: real IPs, DAC subnet 10.100.100.0/30, NCCL env vars - Update file-locations section: new scripts tree, vllm-worker.service on spark-02 - Remove all real usernames (cameron); replace with YOUR_USERNAME throughout - Ensure no secrets/keys/passwords in public guide - Bump footer to v4.0 https://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z
…chas - vllm-head.sh now starts Ray head inside container (ray start --head), waits 60 s, then calls vllm serve — reflects actual two-phase startup - Update startup order throughout: spark-02 worker must start first; head starts after worker logs "Ray runtime started" - Lower max-model-len 262144 → 131072 (higher values OOM on profiling phase) - Lower gpu-memory-utilization 0.70 → 0.68 (hard limit on GB10 x2 with 122B FP8) - Update both LiteLLM configs: max_context_window 131072, max_input_tokens 98304, add max_tokens: 32768 to litellm_params - Update performance table: context 131,072, memory-utilization 0.68 row - Update NCCL verification: expect enp1s0f0np0 socket interface (not NET/IB) - Fix performance table Network row: NCCL_SOCKET_IFNAME replaces NCCL_IB_HCA - Rewrite Step 1 intro paragraph to reflect correct container architecture - Rewrite bootstrap 35B fallback: use sed to edit vllm-head.sh in-place (entrypoint override no longer works now that Ray starts inside the script) - Add four new gotchas: fastsafetensors unsupported, OOM above 0.68, startup order failure mode, LiteLLM /v1/models no context_length https://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sparkyhostname/directory references renamed tospark(keeping-01/-02suffixes throughout)Intel/Qwen3.5-122B-A10B-int4-AutoRound→Qwen/Qwen3.5-122B-A10B-FP8; memory figures updated from ~37 GB (INT4) to ~57 GB (FP8), KV cache headroom ~71 GBeugr/spark-vllm-dockerclone/build/autodiscovery with pre-builtvllm-spark:26.04image + custom launch scripts mounted at runtimevllm-head.service(spark-01) andvllm-worker.service(spark-02) replace the old singlevllm-cluster.serviceNCCL_SOCKET_IFNAME=enp1s0f0np0replaces old IB HCA env vars; DAC subnet updated to10.100.100.0/30(10.100.100.1/2)vllm-head.sh+vllm-worker.shlaunch scripts; rsync scripts anddocker save | ssh | docker loadto spark-02supports_function_calling,supports_tool_choice,max_context_window: 262144, token limit fields to both node configscameron→YOUR_USERNAME); confirmed no secrets/keys/passwords exposedTest plan
VLLM_HOST_IPper nodemodel_infoon both node configscameron,sparky,AutoRound,eugr,vllm-node-tf5, orrun-recipe.shtext remainshttps://claude.ai/code/session_013LoMSbGUaKy6gSCiuL1W9Z
Generated by Claude Code