Release 1.0 changes#32
Open
eb3095 wants to merge 47 commits into
Open
Conversation
…hercomputer/slurm-operator into syu/tcl-1682-fix-module-path
Fix the declared module path for our forked slurm-operator
Match the shmSize and existingDataClaims handling that was added to nodeset-cr.yaml for consistency. This allows login pods to: - Have configurable shared memory (/dev/shm) size - Mount existing PVCs for storage access (/data, etc.)
Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features
TCL-3968 Fix login conflicts
TCL-4123 Fix login chart spec
Made-with: Cursor
Made-with: Cursor
1. reconfigure sidecar: initialize lastHash to current config hash on startup so no spurious scontrol-reconfigure fires while slurmctld is still initializing (avoids a deadlock in slurm 25.11.2). 2. login deployment: mount slurm-auth-slurm and slurm-auth-jwths256 secrets alongside the slurm-config configmap using a projected volume with mode 0600. sackd needs the slurm.key for bootstrap auth. 3. login SLURM_CONF_SERVER env: point to the controller service (slurm-controller) instead of the non-existent "slurm" service. Made-with: Cursor
Made-with: Cursor
…onfig [TCL-2751] Slinky v1.0: fix chart rendering, reconfigure deadlock, login auth, and SLURM_CONF_SERVER
The projected volume set defaultMode 0600 on ALL files including slurm.conf, making it unreadable by non-root LDAP users. sinfo failed with "Permission denied" for regular users. Fix: add initContainer that copies from projected volume (read-only) to an emptyDir, then sets config files to 644 and auth keys to 600. Same pattern used by the accounting pod (TCL-4402 follow-up). Made-with: Cursor
…onfig [TCL-4609] Fix login pod permissions: split config (644) from auth keys (600)
The fix-perms initContainer used docker.io/library/alpine:3.19 which hits Docker Hub's unauthenticated pull rate limit (429 Too Many Requests) on large clusters where many nodes pull simultaneously. Reuse the login image (already pulled for the main container) instead. This eliminates the extra image pull entirely since the kubelet already has the image cached. Discovered on Suno's 128-node cluster during Slinky v1.0 rollout. Made-with: Cursor
…-rate-limit fix: use login image for fix-perms initContainer to avoid Docker Hub rate limits
Add two new Helm values to mount custom init scripts to /root/init.sh: - initScriptLogin: mounted on login nodes (login-deployment) - initScriptNodes: mounted on compute nodes (nodesets) Scripts are stored in ConfigMaps and mounted with mode 0755.
TCL-5107: feat: add initScriptLogin and initScriptNodes values
Add UnkillableStepTimeout=600, HealthCheckInterval=60, HealthCheckNodeState=ANY, HealthCheckProgram to the generated slurm.conf as system defaults. These appear before ### EXTRA CONFIG ### so user extraConf can override if needed (Slurm uses last value). Covers both IC and BM clusters since both use the Slinky operator. No annotation gate needed — Slinky operator rebuilds slurm.conf natively on any Controller CR change. Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at /usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless "HealthCheckProgram not found" warnings. Made-with: Cursor
Without this, --mem in sbatch is just a scheduling hint — Slurm doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs that exceed their memory allocation get killed by Slurm instead of triggering the kernel OOM killer. MemSpecLimit is NOT added as a default because the right value depends on node memory size (per-cluster tuning via extraConf). Made-with: Cursor
Prevents OOM → requeue → OOM loops. When a node OOMs and the job fails, Slurm auto-requeues it by default (JobRequeue=1). The requeued job runs again, OOMs again, creating a crash loop. Same category as UnkillableStepTimeout — every cluster should have this. Placed before EXTRA CONFIG so user can override if needed. Made-with: Cursor
…-in-buildSlurmConf feat: add system defaults to buildSlurmConf (TCL-5588)
TCL-5576: feat: add topology spread for login pods
Add DeleteNode(ctx, nodeset, nodeName) to the SlurmControlInterface, enabling the operator to programmatically remove Slurm node registrations. Previously, scontrol delete was only possible from inside a dying pod's PreStop hook, leaving ghost entries when pods terminate abnormally. The implementation follows the established MakeNodeDrain pattern: lookupClient → Get → Delete, with tolerateError on 404/204 for idempotency. Takes a node name string (not a pod) so callers can delete orphaned entries that have no corresponding pod.
Add DeleteOrphanedNodes to the sync loop so ghost scontrol entries are cleaned up automatically. When a worker pod terminates without running its PreStop hook (force-delete, OOM, node crash), its Slurm node registration persists forever. This step compares the Slurm node list against current pods and deletes entries with no matching pod. Runs after RefreshNodeCache (cache is fresh) and before syncNodeSet (scale decisions) so the operator doesn't count ghosts when deciding replica count.
DeleteOrphanedNodes previously listed all Slurm nodes from the controller and deleted any without a matching pod in the current NodeSet's pod list. If multiple NodeSets share a controller, this would incorrectly delete other NodeSets' valid nodes. Filter by nodeNamePrefix (nodeset.Name + "-") so only nodes belonging to the reconciling NodeSet are considered for deletion.
Only delete Slurm nodes that match the current NodeSet's exact ordinal naming pattern so prefix-overlapping NodeSets cannot be touched. Co-authored-by: Cursor <cursoragent@cursor.com>
The NodeSet hostname template (e.g. "slinky-") determines the actual
Slurm node names ("slinky-0"), not the NodeSet name ("slurm-worker-slinky").
Use the template prefix so orphan detection works on real clusters.
Co-authored-by: Cursor <cursoragent@cursor.com>
…rphaned-slurm-nodes TCL-5951: Reconcile orphaned Slurm node registrations
…node-interface TCL-5951: Add DeleteNode to SlurmControlInterface
When --namespace is set, the controller-runtime cache only watches that namespace (via cache.Options.DefaultNamespaces) and the leader-election ID is suffixed with the namespace so multiple operator instances on the same cluster don't share a lease. Enables one slurm-operator per tenant namespace, which is required for per-cluster substrate-managed slurm CPs. Default behavior (empty flag) is unchanged: cluster-wide watch with the original leader-election ID. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add container-images workflow: build/push on main and PRs with VERSION-dev-SHA vs release tag. - Default REGISTRY to togethercomputer in Makefile and docker-bake.hcl. - push-images: single buildx bake --push (no redundant build-images prerequisite). - Bump VERSION to 1.0.1.
* TCL-6170 Fix github actions * TCL-6170 Fix github actions * TCL-6170: fix Actions — workflows on branch + correct push gates - container-images-1.0.yaml: must exist on slurm-1.0-together-changes; should-push uses RELEASE_BRANCH (not main). - container-images.yaml: main line; same RELEASE_REF pattern for push to main. - Both in repo so pushes to main vs slurm each match a workflow (GitHub loads workflows from the pushed commit). * TCL-6170 Fix github actions
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.