Release 1.0 changes by eb3095 · Pull Request #32 · togethercomputer/slurm-operator

eb3095 · 2026-05-12T23:27:15Z

No description provided.

…hercomputer/slurm-operator into syu/tcl-1682-fix-module-path

Fix the declared module path for our forked slurm-operator

Match the shmSize and existingDataClaims handling that was added to nodeset-cr.yaml for consistency. This allows login pods to: - Have configurable shared memory (/dev/shm) size - Mount existing PVCs for storage access (/data, etc.)

Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features

TCL-3968 Fix login conflicts

TCL-4123 Fix login chart spec

Made-with: Cursor

1. reconfigure sidecar: initialize lastHash to current config hash on startup so no spurious scontrol-reconfigure fires while slurmctld is still initializing (avoids a deadlock in slurm 25.11.2). 2. login deployment: mount slurm-auth-slurm and slurm-auth-jwths256 secrets alongside the slurm-config configmap using a projected volume with mode 0600. sackd needs the slurm.key for bootstrap auth. 3. login SLURM_CONF_SERVER env: point to the controller service (slurm-controller) instead of the non-existent "slurm" service. Made-with: Cursor

Made-with: Cursor

…onfig [TCL-2751] Slinky v1.0: fix chart rendering, reconfigure deadlock, login auth, and SLURM_CONF_SERVER

The projected volume set defaultMode 0600 on ALL files including slurm.conf, making it unreadable by non-root LDAP users. sinfo failed with "Permission denied" for regular users. Fix: add initContainer that copies from projected volume (read-only) to an emptyDir, then sets config files to 644 and auth keys to 600. Same pattern used by the accounting pod (TCL-4402 follow-up). Made-with: Cursor

…onfig [TCL-4609] Fix login pod permissions: split config (644) from auth keys (600)

The fix-perms initContainer used docker.io/library/alpine:3.19 which hits Docker Hub's unauthenticated pull rate limit (429 Too Many Requests) on large clusters where many nodes pull simultaneously. Reuse the login image (already pulled for the main container) instead. This eliminates the extra image pull entirely since the kubelet already has the image cached. Discovered on Suno's 128-node cluster during Slinky v1.0 rollout. Made-with: Cursor

…-rate-limit fix: use login image for fix-perms initContainer to avoid Docker Hub rate limits

Add two new Helm values to mount custom init scripts to /root/init.sh: - initScriptLogin: mounted on login nodes (login-deployment) - initScriptNodes: mounted on compute nodes (nodesets) Scripts are stored in ConfigMaps and mounted with mode 0755.

TCL-5107: feat: add initScriptLogin and initScriptNodes values

Add UnkillableStepTimeout=600, HealthCheckInterval=60, HealthCheckNodeState=ANY, HealthCheckProgram to the generated slurm.conf as system defaults. These appear before ### EXTRA CONFIG ### so user extraConf can override if needed (Slurm uses last value). Covers both IC and BM clusters since both use the Slinky operator. No annotation gate needed — Slinky operator rebuilds slurm.conf natively on any Controller CR change. Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at /usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless "HealthCheckProgram not found" warnings. Made-with: Cursor

Without this, --mem in sbatch is just a scheduling hint — Slurm doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs that exceed their memory allocation get killed by Slurm instead of triggering the kernel OOM killer. MemSpecLimit is NOT added as a default because the right value depends on node memory size (per-cluster tuning via extraConf). Made-with: Cursor

Prevents OOM → requeue → OOM loops. When a node OOMs and the job fails, Slurm auto-requeues it by default (JobRequeue=1). The requeued job runs again, OOMs again, creating a crash loop. Same category as UnkillableStepTimeout — every cluster should have this. Placed before EXTRA CONFIG so user can override if needed. Made-with: Cursor

…-in-buildSlurmConf feat: add system defaults to buildSlurmConf (TCL-5588)

TCL-5576: feat: add topology spread for login pods

Add DeleteNode(ctx, nodeset, nodeName) to the SlurmControlInterface, enabling the operator to programmatically remove Slurm node registrations. Previously, scontrol delete was only possible from inside a dying pod's PreStop hook, leaving ghost entries when pods terminate abnormally. The implementation follows the established MakeNodeDrain pattern: lookupClient → Get → Delete, with tolerateError on 404/204 for idempotency. Takes a node name string (not a pod) so callers can delete orphaned entries that have no corresponding pod.

Add DeleteOrphanedNodes to the sync loop so ghost scontrol entries are cleaned up automatically. When a worker pod terminates without running its PreStop hook (force-delete, OOM, node crash), its Slurm node registration persists forever. This step compares the Slurm node list against current pods and deletes entries with no matching pod. Runs after RefreshNodeCache (cache is fresh) and before syncNodeSet (scale decisions) so the operator doesn't count ghosts when deciding replica count.

DeleteOrphanedNodes previously listed all Slurm nodes from the controller and deleted any without a matching pod in the current NodeSet's pod list. If multiple NodeSets share a controller, this would incorrectly delete other NodeSets' valid nodes. Filter by nodeNamePrefix (nodeset.Name + "-") so only nodes belonging to the reconciling NodeSet are considered for deletion.

Only delete Slurm nodes that match the current NodeSet's exact ordinal naming pattern so prefix-overlapping NodeSets cannot be touched. Co-authored-by: Cursor <cursoragent@cursor.com>

The NodeSet hostname template (e.g. "slinky-") determines the actual Slurm node names ("slinky-0"), not the NodeSet name ("slurm-worker-slinky"). Use the template prefix so orphan detection works on real clusters. Co-authored-by: Cursor <cursoragent@cursor.com>

…rphaned-slurm-nodes TCL-5951: Reconcile orphaned Slurm node registrations

…node-interface TCL-5951: Add DeleteNode to SlurmControlInterface

When --namespace is set, the controller-runtime cache only watches that namespace (via cache.Options.DefaultNamespaces) and the leader-election ID is suffixed with the namespace so multiple operator instances on the same cluster don't share a lease. Enables one slurm-operator per tenant namespace, which is required for per-cluster substrate-managed slurm CPs. Default behavior (empty flag) is unchanged: cluster-wide watch with the original leader-election ID. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add container-images workflow: build/push on main and PRs with VERSION-dev-SHA vs release tag. - Default REGISTRY to togethercomputer in Makefile and docker-bake.hcl. - push-images: single buildx bake --push (no redundant build-images prerequisite). - Bump VERSION to 1.0.1.

* TCL-6170 Fix github actions * TCL-6170 Fix github actions * TCL-6170: fix Actions — workflows on branch + correct push gates - container-images-1.0.yaml: must exist on slurm-1.0-together-changes; should-push uses RELEASE_BRANCH (not main). - container-images.yaml: main line; same RELEASE_REF pattern for push to main. - Both in repo so pushes to main vs slurm each match a workflow (GitHub loads workflows from the pushed commit). * TCL-6170 Fix github actions

syutogether and others added 30 commits July 23, 2025 10:31

update the declared module path for forked slurm-operator

82bfcd5

fix the declared module path for our forked slurm-operator

fe97a0e

Merge branch 'syu/tcl-1682-fix-module-path' of ssh://github.com/toget…

3f42f59

…hercomputer/slurm-operator into syu/tcl-1682-fix-module-path

Merge pull request #4 from togethercomputer/syu/tcl-1682-fix-module-path

51bd8f1

Fix the declared module path for our forked slurm-operator

Merge remote-tracking branch 'upstream/release-1.0' into release-1.0

f4bb007

add node-cordon, login chart, Helm/operator tweaks

910e26f

fix update

3f23377

fix: use initContainer to set correct permissions on slurmdbd.conf

085c2fa

use tcp probe - fix slurmctl pod

6b78f4b

fix nodeset bugs

b245a49

Add tolerateError for job list in GetNodeDeadlines

e654fe0

Merge pull request #15 from togethercomputer/slurm-1.0-together-changes

d3d8824

Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features

TCL-3968 Fix login conflicts

13facbf

Merge pull request #16 from togethercomputer/ebenner/TCL-3968

0c0124c

TCL-3968 Fix login conflicts

TCL-4123 Fix login chart spec

8b1aaee

Merge pull request #17 from togethercomputer/ebenner/TCL-4123

e763bc9

TCL-4123 Fix login chart spec

Fix slurm login template helpers

8bee33c

Made-with: Cursor

Add SACKD_OPTIONS env var to login deployment

016e363

Made-with: Cursor

Add Linear ticket references (TCL-4401, TCL-4402, TCL-4403)

75440e4

Made-with: Cursor

Merge pull request #18 from togethercomputer/jhu/fix-slurm-login-dnsc…

b95314c

…onfig [TCL-2751] Slinky v1.0: fix chart rendering, reconfigure deadlock, login auth, and SLURM_CONF_SERVER

Merge pull request #19 from togethercomputer/jhu/fix-slurm-login-dnsc…

0406014

…onfig [TCL-4609] Fix login pod permissions: split config (644) from auth keys (600)

Merge pull request #20 from togethercomputer/jhu/fix-login-init-image…

b20f5b8

…-rate-limit fix: use login image for fix-perms initContainer to avoid Docker Hub rate limits

Merge pull request #21 from togethercomputer/ebenner/TCL-5107

9910102

TCL-5107: feat: add initScriptLogin and initScriptNodes values

Fix scratch vols on slinky 1 (#22)

7037592

jhu-svg and others added 17 commits April 21, 2026 13:51

Merge pull request #23 from togethercomputer/TCL-5588/system-defaults…

e76f20c

…-in-buildSlurmConf feat: add system defaults to buildSlurmConf (TCL-5588)

TCL-5576: feat: add topology spread for login pods

27d15ad

Merge pull request #24 from togethercomputer/ebenner/TCL-5576

6b79afa

TCL-5576: feat: add topology spread for login pods

fix: avoid cross-NodeSet orphan cleanup

14dcfdb

Only delete Slurm nodes that match the current NodeSet's exact ordinal naming pattern so prefix-overlapping NodeSets cannot be touched. Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #26 from togethercomputer/jhu/tcl-5951-reconcile-o…

3eb4716

…rphaned-slurm-nodes TCL-5951: Reconcile orphaned Slurm node registrations

Merge pull request #25 from togethercomputer/jhu/tcl-5951-add-delete-…

b2ee203

…node-interface TCL-5951: Add DeleteNode to SlurmControlInterface

TCL-6170 Fix github actions (#30)

c6e194f

Fresh

bb96508

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.0 changes#32

Release 1.0 changes#32
eb3095 wants to merge 47 commits into
release-1.0-cleanfrom
release-1.0-changes

eb3095 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

eb3095 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants