Skip to content

Release 1.0 changes#32

Open
eb3095 wants to merge 47 commits into
release-1.0-cleanfrom
release-1.0-changes
Open

Release 1.0 changes#32
eb3095 wants to merge 47 commits into
release-1.0-cleanfrom
release-1.0-changes

Conversation

@eb3095
Copy link
Copy Markdown
Collaborator

@eb3095 eb3095 commented May 12, 2026

No description provided.

syutogether and others added 30 commits July 23, 2025 10:31
…hercomputer/slurm-operator into syu/tcl-1682-fix-module-path
Fix the declared module path for our forked slurm-operator
Match the shmSize and existingDataClaims handling
that was added to nodeset-cr.yaml for consistency.

This allows login pods to:
- Have configurable shared memory (/dev/shm) size
- Mount existing PVCs for storage access (/data, etc.)
Upgrade slurm-operator fork to Slinky v1.0 with Together-specific features
1. reconfigure sidecar: initialize lastHash to current config hash on
   startup so no spurious scontrol-reconfigure fires while slurmctld is
   still initializing (avoids a deadlock in slurm 25.11.2).

2. login deployment: mount slurm-auth-slurm and slurm-auth-jwths256
   secrets alongside the slurm-config configmap using a projected volume
   with mode 0600. sackd needs the slurm.key for bootstrap auth.

3. login SLURM_CONF_SERVER env: point to the controller service
   (slurm-controller) instead of the non-existent "slurm" service.

Made-with: Cursor
…onfig

[TCL-2751] Slinky v1.0: fix chart rendering, reconfigure deadlock, login auth, and SLURM_CONF_SERVER
The projected volume set defaultMode 0600 on ALL files including
slurm.conf, making it unreadable by non-root LDAP users. sinfo
failed with "Permission denied" for regular users.

Fix: add initContainer that copies from projected volume (read-only)
to an emptyDir, then sets config files to 644 and auth keys to 600.
Same pattern used by the accounting pod (TCL-4402 follow-up).

Made-with: Cursor
…onfig

[TCL-4609] Fix login pod permissions: split config (644) from auth keys (600)
The fix-perms initContainer used docker.io/library/alpine:3.19 which
hits Docker Hub's unauthenticated pull rate limit (429 Too Many Requests)
on large clusters where many nodes pull simultaneously.

Reuse the login image (already pulled for the main container) instead.
This eliminates the extra image pull entirely since the kubelet already
has the image cached.

Discovered on Suno's 128-node cluster during Slinky v1.0 rollout.

Made-with: Cursor
…-rate-limit

fix: use login image for fix-perms initContainer to avoid Docker Hub rate limits
Add two new Helm values to mount custom init scripts to /root/init.sh:
- initScriptLogin: mounted on login nodes (login-deployment)
- initScriptNodes: mounted on compute nodes (nodesets)

Scripts are stored in ConfigMaps and mounted with mode 0755.
TCL-5107: feat: add initScriptLogin and initScriptNodes values
Add UnkillableStepTimeout=600, HealthCheckInterval=60,
HealthCheckNodeState=ANY, HealthCheckProgram to the generated
slurm.conf as system defaults. These appear before ### EXTRA CONFIG ###
so user extraConf can override if needed (Slurm uses last value).

Covers both IC and BM clusters since both use the Slinky operator.
No annotation gate needed — Slinky operator rebuilds slurm.conf
natively on any Controller CR change.

Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at
/usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless
"HealthCheckProgram not found" warnings.

Made-with: Cursor
jhu-svg and others added 17 commits April 21, 2026 13:51
Without this, --mem in sbatch is just a scheduling hint — Slurm
doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs
that exceed their memory allocation get killed by Slurm instead
of triggering the kernel OOM killer.

MemSpecLimit is NOT added as a default because the right value
depends on node memory size (per-cluster tuning via extraConf).

Made-with: Cursor
Prevents OOM → requeue → OOM loops. When a node OOMs and the job
fails, Slurm auto-requeues it by default (JobRequeue=1). The
requeued job runs again, OOMs again, creating a crash loop.

Same category as UnkillableStepTimeout — every cluster should have
this. Placed before EXTRA CONFIG so user can override if needed.

Made-with: Cursor
…-in-buildSlurmConf

feat: add system defaults to buildSlurmConf (TCL-5588)
TCL-5576: feat: add topology spread for login pods
Add DeleteNode(ctx, nodeset, nodeName) to the SlurmControlInterface,
enabling the operator to programmatically remove Slurm node
registrations. Previously, scontrol delete was only possible from
inside a dying pod's PreStop hook, leaving ghost entries when pods
terminate abnormally.

The implementation follows the established MakeNodeDrain pattern:
lookupClient → Get → Delete, with tolerateError on 404/204 for
idempotency. Takes a node name string (not a pod) so callers can
delete orphaned entries that have no corresponding pod.
Add DeleteOrphanedNodes to the sync loop so ghost scontrol entries
are cleaned up automatically. When a worker pod terminates without
running its PreStop hook (force-delete, OOM, node crash), its Slurm
node registration persists forever. This step compares the Slurm node
list against current pods and deletes entries with no matching pod.

Runs after RefreshNodeCache (cache is fresh) and before syncNodeSet
(scale decisions) so the operator doesn't count ghosts when deciding
replica count.
DeleteOrphanedNodes previously listed all Slurm nodes from the
controller and deleted any without a matching pod in the current
NodeSet's pod list. If multiple NodeSets share a controller, this
would incorrectly delete other NodeSets' valid nodes.

Filter by nodeNamePrefix (nodeset.Name + "-") so only nodes belonging
to the reconciling NodeSet are considered for deletion.
Only delete Slurm nodes that match the current NodeSet's exact ordinal naming pattern so prefix-overlapping NodeSets cannot be touched.

Co-authored-by: Cursor <cursoragent@cursor.com>
The NodeSet hostname template (e.g. "slinky-") determines the actual
Slurm node names ("slinky-0"), not the NodeSet name ("slurm-worker-slinky").
Use the template prefix so orphan detection works on real clusters.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rphaned-slurm-nodes

TCL-5951: Reconcile orphaned Slurm node registrations
…node-interface

TCL-5951: Add DeleteNode to SlurmControlInterface
When --namespace is set, the controller-runtime cache only watches
that namespace (via cache.Options.DefaultNamespaces) and the
leader-election ID is suffixed with the namespace so multiple
operator instances on the same cluster don't share a lease.

Enables one slurm-operator per tenant namespace, which is required
for per-cluster substrate-managed slurm CPs.

Default behavior (empty flag) is unchanged: cluster-wide watch with
the original leader-election ID.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add container-images workflow: build/push on main and PRs with VERSION-dev-SHA vs release tag.

- Default REGISTRY to togethercomputer in Makefile and docker-bake.hcl.

- push-images: single buildx bake --push (no redundant build-images prerequisite).

- Bump VERSION to 1.0.1.
* TCL-6170 Fix github actions

* TCL-6170 Fix github actions

* TCL-6170: fix Actions — workflows on branch + correct push gates

- container-images-1.0.yaml: must exist on slurm-1.0-together-changes; should-push uses RELEASE_BRANCH (not main).

- container-images.yaml: main line; same RELEASE_REF pattern for push to main.

- Both in repo so pushes to main vs slurm each match a workflow (GitHub loads workflows from the pushed commit).

* TCL-6170 Fix github actions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants