Skip to content

[CloudRift] Fix NTP clock skew breaking Docker; handle amd-smi 7.x output#3701

Merged
peterschmidt85 merged 1 commit intomasterfrom
fix/cloudrift-ntp-and-amd-smi-parsing
Mar 27, 2026
Merged

[CloudRift] Fix NTP clock skew breaking Docker; handle amd-smi 7.x output#3701
peterschmidt85 merged 1 commit intomasterfrom
fix/cloudrift-ntp-and-amd-smi-parsing

Conversation

@peterschmidt85
Copy link
Copy Markdown
Contributor

Summary

  • CloudRift VMs boot with incorrect RTC clock (~1h ahead). When NTP corrects it backwards, Docker discards container exit events, leaving containers stuck as ghosts. Added NTP sync wait before launching the shim.
  • Handle both amd-smi output formats: flat array (ROCm 6.x) and wrapped {"gpu_data": [...]} (ROCm 7.x)
  • Add 2-minute timeout to AMD GPU detection to prevent shim from hanging indefinitely
  • Select correct VM image (AMD vs NVIDIA driver) based on GPU vendor

Test plan

  • Verified NTP sync fix on CloudRift MI350X instance — no ghost containers
  • Verified GPU detection works with ROCm 6.4 amd-smi output format
  • Verified workload container starts with GPU attached (gpus=[/dev/dri/renderD128])
  • All Python tests pass (2357 passed)
  • All Go tests pass
  • golangci-lint, ruff, pre-commit all clean

Depends on dstackai/gpuhunt#223

🤖 Generated with Claude Code

…tput format

CloudRift VMs boot with an incorrect RTC clock (~1h ahead). When NTP
corrects it backwards, Docker discards container exit events, leaving
containers stuck as ghosts forever. Add NTP sync wait before launching
the shim to prevent this.

Also handle both amd-smi output formats (flat array in ROCm 6.x,
wrapped {"gpu_data": [...]} in ROCm 7.x) and add a 2-minute timeout
to AMD GPU detection to prevent the shim from hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@peterschmidt85 peterschmidt85 requested a review from un-def March 26, 2026 13:33
@peterschmidt85 peterschmidt85 merged commit 77f3be1 into master Mar 27, 2026
28 checks passed
@peterschmidt85 peterschmidt85 deleted the fix/cloudrift-ntp-and-amd-smi-parsing branch March 27, 2026 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants