Skip to content

Migrate logging from internal/logger to log/slog for amd-ctk#112

Open
shiv-tyagi wants to merge 2 commits intoROCm:mainfrom
shiv-tyagi:migrate-logging
Open

Migrate logging from internal/logger to log/slog for amd-ctk#112
shiv-tyagi wants to merge 2 commits intoROCm:mainfrom
shiv-tyagi:migrate-logging

Conversation

@shiv-tyagi
Copy link
Copy Markdown
Member

Replace the custom logger.Log (backed by standard log.Logger) with Go's structured logging (log/slog) across the project.

  • Configure slog in cmd/amd-ctk with --debug flag for level control (defaults to Info on stderr)
  • Configure slog in cmd/container-runtime via internal/logger.Init for file-based logging at Debug level to preserve the previous behavior where all log lines were captured
  • Update internal/logger to set slog as the default logger instead of exposing a log.Logger instance
  • Convert all logger.Log.Printf calls to appropriate slog levels (Info, Debug, Warn, Error) with structured key-value pairs
  • Remove logger.Log references from all internal packages and tests

Replace the custom logger.Log (backed by standard log.Logger) with
Go's structured logging (log/slog) across the project.

- Configure slog in cmd/amd-ctk with --debug flag for level control
  (defaults to Info on stderr)
- Configure slog in cmd/container-runtime via internal/logger.Init
  for file-based logging at Debug level to preserve the previous
  behavior where all log lines were captured
- Update internal/logger to set slog as the default logger instead
  of exposing a log.Logger instance
- Convert all logger.Log.Printf calls to appropriate slog levels
  (Info, Debug, Warn, Error) with structured key-value pairs
- Remove logger.Log references from all internal packages and tests
The container runtime no longer leaks GPU allocation, release,
and exclusive-enforcement messages into Docker's error output.
These messages now go exclusively to the runtime log file. Update
the gpu-tracker doc's exclusive access example to show the new
Docker error output and direct users to the runtime log instead.

Also fix the stale GPU info mismatch error message wording and
document the new --debug/-d flag for amd-ctk across gpu-tracker,
troubleshooting, and CDI docs.
@shiv-tyagi shiv-tyagi marked this pull request as draft April 8, 2026 10:14
@shiv-tyagi
Copy link
Copy Markdown
Member Author

shiv-tyagi commented Apr 8, 2026

Test Checklist

GPU Tracker — Enable/Disable/Reset

  • amd-ctk gpu-tracker init — succeeds silently (no output, exit 0)
  • amd-ctk gpu-tracker enable — prints "GPU Tracker has been enabled"
  • amd-ctk gpu-tracker enable (again) — prints "GPU Tracker is already enabled"
  • amd-ctk gpu-tracker disable — prints "GPU Tracker has been disabled"
  • amd-ctk gpu-tracker disable (again) — prints "GPU Tracker is already disabled"
  • amd-ctk gpu-tracker reset (was disabled) — prints "GPU Tracker has been reset" (no restart recommendation)
  • amd-ctk gpu-tracker enable then amd-ctk gpu-tracker reset — prints "GPU Tracker has been reset" followed by restart recommendation
  • All above as non-root — prints "Permission denied: Not running as root"

GPU Tracker — Status

  • amd-ctk gpu-tracker status (disabled) — prints "GPU Tracker is disabled"
  • amd-ctk gpu-tracker status (enabled, GPUs present) — prints table with correct GPU Id, UUID, Accessibility, Container Ids
  • amd-ctk gpu-tracker status (GPU info mismatch, e.g. GPU removed after enable) — prints error: "GPU info mismatch: please reset GPU Tracker"

GPU Tracker — Exclusive/Shared

  • amd-ctk gpu-tracker 0 exclusive — prints "GPUs [0] have been made exclusive"
  • amd-ctk gpu-tracker 0,1 exclusive (GPU 0 used by 2+ containers) — prints "GPUs [0] have not been made exclusive because more than one container is currently using it"
  • amd-ctk gpu-tracker 0,99 exclusive — prints success for 0, prints "Ignoring [99] GPUs as they are invalid"
  • amd-ctk gpu-tracker 0 shared — prints "GPUs [0] have been made shared"
  • amd-ctk gpu-tracker 0,1 exclusive (tracker disabled) — prints "GPU Tracker is disabled" (no operation performed)
  • amd-ctk gpu-tracker 0,1 exclusive (GPU info mismatch) — prints error: "GPU info mismatch: please reset GPU Tracker"

GPU Tracker — Release

  • amd-ctk gpu-tracker release <container_id> — succeeds silently (exit 0)
  • amd-ctk gpu-tracker release <container_id> (tracker file missing) — succeeds silently (tracker not initialized, nothing to release)
  • amd-ctk gpu-tracker release <container_id> (permission denied on tracker file) — prints error with container ID context

GPU Tracker — Error Scenarios

  • Delete /var/log/gpu-tracker.json and run amd-ctk gpu-tracker status — auto-initializes and shows status (or "disabled")
  • Make /var/log/gpu-tracker.json immutable and run amd-ctk gpu-tracker enable — prints clean error (no "Error: Error:" nesting)
  • Corrupt /var/log/gpu-tracker.json with invalid JSON and run amd-ctk gpu-tracker status — prints clean error: "decoding GPU tracker JSON: ..."
  • Hold the lock file (flock /var/log/gpu-tracker.lock sleep 30 &) and run amd-ctk gpu-tracker enable — times out after 10s with: "acquiring lock: timeout exceeded"

GPU Tracker — JSON Backward Compatibility

  • With an existing /var/log/gpu-tracker.json (from old version with "accessibility": 0), run amd-ctk gpu-tracker status — deserializes correctly, shows "Shared"
  • Same file with "accessibility": 1 — shows "Exclusive"
  • Run amd-ctk gpu-tracker init — overwrites with new file, still uses int format ("accessibility": 0)

Container Runtime — GPU Allocation

  • docker run --runtime=amd -e AMD_VISIBLE_DEVICES=all <image> rocm-smi — GPUs visible inside container
  • docker run --runtime=amd -e AMD_VISIBLE_DEVICES=0 <image> rocm-smi — only GPU 0 visible
  • docker run --runtime=amd <image> echo hello (no AMD_VISIBLE_DEVICES) — container starts without GPUs, no errors
  • Check /var/log/amd-container-runtime.log — contains structured slog entries: "Creating ROCm container runtime", "Launching container"
  • Verify no output on container's stdout from the runtime itself

Container Runtime — Exclusive GPU Enforcement

  • Enable tracker, make GPU 0 exclusive
  • Start container 1 with GPU 0 — succeeds, logs "GPUs allocated gpus=[0]"
  • Start container 2 with GPU 0 — fails with error: "GPUs [0] are exclusive and already in use"
  • Stop container 1 — poststop hook releases GPU 0
  • Start container 2 with GPU 0 — now succeeds

Container Runtime — Log File

  • As root: verify log written to /var/log/amd-container-runtime.log
  • Verify log format is structured: time=... level=... msg="amd-container-runtime ..." key=value
  • Verify Debug-level messages appear in log file (e.g. "Running runc", "Parsing topology node file")

CDI — Generate/Validate

  • amd-ctk cdi generate — prints "Generated CDI spec: /etc/cdi/amd.json" (or custom path)
  • amd-ctk cdi generate --output /tmp/amd.json — generates at custom path
  • amd-ctk cdi validate — prints "Validating CDI spec: ..." then "CDI spec is valid"
  • Modify the generated spec and run amd-ctk cdi validate — prints "CDI spec is invalid\nPlease regenerate CDI spec"
  • amd-ctk cdi validate with missing spec file — prints error: "validating CDI spec: opening CDI spec file ...: no such file or directory"
  • amd-ctk cdi validate with corrupt JSON — prints error with decode details

CLI — Debug Flag

  • amd-ctk gpu-tracker status — no debug output on stderr
  • amd-ctk --debug gpu-tracker status — debug-level slog output visible on stderr
  • amd-ctk -d gpu-tracker status — same as above (short flag)

Signal Handling

  • Start amd-ctk gpu-tracker enable and send SIGINT (Ctrl+C) during execution — process exits, lock file released (verify by immediately running another command)
  • Start amd-ctk gpu-tracker enable and send SIGKILL — lock file released by kernel (verify by immediately running another command)

Unit Tests

  • go test ./internal/gpu-tracker/ — passes
  • go test ./internal/cdi/ — passes
  • go test ./internal/amdgpu/ — passes
  • go test ./internal/oci/ — passes
  • go build ./cmd/... — builds without errors

I have performed these tests on a GPU machine and carefully verified all outputs.

@shiv-tyagi shiv-tyagi marked this pull request as ready for review April 8, 2026 11:25
@shiv-tyagi shiv-tyagi requested a review from sgopinath1 April 8, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant