Skip to content

feat: add AMD GPU (amdflang/OpenMP offload) container#1422

Draft
sbryngelson wants to merge 7 commits intoMFlowCode:masterfrom
sbryngelson:feat/amd-container
Draft

feat: add AMD GPU (amdflang/OpenMP offload) container#1422
sbryngelson wants to merge 7 commits intoMFlowCode:masterfrom
sbryngelson:feat/amd-container

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

Summary

  • Dockerfile: adds TARGET=amd branch — downloads AFAR drop (rocm-afar-8873-drop-22.2.0) from repo.radeon.com, installs cmake 3.28 (Ubuntu 22.04 ships 3.22 which doesn't recognise LLVMFlang), builds MPICH 3.4.3 with amdflang as the Fortran compiler so mpi.mod is compiler-compatible, and includes libnuma1/libdrm2/libdrm-amdgpu1 so only --rocm is needed at Apptainer runtime
  • docker.yml: adds amd matrix entry with full build/push/manifest steps; $TAG-amd manifest always, latest-amd on release only
  • CMakeLists.txt: makes Cray-specific MPI/hipfft paths conditional on CRAY_MPICH_INC/CRAY_HIPFORT_LIB being set; falls back to standard find_package(MPI) and find_library(hipfft/amdhip64) with $OLCF_AFAR_ROOT hints so the self-contained container works without any OLCF env vars loaded
  • toolchain: adds amd90a cluster profile (HPCFund gfx90a / MI250); fixes module variable export loop so vars that reference previously exported vars expand in the right order

Validation

  • Built mfc-amd-final.sif (Apptainer) on top of Ubuntu 22.04 + AFAR + cmake 3.28 + MPICH 3.4.3
  • All 32 dry-run tests passed
  • 1D Sod shock tube ran 1001 time steps on MI250X (gfx90a) GPU via apptainer exec --writable-tmpfs --rocm

Test plan

  • Docker CI build passes for TARGET=amd (compile + dry-run, no GPU runner needed in CI)
  • Existing cpu and gpu builds unaffected
  • latest-amd manifest pushed only on release trigger

…h timeout

The CPU container job was using QEMU to cross-compile linux/arm64 on a
single x86 runner, consistently hitting the 6-hour GitHub Actions limit.
All recent releases (v5.1.3 through v5.3.1) failed to publish latest-cpu.

Fix: split into two native jobs (ubuntu-22.04 and ubuntu-22.04-arm),
mirroring the existing GPU build pattern. Remove QEMU. Merge into a
multi-arch manifest in the manifests job using buildx imagetools.

Also: add weekly schedule trigger (Sunday midnight UTC) so the devcontainer
image stays fresh between releases, and bump build-push-action to v6.
- Dockerfile: add TARGET=amd branch — downloads AFAR drop from repo.radeon.com,
  installs cmake 3.28 (3.22 doesn't recognise LLVMFlang), builds MPICH 3.4.3
  with amdflang so mpi.mod is compiler-compatible; runtime libs libnuma1/libdrm2
  added so only --rocm is needed at apptainer runtime
- docker.yml: add amd matrix entry + build/push/manifest steps; fix cpu to run
  natively on amd64/arm64 instead of QEMU cross-build; add weekly nightly cron
- CMakeLists.txt: make Cray-specific MPI/hipfft paths conditional on
  CRAY_MPICH_INC/CRAY_HIPFORT_LIB being set; fall back to standard
  find_package(MPI) and find_library(hipfft/amdhip64) so the self-contained
  container image works without any OLCF env vars loaded
- toolchain: add amd90a cluster profile (HPCFund gfx90a / MI250); fix module
  variable export loop so vars that reference previously exported vars expand correctly
@github-actions
Copy link
Copy Markdown

Claude Code Review

Head SHA: 1557fc4

Files changed:

  • 5
  • .github/Dockerfile
  • .github/workflows/docker.yml
  • CMakeLists.txt
  • toolchain/bootstrap/modules.sh
  • toolchain/modules

Findings:

1. OLCF_AFAR_ROOT is set to /opt/ in cpu and gpu Docker images

.github/Dockerfile — the unconditional ENV line after ARG AFAR_VERSION:

ENV OLCF_AFAR_ROOT=/opt/${AFAR_VERSION}

When the AFAR_VERSION build-arg is not supplied (the cpu and gpu matrix entries provide no AFAR_VERSION), Docker expands the ARG as an empty string, producing OLCF_AFAR_ROOT=/opt/. This real system directory is baked into the published cpu and gpu images. CMakeLists.txt uses HINTS "$ENV{OLCF_AFAR_ROOT}/lib" for find_library calls inside the LLVMFlang GPU path; on those images that resolves to /opt/lib, which could yield false positives if a matching library happens to be present there. The fix is to guard the ENV instruction under an ARG-conditional build stage, or give ARG AFAR_VERSION a sentinel default that is guaranteed absent from /opt/.

2. flang_rt.hostdevice removed from Cray CCE GPU (OpenMP) link path

CMakeLists.txt, the changed hunk around line 703–710:

-                find_package(hipfort COMPONENTS hip CONFIG REQUIRED)
-                target_link_libraries(${a_target} PRIVATE hipfort::hip hipfort::hipfort-amdgcn flang_rt.hostdevice)

The post-change Cray block (context lines 704–705) now links only hipfort::hip hipfort::hipfort-amdgcn; flang_rt.hostdevice was moved exclusively to the new elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang") block. Frontier builds using PrgEnv-cray (compiler ID "Cray") with OpenMP GPU offload previously linked flang_rt.hostdevice. If Cray CCE requires that library for device-code linking on AMD GPUs, this removal is a regression. The change should be validated against a live Frontier build before merging.

ENV OLCF_AFAR_ROOT=/opt/${AFAR_VERSION} expanded to /opt/ in cpu/gpu
images because those builds supply no AFAR_VERSION. Introduce a
dedicated OLCF_AFAR_ROOT build-arg (default "") so cpu/gpu images get
an empty var and only the AMD build passes the real path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant