Skip to content

Fix XProf GCS upload bug in max_utils.Profiler#409

Open
eltsai wants to merge 1 commit into
mainfrom
xprof_upload_fix
Open

Fix XProf GCS upload bug in max_utils.Profiler#409
eltsai wants to merge 1 commit into
mainfrom
xprof_upload_fix

Conversation

@eltsai
Copy link
Copy Markdown
Collaborator

@eltsai eltsai commented May 15, 2026

Creating this PR based on CL:

This change fixes a bug where XProf traces were not uploaded to GCS when the profiler was used manually via start() and stop() calls, which is the case in most MaxDiffusion trainers.

The GCS upload logic was previously located in Profiler.__exit__ (used by the context manager), but not in Profiler.stop(). This change moves the upload logic to stop() and guards it with _jax_profiler_enabled(self.config) to ensure it only runs when JAX profiling was active.

Also:

  • Added unit tests in profiler_test.py to verify GCS upload for both manual and context manager usage.
  • Added profiler_test target to BUILD.

@eltsai eltsai self-assigned this May 15, 2026
@eltsai eltsai requested a review from entrpn as a code owner May 15, 2026 22:18
@github-actions
Copy link
Copy Markdown

@eltsai eltsai force-pushed the xprof_upload_fix branch 2 times, most recently from e4b9135 to d6c9c2a Compare May 15, 2026 22:25
@eltsai
Copy link
Copy Markdown
Collaborator Author

eltsai commented May 15, 2026

Also disabled profiler for the first two runs. The xprof functions is as expected now:

gcloud storage ls gs://elisatsai-wan-maxdiffusion/wan22/8tpu/ulysses-custom-test-xprof3/ulysses-custom-test-xprof3/tensorboard/plugins/profile/2026_05_15_22_27_45/t1v-n-cd58a6ba-w-0.xplane.pb

gs://elisatsai-wan-maxdiffusion/wan22/8tpu/ulysses-custom-test-xprof3/ulysses-custom-test-xprof3/tensorboard/plugins/profile/2026_05_15_22_27_45/t1v-n-cd58a6ba-w-0.xplane.pb

@eltsai eltsai force-pushed the xprof_upload_fix branch from d6c9c2a to 17cff6a Compare May 15, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant