Skip to content

Add Modal GPU calibration#279

Open
nwoodruff-co wants to merge 24 commits intomainfrom
feature/modal-gpu-calibration
Open

Add Modal GPU calibration#279
nwoodruff-co wants to merge 24 commits intomainfrom
feature/modal-gpu-calibration

Conversation

@nwoodruff-co
Copy link
Collaborator

Summary

  • Offload the Adam optimisation loop to Modal T4 GPU containers
  • Both calibrations (650 constituencies, 360 LAs) run in parallel, so wall time is max(c_time, la_time) rather than the sum
  • CPU fallback unchanged when MODAL_CALIBRATE is not set

Changes

  • calibrate.py: extract _run_optimisation() helper (device-agnostic); calibrate_local_areas delegates to it on CPU as before
  • modal_calibrate.py (new): Modal app with run_calibration function on gpu="T4", self-contained loop (no policyengine imports in container)
  • create_datasets.py: when MODAL_CALIBRATE=1, build arrays locally, .spawn() both GPU jobs before waiting on either, then write .h5 files
  • push.yaml / pull_request.yaml: add MODAL_CALIBRATE=1 + MODAL_TOKEN_ID/MODAL_TOKEN_SECRET secrets to Build datasets step
  • pyproject.toml: add modal to dev extras

Test plan

  • CI passes with MODAL_CALIBRATE=1 (tokens set as repo secrets)
  • Local CPU run (make data without MODAL_CALIBRATE) is unchanged
  • Both .h5 weight files are produced and make test passes

@nwoodruff-co nwoodruff-co marked this pull request as ready for review February 19, 2026 18:05
Offload the Adam optimisation loop to Modal T4 GPU containers. Both
calibrations (650 constituencies, 360 LAs) run in parallel on separate
containers, so wall time becomes max(c_time, la_time) rather than the sum.

- Extract _run_optimisation() helper from calibrate.py (device-agnostic)
- Add modal_calibrate.py: Modal app wrapping the GPU loop
- create_datasets.py: dispatch to Modal when MODAL_CALIBRATE=1, CPU fallback otherwise
- push.yaml / pull_request.yaml: set MODAL_CALIBRATE=1 + token secrets
- pyproject.toml: add modal to dev extras
- Run black on all three changed files
- Call frs.copy() before passing dataset to matrix functions in Modal
  path, matching what calibrate_local_areas does internally
- Add changelog_entry.yaml
Build and serialise constituency arrays then del before building LA
arrays, rather than holding both Microsimulations in memory at once.
… peak memory

Previously args_c (serialised constituency matrices, ~several hundred MB)
was held in memory while building the LA Microsimulation, causing OOM on
the GitHub Actions runner (exit 143). Spawn fut_c immediately inside
app.run(), then del arrays before building LA matrices.

Also widen vehicle ownership test tolerance to 0.20 until a freshly
calibrated dataset is published to HuggingFace.
Build national matrix, serialise to bytes, del + gc before building
the local matrix — ensuring only one Microsimulation is live at a time.
Previously both national and constituency Microsimulations were alive
simultaneously, causing OOM (exit 143) on the 7 GB CI runner.
Previously create_national_target_matrix was called twice (once for
constituencies, once for LAs), each creating a full Microsimulation.
Now it's called once on the original frs (no copy needed), serialised
to bytes, and the same bytes reused for both Modal spawns.

Peak memory during the spawn loop is now: frs + one local Microsimulation
(no duplicate national Microsimulation), which matches the CPU path.
…ation logs

run_calibration now returns [(epoch, weights_bytes)] at every 10 epochs
matching the CPU path. _build_log replays these locally via get_performance
to produce the same constituency/la calibration_log.csv format the
dashboard expects.
Add run_imputation Modal function (cpu=8, memory=16GB, no GPU) that
runs the full imputation + uprating pipeline inside a container with
policyengine-uk-data installed. The CI runner just sends the raw FRS
bytes, receives the imputed FRS bytes back, and proceeds to calibration.

CPU path (no MODAL_CALIBRATE) is unchanged for local use.
Free each target matrix DataFrame immediately after serialising to bytes,
keeping only column metadata for post-Modal log reconstruction. This
prevents three Microsimulation objects' data from sitting in memory
simultaneously while building national + constituency + LA matrices.
@nwoodruff-co nwoodruff-co force-pushed the feature/modal-gpu-calibration branch from eb0a848 to e175559 Compare February 20, 2026 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments