fix(main): store label for GSP direct-MSE + unwrap TD3 tuple + bump GSP cadence#15
Merged
Merged
Conversation
…dence
Three coordinated changes supporting the GSP direct-MSE training path from
GSP-RL PR #24.
1. store_gsp_transition call sites — pass LABEL as the 2nd arg
Previously non-attention variants stored the previous prediction
(`old_heading_gsp[i]`) in the action field so the DDPG critic could
evaluate Q(state, prediction). The DDPG actor-critic training path is
gone, so the "action" slot is now used for the supervised target.
The attention variant already passed `label` here; now all branches do.
Three call sites updated: independent-learning training store, shared-
model training store, and the global-knowledge branch.
2. TD3 tuple-loss crash fix
`learn_TD3` returns (0, 0) on non-actor-update steps (where the critic
updated but the actor did not, because of UPDATE_ACTOR_ITER>1). Main.py
was storing this tuple in the `loss` timeseries passed to hdf5_writer,
which then crashed with:
ValueError: setting an array element with a sequence. The detected
shape was (4500,) + inhomogeneous part.
This crashed `td3_gsp_s123` in the diagnostic batch. Unwrap the tuple
to a scalar at both learn call sites (independent + shared).
3. GSP_LEARNING_FREQUENCY: 500 → 4
Under direct-MSE GSP training the loss is supervised and cheap — there's
no need for the 500-step cadence that was a hedge for the actor-critic
path's sparse feedback. 4 matches LEARN_EVERY so the GSP predictor
updates in lockstep with the primary network and gets ~125× more
gradient steps per episode.
New contract test: tests/test_main_gsp_contract.py statically asserts that
every store_gsp_transition call in Main.py passes the label as its 2nd arg.
Catches regressions in the call signature.
Full RL-CT suite: 112/112 pass (excluding pre-existing test_nan_guards.py
import error unrelated to this PR).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three coordinated changes supporting the GSP direct-MSE training path from GSP-RL PR #24.
1. `store_gsp_transition` call sites — pass `label` as 2nd arg. Previously non-attention variants stored the previous prediction in the action field so the DDPG critic could evaluate Q(state, prediction). With the DDPG actor-critic training path gone (see GSP-RL #24), the action slot now carries the supervised target. Three call sites updated.
2. TD3 tuple-loss crash fix. `learn_TD3` returns `(0, 0)` on non-actor-update steps; Main.py was storing that tuple in the per-step `loss` timeseries, crashing `hdf5_writer.write_episode` with `ValueError: setting an array element with a sequence` when TD3 jobs reached episode end. Crashed `td3_gsp_s123` in the live diagnostic batch. Unwrapped the tuple at both learn call sites.
3. `GSP_LEARNING_FREQUENCY`: 500 → 4. Direct-MSE updates are cheap; the 500-step cadence was a hedge for the actor-critic path's sparse feedback. 4 matches `LEARN_EVERY` so the GSP predictor gets ~125× more gradient steps per episode.
Test plan
Companion PR (must merge together)
GSP-RL #24 — `learn_gsp_mse` method + `learn_gsp` dispatch rewrite. The two PRs are runtime-coupled. Merging only one leaves the GSP training path reading the wrong field of the replay buffer.
🤖 Generated with Claude Code