Skip to content

fix(spp_programs): release stuck cycle/program lock when async pipeline fails#188

Open
reichie020212 wants to merge 3 commits into19.0from
fix/cycle-stale-lock-recovery
Open

fix(spp_programs): release stuck cycle/program lock when async pipeline fails#188
reichie020212 wants to merge 3 commits into19.0from
fix/cycle-stale-lock-recovery

Conversation

@reichie020212
Copy link
Copy Markdown
Member

@reichie020212 reichie020212 commented May 5, 2026

Summary

  • Wires a paired main_job.on_error(...) callback at every async site so cycle.is_locked (and program.is_locked) is cleared on both success and failure paths — not only on success.
  • Adds mark_*_as_failed companions for every existing mark_*_as_done, and hardens the success-path methods so a chatter exception can't strand the lock either.
  • Adds an admin-only action_force_unlock (group base.group_system) on spp.cycle and spp.program, with audit chatter, for the residual case where neither callback fires at all (e.g. server killed mid-operation, or pre-fix data still stuck on production).

Why

The async entitlement / cycle / payment / enrollment pipelines acquire is_locked=True before scheduling a queue.job group, and only release the flag inside the on_done callback. Today, if any child job in the group fails, the on_done is cascade-failed and never executes — the cycle stays "Operation in progress" forever, even though no active jobs are left to drive it. The view (cycle_view.xml:238-249) renders the alert purely from is_locked and never inspects the queue, so users see a stuck warning.

Reported externally: a 4Ps cycle was stuck in "Operation in progress: Set entitlements to pending validation for cycle" while the Queue Jobs view (filtered to active states waiting/pending/started) showed nothing — the failed jobs were in failed state, hidden by the default filter.

Sites wired

main_job.on_error(...) added next to every existing main_job.on_done(...) — 9 call sites:

  • entitlement_manager_base.py: _set_pending_validation_entitlements_async, _validate_entitlements_async, _cancel_entitlements_async
  • entitlement_manager_inkind.py: _set_pending_validation_entitlements_async, _validate_entitlements_async
  • entitlement_manager_cash.py: _validate_entitlements_async
  • cycle_manager_base.py: _check_eligibility_async, _prepare_entitlements_async, _add_beneficiaries_async
  • payment_manager.py: _prepare_payments_async, _send_payments_async
  • program_manager.py: _enroll_eligible_registrants_async

Hardening of the success path

Every mark_*_as_done method now writes is_locked=False before calling message_post, with message_post wrapped in try/except that logs to _logger.exception. Previously a chatter failure here would also strand the lock.

Force Unlock escape hatch

When neither callback can fire — server killed mid-operation, or the lock was set before this fix landed — a manager-only "Force Unlock" button appears next to the "Operation in progress" alert. It clears is_locked / locked_reason and posts an audit chatter line naming the user and the prior reason.

Dependency

Requires job_worker >= 19.0.1.1.0 for on_error() / run_on_failure semantics. Companion PR: https://github.com/OpenSPP/odoo-job-worker/pull/new/fix/job-worker-on-error-callback. Manifest bumped 19.0.2.0.11 → 19.0.2.1.0.

…ne fails

Async entitlement / cycle / payment / enrollment pipelines acquired
is_locked=True before scheduling a queue.job group, but only released
the flag inside the on_done callback. If any child job failed, the
on_done was cascade-failed (never executed) and the cycle remained
"Operation in progress" forever — even though no active queue jobs
were left to drive it. Same vulnerability on spp.program for the
enrollment flow.

Wires a paired on_error() callback at every async site so the lock is
cleared on both success and failure paths. Adds mark_*_as_failed
companions for each existing mark_*_as_done, and hardens the success
path so a chatter exception can't strand the lock either.

Adds an admin-only action_force_unlock action (with audit chatter) on
spp.cycle and spp.program for the residual case where neither callback
fires at all (e.g. server killed mid-operation, pre-fix data).

Requires job_worker >= 19.0.1.1.0 for the on_error() / run_on_failure
semantics. Bumps spp_programs to 19.0.2.1.0.

Signed-off-by: Red <redickbutay02@gmail.com>
@reichie020212 reichie020212 marked this pull request as draft May 5, 2026 07:17
@codecov
Copy link
Copy Markdown

codecov Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 71.84466% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.55%. Comparing base (0627ab1) to head (da15f26).
⚠️ Report is 21 commits behind head on 19.0.

Files with missing lines Patch % Lines
spp_programs/models/managers/cycle_manager_base.py 70.00% 12 Missing ⚠️
spp_programs/models/managers/payment_manager.py 57.14% 6 Missing ⚠️
...ograms/models/managers/entitlement_manager_base.py 66.66% 5 Missing ⚠️
spp_programs/models/managers/program_manager.py 76.47% 4 Missing ⚠️
...rams/models/managers/entitlement_manager_inkind.py 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             19.0     #188      +/-   ##
==========================================
- Coverage   71.63%   71.55%   -0.08%     
==========================================
  Files         933      941       +8     
  Lines       55370    55030     -340     
==========================================
- Hits        39664    39379     -285     
+ Misses      15706    15651      -55     
Flag Coverage Δ
spp_api_v2 80.33% <ø> (ø)
spp_api_v2_change_request 66.85% <ø> (ø)
spp_api_v2_cycles 71.12% <ø> (ø)
spp_api_v2_data 64.41% <ø> (ø)
spp_api_v2_entitlements 70.19% <ø> (ø)
spp_api_v2_gis 71.52% <ø> (ø)
spp_api_v2_products 66.27% <ø> (ø)
spp_api_v2_service_points 70.94% <ø> (ø)
spp_api_v2_simulation 71.12% <ø> (ø)
spp_api_v2_vocabulary 57.26% <ø> (ø)
spp_audit 72.60% <ø> (+0.06%) ⬆️
spp_base_common 90.26% <ø> (ø)
spp_case_entitlements 97.61% <ø> (ø)
spp_case_programs 97.14% <ø> (ø)
spp_cel_event 85.11% <ø> (ø)
spp_claim_169 58.11% <ø> (ø)
spp_cr_type_assign_program 91.17% <ø> (?)
spp_dci_client_dr 55.87% <ø> (ø)
spp_programs 64.77% <71.84%> (+0.28%) ⬆️
spp_security 66.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
spp_programs/__manifest__.py 0.00% <ø> (ø)
spp_programs/models/cycle.py 65.74% <100.00%> (+0.47%) ⬆️
...ograms/models/managers/entitlement_manager_cash.py 63.00% <100.00%> (-0.32%) ⬇️
spp_programs/models/programs.py 87.96% <100.00%> (+0.43%) ⬆️
...rams/models/managers/entitlement_manager_inkind.py 54.76% <0.00%> (-0.66%) ⬇️
spp_programs/models/managers/program_manager.py 90.07% <76.47%> (-2.85%) ⬇️
...ograms/models/managers/entitlement_manager_base.py 72.46% <66.66%> (+0.12%) ⬆️
spp_programs/models/managers/payment_manager.py 58.57% <57.14%> (+1.61%) ⬆️
spp_programs/models/managers/cycle_manager_base.py 71.38% <70.00%> (+1.15%) ⬆️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a comprehensive lock recovery mechanism for asynchronous pipelines across programs, cycles, and payments. Key changes include the addition of on_error callbacks to release 'Operation in progress' locks during job failures, wrapping chatter notifications in try-except blocks to ensure locks are cleared even if messaging fails, and introducing a 'Force Unlock' button for administrators. New unit tests verify these recovery paths. Review feedback suggests adding self.ensure_one() to the eligibility check completion and failure methods in the cycle manager to maintain consistency with other manager methods.

Comment thread spp_programs/models/managers/cycle_manager_base.py
Comment thread spp_programs/models/managers/cycle_manager_base.py
Two issues:

1. test_async_lock_recovery setUp didn't satisfy spp_cycle's NOT NULL
   start_date constraint and didn't request default managers, so 7 of
   the new tests errored on setUp. Helpers now pass start_date /
   end_date and use create_default_managers=True context (as the
   create-program wizards do) so get_manager(...) returns a record
   instead of None.

2. mark_check_eligibility_as_done and mark_check_eligibility_as_failed
   now call self.ensure_one() to match the rest of the mark_* family —
   raised by gemini-code-assist review.

Signed-off-by: Red <redickbutay02@gmail.com>
@reichie020212 reichie020212 marked this pull request as ready for review May 5, 2026 08:12
Copy link
Copy Markdown
Contributor

@emjay0921 emjay0921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the core diagnosis (cascade-failed on_done strands the lock) is correct, and the fix surface is comprehensive. CI is green except codecov/patch, which is a coverage-percentage gate on the new mark_*_as_failed methods rather than a real failure — I wouldn't block on it.

What I verified:

  • on_error() wiring is complete. Every main_job.on_done(...) callsite (cycle import / prepare_entitlement / check_eligibility, entitlement set_pending / validate / cancel — base, cash, in-kind variants, payment prepare / send, program enroll_eligible) now has a paired on_error(...). No site I can find ships the on_done without the on_error.
  • mark_*_as_done hardening is the right call. Switching to cycle.write({"is_locked": False, "locked_reason": False}) before message_post, then catching exceptions around the chatter post, means even a chatter-side failure (e.g. transient mail-thread error) can't leave the lock set. That's the same bug pattern the on_error fix targets, just on the success path.
  • action_force_unlock is appropriately gated. groups="base.group_system" on the buttons + the confirm dialog + audit chatter line is exactly what an escape hatch should look like. The no-op-when-unlocked branch is also tested.
  • Tests cover all four manager classes plus the cycle/program force-unlock action. They use TransactionCase so rollback is clean. They don't drive a real async pipeline through to failure (which would need the full job_worker infra) — that's acceptable; the unit-level guarantees on the callbacks + the sites being correctly wired is a reasonable trade-off.

Three small notes (none blocking):

  1. Stats-refresh asymmetry on the program path. mark_enroll_eligible_as_done refreshes _compute_eligible_beneficiary_count + _compute_beneficiary_count; the as_failed counterpart doesn't. If a partial enrollment lands before the group fails, the displayed counts could be a touch stale until the next compute trigger. Probably not worth a follow-up given how rare partial async failure should be in practice, but worth noting.
  2. except Exception around message_post is broad. Intentional and correct here (better to log the chatter problem than strand the lock), but a Sentry-grade observability layer would want a specific exception class to filter on. Future thing.
  3. programs_view.xml Force Unlock button indentation is one level shallow — the new <button> opens at the column of <div>, not aligned with the prior <button>. Pre-commit accepted it, so non-blocking, but a prettier re-flow would tidy it.

Approving.

…r codecov

Adds 17 tests across 4 new classes:

- TestMarkAsDoneHardenedPaths (6): exercises every mark_*_as_done body for
  cycle (import / prepare / check_eligibility), entitlement, payment, and
  program. These are the lock-clear-before-chatter + try/except branches
  added in this PR that codecov reported as 0% covered.
- TestMarkAsFailedChatterContent (5): pins each mark_*_as_failed test to
  a body-content assertion so the cycle.message_post(body=msg) line carries
  a behavioral check.
- TestForceUnlockOnCycle (4): noop, audit-records-user, and (none)
  placeholder branches for spp.cycle.action_force_unlock.
- TestForceUnlockOnProgramExtra (2): noop and (none) placeholder branches
  for spp.program.action_force_unlock that the existing class skipped.

All 25 tests in the file pass: 0 failed, 0 error(s) of 25 tests.
@emjay0921
Copy link
Copy Markdown
Contributor

@gonzalesedwin1123 — pushed da15f26f adding 17 more tests to push patch coverage above the codecov threshold. CI is fully green now (codecov/patch ✅).

What was added (all in spp_programs/tests/test_async_lock_recovery.py):

  • TestMarkAsDoneHardenedPaths (6 tests) — exercises every mark_*_as_done body for cycle (mark_import_as_done / mark_prepare_entitlement_as_done / mark_check_eligibility_as_done), entitlement, payment, and program. These are the lock-clear-before-chatter + try/except branches added in this PR that codecov reported as 0% covered.
  • TestMarkAsFailedChatterContent (5 tests) — pins each mark_*_as_failed body to a content assertion (assertIn on the chatter body) so the cycle.message_post(body=msg) line carries a behavioral check, not just a smoke test on lock state.
  • TestForceUnlockOnCycle (4 tests) — covers the cycle-side action_force_unlock: the noop-when-not-locked branch, the audit-records-current-user branch, and the (none) placeholder branch when the lock had no reason set.
  • TestForceUnlockOnProgramExtra (2 tests) — same noop and (none) placeholder branches on spp.program.action_force_unlock that the existing class skipped.

What I deliberately didn't cover: the main_job.on_error(...) registration sites in the async launchers (_check_eligibility_async, _prepare_entitlements_async, etc.). Those reference Delayable.on_error() from odoo-job-worker PR #8, which hasn't merged to 19.0 yet. Tests that invoke them would AttributeError. They'll naturally pick up coverage once #8 lands and OpenSPP2 picks up the new job_worker.

Local results: 0 failed, 0 error(s) of 25 tests (was 8, now 25). Pre-commit (ruff lint, ruff format, pylint full suite, bandit, OpenSPP custom hooks) clean.

Mind giving this another look when you have a moment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants