fix(spp_programs): release stuck cycle/program lock when async pipeline fails by reichie020212 · Pull Request #188 · OpenSPP/OpenSPP2

reichie020212 · 2026-05-05T07:16:27Z

Summary

Wires a paired main_job.on_error(...) callback at every async site so cycle.is_locked (and program.is_locked) is cleared on both success and failure paths — not only on success.
Adds mark_*_as_failed companions for every existing mark_*_as_done, and hardens the success-path methods so a chatter exception can't strand the lock either.
Adds an admin-only action_force_unlock (group base.group_system) on spp.cycle and spp.program, with audit chatter, for the residual case where neither callback fires at all (e.g. server killed mid-operation, or pre-fix data still stuck on production).

Why

The async entitlement / cycle / payment / enrollment pipelines acquire is_locked=True before scheduling a queue.job group, and only release the flag inside the on_done callback. Today, if any child job in the group fails, the on_done is cascade-failed and never executes — the cycle stays "Operation in progress" forever, even though no active jobs are left to drive it. The view (cycle_view.xml:238-249) renders the alert purely from is_locked and never inspects the queue, so users see a stuck warning.

Reported externally: a 4Ps cycle was stuck in "Operation in progress: Set entitlements to pending validation for cycle" while the Queue Jobs view (filtered to active states waiting/pending/started) showed nothing — the failed jobs were in failed state, hidden by the default filter.

Sites wired

main_job.on_error(...) added next to every existing main_job.on_done(...) — 9 call sites:

entitlement_manager_base.py: _set_pending_validation_entitlements_async, _validate_entitlements_async, _cancel_entitlements_async
entitlement_manager_inkind.py: _set_pending_validation_entitlements_async, _validate_entitlements_async
entitlement_manager_cash.py: _validate_entitlements_async
cycle_manager_base.py: _check_eligibility_async, _prepare_entitlements_async, _add_beneficiaries_async
payment_manager.py: _prepare_payments_async, _send_payments_async
program_manager.py: _enroll_eligible_registrants_async

Hardening of the success path

Every mark_*_as_done method now writes is_locked=False before calling message_post, with message_post wrapped in try/except that logs to _logger.exception. Previously a chatter failure here would also strand the lock.

Force Unlock escape hatch

When neither callback can fire — server killed mid-operation, or the lock was set before this fix landed — a manager-only "Force Unlock" button appears next to the "Operation in progress" alert. It clears is_locked / locked_reason and posts an audit chatter line naming the user and the prior reason.

Dependency

Requires job_worker >= 19.0.1.1.0 for on_error() / run_on_failure semantics. Companion PR: https://github.com/OpenSPP/odoo-job-worker/pull/new/fix/job-worker-on-error-callback. Manifest bumped 19.0.2.0.11 → 19.0.2.1.0.

…ne fails Async entitlement / cycle / payment / enrollment pipelines acquired is_locked=True before scheduling a queue.job group, but only released the flag inside the on_done callback. If any child job failed, the on_done was cascade-failed (never executed) and the cycle remained "Operation in progress" forever — even though no active queue jobs were left to drive it. Same vulnerability on spp.program for the enrollment flow. Wires a paired on_error() callback at every async site so the lock is cleared on both success and failure paths. Adds mark_*_as_failed companions for each existing mark_*_as_done, and hardens the success path so a chatter exception can't strand the lock either. Adds an admin-only action_force_unlock action (with audit chatter) on spp.cycle and spp.program for the residual case where neither callback fires at all (e.g. server killed mid-operation, pre-fix data). Requires job_worker >= 19.0.1.1.0 for the on_error() / run_on_failure semantics. Bumps spp_programs to 19.0.2.1.0. Signed-off-by: Red <redickbutay02@gmail.com>

codecov · 2026-05-05T07:20:35Z

Codecov Report

❌ Patch coverage is 71.84466% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.55%. Comparing base (0627ab1) to head (da15f26).
⚠️ Report is 21 commits behind head on 19.0.

Files with missing lines	Patch %	Lines
spp_programs/models/managers/cycle_manager_base.py	70.00%	12 Missing ⚠️
spp_programs/models/managers/payment_manager.py	57.14%	6 Missing ⚠️
...ograms/models/managers/entitlement_manager_base.py	66.66%	5 Missing ⚠️
spp_programs/models/managers/program_manager.py	76.47%	4 Missing ⚠️
...rams/models/managers/entitlement_manager_inkind.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             19.0     #188      +/-   ##
==========================================
- Coverage   71.63%   71.55%   -0.08%     
==========================================
  Files         933      941       +8     
  Lines       55370    55030     -340     
==========================================
- Hits        39664    39379     -285     
+ Misses      15706    15651      -55

Flag	Coverage Δ
spp_api_v2	`80.33% <ø> (ø)`
spp_api_v2_change_request	`66.85% <ø> (ø)`
spp_api_v2_cycles	`71.12% <ø> (ø)`
spp_api_v2_data	`64.41% <ø> (ø)`
spp_api_v2_entitlements	`70.19% <ø> (ø)`
spp_api_v2_gis	`71.52% <ø> (ø)`
spp_api_v2_products	`66.27% <ø> (ø)`
spp_api_v2_service_points	`70.94% <ø> (ø)`
spp_api_v2_simulation	`71.12% <ø> (ø)`
spp_api_v2_vocabulary	`57.26% <ø> (ø)`
spp_audit	`72.60% <ø> (+0.06%)`	⬆️
spp_base_common	`90.26% <ø> (ø)`
spp_case_entitlements	`97.61% <ø> (ø)`
spp_case_programs	`97.14% <ø> (ø)`
spp_cel_event	`85.11% <ø> (ø)`
spp_claim_169	`58.11% <ø> (ø)`
spp_cr_type_assign_program	`91.17% <ø> (?)`
spp_dci_client_dr	`55.87% <ø> (ø)`
spp_programs	`64.77% <71.84%> (+0.28%)`	⬆️
spp_security	`66.66% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
spp_programs/__manifest__.py	`0.00% <ø> (ø)`
spp_programs/models/cycle.py	`65.74% <100.00%> (+0.47%)`	⬆️
...ograms/models/managers/entitlement_manager_cash.py	`63.00% <100.00%> (-0.32%)`	⬇️
spp_programs/models/programs.py	`87.96% <100.00%> (+0.43%)`	⬆️
...rams/models/managers/entitlement_manager_inkind.py	`54.76% <0.00%> (-0.66%)`	⬇️
spp_programs/models/managers/program_manager.py	`90.07% <76.47%> (-2.85%)`	⬇️
...ograms/models/managers/entitlement_manager_base.py	`72.46% <66.66%> (+0.12%)`	⬆️
spp_programs/models/managers/payment_manager.py	`58.57% <57.14%> (+1.61%)`	⬆️
spp_programs/models/managers/cycle_manager_base.py	`71.38% <70.00%> (+1.15%)`	⬆️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gemini-code-assist

Code Review

This pull request implements a comprehensive lock recovery mechanism for asynchronous pipelines across programs, cycles, and payments. Key changes include the addition of on_error callbacks to release 'Operation in progress' locks during job failures, wrapping chatter notifications in try-except blocks to ensure locks are cleared even if messaging fails, and introducing a 'Force Unlock' button for administrators. New unit tests verify these recovery paths. Review feedback suggests adding self.ensure_one() to the eligibility check completion and failure methods in the cycle manager to maintain consistency with other manager methods.

Two issues: 1. test_async_lock_recovery setUp didn't satisfy spp_cycle's NOT NULL start_date constraint and didn't request default managers, so 7 of the new tests errored on setUp. Helpers now pass start_date / end_date and use create_default_managers=True context (as the create-program wizards do) so get_manager(...) returns a record instead of None. 2. mark_check_eligibility_as_done and mark_check_eligibility_as_failed now call self.ensure_one() to match the rest of the mark_* family — raised by gemini-code-assist review. Signed-off-by: Red <redickbutay02@gmail.com>

emjay0921

LGTM — the core diagnosis (cascade-failed on_done strands the lock) is correct, and the fix surface is comprehensive. CI is green except codecov/patch, which is a coverage-percentage gate on the new mark_*_as_failed methods rather than a real failure — I wouldn't block on it.

What I verified:

on_error() wiring is complete. Every main_job.on_done(...) callsite (cycle import / prepare_entitlement / check_eligibility, entitlement set_pending / validate / cancel — base, cash, in-kind variants, payment prepare / send, program enroll_eligible) now has a paired on_error(...). No site I can find ships the on_done without the on_error.
mark_*_as_done hardening is the right call. Switching to cycle.write({"is_locked": False, "locked_reason": False}) before message_post, then catching exceptions around the chatter post, means even a chatter-side failure (e.g. transient mail-thread error) can't leave the lock set. That's the same bug pattern the on_error fix targets, just on the success path.
action_force_unlock is appropriately gated. groups="base.group_system" on the buttons + the confirm dialog + audit chatter line is exactly what an escape hatch should look like. The no-op-when-unlocked branch is also tested.
Tests cover all four manager classes plus the cycle/program force-unlock action. They use TransactionCase so rollback is clean. They don't drive a real async pipeline through to failure (which would need the full job_worker infra) — that's acceptable; the unit-level guarantees on the callbacks + the sites being correctly wired is a reasonable trade-off.

Three small notes (none blocking):

Stats-refresh asymmetry on the program path. mark_enroll_eligible_as_done refreshes _compute_eligible_beneficiary_count + _compute_beneficiary_count; the as_failed counterpart doesn't. If a partial enrollment lands before the group fails, the displayed counts could be a touch stale until the next compute trigger. Probably not worth a follow-up given how rare partial async failure should be in practice, but worth noting.
except Exception around message_post is broad. Intentional and correct here (better to log the chatter problem than strand the lock), but a Sentry-grade observability layer would want a specific exception class to filter on. Future thing.
programs_view.xml Force Unlock button indentation is one level shallow — the new <button> opens at the column of <div>, not aligned with the prior <button>. Pre-commit accepted it, so non-blocking, but a prettier re-flow would tidy it.

Approving.

…r codecov Adds 17 tests across 4 new classes: - TestMarkAsDoneHardenedPaths (6): exercises every mark_*_as_done body for cycle (import / prepare / check_eligibility), entitlement, payment, and program. These are the lock-clear-before-chatter + try/except branches added in this PR that codecov reported as 0% covered. - TestMarkAsFailedChatterContent (5): pins each mark_*_as_failed test to a body-content assertion so the cycle.message_post(body=msg) line carries a behavioral check. - TestForceUnlockOnCycle (4): noop, audit-records-user, and (none) placeholder branches for spp.cycle.action_force_unlock. - TestForceUnlockOnProgramExtra (2): noop and (none) placeholder branches for spp.program.action_force_unlock that the existing class skipped. All 25 tests in the file pass: 0 failed, 0 error(s) of 25 tests.

emjay0921 · 2026-05-05T10:00:24Z

@gonzalesedwin1123 — pushed da15f26f adding 17 more tests to push patch coverage above the codecov threshold. CI is fully green now (codecov/patch ✅).

What was added (all in spp_programs/tests/test_async_lock_recovery.py):

TestMarkAsDoneHardenedPaths (6 tests) — exercises every mark_*_as_done body for cycle (mark_import_as_done / mark_prepare_entitlement_as_done / mark_check_eligibility_as_done), entitlement, payment, and program. These are the lock-clear-before-chatter + try/except branches added in this PR that codecov reported as 0% covered.
TestMarkAsFailedChatterContent (5 tests) — pins each mark_*_as_failed body to a content assertion (assertIn on the chatter body) so the cycle.message_post(body=msg) line carries a behavioral check, not just a smoke test on lock state.
TestForceUnlockOnCycle (4 tests) — covers the cycle-side action_force_unlock: the noop-when-not-locked branch, the audit-records-current-user branch, and the (none) placeholder branch when the lock had no reason set.
TestForceUnlockOnProgramExtra (2 tests) — same noop and (none) placeholder branches on spp.program.action_force_unlock that the existing class skipped.

What I deliberately didn't cover: the main_job.on_error(...) registration sites in the async launchers (_check_eligibility_async, _prepare_entitlements_async, etc.). Those reference Delayable.on_error() from odoo-job-worker PR #8, which hasn't merged to 19.0 yet. Tests that invoke them would AttributeError. They'll naturally pick up coverage once #8 lands and OpenSPP2 picks up the new job_worker.

Local results: 0 failed, 0 error(s) of 25 tests (was 8, now 25). Pre-commit (ruff lint, ruff format, pylint full suite, bandit, OpenSPP custom hooks) clean.

Mind giving this another look when you have a moment?

reichie020212 marked this pull request as draft May 5, 2026 07:17

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

Comment thread spp_programs/models/managers/cycle_manager_base.py

Comment thread spp_programs/models/managers/cycle_manager_base.py

reichie020212 marked this pull request as ready for review May 5, 2026 08:12

emjay0921 approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spp_programs): release stuck cycle/program lock when async pipeline fails#188

fix(spp_programs): release stuck cycle/program lock when async pipeline fails#188
reichie020212 wants to merge 3 commits into19.0from
fix/cycle-stale-lock-recovery

reichie020212 commented May 5, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

emjay0921 left a comment

Uh oh!

emjay0921 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

reichie020212 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Sites wired

Hardening of the success path

Force Unlock escape hatch

Dependency

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

emjay0921 left a comment

Choose a reason for hiding this comment

Uh oh!

emjay0921 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

reichie020212 commented May 5, 2026 •

edited

Loading

codecov Bot commented May 5, 2026 •

edited

Loading