Reset LSPS5 persistence_in_flight counter on persist errors#4597
Reset LSPS5 persistence_in_flight counter on persist errors#4597tnull wants to merge 1 commit intolightningdevkit:mainfrom
persistence_in_flight counter on persist errors#4597Conversation
|
👋 Thanks for assigning @TheBlueMatt as a reviewer! |
|
I've reviewed the full diff, comparing the new LSPS5 LSPS5
All paths correctly reset the counter. The fix is sound. No issues found. The code correctly ensures |
|
|
||
| let res = self.do_persist().await; | ||
| debug_assert!(res.is_err() || self.persistence_in_flight.load(Ordering::Acquire) == 0); | ||
| self.persistence_in_flight.store(0, Ordering::Release); |
There was a problem hiding this comment.
No, this races with a second writer that is started at the same time. We should move the loop out of the inner method and into this method to control the flag entirely in this method.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4597 +/- ##
==========================================
- Coverage 86.84% 86.22% -0.62%
==========================================
Files 161 156 -5
Lines 109260 108645 -615
Branches 109260 108645 -615
==========================================
- Hits 94882 93677 -1205
- Misses 11797 12362 +565
- Partials 2581 2606 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
`LSPS5ServiceHandler::persist` incremented `persistence_in_flight` at the top as a single-runner gate, but only decremented it on the success path: each interior `?` on a `kv_store` future propagated the error out of the function while leaving the counter at >= 1. After one transient I/O failure (disk full, brief unavailability of a remote `KVStore`, EPERM, etc.) every subsequent `persist()` call hit the `fetch_add > 0` short-circuit and silently returned `Ok(false)`. The in-memory `needs_persist` flags then continued to grow without ever reaching disk, so webhook state, removals, and notification cooldowns were lost on the next process restart — including the spec-mandated webhook retention/pruning state — without any error surfaced to the operator. The counter is monotonic, so recovery required a process restart. Adopt the LSPS1 / LSPS2 pattern: split the body into an inner `do_persist` and an outer `persist` that unconditionally clears the counter via `store(0)` after the call returns, regardless of outcome. A failed write now still propagates `Err`, but the next `persist()` attempt actually retries the write instead of no-op'ing. Co-Authored-By: HAL 9000
a23feec to
b3544de
Compare
Squashed without further changes. |
LSPS5ServiceHandler::persistincrementedpersistence_in_flightat the top as a single-runner gate, but only decremented it on the success path: each interior?on akv_storefuture propagated the error out of the function while leaving the counter at >= 1. After one transient I/O failure (disk full, brief unavailability of a remoteKVStore, EPERM, etc.) every subsequentpersist()call hit thefetch_add > 0short-circuit and silently returnedOk(false).The in-memory
needs_persistflags then continued to grow without ever reaching disk, so webhook state, removals, and notification cooldowns were lost on the next process restart — including the spec-mandated webhook retention/pruning state — without any error surfaced to the operator. The counter is monotonic, so recovery required a process restart.Adopt the LSPS1 / LSPS2 pattern: split the body into an inner
do_persistand an outerpersistthat unconditionally clears the counter viastore(0)after the call returns, regardless of outcome. A failed write now still propagatesErr, but the nextpersist()attempt actually retries the write instead of no-op'ing.Co-Authored-By: HAL 9000