Improve debugging experience by hmgaudecker · Pull Request #273 · OpenSourceEconomics/pylcm

hmgaudecker · 2026-03-09T19:31:58Z

Fixes #97.
Fixes #142

Replace the boolean debug_mode flag with a structured logging/persistence system
that makes it easier to diagnose solve and simulation failures — especially NaN value
functions during parameter estimation.

Logging rework

debug_mode: bool → log_level: LogLevel on solve(), simulate(), and
solve_and_simulate(). Four levels:

Level	Console output	Snapshots
`"off"`	Nothing	No
`"warning"`	NaN/Inf warnings only	No
`"progress"`	Per-age timing + regime counts	No
`"debug"`	V-function stats per regime-age	Yes

Per-period timing in both solve and simulate (elapsed seconds, active regime count).

Debug snapshots

When log_level="debug", complete snapshots are saved to log_path:

solve_snapshot_000/
├── arrays.h5          # V_arr_dict in HDF5
├── model.pkl          # cloudpickle'd Model
├── params.pkl         # user params
├── metadata.json      # snapshot type, fields, platform
├── pixi.lock          # environment lock file
└── REPRODUCE.md       # instructions for reproducing

Three snapshot types: SolveSnapshot, SimulateSnapshot,
SolveAndSimulateSnapshot. Load with load_snapshot(path, exclude=(...)) to
skip large fields.

Automatic retention: log_keep_n_latest=3 (default) deletes oldest snapshots.

Better error messages

InvalidValueFunctionError now includes regime name and NaN count
("3 of 100 values are NaN in regime 'working_life'")
NaN/Inf in value function arrays logged as warnings before the hard error
Value function debug stats: min/max/mean per regime per age at debug level

Validation improvements

Feasibility check now vmaps over action combos (consistent with solve/simulate),
fixing shape mismatches with MappingLeaf parameters
Batched vmapped feasibility to cap memory at ~256 MB
Enhanced infeasibility error: shows a DataFrame of state values with discrete
codes decoded to labels, plus constraint details and grid bounds

Other

SimulationResult.to_pickle() / .from_pickle() for serialisation
save_solution() / load_solution() for value function arrays
Test file splits for speed in parallel runs (model set up is expensive, so make
batches by module):
- test_solution_on_toy_model.py →
  test_solution_on_toy_model_deterministic.py +
  test_solution_on_toy_model_stochastic.py (speed when
- test_shock_grids.py → grid tests stay, draw tests split into
  test_shock_draw.py
Test model factories cached with @functools.cache for faster test suites

Documentation

New user guide page: Debugging (docs/user_guide/debugging.md) — log levels,
snapshot structure, NaN diagnosis recipes, value function inspection patterns

- Create `lcm.persistence` module with `save_solution`/`load_solution` (extracted `_atomic_dump` from `simulation/result.py`) - Add `debug_path` and `keep_n_latest` params to `solve()`, `simulate()`, and `solve_and_simulate()` for auto-persisting intermediate results to disk - Rename `debug_mode` to `debug` across all public APIs - Make cloudpickle a hard dependency (remove all try/import guards) - Enhance `validate_value_function_array` with regime name and NaN count in errors - Add debugging user guide with recipes (disable JIT, auto-persist, optimagic NaN debugging, inspecting value functions, understanding error messages) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

read-the-docs-community · 2026-03-09T19:33:41Z

Documentation build overview

📚 pylcm | 🛠️ Build #31821026 | 📁 Comparing e7cf1e0 against latest (5f1d1ef)

🔍 Preview build

Show files changed (27 files in total): 📝 24 modified | ➕ 3 added | ➖ 0 deleted

File	Status
index.html	📝 modified
approximating-continuous-shocks/index.html	📝 modified
debugging/index.html	➕ added
defining-models/index.html	📝 modified
dispatchers/index.html	📝 modified
function-representation/index.html	📝 modified
grids/index.html	📝 modified
index-1/index.html	📝 modified
index-2/index.html	📝 modified
index-3/index.html	📝 modified
index-4/index.html	📝 modified
installation/index.html	📝 modified
interpolation/index.html	📝 modified
mahler-yum-2024/index.html	📝 modified
mortality/index.html	📝 modified
pandas-interop/index.html	➕ added
parameters/index.html	📝 modified
precautionary-savings/index.html	📝 modified
precautionary-savings-health/index.html	📝 modified
regimes/index.html	📝 modified
shocks/index.html	📝 modified
solving-and-simulating/index.html	📝 modified
stochastic-transitions/index.html	➕ added
tiny/index.html	📝 modified
tiny-example/index.html	📝 modified
transitions/index.html	📝 modified
write-economics/index.html	📝 modified

… code.

hmgaudecker · 2026-03-14T20:57:34Z

Code review

Found 2 issues:

regime_transition_probs_from_series is listed in __all__ but is never imported or defined anywhere. from lcm import regime_transition_probs_from_series raises ImportError. Either add the import or remove the entry.

pylcm/src/lcm/__init__.py

Lines 54 to 56 in 32ce4a3

    
           "load_solution", 
        
           "regime_transition_probs_from_series", 
        
           "save_solution",

Decorative section-separator comment added. CLAUDE.md says "Never add decorative section-separator comments".

pylcm/src/lcm/model.py

Lines 716 to 718 in 32ce4a3

    
           # ====================================================================================== 
        
           # Debug persistence helpers 
        
           # ======================================================================================

Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

… docstrings.

…log a complete reproducer.

timmens

Looks very good. I have a few comments. Some of the issues are only somewhat related to this PR, but the changes here make them easier to spot.

timmens · 2026-03-17T17:43:00Z

src/lcm/simulation/simulate.py

        seed = draw_random_seed()

    logger.info("Starting simulation")
+    has_multiple_regimes = len(internal_regimes) > 1


Don't we always have at least two regimes?

timmens · 2026-03-17T17:44:46Z

src/lcm/simulation/simulate.py


+            # Check for NaN/Inf in V_arr
+            if jnp.any(jnp.isnan(result.V_arr)) or jnp.any(jnp.isinf(result.V_arr)):
+                logger.warning(
+                    "NaN/Inf in V_arr for regime '%s' at age %s", regime_name, age
+                )
+
        subject_regime_ids = new_subject_regime_ids

+        # Log regime transition counts at debug level
+        if has_multiple_regimes and logger.isEnabledFor(logging.DEBUG):
+            _log_regime_transitions(
+                logger=logger,
+                prev_regime_ids=prev_regime_ids,
+                new_regime_ids=subject_regime_ids,
+                ids_to_names=ids_to_names,
+            )
+
+        elapsed = time.monotonic() - period_start
+        if has_multiple_regimes:
+            logger.info(
+                "Age: %s  regimes=%d  (%.1fs)",
+                age,
+                len(active_regimes),
+                elapsed,
+            )
+        else:
+            logger.info("Age: %s  (%.1fs)", age, elapsed)
+
+    total_elapsed = time.monotonic() - total_start
+    logger.info("Simulation complete  (%.1fs)", total_elapsed)
+


The simulation module / functions are already pretty logic packed. Should we maybe pull this out into a logging function / warning function that hides the internals?

Very good idea.

timmens · 2026-03-17T17:52:49Z

src/lcm/input_processing/regime_processing.py

+
+    # Unwrap partials before renaming, then re-wrap — this avoids rename_arguments
+    # seeing the already-bound keywords and is simpler than letting dags handle it.
+    if isinstance(func, functools.partial):
+        renamed = rename_arguments(func.func, mapper=mapper)
+        return cast(
+            "InternalUserFunction",
+            functools.partial(renamed, *func.args, **func.keywords),
+        )


If any of func.keywords overlap with mapper, the rebound keywords will reference the old (pre-rename) parameter names on the renamed function. Should the keywords be renamed too?

renamed_keywords = {mapper.get(k, k): v for k, v in func.keywords.items()} functools.partial(renamed, *func.args, **renamed_keywords)

timmens · 2026-03-17T17:58:06Z

src/lcm/persistence.py

+
+
+@dataclass(frozen=True)
+class SolveAndSimulateSnapshot:


This class is identical to the SimulateSnapshot class.

Maybe we should also remove the solve_and_simulate logic by just having simulate where you can either pass the pre-computed vf_arr or not; and if not it behaves as solve_and_simulate.

timmens · 2026-03-17T17:58:37Z

src/lcm/solution/solve_brute.py

    )

    logger.info("Starting solution")
+    has_multiple_regimes = len(internal_regimes) > 1


Same question as in simulate.py

timmens · 2026-03-18T08:44:21Z

src/lcm/model.py

+            save_solve_snapshot(
+                model=self,
+                params=params,
+                V_arr_dict=V_arr_dict,
+                log_path=Path(log_path),
+                log_keep_n_latest=log_keep_n_latest,


Have you checked whether this takes a long time for large models?

If so, could we not return V_arr_dict already while doing the snapshot in the background? Same holds for simulation snapshot.

timmens · 2026-03-18T08:46:14Z

src/lcm/model.py

+            )
+        return result

    def solve_and_simulate(


If the log level is on debug; will solve_and_simulate log both the SolutionSnapshot and the SimulationSnapshot, and is that not redundant then?

Also here thinking again, we should probably merge solve_and_simulate and simulate.

timmens · 2026-03-18T08:49:13Z

tests/test_debug_persistence.py

Why is there this file (test_debug_persistence) and the file "test_persistence.py"?

timmens · 2026-03-18T08:53:10Z

tests/test_validation_scalar_actions.py

It feels like this exists because of ttsim / gettsim. Maybe we want to be explicit then? We could even add ttsim to the testing environment and write a test with it, no?

timmens · 2026-03-18T08:53:43Z

pyproject.toml

 ]
 dynamic = [ "version" ]
 dependencies = [
+  "cloudpickle>=3.1.2,<4",


Why the upper limit?

hmgaudecker and others added 4 commits March 9, 2026 14:56

Merge branch 'pandas-ui' into improve-debugging

e3c64b6

Merge branch 'pandas-ui' into improve-debugging

39c2c41

remove params_workflow.ipynb, somehow slipped in again.

6693397

This was linked to issues Mar 9, 2026

ENH: Export value function evaluated at states for debugging #97

Open

ENH: Improve logging and debugging #142

Open

hmgaudecker added 7 commits March 10, 2026 18:33

Merge branch 'pandas-ui' into improve-debugging

ee0fb72

Improve error message for initial conditions validation.

3288cb6

vmap the regime feasibility.

9c7b041

Batch the vmapped validation loop.

4760238

Accommodate the ttsim case (requiring a unique id) also in validation…

382ae5b

… code.

Speed up tests via caching and splitting some files into chunks.

f3ac5f1

Merge branch 'pandas-ui' into improve-debugging

fb17461

hmgaudecker changed the title ~~Improve debugging~~ Improve debugging experience Mar 14, 2026

hmgaudecker added 4 commits March 14, 2026 05:21

Modernise ty rules.

5b472e5

Merge branch 'pandas-ui' into improve-debugging

4a89921

Merge branch 'pandas-ui' into improve-debugging

a15ef87

Merge branch 'pandas-ui' into improve-debugging

32ce4a3

hmgaudecker added 5 commits March 15, 2026 07:44

Autoreview: Be more consistent about typing as Mapping/Sequence, myst…

e6b6940

… docstrings.

Rework debugging experience -- now work via log_level; add option to …

814d913

…log a complete reproducer.

Merge branch 'pandas-ui' into improve-debugging

fb3c611

Merge branch 'pandas-ui' into improve-debugging

57cf433

Fix ty errors; autoreview.

20a7679

hmgaudecker marked this pull request as ready for review March 15, 2026 20:16

hmgaudecker requested a review from timmens March 15, 2026 20:25

Merge branch 'pandas-ui' into improve-debugging

e7cf1e0

timmens reviewed Mar 18, 2026

View reviewed changes

Conversation

hmgaudecker commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Logging rework

Debug snapshots

Better error messages

Validation improvements

Other

Documentation

Uh oh!

read-the-docs-community bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

hmgaudecker commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

timmens left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hmgaudecker commented Mar 9, 2026 •

edited

Loading

read-the-docs-community bot commented Mar 9, 2026 •

edited

Loading

hmgaudecker commented Mar 14, 2026 •

edited

Loading