Skip to content

Improve debugging experience#273

Open
hmgaudecker wants to merge 21 commits intopandas-uifrom
improve-debugging
Open

Improve debugging experience#273
hmgaudecker wants to merge 21 commits intopandas-uifrom
improve-debugging

Conversation

@hmgaudecker
Copy link
Member

@hmgaudecker hmgaudecker commented Mar 9, 2026

Fixes #97.
Fixes #142

Replace the boolean debug_mode flag with a structured logging/persistence system
that makes it easier to diagnose solve and simulation failures — especially NaN value
functions during parameter estimation.

Logging rework

debug_mode: boollog_level: LogLevel on solve(), simulate(), and
solve_and_simulate(). Four levels:

Level Console output Snapshots
"off" Nothing No
"warning" NaN/Inf warnings only No
"progress" Per-age timing + regime counts No
"debug" V-function stats per regime-age Yes

Per-period timing in both solve and simulate (elapsed seconds, active regime count).

Debug snapshots

When log_level="debug", complete snapshots are saved to log_path:

solve_snapshot_000/
├── arrays.h5          # V_arr_dict in HDF5
├── model.pkl          # cloudpickle'd Model
├── params.pkl         # user params
├── metadata.json      # snapshot type, fields, platform
├── pixi.lock          # environment lock file
└── REPRODUCE.md       # instructions for reproducing

Three snapshot types: SolveSnapshot, SimulateSnapshot,
SolveAndSimulateSnapshot. Load with load_snapshot(path, exclude=(...)) to
skip large fields.

Automatic retention: log_keep_n_latest=3 (default) deletes oldest snapshots.

Better error messages

  • InvalidValueFunctionError now includes regime name and NaN count
    ("3 of 100 values are NaN in regime 'working_life'")
  • NaN/Inf in value function arrays logged as warnings before the hard error
  • Value function debug stats: min/max/mean per regime per age at debug level

Validation improvements

  • Feasibility check now vmaps over action combos (consistent with solve/simulate),
    fixing shape mismatches with MappingLeaf parameters
  • Batched vmapped feasibility to cap memory at ~256 MB
  • Enhanced infeasibility error: shows a DataFrame of state values with discrete
    codes decoded to labels, plus constraint details and grid bounds

Other

  • SimulationResult.to_pickle() / .from_pickle() for serialisation
  • save_solution() / load_solution() for value function arrays
  • Test file splits for speed in parallel runs (model set up is expensive, so make
    batches by module):
    • test_solution_on_toy_model.py
      test_solution_on_toy_model_deterministic.py +
      test_solution_on_toy_model_stochastic.py (speed when
    • test_shock_grids.py → grid tests stay, draw tests split into
      test_shock_draw.py
  • Test model factories cached with @functools.cache for faster test suites

Documentation

  • New user guide page: Debugging (docs/user_guide/debugging.md) — log levels,
    snapshot structure, NaN diagnosis recipes, value function inspection patterns

hmgaudecker and others added 4 commits March 9, 2026 14:56
- Create `lcm.persistence` module with `save_solution`/`load_solution` (extracted
  `_atomic_dump` from `simulation/result.py`)
- Add `debug_path` and `keep_n_latest` params to `solve()`, `simulate()`, and
  `solve_and_simulate()` for auto-persisting intermediate results to disk
- Rename `debug_mode` to `debug` across all public APIs
- Make cloudpickle a hard dependency (remove all try/import guards)
- Enhance `validate_value_function_array` with regime name and NaN count in errors
- Add debugging user guide with recipes (disable JIT, auto-persist, optimagic NaN
  debugging, inspecting value functions, understanding error messages)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@read-the-docs-community
Copy link

read-the-docs-community bot commented Mar 9, 2026

Documentation build overview

📚 pylcm | 🛠️ Build #31821026 | 📁 Comparing e7cf1e0 against latest (5f1d1ef)


🔍 Preview build

Show files changed (27 files in total): 📝 24 modified | ➕ 3 added | ➖ 0 deleted
File Status
index.html 📝 modified
approximating-continuous-shocks/index.html 📝 modified
debugging/index.html ➕ added
defining-models/index.html 📝 modified
dispatchers/index.html 📝 modified
function-representation/index.html 📝 modified
grids/index.html 📝 modified
index-1/index.html 📝 modified
index-2/index.html 📝 modified
index-3/index.html 📝 modified
index-4/index.html 📝 modified
installation/index.html 📝 modified
interpolation/index.html 📝 modified
mahler-yum-2024/index.html 📝 modified
mortality/index.html 📝 modified
pandas-interop/index.html ➕ added
parameters/index.html 📝 modified
precautionary-savings/index.html 📝 modified
precautionary-savings-health/index.html 📝 modified
regimes/index.html 📝 modified
shocks/index.html 📝 modified
solving-and-simulating/index.html 📝 modified
stochastic-transitions/index.html ➕ added
tiny/index.html 📝 modified
tiny-example/index.html 📝 modified
transitions/index.html 📝 modified
write-economics/index.html 📝 modified

@hmgaudecker hmgaudecker changed the title Improve debugging Improve debugging experience Mar 14, 2026
@hmgaudecker
Copy link
Member Author

hmgaudecker commented Mar 14, 2026

Code review

Found 2 issues:

  1. regime_transition_probs_from_series is listed in __all__ but is never imported or defined anywhere. from lcm import regime_transition_probs_from_series raises ImportError. Either add the import or remove the entry.

"load_solution",
"regime_transition_probs_from_series",
"save_solution",

  1. Decorative section-separator comment added. CLAUDE.md says "Never add decorative section-separator comments".

pylcm/src/lcm/model.py

Lines 716 to 718 in 32ce4a3

# ======================================================================================
# Debug persistence helpers
# ======================================================================================

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@hmgaudecker hmgaudecker marked this pull request as ready for review March 15, 2026 20:16
@hmgaudecker hmgaudecker requested a review from timmens March 15, 2026 20:25
Copy link
Member

@timmens timmens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. I have a few comments. Some of the issues are only somewhat related to this PR, but the changes here make them easier to spot.

seed = draw_random_seed()

logger.info("Starting simulation")
has_multiple_regimes = len(internal_regimes) > 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we always have at least two regimes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed!

Comment on lines 150 to +181

# Check for NaN/Inf in V_arr
if jnp.any(jnp.isnan(result.V_arr)) or jnp.any(jnp.isinf(result.V_arr)):
logger.warning(
"NaN/Inf in V_arr for regime '%s' at age %s", regime_name, age
)

subject_regime_ids = new_subject_regime_ids

# Log regime transition counts at debug level
if has_multiple_regimes and logger.isEnabledFor(logging.DEBUG):
_log_regime_transitions(
logger=logger,
prev_regime_ids=prev_regime_ids,
new_regime_ids=subject_regime_ids,
ids_to_names=ids_to_names,
)

elapsed = time.monotonic() - period_start
if has_multiple_regimes:
logger.info(
"Age: %s regimes=%d (%.1fs)",
age,
len(active_regimes),
elapsed,
)
else:
logger.info("Age: %s (%.1fs)", age, elapsed)

total_elapsed = time.monotonic() - total_start
logger.info("Simulation complete (%.1fs)", total_elapsed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simulation module / functions are already pretty logic packed. Should we maybe pull this out into a logging function / warning function that hides the internals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea.

Comment on lines +495 to +503

# Unwrap partials before renaming, then re-wrap — this avoids rename_arguments
# seeing the already-bound keywords and is simpler than letting dags handle it.
if isinstance(func, functools.partial):
renamed = rename_arguments(func.func, mapper=mapper)
return cast(
"InternalUserFunction",
functools.partial(renamed, *func.args, **func.keywords),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of func.keywords overlap with mapper, the rebound keywords will reference the old (pre-rename) parameter names on the renamed function. Should the keywords be renamed too?

renamed_keywords = {mapper.get(k, k): v for k, v in func.keywords.items()}
functools.partial(renamed, *func.args, **renamed_keywords)



@dataclass(frozen=True)
class SolveAndSimulateSnapshot:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is identical to the SimulateSnapshot class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also remove the solve_and_simulate logic by just having simulate where you can either pass the pre-computed vf_arr or not; and if not it behaves as solve_and_simulate.

)

logger.info("Starting solution")
has_multiple_regimes = len(internal_regimes) > 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as in simulate.py

Comment on lines +209 to +214
save_solve_snapshot(
model=self,
params=params,
V_arr_dict=V_arr_dict,
log_path=Path(log_path),
log_keep_n_latest=log_keep_n_latest,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked whether this takes a long time for large models?

If so, could we not return V_arr_dict already while doing the snapshot in the background? Same holds for simulation snapshot.

)
return result

def solve_and_simulate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. If the log level is on debug; will solve_and_simulate log both the SolutionSnapshot and the SimulationSnapshot, and is that not redundant then?
  2. Also here thinking again, we should probably merge solve_and_simulate and simulate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there this file (test_debug_persistence) and the file "test_persistence.py"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like this exists because of ttsim / gettsim. Maybe we want to be explicit then? We could even add ttsim to the testing environment and write a test with it, no?

]
dynamic = [ "version" ]
dependencies = [
"cloudpickle>=3.1.2,<4",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the upper limit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ENH: Improve logging and debugging ENH: Export value function evaluated at states for debugging

2 participants