Optimise Simulation() init: ~10x speedup on warm loads by nikhilwoodruff · Pull Request #1497 · PolicyEngine/policyengine-uk

nikhilwoodruff · 2026-02-18T09:49:29Z

Three changes that together give ~10x speedup on repeated Simulation() calls (3.7s → 0.38s warm).

Parameter tree cache. CountryTaxBenefitSystem.__init__() clones a pre-processed parameter tree instead of running convert_to_fiscal_year_parameters() (22,538 param.update() calls, ~0.5s) on every instantiation. The first init populates the cache; subsequent inits clone it in ~0.15s. Reforms still run the full pipeline via apply_parameter_changes() → reset_parameters() → process_parameters().

URL dataset cache. After the first HuggingFace load, the fully-uprated, enum-pre-encoded UKMultiYearDataset is cached in memory. _pre_encode_enum_columns() converts string enum columns to int16 before caching so subsequent loads use encode()'s fast integer path. Saves ~2.2s (HDF5 read + uprating + string encoding) per warm simulation.

Vectorised interpolate_percentile. The Python list comprehension over ~115k households in attends_private_school is replaced with a 21-point parameter lookup, np.interp, and numpy array indexing.

Correctness verified: enum variables (region, tenure_type), income tax and household net income all give identical results between cold and warm runs. All tests pass except two pre-existing failures in test_reform_impacts.py.

…rised interpolation Three performance changes that give ~10x speedup on warm Simulation() calls: 1. Cache the fully-processed parameter tree in CountryTaxBenefitSystem.__init__(). convert_to_fiscal_year_parameters() (22,538 param.update() calls, ~0.5s) now only runs once per process. Subsequent inits clone the cached tree. Reforms still go through the full pipeline via apply_parameter_changes() -> reset_parameters() -> process_parameters(), so correctness is preserved. 2. Cache the loaded, uprated and enum-pre-encoded UKMultiYearDataset per URL. _pre_encode_enum_columns() converts string enum columns to int16 before caching so subsequent build_from_multi_year_dataset calls use encode()'s fast integer path. Saves ~2.2s (HDF5 read + uprating + string encoding) on every warm load. 3. Vectorise attends_private_school.interpolate_percentile. Replace the Python list comprehension over ~115k households with a 21-point parameter lookup, np.interp and numpy array indexing. Benchmark: 2nd+ Simulation() drops from ~3.7s to ~0.38s. Co-Authored-By: Claude <noreply@anthropic.com>

Removes the -m "not microsimulation" exclusion from make test, so reform impact and salary sacrifice cap tests run in CI. Updates stale expected values in reforms_config.yaml and test_salary_sacrifice_cap_reform.py to match current model output. Co-Authored-By: Claude <noreply@anthropic.com>

…m labels After the warm-load caching optimisation (#1497), enum columns in dataset DataFrames are stored as int16 for performance. Add decoded_person, decoded_benunit and decoded_household properties to UKSingleYearDataset that return copies of the DataFrames with enum columns decoded back to their string names (e.g. 0 -> 'MALE'). The internal int16 representation is unchanged so simulation performance is unaffected.

…ings Replace plain DataFrame attributes with properties that decode int16 enum columns back to string names on access. The raw int16 data is stored in private _person/_benunit/_household attributes and accessed via .tables by the simulation engine, preserving the warm-load performance gain from #1497. Callers (e.g. sim.dataset[year].person) now see string labels as before.

…ehold (#1498) * Add decoded_person/benunit/household properties to restore string enum labels After the warm-load caching optimisation (#1497), enum columns in dataset DataFrames are stored as int16 for performance. Add decoded_person, decoded_benunit and decoded_household properties to UKSingleYearDataset that return copies of the DataFrames with enum columns decoded back to their string names (e.g. 0 -> 'MALE'). The internal int16 representation is unchanged so simulation performance is unaffected. * Make UKSingleYearDataset.person/benunit/household decode enums to strings Replace plain DataFrame attributes with properties that decode int16 enum columns back to string names on access. The raw int16 data is stored in private _person/_benunit/_household attributes and accessed via .tables by the simulation engine, preserving the warm-load performance gain from #1497. Callers (e.g. sim.dataset[year].person) now see string labels as before. * Format with black * Update UC taper reform expected impact to -44.8bn

nikhilwoodruff and others added 3 commits February 18, 2026 09:58

Add changelog entry

ef59638

Update test expected values

852ca11

nikhilwoodruff force-pushed the perf/brma-lha-optimisation branch from ca4956f to 14dcc79 Compare February 18, 2026 09:58

nikhilwoodruff merged commit 033fbb1 into main Feb 18, 2026
2 checks passed

nikhilwoodruff deleted the perf/brma-lha-optimisation branch February 18, 2026 10:18

nwoodruff-co mentioned this pull request Feb 19, 2026

Restore string enum labels in UKSingleYearDataset.person/benunit/household #1498

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise Simulation() init: ~10x speedup on warm loads#1497

Optimise Simulation() init: ~10x speedup on warm loads#1497
nikhilwoodruff merged 4 commits intomainfrom
perf/brma-lha-optimisation

nikhilwoodruff commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nikhilwoodruff commented Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant