Skip to content

Optimise Simulation() init: ~10x speedup on warm loads#1497

Merged
nikhilwoodruff merged 4 commits intomainfrom
perf/brma-lha-optimisation
Feb 18, 2026
Merged

Optimise Simulation() init: ~10x speedup on warm loads#1497
nikhilwoodruff merged 4 commits intomainfrom
perf/brma-lha-optimisation

Conversation

@nikhilwoodruff
Copy link
Copy Markdown
Collaborator

Three changes that together give ~10x speedup on repeated Simulation() calls (3.7s → 0.38s warm).

Parameter tree cache. CountryTaxBenefitSystem.__init__() clones a pre-processed parameter tree instead of running convert_to_fiscal_year_parameters() (22,538 param.update() calls, ~0.5s) on every instantiation. The first init populates the cache; subsequent inits clone it in ~0.15s. Reforms still run the full pipeline via apply_parameter_changes()reset_parameters()process_parameters().

URL dataset cache. After the first HuggingFace load, the fully-uprated, enum-pre-encoded UKMultiYearDataset is cached in memory. _pre_encode_enum_columns() converts string enum columns to int16 before caching so subsequent loads use encode()'s fast integer path. Saves ~2.2s (HDF5 read + uprating + string encoding) per warm simulation.

Vectorised interpolate_percentile. The Python list comprehension over ~115k households in attends_private_school is replaced with a 21-point parameter lookup, np.interp, and numpy array indexing.

Correctness verified: enum variables (region, tenure_type), income tax and household net income all give identical results between cold and warm runs. All tests pass except two pre-existing failures in test_reform_impacts.py.

nikhilwoodruff and others added 3 commits February 18, 2026 09:58
…rised interpolation

Three performance changes that give ~10x speedup on warm Simulation() calls:

1. Cache the fully-processed parameter tree in CountryTaxBenefitSystem.__init__().
   convert_to_fiscal_year_parameters() (22,538 param.update() calls, ~0.5s) now only
   runs once per process. Subsequent inits clone the cached tree. Reforms still go
   through the full pipeline via apply_parameter_changes() -> reset_parameters() ->
   process_parameters(), so correctness is preserved.

2. Cache the loaded, uprated and enum-pre-encoded UKMultiYearDataset per URL.
   _pre_encode_enum_columns() converts string enum columns to int16 before caching
   so subsequent build_from_multi_year_dataset calls use encode()'s fast integer
   path. Saves ~2.2s (HDF5 read + uprating + string encoding) on every warm load.

3. Vectorise attends_private_school.interpolate_percentile. Replace the Python list
   comprehension over ~115k households with a 21-point parameter lookup, np.interp
   and numpy array indexing.

Benchmark: 2nd+ Simulation() drops from ~3.7s to ~0.38s.

Co-Authored-By: Claude <noreply@anthropic.com>
@nikhilwoodruff nikhilwoodruff force-pushed the perf/brma-lha-optimisation branch from ca4956f to 14dcc79 Compare February 18, 2026 09:58
Removes the -m "not microsimulation" exclusion from make test, so reform
impact and salary sacrifice cap tests run in CI. Updates stale expected
values in reforms_config.yaml and test_salary_sacrifice_cap_reform.py to
match current model output.

Co-Authored-By: Claude <noreply@anthropic.com>
@nikhilwoodruff nikhilwoodruff merged commit 033fbb1 into main Feb 18, 2026
2 checks passed
@nikhilwoodruff nikhilwoodruff deleted the perf/brma-lha-optimisation branch February 18, 2026 10:18
nwoodruff-co pushed a commit that referenced this pull request Feb 19, 2026
…m labels

After the warm-load caching optimisation (#1497), enum columns in dataset
DataFrames are stored as int16 for performance. Add decoded_person,
decoded_benunit and decoded_household properties to UKSingleYearDataset that
return copies of the DataFrames with enum columns decoded back to their string
names (e.g. 0 -> 'MALE'). The internal int16 representation is unchanged so
simulation performance is unaffected.
nwoodruff-co pushed a commit that referenced this pull request Feb 19, 2026
…ings

Replace plain DataFrame attributes with properties that decode int16 enum
columns back to string names on access. The raw int16 data is stored in
private _person/_benunit/_household attributes and accessed via .tables by
the simulation engine, preserving the warm-load performance gain from #1497.

Callers (e.g. sim.dataset[year].person) now see string labels as before.
nwoodruff-co pushed a commit that referenced this pull request Feb 19, 2026
…ehold (#1498)

* Add decoded_person/benunit/household properties to restore string enum labels

After the warm-load caching optimisation (#1497), enum columns in dataset
DataFrames are stored as int16 for performance. Add decoded_person,
decoded_benunit and decoded_household properties to UKSingleYearDataset that
return copies of the DataFrames with enum columns decoded back to their string
names (e.g. 0 -> 'MALE'). The internal int16 representation is unchanged so
simulation performance is unaffected.

* Make UKSingleYearDataset.person/benunit/household decode enums to strings

Replace plain DataFrame attributes with properties that decode int16 enum
columns back to string names on access. The raw int16 data is stored in
private _person/_benunit/_household attributes and accessed via .tables by
the simulation engine, preserving the warm-load performance gain from #1497.

Callers (e.g. sim.dataset[year].person) now see string labels as before.

* Format with black

* Update UC taper reform expected impact to -44.8bn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant