Investigate maximum feasible QRF subsample size for PUF imputation

## Context

In #537, `puf_impute.py` fuses PUF tax variables onto CPS records via quantile random forests (QRF). The QRF is trained on PUF donor records and then predicts ~104 tax variables for each CPS recipient, matching on demographic predictors (age, sex, filing status, dependents). To keep this tractable, the CPS side is stratified-subsampled to 20K records (force-including the top 0.5% by AGI to preserve the high-income tail).

The 20K target was chosen pragmatically but hasn't been stress-tested — it's unclear whether 20K is near the ceiling or if we could go significantly higher. Larger subsamples should improve imputation quality (especially in the tails), so it's worth finding the practical upper bound.

## Questions to answer

- What is the maximum subsample size that fits in memory during QRF training (given ~104 target variables batched in groups)?
- How does imputation accuracy (e.g., distribution of imputed AGI, top-tail fidelity) change as subsample size increases from 20K toward the full CPS?
- Is there a point of diminishing returns where more data doesn't meaningfully improve the QRF fits?
- Does the current batching strategy (grouping target variables) affect the memory ceiling?

## Suggested approach

1. Benchmark memory usage and wall time at 20K, 40K, 60K, etc.
2. Compare imputed distributions (quantiles, max AGI) across subsample sizes.
3. Identify the largest feasible size given CI runner constraints (~7 GB RAM).

## Related

- #530 / #537 — PUF imputation module where the 20K default was introduced

🤖 Generated with [Claude Code](https://claude.com/claude-code)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate maximum feasible QRF subsample size for PUF imputation #541

Context

Questions to answer

Suggested approach

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate maximum feasible QRF subsample size for PUF imputation #541

Description

Context

Questions to answer

Suggested approach

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions