Skip to content

Investigate maximum feasible QRF subsample size for PUF imputation #541

@baogorek

Description

@baogorek

Context

In #537, puf_impute.py fuses PUF tax variables onto CPS records via quantile random forests (QRF). The QRF is trained on PUF donor records and then predicts ~104 tax variables for each CPS recipient, matching on demographic predictors (age, sex, filing status, dependents). To keep this tractable, the CPS side is stratified-subsampled to 20K records (force-including the top 0.5% by AGI to preserve the high-income tail).

The 20K target was chosen pragmatically but hasn't been stress-tested — it's unclear whether 20K is near the ceiling or if we could go significantly higher. Larger subsamples should improve imputation quality (especially in the tails), so it's worth finding the practical upper bound.

Questions to answer

  • What is the maximum subsample size that fits in memory during QRF training (given ~104 target variables batched in groups)?
  • How does imputation accuracy (e.g., distribution of imputed AGI, top-tail fidelity) change as subsample size increases from 20K toward the full CPS?
  • Is there a point of diminishing returns where more data doesn't meaningfully improve the QRF fits?
  • Does the current batching strategy (grouping target variables) affect the memory ceiling?

Suggested approach

  1. Benchmark memory usage and wall time at 20K, 40K, 60K, etc.
  2. Compare imputed distributions (quantiles, max AGI) across subsample sizes.
  3. Identify the largest feasible size given CI runner constraints (~7 GB RAM).

Related

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions