-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Context
In #537, puf_impute.py fuses PUF tax variables onto CPS records via quantile random forests (QRF). The QRF is trained on PUF donor records and then predicts ~104 tax variables for each CPS recipient, matching on demographic predictors (age, sex, filing status, dependents). To keep this tractable, the CPS side is stratified-subsampled to 20K records (force-including the top 0.5% by AGI to preserve the high-income tail).
The 20K target was chosen pragmatically but hasn't been stress-tested — it's unclear whether 20K is near the ceiling or if we could go significantly higher. Larger subsamples should improve imputation quality (especially in the tails), so it's worth finding the practical upper bound.
Questions to answer
- What is the maximum subsample size that fits in memory during QRF training (given ~104 target variables batched in groups)?
- How does imputation accuracy (e.g., distribution of imputed AGI, top-tail fidelity) change as subsample size increases from 20K toward the full CPS?
- Is there a point of diminishing returns where more data doesn't meaningfully improve the QRF fits?
- Does the current batching strategy (grouping target variables) affect the memory ceiling?
Suggested approach
- Benchmark memory usage and wall time at 20K, 40K, 60K, etc.
- Compare imputed distributions (quantiles, max AGI) across subsample sizes.
- Identify the largest feasible size given CI runner constraints (~7 GB RAM).
Related
- CPS top-coding caps AGI at $6.26M — zero observations above $10M in any state #530 / Add PUF + source impute modules, fix AGI ceiling (issue #530) #537 — PUF imputation module where the 20K default was introduced
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels