Skip to content

Add occupation/industry as predictors in tip income imputation model #526

@MaxGhenis

Description

@MaxGhenis

Summary

The QRF model for tip income imputation uses only 4 features: employment_income, age, count_under_18, and count_under_6. Occupation and industry are the strongest predictors of who receives tips, but are not used despite being available in SIPP.

Available SIPP variables (already loaded but unused)

  • TJB*_OCC — Occupation codes per job (up to 7 jobs)
  • TJB*_IND — Industry codes per job (up to 7 jobs)

Why this matters

Tip income is highly concentrated in specific occupations (food servers, bartenders, hairdressers, etc.) and industries (NAICS 72: Accommodation and Food Services). Without occupation/industry, the model spreads tip income more diffusely across the income distribution, which:

  1. Understates tip concentration among low-wage service workers
  2. Reduces accuracy of the distributional impact of "no tax on tips"
  3. May assign tips to workers in non-tipped occupations

Suggested approach

  1. Map SIPP occupation/industry codes to CPS occupation/industry codes (or use broad categories like 2-digit NAICS)
  2. Add these as categorical features to the QRF model in sipp.py
  3. Ensure the CPS recipient dataset has matching occupation/industry variables for prediction

Mapping challenge

CPS and SIPP use different occupation/industry classification systems, so a crosswalk may be needed. At minimum, a broad industry indicator (e.g., food services vs. other) would capture most of the signal.

Context

This is the highest-impact improvement for closing the gap between PolicyEngine's tip deduction estimate ($4.7B) and JCT's score ($10.0B for FY2026).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions