Skip to content

PM 31023 - Creating density models for Seeder#7157

Draft
theMickster wants to merge 12 commits intomainfrom
PM-31023/creating-density-models-for-seeder
Draft

PM 31023 - Creating density models for Seeder#7157
theMickster wants to merge 12 commits intomainfrom
PM-31023/creating-density-models-for-seeder

Conversation

@theMickster
Copy link
Contributor

@theMickster theMickster commented Mar 5, 2026

🎟️ Tracking

PM-31023 - Relational Density Modeling
PM-32777 - Baked-In Density Preset Profiles

📔 Objective

Complete the density modeling additions to the Seeder in our presets. The work represents a sizable shift in the way the Seeder was first allocating entities as it created them. By leveraging the new JSON density property, we can now make precise adjustments to allocate entity distribution without changing the Seeder.

Key changes

  • Add 5 density distributions to preset density block
  • Create 9 production-calibrated scale presets (XS-XL)
  • Fix Hamilton apportionment bug in Distribution
  • Reorganize presets into purpose-based folders
  • Consolidate docs into Seeds/docs/ with cross-refs
  • Add Q5-Q8 verification queries for new distributions
  • Deprecate wonka-teams-small and large-enterprise in favor of the production scale presets
Note on the Hamilton apportionment bug

The Distribution.Select() method divides items into percentage-based buckets using integer truncation, which leaves unclaimed remainder items. The old code silently dumped all remainder onto the last bucket — so a zero-weight HidePasswords bucket would still receive items. The fix uses Hamilton apportionment (largest-remainder method): remainder items go one-at-a-time to whichever buckets lost the most from truncation, and zero-weight buckets are guaranteed to receive exactly zero.

Alexander Hamilton — the first U.S. Secretary of the Treasury. He proposed this method in 1792 to apportion congressional seats among states. The math problem is the same: distribute a fixed number of indivisible items (seats, or in our case collection permissions) proportionally across groups when the proportional shares aren't whole numbers.

Where did our distribution statistics come from?

The scale preset archetypes are modeled after three real production organizations analyzed in DBOPS-91: Company A (hierarchical, 2,795 users/74 groups), Company B (flat, 11,491 users/5 groups/13,906 collections), and Company C (balanced, 954 users/99 groups). These profiles revealed that production relationship patterns follow power-law and mega-group distributions — not the uniform round-robin the seeder previously generated. Each scale preset's density parameters (membership skew, collection fan-out, permission weights, orphan rates) were calibrated to reproduce these observed production shapes at five tiers from family (6 users) to mega-corp (10,000 users).

Why the re-organization or presets?

The seeder is still early-adoption — breaking preset names now costs nearly nothing, but doing it after teams build scripts around them has cost. Purpose-based folders (features/qa/scale/validation) make preset discovery self-documenting so engineers don't need to read a README to find the right one. Consolidating docs into Seeds/docs/ eliminates duplication across scattered READMEs and separates everyday usage from developer-only verification content.

🧪 Testing

Expand for detailed instructions

Step 1: Verify preset resolution (all 4 folders)

From util/SeederUtility/, run one preset from each folder:

dotnet run -- seed --preset features.sso-enterprise --mangle
dotnet run -- seed --preset qa.enterprise-basic --mangle
dotnet run -- seed --preset scale.sm-balanced-planet-express --mangle
dotnet run -- seed --preset validation.density-modeling-power-law-test --mangle

All four should seed successfully with no errors.

Step 2: Verify density distributions on a scale preset

Seed a mid-tier and large-tier preset:

dotnet run -- seed --preset scale.md-balanced-sterling-cooper --mangle
dotnet run -- seed --preset scale.lg-highperm-tyrell-corp --mangle

After each, run the verification queries from util/Seeder/Seeds/docs/verification.md against your local MSSQL database. Compare results to the expected-value tables in the same doc.

Key things to verify

  • Q1: Group membership follows power-law decay (not uniform)
  • Q3: Permission percentages match configured weights
  • Q4: Orphan cipher count matches configured rate
  • Q7: Collections-per-user shows min/max spread (not flat 1-2-3)
  • Q8: Multi-collection ciphers present at configured rate

Step 3: Verify backward compatibility

Seed the no-density validation preset:

dotnet run -- seed --preset validation.density-modeling-no-density-test --mangle

Key things to verify

  • 0 CollectionGroup records
  • uniform round-robin group membership
  • every cipher assigned to at least one collection.

This confirms the null-density path is unchanged.

Claude Code prompt for verification

Note: Mick has a reading-bw-mssql skill that automates the pwsh/SqlClient connection pattern. If you'd like it for your Claude Code setup, ask him to share it.

If you'd like Claude Code to run the verification queries for you, use this prompt after seeding:

1. Read util/Seeder/Seeds/docs/verification.md for the SQL queries. 2. Run Q1 through Q8 against org ID '{paste-org-id-here}' using pwsh with $env:BW_READ_ONLY_MSSQL_CONNECTION_STRING. 3. Present results as markdown tables and compare against the expected values for {preset-name} in the verification doc.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Logo
Checkmarx One – Scan Summary & Details7acff828-9ba9-4e2b-8e28-5cb492a54a88

Great job! No new security vulnerabilities introduced in this pull request

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.77%. Comparing base (996f479) to head (9832ae8).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7157      +/-   ##
==========================================
+ Coverage   56.68%   56.77%   +0.08%     
==========================================
  Files        2026     2026              
  Lines       88681    88685       +4     
  Branches     7905     7906       +1     
==========================================
+ Hits        50272    50348      +76     
+ Misses      36585    36507      -78     
- Partials     1824     1830       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@theMickster theMickster changed the title Pm 31023 - Creating density models for Seeder PM 31023 - Creating density models for Seeder Mar 5, 2026
@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant