gitThis document provides the mathematical foundation for the main benchmark metrics and reference material for legacy/auxiliary formulas.
Main metrics (report these only):
- PAS — Probability of Agreement State (raw): alignment of human/agent replication decisions.
- ECS — Effect Consistency Score: correlation-based (Lin's Concordance Correlation Coefficient, CCC, between human and agent effect profiles).
Legacy / auxiliary (do not report as primary): PAS normalized, ECS_Strict (Z-diff based), caricature regression, per-test Z-diff, etc.
- Per-Test Definitions (PAS only)
- PAS Aggregation (Test → Finding → Study)
- ECS (Correlation-based, main)
- Appendix: ECS_Strict (Z-diff, legacy)
- Edge Cases & Numeric Thresholds
- Effective Sample Size (n_eff) Definitions
- Effect Size & Standard Error Formulas
- Code Mapping
ECS is not defined per test in the main metric; it is defined at aggregate level as correlation-based (see ECS (Correlation-based, main)). Per-test effect sizes and Z-diff are used only for legacy ECS_Strict (appendix).
For each statistical test, we compute:
Posterior Probabilities:
-
$\pi_h$ : Posterior probability that an effect exists in human data (0 to 1) -
$\pi_a$ : Posterior probability that an effect exists in agent data (0 to 1)
These are derived from Bayes Factors (
PAS Formula:
This measures the probability that human and agent are in the same state (both replicated or both null).
Range:
-
$PAS = 1$ : Perfect alignment (both agree on effect or both agree on null) -
$PAS = 0$ : Complete contradiction (one says effect, other says null)
Step 1: Convert PAS to r-scale
For each test
This transforms
Step 2: Single Test Case
If finding has only one test (
No aggregation needed.
Step 3: Multiple Tests - Fisher-z Pooling
If finding has
-
Fisher-z Transform:
$$z_i = \tanh^{-1}(r_i) = \frac{1}{2}\ln\frac{1+r_i}{1-r_i}$$ -
Weighted Average:
$$w_i = n_{eff,i} \quad \text{(effective sample size for test } i\text{)}$$ $$\bar{z}f = \frac{\sum{i=1}^K w_i z_i}{\sum_{i=1}^K w_i}$$ -
Inverse Transform:
$$r_f = \tanh(\bar{z}_f)$$ -
Convert Back to PAS:
$$BAS_f^{raw} = \frac{r_f + 1}{2}$$
Code: src/evaluation/stats_lib.py::aggregate_finding_pas_raw()
Step 1: Compute Human Ceiling
For each test
This represents the maximum possible PAS when agent perfectly matches human posteriors.
Step 2: Convert to r-scale
Step 3: Normalized Ratio
Step 4: Single Test Case
If
Step 5: Multiple Tests - Fisher-z Pooling
If
-
Clamp
$r_i'$ to valid range:$$r_i' \leftarrow \max(-1 + \epsilon, \min(1 - \epsilon, r_i'))$$ where$\epsilon = 10^{-6}$ -
Fisher-z Transform:
$$z_i' = \tanh^{-1}(r_i')$$ -
Weighted Average: $$\bar{z}f' = \frac{\sum{i=1}^K w_i z_i'}{\sum_{i=1}^K w_i}$$
-
Inverse Transform:
$$r_f' = \tanh(\bar{z}_f')$$ $$PAS_f^{norm} = r_f'$$
Code: src/evaluation/stats_lib.py::aggregate_finding_pas_norm()
Simple Mean Across Findings:
where
Note: We use simple mean (not weighted by finding_weight) per your interpretation: "randomly pick a human finding, how likely this agent can reproduce the effect."
Code: src/evaluation/stats_lib.py::aggregate_study_pas()
ECS (Effect Consistency Score) is the main effect-consistency metric. It measures construct validity / profile similarity between human and agent effect profiles using Lin's Concordance Correlation Coefficient (CCC) — weighted CCC, not Pearson and not Z-diff.
All effect sizes are converted to Cohen's d-equivalent for consistency:
-
t-test/f-test/anova: Already
$d$ →$\delta = d$ -
correlation (stored as Fisher z):
$r = \tanh(z)$ , then$d = \frac{2r}{\sqrt{1-r^2}}$ -
chi-square (stored as log OR):
$d \approx \log(OR) \cdot \frac{\sqrt{3}}{\pi}$ -
Mann-Whitney (stored as rank-biserial
$r_{rb}$ ):$d = \frac{2r_{rb}}{\sqrt{1-r_{rb}^2}}$ -
binomial (stored as proportion
$p$ ):$d \approx 2 \cdot \frac{p - p_0}{\sqrt{p_0(1-p_0)}}$ where$p_0 = 0.5$
Code: src/evaluation/stats_lib.py::effect_to_d_equiv()
Per-Test Effect Profiles:
- $\boldsymbol{\delta}h = (\delta{h,1}, \delta_{h,2}, ..., \delta_{h,M})$: Human effect sizes (Cohen's d)
- $\boldsymbol{\delta}a = (\delta{a,1}, \delta_{a,2}, ..., \delta_{a,M})$: Agent effect sizes (Cohen's d)
Weights for Fairness:
Within study
Weighted Lin's CCC:
where
Range:
Aggregation Levels:
- Overall ECS: CCC across all tests from all studies (weighted)
- Domain ECS: CCC within each domain (Cognition, Strategic, Social)
- Per-study ECS: CCC within each study (undefined if study has < 3 tests)
Code: src/evaluation/stats_lib.py::compute_ecs_corr(), weighted_ccc()
The Z-diff based metric is legacy/auxiliary. Do not report as primary. Kept for reference and the "Caricature Hypothesis."
Step 1: Collect Z_diff Values
For each test
Step 2: Single Test Case
If
Step 3: Multiple Tests - RMS Pooling
If
Then convert to ECS:
Code: src/evaluation/stats_lib.py::aggregate_finding_ecs()
Simple Mean Across Findings:
Code: generation_pipeline/pipeline.py (lines ~1591-1600)
Threshold:
Rule: Before applying
Rationale: Prevents infinities when
Code: src/evaluation/stats_lib.py::fisher_z_transform() (line 1840)
Threshold:
Rule: If
-
Single test: Return
$PAS_f^{norm} = 0.0$ - Multiple tests: Skip this test in aggregation
Rationale: When human evidence is neutral (
Code: src/evaluation/stats_lib.py::aggregate_finding_pas_norm() (lines 2020, 2048)
Empty test list:
- PAS (raw): Return
$0.5$ - PAS (normalized): Return
$0.0$ - ECS: Return
$0.0$
All z-values invalid (NaN/Inf):
- PAS (raw): Fallback to simple mean of PAS values
- PAS (normalized): Fallback to simple mean of normalized ratios (if any valid)
- ECS: Return
$0.0$
Zero/negative weights:
- If
$\sum w_i \leq 0$ : Use unweighted mean of z-values
Code: See fallback logic in aggregate_finding_pas_raw(), aggregate_finding_pas_norm(), aggregate_finding_ecs()
Rule: Skip any test where:
-
$z_i$ is NaN or Inf (after Fisher transform) -
$Z_{diff,i}$ is NaN or Inf -
$z_{agg}$ is NaN or Inf (after aggregation)
Code: Check math.isnan() and math.isinf() before including in aggregation.
The effective sample size
| Test Type | n_eff Formula | Code Location |
|---|---|---|
| t-test / F-test / ANOVA | Independent: Paired/One-sample: |
compute_n_eff_for_test() lines 1886-1895 |
| Correlation (Pearson/Spearman) |
compute_n_eff_for_test() lines 1896-1899 |
|
| Chi-square |
|
compute_n_eff_for_test() lines 1900-1906 |
| Mann-Whitney U |
compute_n_eff_for_test() lines 1907-1911 |
|
| Binomial / Sign Test |
compute_n_eff_for_test() lines 1912-1915 |
|
| Fallback |
|
compute_n_eff_for_test() line 1918 |
Code: src/evaluation/stats_lib.py::compute_n_eff_for_test()
Note: n_eff is also computed and stored during add_statistical_replication_fields() (lines ~417-435 in stats_lib.py).
Effect Size: $$d = \frac{\bar{x}_1 - \bar{x}2}{s{pooled}}$$
Standard Error:
-
Independent samples:
$$SE_d = \sqrt{\frac{n_1 + n_2}{n_1 n_2} + \frac{d^2}{2(n_1 + n_2)}}$$ -
One-sample / Paired:
$$SE_d = \sqrt{\frac{1}{n} + \frac{d^2}{2n}}$$
Code: FrequentistConsistency.cohens_d_se() (lines 1450-1468)
Effect Size:
Standard Error:
Code: FrequentistConsistency.correlation_se() (lines 1471-1489)
Effect Size:
Haldane-Anscombe Correction: If any cell
Standard Error:
(After correction if needed)
Code: FrequentistConsistency.log_odds_ratio() and log_odds_ratio_se() (lines 1511-1542)
Effect Size:
where
Standard Error:
Code: FrequentistConsistency.r_rb_se() (lines 1759-1775)
Effect Size:
Standard Error:
Code: FrequentistConsistency.calculate_consistency_for_binomial() (lines 1801-1820)
| Function | Location | Purpose |
|---|---|---|
aggregate_finding_pas_raw() |
src/evaluation/stats_lib.py:1921 |
PAS (raw) aggregation at finding level |
aggregate_finding_pas_norm() |
src/evaluation/stats_lib.py:1982 |
PAS (normalized) aggregation at finding level |
aggregate_finding_ecs() |
src/evaluation/stats_lib.py:2096 |
ECS_Strict aggregation at finding level (appendix) |
aggregate_study_pas() |
src/evaluation/stats_lib.py:2137 |
PAS aggregation at study level (calls finding-level functions) |
effect_to_d_equiv() |
src/evaluation/stats_lib.py:1845 |
Convert effect sizes to Cohen's d-equivalent |
weighted_ccc() |
src/evaluation/stats_lib.py:3227 |
Weighted Lin's CCC (main ECS) |
compute_ecs_corr() |
src/evaluation/stats_lib.py:3301 |
ECS computation (CCC overall/domain/study) |
weighted_corr() |
src/evaluation/stats_lib.py:3109 |
Weighted Pearson (retained for figures/appendix) |
weighted_linreg() |
src/evaluation/stats_lib.py:3185 |
Weighted least squares regression (caricature, legacy) |
fisher_z_transform() |
src/evaluation/stats_lib.py:1827 |
Fisher z-transform helper |
fisher_z_inverse() |
src/evaluation/stats_lib.py:1851 |
Inverse Fisher z-transform helper |
compute_n_eff_for_test() |
src/evaluation/stats_lib.py:1866 |
Compute effective sample size for weighting |
| Component | Location | Purpose |
|---|---|---|
| Study-level PAS computation | generation_pipeline/pipeline.py:1575-1579 |
Override pas_result['score'] and ['normalized_score'] with Fisher-pooled values |
| Study-level ECS_corr computation | generation_pipeline/pipeline.py:1581-1600 |
Compute ECS_corr (correlation-based) and ECS_Strict (appendix) |
| CSV output | generation_pipeline/pipeline.py:1697-1741 |
Write PAS_Raw, ECS_Test, Z_Diff, Agent_Effect_d, Human_Effect_d columns |
| Summary JSON | scripts/generate_results_table.py:843-844 |
Include average_pas_raw, average_ecs (ECS_corr), ecs_strict_overall (appendix) |
| Function | Location | Purpose |
|---|---|---|
FrequentistConsistency.cohens_d_se() |
src/evaluation/stats_lib.py:1450 |
SE for Cohen's d |
FrequentistConsistency.correlation_se() |
src/evaluation/stats_lib.py:1471 |
SE for Fisher's z (correlation) |
FrequentistConsistency.log_odds_ratio_se() |
src/evaluation/stats_lib.py:1528 |
SE for log odds ratio |
FrequentistConsistency.r_rb_se() |
src/evaluation/stats_lib.py:1759 |
SE for rank-biserial correlation |
FrequentistConsistency.calculate_z_diff() |
src/evaluation/stats_lib.py:1555 |
Compute |
CSV (detailed_stats.csv):
-
PAS_Raw: Per-test raw PAS (PAS) value -
ECS_Test: Per-test ECS value -
Z_Diff: Per-test$Z_{diff}$ value -
Agent_Effect_Size: Per-test agent effect size -
Human_Effect_Size: Per-test human effect size
JSON (evaluation_results.json, benchmark_summary.json) — report only main:
- Main:
score= Study-level PAS (raw), Fisher-pooled. - Main:
average_ecs= Overall ECS (Lin's CCC). - Main:
ecs_corr_study= Study-level ECS (Lin's CCC). - Legacy/auxiliary:
normalized_score(PAS norm),ecs_strict_study,ecs_strict_overall,average_pas→average_pas_raw,average_consistency_score→ecs_strict_study.
Report as main metrics only:
- PAS (raw): Aggregates test-level PAS using Fisher-z pooling within findings, then simple mean across findings.
- ECS (correlation-based): Standardizes all effect sizes to Cohen's d-equivalent, then computes Lin's Concordance Correlation Coefficient (CCC) between human and agent effect profiles (weighted). CCC measures agreement (correlation + accuracy). This is the real ECS.
Legacy / auxiliary (do not report as primary): PAS (normalized), ECS_Strict (Z-diff, RMS-pooled), caricature regression, per-test Z-diff.
All aggregations respect edge cases (clamping, divide-by-zero protection, NaN/Inf handling) with explicit numeric thresholds documented above.