Incorrect normalisation in `TreeSequence.pair_coalescence_counts` (and `TreeSequence.pair_coalescence_rates`) with isolated samples

The `pair_normalise` option for `TreeSequence.pair_coalescence_counts` should result in the output being divided by the possible number of coalescing sample pairs---so that if `pair_normalise` and `span_normalise` are both set, the output will sum to one over the time dimension. However, if there are isolated samples over a portion of the tree sequence, then this won't be the case---the denominator is incorrect.

Specifically, let $c_i(x)$ be the number of sample pairs coalescing in node $i$ at position $x$. Without normalisation, `pair_coalescence_counts` returns $C_i = \int_a^b c_i(x) dx$ where the integral is over the interval $[a, b)$ (ignoring "empty" trees). Let $p(x)$ be the number of potentially coalescing sample pairs at position $x$ (that is, omitting isolated samples and pairs where one sample is a descendant of the other). The "full normalisation" should return $C_i / \int_a^b p(x) dx$. If all sample pairs are present over the entirety of the sequence, then $\int_a^b p(x) dx = {n \choose 2} (b - a)$, and the current implementation is hunky dory. But otherwise $\int_a^b p(x) dx < {n \choose 2} (b - a)$ and the normalised output is biased upwards.

A consequence is that pair coalescence rates (which rely on the normalised counts) are biased when there are isolated samples. For example, the figure below shows empirical rates estimated by `TreeSequence.pair_coalescence_rates` (colored lines that are individual simulations) around the theoretical expectation for the Zigzag model (black line). The "full" panel is when all samples are present, and the "drop" panel is when ancestry is partially removed so that the number of non-isolated samples varies dramatically across the sequence.

<img width="75%" alt="Image" src="https://github.com/user-attachments/assets/dfd0086d-7d9a-43f7-8c2f-2af147d06b8b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect normalisation in `TreeSequence.pair_coalescence_counts` (and `TreeSequence.pair_coalescence_rates`) with isolated samples #3459

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect normalisation in TreeSequence.pair_coalescence_counts (and TreeSequence.pair_coalescence_rates) with isolated samples #3459

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Incorrect normalisation in `TreeSequence.pair_coalescence_counts` (and `TreeSequence.pair_coalescence_rates`) with isolated samples #3459