Skip to content

id: dedupe parse results to avoid false ambiguity#12578

Merged
Byron merged 1 commit intomasterfrom
cli-id-dedupe
Feb 27, 2026
Merged

id: dedupe parse results to avoid false ambiguity#12578
Byron merged 1 commit intomasterfrom
cli-id-dedupe

Conversation

@krlvi
Copy link
Copy Markdown
Member

@krlvi krlvi commented Feb 24, 2026

IdMap::parse can return duplicate CliIds for the same logical entity when a
commit is reachable through both local and remote views. Callers interpret
multiple matches as ambiguity and may fail operations (e.g. target/anchor
resolution) even though only one entity is intended.

Deduplicate parse results by entity identity so we preserve real ambiguity
between distinct entities while preventing these false ambiguous-ID errors.


This is part 1 of 2 in a stack made with GitButler:

Copilot AI review requested due to automatic review settings February 24, 2026 22:34
@vercel
Copy link
Copy Markdown

vercel Bot commented Feb 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
gitbutler-web Ignored Ignored Preview Feb 27, 2026 10:10am

Request Review

@github-actions github-actions Bot added rust Pull requests that update Rust code CLI The command-line program `but` labels Feb 24, 2026
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bdc5f57d17

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/but/src/id/mod.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where IdMap::parse returns duplicate CliId entries for the same logical entity when a commit is reachable through both local and remote tracking views. The fix deduplicates parse results by entity identity, preserving real ambiguity between distinct entities while preventing false "ambiguous ID" errors.

Changes:

  • Added deduplication logic to IdMap::parse to remove duplicate results that refer to the same entity
  • Implemented cli_ids_refer_to_same_entity helper function to compare entities by their identifying attributes (commit_id for commits, commit_id+path for files, name+stack_id for branches, stack_id for stacks)
  • Added comprehensive tests to verify deduplication works correctly for commits while preserving genuine ambiguity for distinct entities

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
crates/but/src/id/mod.rs Implements deduplication in parse() method and adds cli_ids_refer_to_same_entity() helper to compare entity identity
crates/but/src/id/tests.rs Adds three tests verifying commit deduplication works and genuine ambiguity is preserved

Comment thread crates/but/src/id/mod.rs Outdated
@jonathantanmy2
Copy link
Copy Markdown
Collaborator

The PR looks fine, but shouldn't we fix this at the point where we generate the workspace commits and upstream commits (i.e. make sure that the same commit doesn't appear twice)? Unlike the case where we have duplicate TreeChange entries, this one seems to be entirely our code (as opposed to code from a dependency).

@krlvi
Copy link
Copy Markdown
Member Author

krlvi commented Feb 25, 2026

Actually that's is a good point... @Byron or @Caleb-T-Owens do you guys have an idea how we can solve this properly?

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Feb 25, 2026

A good point, thanks @jonathantanmy2.

This sounds like there should be a test-case in but-graph to ensure commits aren't associated multiple times.

It's worth figuring out what's going on. Is it a bug here, where a commit can be in both lists?

/// The portion of commits that can be reached from the tip of the *branch* downwards to the next [StackSegment],
/// so that they are unique for this stack segment and not included in any other stack or stack segment.
/// The walk is performed **along the first parent only**.
///
/// The list could be empty for when this is a dedicated empty segment as insertion position of commits.
pub commits: Vec<StackCommit>,
/// All commits *that are not workspace commits* reachable by (and including commits in) this segment.
/// The list was created by walking all parents, not only the first parent.
/// Note that the tips of these commits is the `sibling_segment_id` which in this case is `Some`
/// if this field is `Some`.
/// When set, we will also have copied the `ref_name`, `metadata` and `remote_tracking_ref_name` from
/// `sibling_segment_id` over to this segment to provide more meaningful information.
pub commits_outside: Option<Vec<StackCommit>>,

Or is it about a commit being reachable from multiple stacks? Because that's intentional and I think it's important to 'keep it real'. Because unifying them means to say that "because our ordering from left to right, stack 1 gets to own this commit, but not stack 2 which just happens to see it later".

Copilot AI review requested due to automatic review settings February 25, 2026 10:15
@krlvi
Copy link
Copy Markdown
Member Author

krlvi commented Feb 25, 2026

Good point @Byron
I actually added a second test to demonstrate the use case you pointed out. I think this is actually about handling exactly that situation...

(PS: Now i need your review Byron as well :P)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread crates/but/src/id/mod.rs
Comment on lines +770 to +778
let mut deduped = Vec::new();
'next: for cli_id in cli_ids {
for existing in &deduped {
if cli_ids_refer_to_same_entity(existing, &cli_id) {
continue 'next;
}
}
deduped.push(cli_id);
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deduplication logic uses an O(n²) algorithm where for each item in cli_ids, it iterates through all items in deduped to check for matches. While this is acceptable for small result sets (typically 1-2 items in practice), consider using a HashSet-based approach if parse results could contain many matches in edge cases. The current implementation could become a bottleneck if cli_ids contains many duplicates or a large number of results.

Copilot uses AI. Check for mistakes.
@Caleb-T-Owens
Copy link
Copy Markdown
Contributor

Perhaps it's fussy, but I'm really not sure if this is a good idea to merge.

Even if we don't end up fixing the but-graph, I would be very interesting in us avoiding any form of de-duplication logic like this.

Is it possible to avoid generating duplicate IDs in the first place, even if there might be redundancy in the source dataset?

If not, I'm wondering if we can avoid having such a large and potentially error-prone comparison. Could we instead drive Hash on the CliId enum and use that in combo with a HashSet to de-duplicate automatically as we build up the output structure - which would also avoid the current O(n^2) problem which could have a meaningful impact on larger workspaces.

Thanks.

@krlvi
Copy link
Copy Markdown
Member Author

krlvi commented Feb 25, 2026

Perhaps it's fussy, but I'm really not sure if this is a good idea to merge.

Even if we don't end up fixing the but-graph, I would be very interesting in us avoiding any form of de-duplication logic like this.

Is it possible to avoid generating duplicate IDs in the first place, even if there might be redundancy in the source dataset?

If not, I'm wondering if we can avoid having such a large and potentially error-prone comparison. Could we instead drive Hash on the CliId enum and use that in combo with a HashSet to de-duplicate automatically as we build up the output structure - which would also avoid the current O(n^2) problem which could have a meaningful impact on larger workspaces.

Thanks.

this is fine, but it seems weird to me that we can show the same commit multiple times in the first place. I dont know how to solve that though, would need help there

Copy link
Copy Markdown
Collaborator

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related to what @Caleb-T-Owens said, in the end, the but_graph::Workspace is there to define how GitButler sees the world, and it can be adjusted to however we see fit.
To my mind, it must show "the ground truth", and everything else has to be done as a function of how it's displayed. You want to hide or dim duplicates? Here you go.
Our code just has to deal with 'the truth' now, which admittedly is a bit of an extreme coming from vb.toml. And I quoted it because some code will even have to go down from the projection and into the graph for 'the real truth' :D.

}

#[test]
fn shared_commits_can_be_in_multiple_stacks_without_local_remote_overlap() -> anyhow::Result<()> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this test as this point is more obvious when looking at the test that was originally using the fixture. If there is something special here, I think it should be added to the existing test.

    let (repo, mut meta) =
        read_only_in_memory_scenario("ws/multiple-stacks-with-shared-segment-and-remote")?;
    insta::assert_snapshot!(visualize_commit_graph_all(&repo)?, @r"
    *   e982e8a (HEAD -> gitbutler/workspace) GitButler Workspace Commit
    |\  
    | * aff8449 (B-on-A) B-on-A
    * | 4f1bb32 (C-on-A) C-on-A
    |/  
    | * b627ca7 (origin/A) A-on-remote
    |/  
    * e255adc (A) A
    * fafd9d0 (origin/main, main) init
    ");

    add_stack_with_segments(&mut meta, 1, "C-on-A", StackState::InWorkspace, &[]);

    let graph = Graph::from_head(&repo, &*meta, standard_options())?.validated()?;
    insta::assert_snapshot!(graph_tree(&graph), @r"

    ├── 👉📕►►►:0[0]:gitbutler/workspace[🌳]
    │   └── ·e982e8a (⌂|🏘|0001)
    │       ├── 📙►:3[1]:C-on-A
    │       │   └── ·4f1bb32 (⌂|🏘|0001)
    │       │       └── ►:4[2]:A <> origin/A →:5:
    │       │           └── ·e255adc (⌂|🏘|1101)
    │       │               └── ►:2[3]:main <> origin/main →:1:
    │       │                   └── ·fafd9d0 (⌂|🏘|✓|1111)
    │       └── ►:6[1]:B-on-A
    │           └── ·aff8449 (⌂|🏘|0001)
    │               └── →:4: (A →:5:)
    ├── ►:1[0]:origin/main →:2:
    │   └── →:2: (main →:1:)
    └── ►:5[0]:origin/A →:4:
        └── 🟣b627ca7 (0x0|1000)
            └── →:4: (A →:5:)
    ");

    insta::assert_snapshot!(graph_workspace(&graph.into_workspace()?), @r"
    📕🏘️:0:gitbutler/workspace[🌳] <> ✓refs/remotes/origin/main on fafd9d0
    ├── ≡:6:B-on-A on fafd9d0
    │   ├── :6:B-on-A
    │   │   └── ·aff8449 (🏘️)
    │   └── :4:A <> origin/A →:5:⇣1
    │       ├── 🟣b627ca7
    │       └── ❄️e255adc (🏘️)
    └── ≡📙:3:C-on-A on fafd9d0 {1}
        ├── 📙:3:C-on-A
        │   └── ·4f1bb32 (🏘️)
        └── :4:A <> origin/A →:5:⇣1
            ├── 🟣b627ca7
            └── ❄️e255adc (🏘️)
    ");

@krlvi
Copy link
Copy Markdown
Member Author

krlvi commented Feb 25, 2026

image image

even here i am running in the commit duplication - it looks like a bug to me

@jonathantanmy2
Copy link
Copy Markdown
Collaborator

It's worth figuring out what's going on. Is it a bug here, where a commit can be in both lists?

@krlvi provided a screenshot above that shows what's going on - a workspace commit (92d4...) is reachable from 2 stacks.

Or is it about a commit being reachable from multiple stacks?

It's about a workspace commit being reachable from multiple stacks, I think. From the documentation you quoted, "so that they are unique for this stack segment and not included in any other stack or stack segment" - that was my assumption for all workspace commits.

Because that's intentional and I think it's important to 'keep it real'.

Is it the intention to allow a workspace commit to appear in multiple stacks? My assumption on how it worked was that GitButler would never create such a commit graph, and having such a commit graph is an invalid state.

If we were to allow it, I think that any operation on an entire stack (e.g. muting an entire stack) would become more complicated, as the trees involved are no longer independent. We would also no longer be able to speak meaningfully about the branch/stack of a commit (e.g. being able to specify commitid@{stack} or commitid@{branch}, but AFAIK no functionality depends on doing that right now, so it might be OK.

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Feb 26, 2026

Is it the intention to allow a workspace commit to appear in multiple stacks? My assumption on how it worked was that GitButler would never create such a commit graph, and having such a commit graph is an invalid state.

Every commit graph is valid now and we have to deal with it. The reason is that we don't control the graph, the user does and all the tools that they use. Obviously, we already create such 'ambiguous' graphs as I don't think Kiril manipulated that into existence.

If we were to allow it, [..]

I think this is the misconception - we aren't in that position, we really can't prevent this from happening. We don't have that kind of grip over our own graph manipulations to prevent it, and even if we did, the user might manoeuvre themselves into deemed-invalid positions themselves and we better handle that gracefully.

And yes, that definitely complicates things.

@krlvi

even here i am running in the commit duplication - it looks like a bug to me

My guess is that there is an anonymous segment that is reachable by two refs that are explicitly added to the workspace. From Git's point of view, both are pointing to that commit. The dot-graph would be good to see.
With that said, it's probably good to keep this repro and I take a closer look. Right now I can't conclude it's a bug, while I admit it's surprising - but that's not the first time.

@krlvi
Copy link
Copy Markdown
Member Author

krlvi commented Feb 26, 2026

@Byron

debug-graph-01.pdf

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Feb 26, 2026

@krlvi Thanks. Unfortunately, there is code that tries to prevent dot from hanging which completely butchers this graph before displaying it.

const LIMIT: usize = 5000;
let mut to_remove = graph.num_segments().saturating_sub(LIMIT);
if to_remove > 0 {
tracing::warn!(
"Pruning at most {to_remove} nodes from the bottom to assure 'dot' won't hang",
);
let mut next = std::collections::VecDeque::new();
next.extend(graph.base_segments());
let mut seen = std::collections::BTreeSet::new();
while let Some(sidx) = next.pop_front() {
if to_remove == 0 {
break;
}
if let Some(s) = graph.node_weight(sidx)
&& (s.metadata.is_some()
|| s.sibling_segment_id.is_some()
|| s.remote_tracking_branch_segment_id.is_some())
{
continue;
}
next.extend(
graph
.neighbors_directed(sidx, but_graph::petgraph::Direction::Incoming)
.filter(|n| seen.insert(*n)),
);
graph.remove_node(sidx);
to_remove -= 1;
}
if to_remove != 0 {
tracing::warn!("{to_remove} extra nodes were kept to keep vital portions of the graph");
}
graph.set_hard_limit_hit();
}

But I counted, and it looks like a412a5d and ·c9f3a40 might have the same commit message?

Also, I will see if I can do something about this butchering.

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Feb 26, 2026

@krlvi The Graph display issue is fixed in #12606 . Maybe it makes a difference here, as in would change what I think of this.

@krlvi
Copy link
Copy Markdown
Member Author

krlvi commented Feb 26, 2026

@krlvi The Graph display issue is fixed in #12606 . Maybe it makes a difference here, as in would change what I think of this.

still shows like before with the latest code. lemme zip the repo

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Feb 27, 2026

Screenshot_2026-02-27_at_08 10 25 Screenshot_2026-02-27_at_08 09 01

In that repo, there is definitely no undue duplication. Different commits, same message. I could also verify that the commit-graph now looks better with the algorithm change: it now shows the base of the workspace, so it doesn't look like it's ripped apart anymore.
So all good here.

I also had a conversation with @jonathantanmy2 and I believe we agree that the deduplication must happen here.

@krlvi
Copy link
Copy Markdown
Member Author

krlvi commented Feb 27, 2026

Screenshot_2026-02-27_at_08 10 25 Screenshot_2026-02-27_at_08 09 01

In that repo, there is definitely no undue duplication. Different commits, same message. I could also verify that the commit-graph now looks better with the algorithm change: it now shows the base of the workspace, so it doesn't look like it's ripped apart anymore. So all good here.

I also had a conversation with @jonathantanmy2 and I believe we agree that the deduplication must happen here.

Thanks and sorry for the misleading example.
Just to confirm, are we saying that we should indeed perform this dedup here regardless? I.e. we want to merge this?
If we want this code, are we happy with how it's currently implemented?

@Byron
Copy link
Copy Markdown
Collaborator

Byron commented Feb 27, 2026

Taking it over for minor changes, then handing over to @Caleb-T-Owens for perf improvements in a separate PR.

Copy link
Copy Markdown
Collaborator

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if the approval survives pushing into it myself.

Copy link
Copy Markdown
Collaborator

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have to 'unapprove' it removing the deduplication still passes the tests.

Here is a patch for double-checking:

diff --git a/crates/but/src/id/mod.rs b/crates/but/src/id/mod.rs
index 7b12875ac7..20fca5997a 100644
--- a/crates/but/src/id/mod.rs
+++ b/crates/but/src/id/mod.rs
@@ -765,17 +765,7 @@ impl IdMap {
             }
         }
 
-        let mut deduped = Vec::new();
-        'next: for cli_id in cli_ids {
-            for existing in &deduped {
-                if cli_ids_refer_to_same_entity(existing, &cli_id) {
-                    continue 'next;
-                }
-            }
-            deduped.push(cli_id);
-        }
-
-        Ok(deduped)
+        return Ok(cli_ids);
     }
     /// Convenience for [IdMap::parse] if a [gix::Repository] is available.
     pub fn parse_using_repo<'a>(

Still seeing if there is a quick way to get a test that validates this.

Copilot AI review requested due to automatic review settings February 27, 2026 10:07
IdMap::parse can return duplicate CliIds for the same logical entity when a
commit is reachable through both local and remote views. Callers interpret
multiple matches as ambiguity and may fail operations (e.g. target/anchor
resolution) even though only one entity is intended.

Deduplicate parse results by entity identity so we preserve real ambiguity
between distinct entities while preventing these false ambiguous-ID errors.
Copy link
Copy Markdown
Collaborator

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conflict resolution dropped a line: #[test] 🤦‍♂️.
Have to get a proper merge editor now.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread crates/but/src/id/tests.rs
Comment thread crates/but/src/id/tests.rs
@Byron Byron enabled auto-merge February 27, 2026 10:17
@Byron Byron merged commit 0e1e01b into master Feb 27, 2026
33 checks passed
@Byron Byron deleted the cli-id-dedupe branch February 27, 2026 10:19
@jonathantanmy2
Copy link
Copy Markdown
Collaborator

image

There's another issue illustrated in this screenshot - the commit CLI ID is too long. I've sent #12624 to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLI The command-line program `but` rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants