id: dedupe parse results to avoid false ambiguity#12578
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bdc5f57d17
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where IdMap::parse returns duplicate CliId entries for the same logical entity when a commit is reachable through both local and remote tracking views. The fix deduplicates parse results by entity identity, preserving real ambiguity between distinct entities while preventing false "ambiguous ID" errors.
Changes:
- Added deduplication logic to
IdMap::parseto remove duplicate results that refer to the same entity - Implemented
cli_ids_refer_to_same_entityhelper function to compare entities by their identifying attributes (commit_id for commits, commit_id+path for files, name+stack_id for branches, stack_id for stacks) - Added comprehensive tests to verify deduplication works correctly for commits while preserving genuine ambiguity for distinct entities
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| crates/but/src/id/mod.rs | Implements deduplication in parse() method and adds cli_ids_refer_to_same_entity() helper to compare entity identity |
| crates/but/src/id/tests.rs | Adds three tests verifying commit deduplication works and genuine ambiguity is preserved |
|
The PR looks fine, but shouldn't we fix this at the point where we generate the workspace commits and upstream commits (i.e. make sure that the same commit doesn't appear twice)? Unlike the case where we have duplicate TreeChange entries, this one seems to be entirely our code (as opposed to code from a dependency). |
|
Actually that's is a good point... @Byron or @Caleb-T-Owens do you guys have an idea how we can solve this properly? |
|
A good point, thanks @jonathantanmy2. This sounds like there should be a test-case in It's worth figuring out what's going on. Is it a bug here, where a commit can be in both lists? gitbutler/crates/but-graph/src/projection/stack.rs Lines 179 to 191 in b594cea Or is it about a commit being reachable from multiple stacks? Because that's intentional and I think it's important to 'keep it real'. Because unifying them means to say that "because our ordering from left to right, stack 1 gets to own this commit, but not stack 2 which just happens to see it later". |
|
Good point @Byron (PS: Now i need your review Byron as well :P) |
| let mut deduped = Vec::new(); | ||
| 'next: for cli_id in cli_ids { | ||
| for existing in &deduped { | ||
| if cli_ids_refer_to_same_entity(existing, &cli_id) { | ||
| continue 'next; | ||
| } | ||
| } | ||
| deduped.push(cli_id); | ||
| } |
There was a problem hiding this comment.
The deduplication logic uses an O(n²) algorithm where for each item in cli_ids, it iterates through all items in deduped to check for matches. While this is acceptable for small result sets (typically 1-2 items in practice), consider using a HashSet-based approach if parse results could contain many matches in edge cases. The current implementation could become a bottleneck if cli_ids contains many duplicates or a large number of results.
|
Perhaps it's fussy, but I'm really not sure if this is a good idea to merge. Even if we don't end up fixing the Is it possible to avoid generating duplicate IDs in the first place, even if there might be redundancy in the source dataset? If not, I'm wondering if we can avoid having such a large and potentially error-prone comparison. Could we instead drive Thanks. |
this is fine, but it seems weird to me that we can show the same commit multiple times in the first place. I dont know how to solve that though, would need help there |
Byron
left a comment
There was a problem hiding this comment.
Also related to what @Caleb-T-Owens said, in the end, the but_graph::Workspace is there to define how GitButler sees the world, and it can be adjusted to however we see fit.
To my mind, it must show "the ground truth", and everything else has to be done as a function of how it's displayed. You want to hide or dim duplicates? Here you go.
Our code just has to deal with 'the truth' now, which admittedly is a bit of an extreme coming from vb.toml. And I quoted it because some code will even have to go down from the projection and into the graph for 'the real truth' :D.
| } | ||
|
|
||
| #[test] | ||
| fn shared_commits_can_be_in_multiple_stacks_without_local_remote_overlap() -> anyhow::Result<()> { |
There was a problem hiding this comment.
I'd remove this test as this point is more obvious when looking at the test that was originally using the fixture. If there is something special here, I think it should be added to the existing test.
let (repo, mut meta) =
read_only_in_memory_scenario("ws/multiple-stacks-with-shared-segment-and-remote")?;
insta::assert_snapshot!(visualize_commit_graph_all(&repo)?, @r"
* e982e8a (HEAD -> gitbutler/workspace) GitButler Workspace Commit
|\
| * aff8449 (B-on-A) B-on-A
* | 4f1bb32 (C-on-A) C-on-A
|/
| * b627ca7 (origin/A) A-on-remote
|/
* e255adc (A) A
* fafd9d0 (origin/main, main) init
");
add_stack_with_segments(&mut meta, 1, "C-on-A", StackState::InWorkspace, &[]);
let graph = Graph::from_head(&repo, &*meta, standard_options())?.validated()?;
insta::assert_snapshot!(graph_tree(&graph), @r"
├── 👉📕►►►:0[0]:gitbutler/workspace[🌳]
│ └── ·e982e8a (⌂|🏘|0001)
│ ├── 📙►:3[1]:C-on-A
│ │ └── ·4f1bb32 (⌂|🏘|0001)
│ │ └── ►:4[2]:A <> origin/A →:5:
│ │ └── ·e255adc (⌂|🏘|1101)
│ │ └── ►:2[3]:main <> origin/main →:1:
│ │ └── ·fafd9d0 (⌂|🏘|✓|1111)
│ └── ►:6[1]:B-on-A
│ └── ·aff8449 (⌂|🏘|0001)
│ └── →:4: (A →:5:)
├── ►:1[0]:origin/main →:2:
│ └── →:2: (main →:1:)
└── ►:5[0]:origin/A →:4:
└── 🟣b627ca7 (0x0|1000)
└── →:4: (A →:5:)
");
insta::assert_snapshot!(graph_workspace(&graph.into_workspace()?), @r"
📕🏘️:0:gitbutler/workspace[🌳] <> ✓refs/remotes/origin/main on fafd9d0
├── ≡:6:B-on-A on fafd9d0
│ ├── :6:B-on-A
│ │ └── ·aff8449 (🏘️)
│ └── :4:A <> origin/A →:5:⇣1
│ ├── 🟣b627ca7
│ └── ❄️e255adc (🏘️)
└── ≡📙:3:C-on-A on fafd9d0 {1}
├── 📙:3:C-on-A
│ └── ·4f1bb32 (🏘️)
└── :4:A <> origin/A →:5:⇣1
├── 🟣b627ca7
└── ❄️e255adc (🏘️)
");
@krlvi provided a screenshot above that shows what's going on - a workspace commit (92d4...) is reachable from 2 stacks.
It's about a workspace commit being reachable from multiple stacks, I think. From the documentation you quoted, "so that they are unique for this stack segment and not included in any other stack or stack segment" - that was my assumption for all workspace commits.
Is it the intention to allow a workspace commit to appear in multiple stacks? My assumption on how it worked was that GitButler would never create such a commit graph, and having such a commit graph is an invalid state. If we were to allow it, I think that any operation on an entire stack (e.g. muting an entire stack) would become more complicated, as the trees involved are no longer independent. We would also no longer be able to speak meaningfully about the branch/stack of a commit (e.g. being able to specify |
Every commit graph is valid now and we have to deal with it. The reason is that we don't control the graph, the user does and all the tools that they use. Obviously, we already create such 'ambiguous' graphs as I don't think Kiril manipulated that into existence.
I think this is the misconception - we aren't in that position, we really can't prevent this from happening. We don't have that kind of grip over our own graph manipulations to prevent it, and even if we did, the user might manoeuvre themselves into deemed-invalid positions themselves and we better handle that gracefully. And yes, that definitely complicates things.
My guess is that there is an anonymous segment that is reachable by two refs that are explicitly added to the workspace. From Git's point of view, both are pointing to that commit. The dot-graph would be good to see. |
|
@krlvi Thanks. Unfortunately, there is code that tries to prevent gitbutler/crates/but-api/src/legacy/workspace.rs Lines 68 to 100 in ff53618 But I counted, and it looks like a412a5d and ·c9f3a40 might have the same commit message? Also, I will see if I can do something about this butchering. |
In that repo, there is definitely no undue duplication. Different commits, same message. I could also verify that the commit-graph now looks better with the algorithm change: it now shows the base of the workspace, so it doesn't look like it's ripped apart anymore. I also had a conversation with @jonathantanmy2 and I believe we agree that the deduplication must happen here. |
Thanks and sorry for the misleading example. |
|
Taking it over for minor changes, then handing over to @Caleb-T-Owens for perf improvements in a separate PR. |
Byron
left a comment
There was a problem hiding this comment.
Let's see if the approval survives pushing into it myself.
There was a problem hiding this comment.
Actually I have to 'unapprove' it removing the deduplication still passes the tests.
Here is a patch for double-checking:
diff --git a/crates/but/src/id/mod.rs b/crates/but/src/id/mod.rs
index 7b12875ac7..20fca5997a 100644
--- a/crates/but/src/id/mod.rs
+++ b/crates/but/src/id/mod.rs
@@ -765,17 +765,7 @@ impl IdMap {
}
}
- let mut deduped = Vec::new();
- 'next: for cli_id in cli_ids {
- for existing in &deduped {
- if cli_ids_refer_to_same_entity(existing, &cli_id) {
- continue 'next;
- }
- }
- deduped.push(cli_id);
- }
-
- Ok(deduped)
+ return Ok(cli_ids);
}
/// Convenience for [IdMap::parse] if a [gix::Repository] is available.
pub fn parse_using_repo<'a>(Still seeing if there is a quick way to get a test that validates this.
IdMap::parse can return duplicate CliIds for the same logical entity when a commit is reachable through both local and remote views. Callers interpret multiple matches as ambiguity and may fail operations (e.g. target/anchor resolution) even though only one entity is intended. Deduplicate parse results by entity identity so we preserve real ambiguity between distinct entities while preventing these false ambiguous-ID errors.
Byron
left a comment
There was a problem hiding this comment.
The conflict resolution dropped a line: #[test] 🤦♂️.
Have to get a proper merge editor now.
|
There's another issue illustrated in this screenshot - the commit CLI ID is too long. I've sent #12624 to fix that. |







IdMap::parse can return duplicate CliIds for the same logical entity when a
commit is reachable through both local and remote views. Callers interpret
multiple matches as ambiguity and may fail operations (e.g. target/anchor
resolution) even though only one entity is intended.
Deduplicate parse results by entity identity so we preserve real ambiguity
between distinct entities while preventing these false ambiguous-ID errors.
This is part 1 of 2 in a stack made with GitButler: