Skip to content

Commit c4fcfce

Browse files
timsaucerclaude
andcommitted
docs: add upstream sync process documentation
Document the three-PR workflow used to sync to a newer upstream apache/datafusion version: bump crate deps + fix breakage, consolidate transitive deps, then fill API and documentation gaps via /check-upstream. Cross-reference from dev/release/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 13b2c47 commit c4fcfce

2 files changed

Lines changed: 140 additions & 0 deletions

File tree

dev/release/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,12 @@ release branch without blocking ongoing development in the `main` branch.
3333
We can cherry-pick commits from the `main` branch into `branch-53` as needed and then create new patch releases
3434
from that branch.
3535

36+
## Upstream Sync
37+
38+
Between releases the `main` branch is periodically synced to a newer upstream `apache/datafusion` version. This is
39+
broken into a three-PR workflow (bump + fix breakage, consolidate transitive deps, fill API and documentation gaps).
40+
See [`upstream-sync.md`](upstream-sync.md) for the full process.
41+
3642
## Detailed Guide
3743

3844
### Pre-requisites

dev/release/upstream-sync.md

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Upstream Sync Process
21+
22+
This document describes how to sync `datafusion-python` to a new version of the
23+
upstream `apache/datafusion` Rust crates. This is a recurring task: between
24+
official releases the `main` branch tracks DataFusion via crates.io or GitHub
25+
dependencies, and we periodically bump those dependencies to pick up new
26+
features and bug fixes.
27+
28+
The work is broken into **three sequential PRs** rather than landing as one
29+
large change. Splitting reviews along these lines keeps each PR focused, makes
30+
breakage easier to bisect, and lets reviewers concentrate on one concern at a
31+
time.
32+
33+
## PR 1: Bump DataFusion crate dependencies and fix breakage
34+
35+
**Goal:** update the upstream `datafusion` crate version and make the project
36+
build, test, and lint cleanly against it.
37+
38+
1. Update the `datafusion` dependency in the root `Cargo.toml` (workspace
39+
section and dependencies). Any downstream `datafusion-*` crates pinned in
40+
`crates/core/Cargo.toml` should move to the matching version.
41+
2. Run `cargo update -p datafusion` (or `cargo update` for a broader refresh)
42+
so `Cargo.lock` reflects the new pin.
43+
3. Run the standard build and test commands and address compilation errors,
44+
API renames, signature changes, and behavior changes:
45+
- `cargo build`
46+
- `cargo test`
47+
- `pytest`
48+
- `pre-commit run --all-files`
49+
4. Fix only what's needed to restore green CI. Resist the urge to bundle
50+
unrelated cleanups — those belong in their own PR.
51+
5. If a breaking change in upstream requires a user-facing API change in
52+
`datafusion-python`, add the `api change` label and document the change
53+
in the PR description so it surfaces in the changelog.
54+
55+
**Reference PRs:** [#1311](https://github.com/apache/datafusion-python/pull/1311)
56+
(DF51), [#1337](https://github.com/apache/datafusion-python/pull/1337) (DF52).
57+
58+
## PR 2: Consolidate transitive dependencies
59+
60+
**Goal:** after the upstream bump, the dependency tree may have multiple
61+
versions of the same transitive crate (for example, two `arrow` versions, two
62+
`object_store` versions). Reconcile these so we ship a single coherent set.
63+
64+
1. Inspect the lockfile for duplicates:
65+
```bash
66+
cargo tree --duplicates
67+
```
68+
2. For each duplicate that matters (Arrow, `object_store`, `parquet`,
69+
`tokio`, `arrow-flight`, etc.), update our direct dependency declarations
70+
in `Cargo.toml` to versions compatible with what upstream DataFusion now
71+
pulls in. The goal is one version of each ecosystem-critical crate.
72+
3. Re-run `cargo update` and re-run the full test matrix. Some duplicates are
73+
benign (small leaf crates with no FFI surface) and can be left alone if
74+
reconciliation would force a much larger change. Use judgment.
75+
4. If consolidating forces a behavioral change visible to users (for example,
76+
a newer `pyarrow`-compatible Arrow version), call it out in the PR
77+
description.
78+
79+
Keeping this work separate from PR 1 means PR 1 stays a "make it compile"
80+
review and PR 2 stays a "tidy the dependency graph" review.
81+
82+
## PR 3: Fill API and documentation gaps
83+
84+
**Goal:** with the upstream version locked in, identify new APIs that landed
85+
upstream and decide whether to expose them, and update agent-facing
86+
documentation so it still matches the surface we ship.
87+
88+
1. Run the `check-upstream` skill (`.ai/skills/check-upstream/SKILL.md`) to
89+
diff the upstream Rust API against what's exposed in
90+
`python/datafusion/`. The skill covers scalar/aggregate/window/table
91+
functions, `DataFrame` methods, `SessionContext` methods, and FFI types.
92+
Invoke it from the assistant with `/check-upstream` (optionally scoped to
93+
one area, e.g. `/check-upstream scalar functions`).
94+
2. For each gap, decide whether to:
95+
- Expose it now (small, obvious additions can land in this PR).
96+
- File a tracking issue (anything non-trivial — separate PR per feature
97+
keeps reviews focused).
98+
- Skip it (internal-only or already covered by an existing API; record
99+
the decision in the "Evaluated and not requiring exposure" sections of
100+
the skill so future runs don't re-flag it).
101+
3. Cross-reference the user-facing skill at
102+
[`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md)
103+
against the current public API. Look for stale function names, missing
104+
newly exposed APIs, and examples that drifted from current behavior.
105+
Update `SKILL.md` and the relevant RST pages under
106+
`docs/source/user-guide/common-operations/` accordingly. (An
107+
`audit-skill-md` skill is planned to automate this step — once it lands
108+
under `.ai/skills/audit-skill-md/`, invoke it via `/audit-skill-md`.)
109+
4. If new aggregate or window functions were exposed in step 2, also update:
110+
- `docs/source/user-guide/common-operations/aggregations.rst`
111+
- `docs/source/user-guide/common-operations/windows.rst`
112+
113+
PR 3 is the natural place to land small Pythonic-interface improvements
114+
discovered during the audit. Larger reshapes should still get their own PR.
115+
116+
## Why three PRs
117+
118+
- **Bisectable.** If a regression appears, `git bisect` lands on the
119+
responsible PR (compile fix, dependency consolidation, or API addition)
120+
rather than a single mega-commit.
121+
- **Reviewable.** Each PR has a single concern. Reviewers reading PR 1 don't
122+
need to also reason about whether new APIs are well-named.
123+
- **Skippable.** Some upstream syncs are pure version bumps with no new APIs
124+
worth exposing. PR 3 can be empty or merged as a no-op if the audit comes
125+
back clean.
126+
127+
## Related documents
128+
129+
- [`README.md`](README.md) — the broader release process (this sync work
130+
feeds into the next official release).
131+
- [`.ai/skills/check-upstream/SKILL.md`](../../.ai/skills/check-upstream/SKILL.md)
132+
— API coverage audit.
133+
- [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md)
134+
— user-facing agent guide kept in sync via PR 3.

0 commit comments

Comments
 (0)