|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +# Upstream Sync Process |
| 21 | + |
| 22 | +This document describes how to sync `datafusion-python` to a new version of the |
| 23 | +upstream `apache/datafusion` Rust crates. This is a recurring task: between |
| 24 | +official releases the `main` branch tracks DataFusion via crates.io or GitHub |
| 25 | +dependencies, and we periodically bump those dependencies to pick up new |
| 26 | +features and bug fixes. |
| 27 | + |
| 28 | +The work is broken into **three sequential PRs** rather than landing as one |
| 29 | +large change. Splitting reviews along these lines keeps each PR focused, makes |
| 30 | +breakage easier to bisect, and lets reviewers concentrate on one concern at a |
| 31 | +time. |
| 32 | + |
| 33 | +## PR 1: Bump DataFusion crate dependencies and fix breakage |
| 34 | + |
| 35 | +**Goal:** update the upstream `datafusion` crate version and make the project |
| 36 | +build, test, and lint cleanly against it. |
| 37 | + |
| 38 | +1. Update the `datafusion` dependency in the root `Cargo.toml` (workspace |
| 39 | + section and dependencies). Any downstream `datafusion-*` crates pinned in |
| 40 | + `crates/core/Cargo.toml` should move to the matching version. |
| 41 | +2. Run `cargo update -p datafusion` (or `cargo update` for a broader refresh) |
| 42 | + so `Cargo.lock` reflects the new pin. |
| 43 | +3. Run the standard build and test commands and address compilation errors, |
| 44 | + API renames, signature changes, and behavior changes: |
| 45 | + - `cargo build` |
| 46 | + - `cargo test` |
| 47 | + - `pytest` |
| 48 | + - `pre-commit run --all-files` |
| 49 | +4. Fix only what's needed to restore green CI. Resist the urge to bundle |
| 50 | + unrelated cleanups — those belong in their own PR. |
| 51 | +5. If a breaking change in upstream requires a user-facing API change in |
| 52 | + `datafusion-python`, add the `api change` label and document the change |
| 53 | + in the PR description so it surfaces in the changelog. |
| 54 | + |
| 55 | +**Reference PRs:** [#1311](https://github.com/apache/datafusion-python/pull/1311) |
| 56 | +(DF51), [#1337](https://github.com/apache/datafusion-python/pull/1337) (DF52). |
| 57 | + |
| 58 | +## PR 2: Consolidate transitive dependencies |
| 59 | + |
| 60 | +**Goal:** after the upstream bump, the dependency tree may have multiple |
| 61 | +versions of the same transitive crate (for example, two `arrow` versions, two |
| 62 | +`object_store` versions). Reconcile these so we ship a single coherent set. |
| 63 | + |
| 64 | +1. Inspect the lockfile for duplicates: |
| 65 | + ```bash |
| 66 | + cargo tree --duplicates |
| 67 | + ``` |
| 68 | +2. For each duplicate that matters (Arrow, `object_store`, `parquet`, |
| 69 | + `tokio`, `arrow-flight`, etc.), update our direct dependency declarations |
| 70 | + in `Cargo.toml` to versions compatible with what upstream DataFusion now |
| 71 | + pulls in. The goal is one version of each ecosystem-critical crate. |
| 72 | +3. Re-run `cargo update` and re-run the full test matrix. Some duplicates are |
| 73 | + benign (small leaf crates with no FFI surface) and can be left alone if |
| 74 | + reconciliation would force a much larger change. Use judgment. |
| 75 | +4. If consolidating forces a behavioral change visible to users (for example, |
| 76 | + a newer `pyarrow`-compatible Arrow version), call it out in the PR |
| 77 | + description. |
| 78 | + |
| 79 | +Keeping this work separate from PR 1 means PR 1 stays a "make it compile" |
| 80 | +review and PR 2 stays a "tidy the dependency graph" review. |
| 81 | + |
| 82 | +## PR 3: Fill API and documentation gaps |
| 83 | + |
| 84 | +**Goal:** with the upstream version locked in, identify new APIs that landed |
| 85 | +upstream and decide whether to expose them, and update agent-facing |
| 86 | +documentation so it still matches the surface we ship. |
| 87 | + |
| 88 | +1. Run the `check-upstream` skill (`.ai/skills/check-upstream/SKILL.md`) to |
| 89 | + diff the upstream Rust API against what's exposed in |
| 90 | + `python/datafusion/`. The skill covers scalar/aggregate/window/table |
| 91 | + functions, `DataFrame` methods, `SessionContext` methods, and FFI types. |
| 92 | + Invoke it from the assistant with `/check-upstream` (optionally scoped to |
| 93 | + one area, e.g. `/check-upstream scalar functions`). |
| 94 | +2. For each gap, decide whether to: |
| 95 | + - Expose it now (small, obvious additions can land in this PR). |
| 96 | + - File a tracking issue (anything non-trivial — separate PR per feature |
| 97 | + keeps reviews focused). |
| 98 | + - Skip it (internal-only or already covered by an existing API; record |
| 99 | + the decision in the "Evaluated and not requiring exposure" sections of |
| 100 | + the skill so future runs don't re-flag it). |
| 101 | +3. Cross-reference the user-facing skill at |
| 102 | + [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) |
| 103 | + against the current public API. Look for stale function names, missing |
| 104 | + newly exposed APIs, and examples that drifted from current behavior. |
| 105 | + Update `SKILL.md` and the relevant RST pages under |
| 106 | + `docs/source/user-guide/common-operations/` accordingly. (An |
| 107 | + `audit-skill-md` skill is planned to automate this step — once it lands |
| 108 | + under `.ai/skills/audit-skill-md/`, invoke it via `/audit-skill-md`.) |
| 109 | +4. If new aggregate or window functions were exposed in step 2, also update: |
| 110 | + - `docs/source/user-guide/common-operations/aggregations.rst` |
| 111 | + - `docs/source/user-guide/common-operations/windows.rst` |
| 112 | + |
| 113 | +PR 3 is the natural place to land small Pythonic-interface improvements |
| 114 | +discovered during the audit. Larger reshapes should still get their own PR. |
| 115 | + |
| 116 | +## Why three PRs |
| 117 | + |
| 118 | +- **Bisectable.** If a regression appears, `git bisect` lands on the |
| 119 | + responsible PR (compile fix, dependency consolidation, or API addition) |
| 120 | + rather than a single mega-commit. |
| 121 | +- **Reviewable.** Each PR has a single concern. Reviewers reading PR 1 don't |
| 122 | + need to also reason about whether new APIs are well-named. |
| 123 | +- **Skippable.** Some upstream syncs are pure version bumps with no new APIs |
| 124 | + worth exposing. PR 3 can be empty or merged as a no-op if the audit comes |
| 125 | + back clean. |
| 126 | + |
| 127 | +## Related documents |
| 128 | + |
| 129 | +- [`README.md`](README.md) — the broader release process (this sync work |
| 130 | + feeds into the next official release). |
| 131 | +- [`.ai/skills/check-upstream/SKILL.md`](../../.ai/skills/check-upstream/SKILL.md) |
| 132 | + — API coverage audit. |
| 133 | +- [`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) |
| 134 | + — user-facing agent guide kept in sync via PR 3. |
0 commit comments