|
| 1 | +<!--- |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | +
|
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | +--> |
| 19 | + |
| 20 | +--- |
| 21 | +name: audit-skill-md |
| 22 | +description: Audit the user-facing skill at skills/datafusion_python/SKILL.md against the current public Python API. Find new APIs that should be documented, stale mentions of removed/renamed APIs, examples that drifted from current idiomatic style, and places that need a "requires upstream DataFusion vX" note. Run after upstream syncs and before each release. |
| 23 | +argument-hint: [scope] (e.g., "session-context", "dataframe", "expr", "functions", "patterns", "pitfalls", "version-notes", "all") |
| 24 | +--- |
| 25 | + |
| 26 | +# Audit `skills/datafusion_python/SKILL.md` |
| 27 | + |
| 28 | +You are auditing the user-facing skill at |
| 29 | +[`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md) |
| 30 | +against the current state of the Python API. The skill is the source of truth |
| 31 | +for how AI coding assistants are taught to write `datafusion-python` code, so |
| 32 | +it must match what the project actually ships. This skill identifies gaps |
| 33 | +caused by upstream syncs, refactors, or renames, and (if asked) applies the |
| 34 | +edits directly to `SKILL.md`. |
| 35 | + |
| 36 | +The skill is most usefully run **after** the `check-upstream` step of an |
| 37 | +upstream sync (see `dev/release/upstream-sync.md`) — once any new APIs are |
| 38 | +exposed, this skill makes sure they get documented. |
| 39 | + |
| 40 | +## What the skill covers |
| 41 | + |
| 42 | +The user-facing `SKILL.md` documents these public surfaces. This list is not |
| 43 | +exhaustive — if a new top-level area is added (e.g., a new `Catalog` API |
| 44 | +exposed at the package root), include it. |
| 45 | + |
| 46 | +| Surface | Module | Sections in SKILL.md | |
| 47 | +|---|---|---| |
| 48 | +| `SessionContext` | `python/datafusion/context.py` | "Data Loading" | |
| 49 | +| `DataFrame` | `python/datafusion/dataframe.py` | "DataFrame Operations Quick Reference", "Executing and Collecting Results", "Idiomatic Patterns" | |
| 50 | +| `Expr` | `python/datafusion/expr.py` | "Expression Building", "Common Pitfalls" | |
| 51 | +| `functions` | `python/datafusion/functions.py` | "Available Functions (Categorized)", scattered uses throughout | |
| 52 | +| Top-level helpers (`col`, `lit`, `WindowFrame`, ...) | `python/datafusion/__init__.py` | "Import Conventions", "Core Abstractions" | |
| 53 | + |
| 54 | +## Scope argument |
| 55 | + |
| 56 | +The user may specify a scope via `$ARGUMENTS` to limit the audit. If no scope |
| 57 | +is given or `all` is specified, audit every area. |
| 58 | + |
| 59 | +| Scope | Audit target | |
| 60 | +|---|---| |
| 61 | +| `session-context` | `SessionContext` methods and the "Data Loading" section | |
| 62 | +| `dataframe` | `DataFrame` methods and the operations / executing / patterns sections | |
| 63 | +| `expr` | `Expr` methods/operators and the "Expression Building" section | |
| 64 | +| `functions` | `functions.py` `__all__` and the "Available Functions (Categorized)" section | |
| 65 | +| `patterns` | "Idiomatic Patterns" section — confirm patterns still match recommended style | |
| 66 | +| `pitfalls` | "Common Pitfalls" — confirm each pitfall still reproduces, drop ones fixed upstream | |
| 67 | +| `version-notes` | Cross-check version annotations (see below) | |
| 68 | +| `all` | Everything above | |
| 69 | + |
| 70 | +## Inputs to read |
| 71 | + |
| 72 | +Before producing the report: |
| 73 | + |
| 74 | +1. `skills/datafusion_python/SKILL.md` — the document being audited. |
| 75 | +2. The relevant Python module(s) for the chosen scope. Public surface is the |
| 76 | + `__all__` list (where defined) plus `class` and `def` symbols not prefixed |
| 77 | + with `_`. |
| 78 | +3. `Cargo.toml` (root) for the current `datafusion-python` version (e.g., |
| 79 | + `version = "53.0.0"`). The major version always matches the upstream |
| 80 | + `datafusion` crate, so a single `datafusion-python` version expresses |
| 81 | + both. `python/datafusion/__init__.py`'s `__version__` is the same value |
| 82 | + exposed at runtime. |
| 83 | +4. Recent commits touching the relevant module(s) for context on what |
| 84 | + changed since the last sync: |
| 85 | + ```bash |
| 86 | + git log --oneline -- python/datafusion/dataframe.py | head -20 |
| 87 | + ``` |
| 88 | + |
| 89 | +## What to look for |
| 90 | + |
| 91 | +Walk through each scoped area and flag four kinds of issues. |
| 92 | + |
| 93 | +### 1. New APIs not mentioned |
| 94 | + |
| 95 | +For each public symbol in the module's `__all__` (or each public class |
| 96 | +method), check whether it appears anywhere in `SKILL.md`. A symbol is |
| 97 | +"covered" if it shows up in: |
| 98 | + |
| 99 | +- A code block (the strongest signal — it's demonstrated). |
| 100 | +- The "Available Functions (Categorized)" list. |
| 101 | +- The SQL-to-DataFrame Reference table. |
| 102 | + |
| 103 | +**Decide whether each missing symbol deserves an entry.** Not every public |
| 104 | +symbol belongs in `SKILL.md` — the skill is curated for the patterns users |
| 105 | +hit daily, not exhaustive API reference. Use these heuristics: |
| 106 | + |
| 107 | +- **Add it** if it replaces or supersedes something already in the skill |
| 108 | + (e.g., a new operation that is the idiomatic alternative to a documented |
| 109 | + workaround). |
| 110 | +- **Add it** if it fits a category already present (a new aggregate function |
| 111 | + goes in the aggregate list; a new join type goes in the joining section). |
| 112 | +- **Add it** if it changes how a documented pattern should be written. |
| 113 | +- **Skip it** if it is genuinely niche / advanced / experimental. |
| 114 | +- **Skip it** if it is internal plumbing exposed for FFI but not user-facing. |
| 115 | + |
| 116 | +When you flag a missing symbol, include a one-line proposed insertion point |
| 117 | +(which section / which table row) so a reviewer can decide quickly. |
| 118 | + |
| 119 | +### 2. Stale mentions |
| 120 | + |
| 121 | +For each function name, method name, or import shown in `SKILL.md`, verify it |
| 122 | +still exists in the current API: |
| 123 | + |
| 124 | +- Function names mentioned in prose or in the categorized list should appear |
| 125 | + in `python/datafusion/functions.py`'s `__all__`. |
| 126 | +- Method calls in code blocks should resolve against the current class. |
| 127 | +- Imports (`from datafusion import ...`) should succeed against the current |
| 128 | + `__init__.py`. |
| 129 | + |
| 130 | +A quick way to check imports without running them: |
| 131 | + |
| 132 | +```bash |
| 133 | +python -c "from datafusion import SessionContext, col, lit; from datafusion import functions as F; print('ok')" |
| 134 | +``` |
| 135 | + |
| 136 | +For each stale mention, propose either: |
| 137 | +- a rename to the current name, or |
| 138 | +- removal if the API is gone with no replacement. |
| 139 | + |
| 140 | +### 3. Examples that drifted from idiomatic style |
| 141 | + |
| 142 | +The skill teaches a Pythonic style: prefer plain strings to `col(...)` when a |
| 143 | +column reference is all you need; prefer raw Python values to `lit(...)` |
| 144 | +where auto-wrapping applies. Recent refactors (see the `make-pythonic` |
| 145 | +skill) keep moving more functions toward accepting native types. |
| 146 | + |
| 147 | +For each code example in `SKILL.md`, check: |
| 148 | + |
| 149 | +- Does it use `lit(value)` where a raw value would work? Comparison RHS, |
| 150 | + arithmetic with a column, etc. all auto-wrap. (Reserve `lit()` for the |
| 151 | + cases listed in pitfall #2.) |
| 152 | +- Does it use `col("name")` where a plain string would work? `select(...)`, |
| 153 | + `aggregate([keys], ...)`, `sort(...)`, `sort_by(...)` all accept plain |
| 154 | + name strings. |
| 155 | +- Do `functions.py` calls match the current pythonic signature for that |
| 156 | + function? If `make-pythonic` recently changed a signature (e.g., |
| 157 | + `repeat(string, n: Expr | int)`), the example should pass `3` rather than |
| 158 | + `lit(3)`. |
| 159 | +- Does any example use a deprecated or removed parameter name? |
| 160 | + |
| 161 | +For drift, propose the updated snippet. If the change is purely stylistic |
| 162 | +and the older form still works, mark the suggestion as **non-blocking**. |
| 163 | + |
| 164 | +### 4. Missing or stale version notes |
| 165 | + |
| 166 | +When an API depends on a specific version, the skill should say so — |
| 167 | +otherwise an agent referencing the skill in an older project will write |
| 168 | +code that fails at import or at runtime. |
| 169 | + |
| 170 | +`datafusion-python` shares its major version number with the upstream |
| 171 | +`datafusion` crate (e.g., `datafusion-python 53.x` tracks upstream |
| 172 | +`datafusion 53`). Always express version requirements in terms of |
| 173 | +`datafusion-python` only — there is no need to call out upstream and |
| 174 | +package versions separately. |
| 175 | + |
| 176 | +Add a version note when: |
| 177 | + |
| 178 | +- A method or function shown in the skill was added in a specific release |
| 179 | + (e.g., a new `DataFrame` method that didn't exist before 53). |
| 180 | +- A breaking change altered behavior in a specific release (signature |
| 181 | + change, default-value change, new required argument). |
| 182 | +- A pitfall was fixed in a specific release. Either annotate the pitfall |
| 183 | + block with "fixed in datafusion-python NN, kept here for users on older |
| 184 | + versions" or remove it once the supported floor moves past that version. |
| 185 | + |
| 186 | +Format for version notes (inline, italicized): |
| 187 | + |
| 188 | +```markdown |
| 189 | +*Requires datafusion-python 53 or newer.* |
| 190 | +``` |
| 191 | + |
| 192 | +For each missing/stale version note, propose the exact line and where it |
| 193 | +belongs. |
| 194 | + |
| 195 | +## How to discover changes since the last audit |
| 196 | + |
| 197 | +If the user supplies a previous version or commit SHA where the audit was |
| 198 | +last run, diff against it: |
| 199 | + |
| 200 | +```bash |
| 201 | +# Public-API-relevant changes since SHA <prev> |
| 202 | +git log --oneline <prev>..HEAD -- python/datafusion/ |
| 203 | + |
| 204 | +# Whose signatures actually moved |
| 205 | +git diff <prev>..HEAD -- python/datafusion/functions.py | grep '^[+-]def ' |
| 206 | +``` |
| 207 | + |
| 208 | +If no prior audit point is given, fall back to "since the last upstream |
| 209 | +sync" by inspecting commits that touch `Cargo.toml`'s `datafusion` pin: |
| 210 | + |
| 211 | +```bash |
| 212 | +git log --oneline -- Cargo.toml | grep -i datafusion | head -5 |
| 213 | +``` |
| 214 | + |
| 215 | +## Output Format |
| 216 | + |
| 217 | +Produce a report grouped by scope. Each finding is one bullet with a |
| 218 | +proposed action, so a maintainer can review the list quickly and apply |
| 219 | +edits in order. |
| 220 | + |
| 221 | +``` |
| 222 | +## SKILL.md Audit (scope: <scope>) |
| 223 | +
|
| 224 | +Audited against: |
| 225 | +- skills/datafusion_python/SKILL.md @ <git SHA / "working tree"> |
| 226 | +- datafusion-python <version> |
| 227 | +
|
| 228 | +### New APIs to cover |
| 229 | +- `DataFrame.foo()` — added in datafusion-python 53. Insert in "DataFrame Operations Quick Reference" under <subsection>. |
| 230 | + Proposed snippet: |
| 231 | + ```python |
| 232 | + df.foo(...) |
| 233 | + ``` |
| 234 | + |
| 235 | +### Stale mentions |
| 236 | +- "old_function_name" referenced in the categorized list (line N) — renamed to "new_function_name". Replace. |
| 237 | + |
| 238 | +### Drifted examples |
| 239 | +- "Filtering" section, `df.filter(col("a") > lit(10))` — drop `lit(10)`, auto-wrap applies. (non-blocking) |
| 240 | +- "Aggregation" section, `df.aggregate([col("region")], ...)` — pass `"region"` as a plain string per "Projection" guidance. |
| 241 | + |
| 242 | +### Version notes |
| 243 | +- `DataFrame.foo()` block needs *Requires datafusion-python 53 or newer.* |
| 244 | +- "Common Pitfalls" #N — fixed in datafusion-python 53; remove the pitfall and update the SQL-to-DataFrame row to no longer flag the workaround. |
| 245 | + |
| 246 | +### No-change confirmed |
| 247 | +- `SessionContext` data-loading section — all entries match current API. |
| 248 | +``` |
| 249 | +
|
| 250 | +If asked to apply the changes, edit `skills/datafusion_python/SKILL.md` |
| 251 | +directly with `Edit` tool calls, one finding at a time, and re-run the |
| 252 | +relevant doctest sanity check at the end: |
| 253 | +
|
| 254 | +```bash |
| 255 | +pytest --doctest-modules python/datafusion -q |
| 256 | +``` |
| 257 | + |
| 258 | +## What NOT to flag |
| 259 | + |
| 260 | +- **Internal helpers / underscored names.** Private symbols are not part of |
| 261 | + the user-facing surface. |
| 262 | +- **Functions intentionally omitted.** Niche / advanced APIs (custom |
| 263 | + catalogs, raw FFI plumbing, low-level execution plan accessors) live in |
| 264 | + the API reference, not the skill. If an omission was deliberate and a |
| 265 | + comment / commit explains why, leave it out. |
| 266 | +- **Style nits inside explanatory prose.** The skill mixes example code and |
| 267 | + prose; only enforce the pythonic style on actual code blocks. |
| 268 | +- **Function-by-function coverage of every `functions.py` symbol.** The |
| 269 | + "Available Functions (Categorized)" list is curated by category, not |
| 270 | + exhaustive. Adding a single new aggregate to the aggregate list is |
| 271 | + enough — the user follows the pointer to the API reference for the rest. |
| 272 | + |
| 273 | +## Coordination with other skills |
| 274 | + |
| 275 | +- Run `/check-upstream` first to expose any missing upstream APIs into the |
| 276 | + Python layer. Without that, this skill cannot recommend documenting |
| 277 | + something that is not yet exposed. |
| 278 | +- Run `/make-pythonic` before this skill if a Pythonic-signature pass is |
| 279 | + planned for a release — that way this skill can update examples to the |
| 280 | + final signature in one shot rather than churning them twice. |
| 281 | +- The order during an upstream sync (PR 3 of `dev/release/upstream-sync.md`) |
| 282 | + is therefore: `/check-upstream` → `/make-pythonic` (optional) → |
| 283 | + `/audit-skill-md`. |
0 commit comments