Skip to content

Commit 1c4cfe9

Browse files
timsaucerclaude
andcommitted
docs: add audit-skill-md skill
New AI agent skill at .ai/skills/audit-skill-md/SKILL.md to keep the user-facing skills/datafusion_python/SKILL.md in sync with the public Python API. Audits SessionContext, DataFrame, Expr, and functions surfaces for new APIs not covered, stale mentions, examples that drifted from idiomatic style, and missing version notes. Wired into PR 3 of the upstream sync workflow documented in dev/release/upstream-sync.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c4fcfce commit 1c4cfe9

2 files changed

Lines changed: 291 additions & 7 deletions

File tree

.ai/skills/audit-skill-md/SKILL.md

Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
---
21+
name: audit-skill-md
22+
description: Audit the user-facing skill at skills/datafusion_python/SKILL.md against the current public Python API. Find new APIs that should be documented, stale mentions of removed/renamed APIs, examples that drifted from current idiomatic style, and places that need a "requires upstream DataFusion vX" note. Run after upstream syncs and before each release.
23+
argument-hint: [scope] (e.g., "session-context", "dataframe", "expr", "functions", "patterns", "pitfalls", "version-notes", "all")
24+
---
25+
26+
# Audit `skills/datafusion_python/SKILL.md`
27+
28+
You are auditing the user-facing skill at
29+
[`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md)
30+
against the current state of the Python API. The skill is the source of truth
31+
for how AI coding assistants are taught to write `datafusion-python` code, so
32+
it must match what the project actually ships. This skill identifies gaps
33+
caused by upstream syncs, refactors, or renames, and (if asked) applies the
34+
edits directly to `SKILL.md`.
35+
36+
The skill is most usefully run **after** the `check-upstream` step of an
37+
upstream sync (see `dev/release/upstream-sync.md`) — once any new APIs are
38+
exposed, this skill makes sure they get documented.
39+
40+
## What the skill covers
41+
42+
The user-facing `SKILL.md` documents these public surfaces. This list is not
43+
exhaustive — if a new top-level area is added (e.g., a new `Catalog` API
44+
exposed at the package root), include it.
45+
46+
| Surface | Module | Sections in SKILL.md |
47+
|---|---|---|
48+
| `SessionContext` | `python/datafusion/context.py` | "Data Loading" |
49+
| `DataFrame` | `python/datafusion/dataframe.py` | "DataFrame Operations Quick Reference", "Executing and Collecting Results", "Idiomatic Patterns" |
50+
| `Expr` | `python/datafusion/expr.py` | "Expression Building", "Common Pitfalls" |
51+
| `functions` | `python/datafusion/functions.py` | "Available Functions (Categorized)", scattered uses throughout |
52+
| Top-level helpers (`col`, `lit`, `WindowFrame`, ...) | `python/datafusion/__init__.py` | "Import Conventions", "Core Abstractions" |
53+
54+
## Scope argument
55+
56+
The user may specify a scope via `$ARGUMENTS` to limit the audit. If no scope
57+
is given or `all` is specified, audit every area.
58+
59+
| Scope | Audit target |
60+
|---|---|
61+
| `session-context` | `SessionContext` methods and the "Data Loading" section |
62+
| `dataframe` | `DataFrame` methods and the operations / executing / patterns sections |
63+
| `expr` | `Expr` methods/operators and the "Expression Building" section |
64+
| `functions` | `functions.py` `__all__` and the "Available Functions (Categorized)" section |
65+
| `patterns` | "Idiomatic Patterns" section — confirm patterns still match recommended style |
66+
| `pitfalls` | "Common Pitfalls" — confirm each pitfall still reproduces, drop ones fixed upstream |
67+
| `version-notes` | Cross-check version annotations (see below) |
68+
| `all` | Everything above |
69+
70+
## Inputs to read
71+
72+
Before producing the report:
73+
74+
1. `skills/datafusion_python/SKILL.md` — the document being audited.
75+
2. The relevant Python module(s) for the chosen scope. Public surface is the
76+
`__all__` list (where defined) plus `class` and `def` symbols not prefixed
77+
with `_`.
78+
3. `Cargo.toml` (root) for the current `datafusion-python` version (e.g.,
79+
`version = "53.0.0"`). The major version always matches the upstream
80+
`datafusion` crate, so a single `datafusion-python` version expresses
81+
both. `python/datafusion/__init__.py`'s `__version__` is the same value
82+
exposed at runtime.
83+
4. Recent commits touching the relevant module(s) for context on what
84+
changed since the last sync:
85+
```bash
86+
git log --oneline -- python/datafusion/dataframe.py | head -20
87+
```
88+
89+
## What to look for
90+
91+
Walk through each scoped area and flag four kinds of issues.
92+
93+
### 1. New APIs not mentioned
94+
95+
For each public symbol in the module's `__all__` (or each public class
96+
method), check whether it appears anywhere in `SKILL.md`. A symbol is
97+
"covered" if it shows up in:
98+
99+
- A code block (the strongest signal — it's demonstrated).
100+
- The "Available Functions (Categorized)" list.
101+
- The SQL-to-DataFrame Reference table.
102+
103+
**Decide whether each missing symbol deserves an entry.** Not every public
104+
symbol belongs in `SKILL.md` — the skill is curated for the patterns users
105+
hit daily, not exhaustive API reference. Use these heuristics:
106+
107+
- **Add it** if it replaces or supersedes something already in the skill
108+
(e.g., a new operation that is the idiomatic alternative to a documented
109+
workaround).
110+
- **Add it** if it fits a category already present (a new aggregate function
111+
goes in the aggregate list; a new join type goes in the joining section).
112+
- **Add it** if it changes how a documented pattern should be written.
113+
- **Skip it** if it is genuinely niche / advanced / experimental.
114+
- **Skip it** if it is internal plumbing exposed for FFI but not user-facing.
115+
116+
When you flag a missing symbol, include a one-line proposed insertion point
117+
(which section / which table row) so a reviewer can decide quickly.
118+
119+
### 2. Stale mentions
120+
121+
For each function name, method name, or import shown in `SKILL.md`, verify it
122+
still exists in the current API:
123+
124+
- Function names mentioned in prose or in the categorized list should appear
125+
in `python/datafusion/functions.py`'s `__all__`.
126+
- Method calls in code blocks should resolve against the current class.
127+
- Imports (`from datafusion import ...`) should succeed against the current
128+
`__init__.py`.
129+
130+
A quick way to check imports without running them:
131+
132+
```bash
133+
python -c "from datafusion import SessionContext, col, lit; from datafusion import functions as F; print('ok')"
134+
```
135+
136+
For each stale mention, propose either:
137+
- a rename to the current name, or
138+
- removal if the API is gone with no replacement.
139+
140+
### 3. Examples that drifted from idiomatic style
141+
142+
The skill teaches a Pythonic style: prefer plain strings to `col(...)` when a
143+
column reference is all you need; prefer raw Python values to `lit(...)`
144+
where auto-wrapping applies. Recent refactors (see the `make-pythonic`
145+
skill) keep moving more functions toward accepting native types.
146+
147+
For each code example in `SKILL.md`, check:
148+
149+
- Does it use `lit(value)` where a raw value would work? Comparison RHS,
150+
arithmetic with a column, etc. all auto-wrap. (Reserve `lit()` for the
151+
cases listed in pitfall #2.)
152+
- Does it use `col("name")` where a plain string would work? `select(...)`,
153+
`aggregate([keys], ...)`, `sort(...)`, `sort_by(...)` all accept plain
154+
name strings.
155+
- Do `functions.py` calls match the current pythonic signature for that
156+
function? If `make-pythonic` recently changed a signature (e.g.,
157+
`repeat(string, n: Expr | int)`), the example should pass `3` rather than
158+
`lit(3)`.
159+
- Does any example use a deprecated or removed parameter name?
160+
161+
For drift, propose the updated snippet. If the change is purely stylistic
162+
and the older form still works, mark the suggestion as **non-blocking**.
163+
164+
### 4. Missing or stale version notes
165+
166+
When an API depends on a specific version, the skill should say so —
167+
otherwise an agent referencing the skill in an older project will write
168+
code that fails at import or at runtime.
169+
170+
`datafusion-python` shares its major version number with the upstream
171+
`datafusion` crate (e.g., `datafusion-python 53.x` tracks upstream
172+
`datafusion 53`). Always express version requirements in terms of
173+
`datafusion-python` only — there is no need to call out upstream and
174+
package versions separately.
175+
176+
Add a version note when:
177+
178+
- A method or function shown in the skill was added in a specific release
179+
(e.g., a new `DataFrame` method that didn't exist before 53).
180+
- A breaking change altered behavior in a specific release (signature
181+
change, default-value change, new required argument).
182+
- A pitfall was fixed in a specific release. Either annotate the pitfall
183+
block with "fixed in datafusion-python NN, kept here for users on older
184+
versions" or remove it once the supported floor moves past that version.
185+
186+
Format for version notes (inline, italicized):
187+
188+
```markdown
189+
*Requires datafusion-python 53 or newer.*
190+
```
191+
192+
For each missing/stale version note, propose the exact line and where it
193+
belongs.
194+
195+
## How to discover changes since the last audit
196+
197+
If the user supplies a previous version or commit SHA where the audit was
198+
last run, diff against it:
199+
200+
```bash
201+
# Public-API-relevant changes since SHA <prev>
202+
git log --oneline <prev>..HEAD -- python/datafusion/
203+
204+
# Whose signatures actually moved
205+
git diff <prev>..HEAD -- python/datafusion/functions.py | grep '^[+-]def '
206+
```
207+
208+
If no prior audit point is given, fall back to "since the last upstream
209+
sync" by inspecting commits that touch `Cargo.toml`'s `datafusion` pin:
210+
211+
```bash
212+
git log --oneline -- Cargo.toml | grep -i datafusion | head -5
213+
```
214+
215+
## Output Format
216+
217+
Produce a report grouped by scope. Each finding is one bullet with a
218+
proposed action, so a maintainer can review the list quickly and apply
219+
edits in order.
220+
221+
```
222+
## SKILL.md Audit (scope: <scope>)
223+
224+
Audited against:
225+
- skills/datafusion_python/SKILL.md @ <git SHA / "working tree">
226+
- datafusion-python <version>
227+
228+
### New APIs to cover
229+
- `DataFrame.foo()` — added in datafusion-python 53. Insert in "DataFrame Operations Quick Reference" under <subsection>.
230+
Proposed snippet:
231+
```python
232+
df.foo(...)
233+
```
234+
235+
### Stale mentions
236+
- "old_function_name" referenced in the categorized list (line N) — renamed to "new_function_name". Replace.
237+
238+
### Drifted examples
239+
- "Filtering" section, `df.filter(col("a") > lit(10))` — drop `lit(10)`, auto-wrap applies. (non-blocking)
240+
- "Aggregation" section, `df.aggregate([col("region")], ...)` — pass `"region"` as a plain string per "Projection" guidance.
241+
242+
### Version notes
243+
- `DataFrame.foo()` block needs *Requires datafusion-python 53 or newer.*
244+
- "Common Pitfalls" #N — fixed in datafusion-python 53; remove the pitfall and update the SQL-to-DataFrame row to no longer flag the workaround.
245+
246+
### No-change confirmed
247+
- `SessionContext` data-loading section — all entries match current API.
248+
```
249+
250+
If asked to apply the changes, edit `skills/datafusion_python/SKILL.md`
251+
directly with `Edit` tool calls, one finding at a time, and re-run the
252+
relevant doctest sanity check at the end:
253+
254+
```bash
255+
pytest --doctest-modules python/datafusion -q
256+
```
257+
258+
## What NOT to flag
259+
260+
- **Internal helpers / underscored names.** Private symbols are not part of
261+
the user-facing surface.
262+
- **Functions intentionally omitted.** Niche / advanced APIs (custom
263+
catalogs, raw FFI plumbing, low-level execution plan accessors) live in
264+
the API reference, not the skill. If an omission was deliberate and a
265+
comment / commit explains why, leave it out.
266+
- **Style nits inside explanatory prose.** The skill mixes example code and
267+
prose; only enforce the pythonic style on actual code blocks.
268+
- **Function-by-function coverage of every `functions.py` symbol.** The
269+
"Available Functions (Categorized)" list is curated by category, not
270+
exhaustive. Adding a single new aggregate to the aggregate list is
271+
enough — the user follows the pointer to the API reference for the rest.
272+
273+
## Coordination with other skills
274+
275+
- Run `/check-upstream` first to expose any missing upstream APIs into the
276+
Python layer. Without that, this skill cannot recommend documenting
277+
something that is not yet exposed.
278+
- Run `/make-pythonic` before this skill if a Pythonic-signature pass is
279+
planned for a release — that way this skill can update examples to the
280+
final signature in one shot rather than churning them twice.
281+
- The order during an upstream sync (PR 3 of `dev/release/upstream-sync.md`)
282+
is therefore: `/check-upstream``/make-pythonic` (optional) →
283+
`/audit-skill-md`.

dev/release/upstream-sync.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -98,14 +98,15 @@ documentation so it still matches the surface we ship.
9898
- Skip it (internal-only or already covered by an existing API; record
9999
the decision in the "Evaluated and not requiring exposure" sections of
100100
the skill so future runs don't re-flag it).
101-
3. Cross-reference the user-facing skill at
101+
3. Run the `audit-skill-md` skill (`.ai/skills/audit-skill-md/SKILL.md`) to
102+
cross-reference the user-facing skill at
102103
[`skills/datafusion_python/SKILL.md`](../../skills/datafusion_python/SKILL.md)
103-
against the current public API. Look for stale function names, missing
104-
newly exposed APIs, and examples that drifted from current behavior.
105-
Update `SKILL.md` and the relevant RST pages under
106-
`docs/source/user-guide/common-operations/` accordingly. (An
107-
`audit-skill-md` skill is planned to automate this step — once it lands
108-
under `.ai/skills/audit-skill-md/`, invoke it via `/audit-skill-md`.)
104+
against the current public API. The skill flags stale function names,
105+
missing newly exposed APIs, examples that drifted from idiomatic style,
106+
and missing version notes. Invoke it from the assistant with
107+
`/audit-skill-md` (optionally scoped, e.g. `/audit-skill-md dataframe`).
108+
Apply the resulting edits to `SKILL.md` and to the relevant RST pages
109+
under `docs/source/user-guide/common-operations/`.
109110
4. If new aggregate or window functions were exposed in step 2, also update:
110111
- `docs/source/user-guide/common-operations/aggregations.rst`
111112
- `docs/source/user-guide/common-operations/windows.rst`

0 commit comments

Comments
 (0)