Skip to content

feat(tools): add Defuddle extractor chain for web_fetch#296

Closed
mrgoonie wants to merge 3 commits intomainfrom
feat/fetch-tool-defuddle-markdown
Closed

feat(tools): add Defuddle extractor chain for web_fetch#296
mrgoonie wants to merge 3 commits intomainfrom
feat/fetch-tool-defuddle-markdown

Conversation

@mrgoonie
Copy link
Copy Markdown
Contributor

@mrgoonie mrgoonie commented Mar 20, 2026

Summary

  • Add Extractor Chain pattern for web_fetch markdown extraction with waterfall fallback
  • Primary: Defuddle CF Worker at fetch.goclaw.sh extracts clean markdown (10s timeout)
  • Fallback: In-process htmlToMarkdown() converter (existing, battle-tested)
  • Config toggle: defuddle_enabled in config.json5 with runtime pub/sub reload
  • Web UI: Switch toggle in Tools > Web Fetch settings (en/vi/zh i18n)

GoClaw Fetch Solution (Defuddle wrapper)

🤌 https://fetch.goclaw.sh/

Architecture

web_fetch Execute()
  ├─ validate URL, SSRF, domain policy, cache (unchanged)
  ├─ markdown mode → ExtractorChain
  │   [1] DefuddleExtractor → GET fetch.goclaw.sh/<domain>/<path>
  │   [2] InProcessExtractor → HTTP GET + htmlToMarkdown()
  │   Quality gate: min 100 chars, 10 words
  └─ text mode → existing doDirectFetch() (unchanged)

Changes

New files (4):

  • internal/tools/web_fetch_extractor.go — ContentExtractor interface, ExtractorChain, quality gate
  • internal/tools/web_fetch_extractor_defuddle.go — DefuddleExtractor (CF Worker client)
  • internal/tools/web_fetch_extractor_inprocess.go — InProcessExtractor (wraps existing converter)
  • internal/tools/web_fetch_extractor_test.go — 23 tests with race detector coverage

Modified files (8):

  • internal/tools/web_fetch.go — chain integration, doFetch refactor
  • internal/config/config_channels.goDefuddleEnabled *bool field
  • cmd/gateway_setup.go — wire config to WebFetchTool
  • cmd/gateway.go — pub/sub handler for runtime toggle
  • ui/web/src/pages/config/sections/tools-web-section.tsx — Switch toggle
  • ui/web/src/i18n/locales/{en,vi,zh}/config.json — i18n keys

Test plan

  • go build ./... — compile clean
  • go vet ./... — no issues
  • go test -race ./internal/tools/ — 119 tests pass (23 new + 96 existing)
  • Race detector clean
  • No regressions in existing web_fetch_convert_test.go
  • Manual: toggle defuddle on/off in Web UI, verify config.patch sends correctly
  • Manual: verify fetch.goclaw.sh returns clean markdown (requires CF Worker deployed)

…action

Integrate fetch.goclaw.sh Cloudflare Worker as primary content extractor
using Defuddle library. Implements ContentExtractor interface with
waterfall fallback: Defuddle CF Worker (10s timeout) → in-process
htmlToMarkdown() converter.

- Add ExtractorChain pattern with quality gate (min 100 chars, 10 words)
- Add DefuddleExtractor calling GET https://fetch.goclaw.sh/<domain>/<path>
- Add InProcessExtractor wrapping existing HTML→markdown converter
- Add defuddle_enabled config toggle with runtime pub/sub reload
- Add Switch toggle in Web UI settings (en/vi/zh i18n)
- 23 new tests with race detector coverage
@mrgoonie mrgoonie requested a review from viettranx March 20, 2026 05:20
…attern

Move extractor chain settings from config.json5 to builtin_tools DB table,
aligning with the media_provider_chain pattern. Key changes:

Backend:
- Extract fetchRawContent() from doDirectFetch() for code reuse
- Rewrite InProcessExtractor to delegate (93 → 28 lines), fixing missing
  domain policy checks on redirects (security fix)
- Add ResolveExtractorChain() parsing settings from builtin_tools context
- DefuddleExtractor now configurable via base_url + timeout from settings
- Seed default chain [defuddle, html-to-markdown] for new deployments
- Add backfillWebFetchSettings() for existing deployments
- Remove defuddle_enabled from config struct, pub/sub, gateway_setup

Frontend:
- Add DnD extractor chain form on builtin tools page
- Route web_fetch → dedicated form in settings dialog
- Remove defuddle toggle from config page
- Dedicated i18n keys (en/vi/zh), no cross-form coupling
viettranx added a commit that referenced this pull request Mar 20, 2026
Add Cloudflare Worker (fetch.goclaw.sh) as primary markdown extractor
with waterfall fallback to built-in HTML→Markdown converter.

Architecture:
- ExtractorChain pattern with quality gate (min 100 chars, 10 words)
- Settings stored in builtin_tools DB table (not config.json5)
- ResolveExtractorChain reads chain from context per-request
- InProcessExtractor delegates to fetchRawContent (full SSRF + domain
  policy checks on redirects)
- DefuddleExtractor with configurable base_url + timeout
- Seed default chain [defuddle, html-to-markdown] for new deployments
- Backfill migration for existing deployments

Web UI:
- Dedicated DnD extractor chain form on builtin tools page
- Drag-and-drop ordering, enable/disable per extractor
- Timeout + base URL config for Defuddle
- i18n support (en/vi/zh)
@viettranx
Copy link
Copy Markdown
Contributor

@viettranx viettranx closed this Mar 20, 2026
@viettranx viettranx deleted the feat/fetch-tool-defuddle-markdown branch April 2, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants