Skip to content

fix: prevent nil pointer crash in OpenAI SSE tool call accumulation#182

Closed
duhd-vnpay wants to merge 92 commits intonextlevelbuilder:mainfrom
duhd-vnpay:main
Closed

fix: prevent nil pointer crash in OpenAI SSE tool call accumulation#182
duhd-vnpay wants to merge 92 commits intonextlevelbuilder:mainfrom
duhd-vnpay:main

Conversation

@duhd-vnpay
Copy link
Copy Markdown

Summary

  • Root cause fix: openai.goaccumulators map (keyed by SSE tool_call index) was iterated with for i := 0; i < len(map); i++, assuming contiguous keys 0,1,2... SSE indices from extended thinking models can be non-contiguous (e.g. {0, 2}), causing accumulators[i] → nil → panic on acc.rawArgs
  • Defense: lanes.go — add defer recover() in scheduler lane goroutine so panics log error + return semaphore token instead of crashing the process
  • Cleanup: tracing.go + gateway.go — add SweepOrphanTraces() to mark stuck running traces as error on startup (traces older than 1h with status running = orphan from crash)

Changed files

File Change
internal/providers/openai.go Iterate map by sorted keys instead of sequential 0..len-1
internal/scheduler/lanes.go Add recover() in goroutine defer
internal/store/tracing_store.go Add SweepOrphanTraces to interface
internal/store/pg/tracing.go Implement SweepOrphanTraces (UPDATE running→error)
cmd/gateway.go Call sweep on startup in background goroutine

Test results (18 tests, all PASS)

providers (9/9 PASS):

  • ContiguousToolCalls — indices 0,1 (happy path)
  • NonContiguousToolCallIndices — indices 0,2 skip 1 (the crash scenario)
  • LargeGapToolCallIndices — indices 0,5,10
  • SingleToolCallAtHighIndex — single call at index 3
  • ToolCallWithThoughtSignature — metadata preserved
  • TextOnly / EmptyToolCalls — no tool calls
  • HTTPError — 429 → proper HTTPError
  • CancelledContext — context cancel → error

scheduler (5/5 PASS):

  • NoPanic — normal operation
  • PanicRecovery — panic → lane recovers, accepts new work
  • MultiplePanics — 3 panics → all semaphore tokens returned
  • CancelledContext — cancelled ctx with full lane → error
  • StatsAfterPanic — active counter decremented after panic

store/pg (4/4 PASS, integration with PostgreSQL):

  • SweepsOldRunning — running trace 2h old → swept
  • IgnoresRecentRunning — running trace 5min old → NOT swept
  • IgnoresCompleted — completed trace → NOT swept
  • NoOrphans — no running traces → swept=0

🤖 Generated with Claude Code

duhd-vnpay and others added 30 commits March 10, 2026 15:44
…ing, photo handling

- Fix zaloBotInfo to use account_name/display_name (not name)
- Add Label() method for bot display name resolution
- Handle 3 response formats in getUpdates: array, single object, wrapped
- Add photo_url field to zaloMessage for Zalo CDN image URLs
- Add display_name/is_bot to zaloFrom, chat_type to zaloChat
- Use PhotoURL with fallback to Photo in handleImageMessage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ssing

Zalo CDN URLs are auth-restricted and expire quickly, causing read_image
tool failures. Now downloads photos to temp files (like Telegram channel)
so the agent pipeline can base64-encode and process them normally.

Falls back to passing the URL directly if download fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… instance updates

When encryption key is empty, credentials stayed as map[string]any from
JSON unmarshal, causing pgx driver to fail encoding into bytea. Now
credentials are always marshaled to []byte regardless of encryption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Party Mode to GoClaw: structured multi-persona AI discussions with
Standard (single LLM call), Deep (parallel thinking + cross-talk), and
Token-Ring (sequential turns) modes.

Backend: PartyStore + PG implementation, party engine with parallel
goroutines, 7 RPC methods (party.start/round/question/add_context/
summary/exit/list), 10 WebSocket events, migration 000014.

Frontend: React dashboard page with session list, chat view, persona
sidebar, mode controls, start dialog with 6 team presets, i18n (en/vi/zh).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, sanitizeHistory() only cleaned the in-memory copy for each
LLM request but never persisted the fix — causing the same "dropping
orphaned tool message" WARN to repeat on every single request forever.

Changes:
- sanitizeHistory() now returns drop count alongside cleaned messages
- When orphans are detected, cleaned history is persisted back to the
  session store via new SetHistory() method, then saved to DB
- Per-message WARN logs downgraded to DEBUG (cleanup is logged once
  at INFO level with total count)
- Added SetHistory() to SessionStore interface + both implementations

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a delegated agent (e.g. ui-ux-design-agent) spawns subagents, the
announce session key uses the format delegate:{uuid8}:{agentKey}:{id}.
The scheduler's RunFunc only handled agent:{agentId}:{rest} format,
falling back to the hardcoded "default" agent — which doesn't exist in
managed-mode deployments where the default agent has a custom key.

Add delegate: prefix parsing to extract the target agent key from
position 2 of the session key parts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add sender_id and channel to team task metadata for audit trail
- Remove assistant prefill in team task reminder (thinking models reject it)
- Add unit tests for team access control and sender_id tracking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, mobile UX, budget, traces)

Resolved 13 file conflicts:
- zalo.go: kept local struct tags + upstream's n==0 safety check
- channel_instances.go: used upstream's credential merging (supersedes local bytea fix)
- factory.go, stores.go: merged both Party + Contacts/Activity/Snapshots stores
- loop_history.go: kept upstream skill inlining constants + local session persistence
- session_store.go, sessions/manager.go, sessions_ops.go: deduplicated SetHistory()
- sidebar.tsx, routes.tsx: added Party to upstream's restructured sidebar/routes
- protocol.ts, i18n/index.ts, sidebar.json (×3): merged Party + upstream events/translations
- version.go: bumped to schema version 18

Migration collision fix: renamed 000014_party_sessions → 000018_party_sessions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add nullish coalescing for PERSONA_COLORS array index
- Remove unused destructured `round` variable from PartyPage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…otocol

Backend:
- gateway_providers: read default_model from provider settings JSONB
- party.go: getEngine() prefers providers with DefaultModel, alphabetical
  fallback for determinism (fixes random Go map iteration)
- party.go: add slog.Error for round failures (was silently swallowed)

Frontend:
- use-party.ts: align all RPC params/events with snake_case wire format,
  add transformSession/mapStatus/selectSession helpers
- party-start-dialog.tsx: use actual DB persona keys (morpheus-persona etc.)
- party-page.tsx: use selectSession() for proper state hydration
- connection-status.tsx: fix status text alignment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
transformSession() was dropping history/summary fields from backend
response, and selectSession() reset messages to empty array. Old
sessions appeared blank when clicked in sidebar.

- transformSession: preserve _history and _summary from backend
- hydrateMessages: new helper to convert RoundResult[] + SummaryResult
  into PartyMessage[] (round headers, persona messages, summary)
- selectSession: call hydrateMessages() instead of setMessages([])

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: `accumulators` map iterated with sequential `for i := 0; i < len(map)`
but SSE tool_call indices can be non-contiguous (e.g. {0, 2}), causing nil dereference
on `accumulators[i]` when key `i` doesn't exist.

Fixes:
1. openai.go: iterate map by sorted keys instead of sequential 0..len-1
2. lanes.go: add defer recover() in scheduler goroutine to prevent panics from
   crashing the entire process — logs error and returns semaphore token
3. tracing: add SweepOrphanTraces() to mark stuck running traces as error on
   gateway startup (running > 1h = orphan from previous crash)

Test results (18 tests):
- providers: 9/9 PASS (contiguous, non-contiguous, large gap, high index,
  thought_signature, text-only, empty, HTTP error, cancelled context)
- scheduler: 5/5 PASS (no panic, panic recovery, multiple panics, cancelled
  context, stats after panic)
- store/pg: 4/4 PASS (sweeps old running, ignores recent, ignores completed,
  no orphans)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… loops

When LLM hits max_tokens, tool call arguments may be incomplete/malformed JSON.
- Add truncation guard: skip tool execution when finish_reason=length, ask LLM to retry smaller
- Wire per-agent max_tokens from other_config JSONB (default 8192)
- Log warning on JSON parse failure for non-empty tool call arguments (truncation indicator)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts:
#	internal/tools/team_tasks_tool.go
Previously, when an SSE stream was prematurely closed (e.g. proxy timeout,
network interruption), both Anthropic and OpenAI providers would return a
response with finish_reason="stop" and missing usage data — silently
delivering truncated content. The agent loop only checked for
finish_reason="length", so these cases went undetected.

Changes:
- Anthropic provider: track message_stop event; if absent, set
  finish_reason="interrupted" and add scanner.Err() check
- OpenAI provider: track [DONE] marker; if absent, set
  finish_reason="interrupted"
- Agent loop: on "interrupted" finish_reason, discard partial response
  and retry once (with retry event emitted to channels)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
In Docker deployments, ~/.goclaw resolves to /app/.goclaw/ which is empty.
The actual workspace files live at GOCLAW_WORKSPACE (/app/workspace/).
This caused the Storage UI to show no files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Download media URLs, verify local files, and detect MIME types for
WhatsApp incoming messages. Replaces raw string paths with resolved
media info including proper logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Migration file was renumbered to avoid conflict with upstream.
RequiredSchemaVersion bumped from 18 to 19.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create project entity for per-group MCP env overrides with auto-updating
timestamps, FK to agent_teams, and unique constraints on slug and
(channel_type, chat_id). Bump RequiredSchemaVersion to 20.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…erride types

Define store.ProjectStore interface (CRUD + MCP override methods),
Project struct embedding BaseModel, and ProjectMCPOverride struct.
Wire Projects field into Stores container.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PostgreSQL-backed ProjectStore: full CRUD, chat-ID lookup, MCP override
upsert/delete/list with secret-pattern rejection. Register projects and
project_mcp_overrides in tablesWithUpdatedAt, wire into NewPGStores factory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ProjectID and ProjectOverrides fields to RunRequest and
DelegateRunRequest for per-project MCP env overrides. Introduce
composite poolKey (name:projectID) in Pool.Acquire/Release so
different projects get separate MCP stdio processes with merged
environment variables. Update LoadForAgent, connectViaPool, and
filterTools to thread project context through the entire chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… project)

Extend ResolverFunc signature to accept ResolveOpts (ProjectID, ProjectOverrides).
Add Router.GetForProject() which uses composite cache key "agentID:projectID"
to prevent cross-project MCP contamination in shared agents like sdlc-assistant.
The resolver closure now passes opts.ProjectID and opts.ProjectOverrides through
to mcpMgr.LoadForAgent(), connecting the per-project MCP scoping pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n and announce

- Create resolveProjectOverrides() helper in cmd/gateway_consumer_project.go
- Add ProjectID/ProjectOverrides tool context keys for delegation chain propagation
- Set project context in agent loop from RunRequest fields
- Add ProjectID/ProjectOverrides to DelegationTask struct
- Wire project context through delegate_prep.go buildRunRequest
- Use GetForProject() in consumer main lane, process lane, and runAgentFn
- Propagate project resolution through all 4 announce lanes
- Pass pgStores.Projects to consumeInboundMessages
Implements Google Chat as a new channel type (google_chat) in GoClaw:
- Service Account JWT auth with auto-refresh (no Google client libs)
- Pub/Sub REST pull loop for inbound messages with dedup cache
- Markdown → Google Chat text format conversion (bold, italic, links)
- Cards V2 for tables/structured content
- Byte-aware chunking for Vietnamese/CJK (3900-byte safety limit)
- Drive upload for long-form responses with retention cleanup
- Thread routing for group conversations
- Placeholder edit pattern (Thinking... → final response)
- Exponential backoff retry on 429/5xx
- 28 unit tests, all passing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
duhd-vnpay and others added 20 commits March 19, 2026 14:24
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… path matching

Code review findings: matchesAnyPrefix logic was overly complex with
TrimSuffix/HasSuffix dance. Simplified to direct comparison. Added
case-insensitive matching on Windows (paths are case-insensitive there).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…urable

- Add allowed_routes allowlist (default: GET /v1/projects/by-chat,
  POST /v1/projects, PUT /v1/projects/) — blocks all other endpoints
- Allowlist loaded from builtin_tools.settings at runtime via settingsGetter;
  falls back to defaultAllowedRoutes when store is unavailable
- SetSettingsGetter wired in gateway.go after pgStores init
- Seed internal_api into builtin_tools with default settings + metadata hint
- 7 tests: allowlist enforcement, settings getter override, no-token, error handling
Lead agents in a team have their ToolWorkspace overridden to the team
workspace. This commit:
- Syncs promptWorkspace with ToolWorkspaceFromCtx so the system prompt
  matches actual tool behaviour (prevents stale personal-workspace paths)
- Adds isLeadWorkspace flag to buildTeamWorkspaceSection so lead agents
  get guidance about relative paths resolving to team workspace

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…le tools, sandbox fixes

# Conflicts:
#	internal/agent/loop_history.go
…ity fixes, GPT-5.4

Upstream commits: 38dfcf8..32ced98 (31 commits)
Key changes:
- fix(codex,memory): sanitize tool call IDs (replaces local fix)
- feat(ui): kanban, media gallery, team leader indicator, toast notifications
- fix(security): defusedxml, extractor chain retry/timeout
- feat(tools): Defuddle extractor chain for web_fetch
- fix(telegram): bounded handlers, duplicate suppression, stall-aware downloads
- feat(teams): attachments refactor, semantic search
- feat(discord): writer management commands

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…igrations to 26-27

- uniquifyToolCallIDs: strip non-alphanumeric chars from runID before
  appending to tool_use.id (Anthropic rejects colons in cron:/heartbeat: RunIDs)
- Upstream c6bab9a only fixed Codex provider; this fixes the Anthropic path
- Renumber local migrations 22-23 → 26-27 to avoid collision with upstream

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stream conflicts

Move local-only builtin tools (team_message, internal_api) from
gateway_builtin_tools.go to gateway_builtin_tools_local.go. This:

- Restores gateway_builtin_tools.go to exact upstream state (zero diff)
- Restores policy.go tool group lists to exact upstream state
- Adds AppendToToolGroup() helper to policy.go for dynamic group extension
- Local tools seeded via seedLocalBuiltinTools() after upstream seed
- Local policy groups registered via registerLocalToolGroups()

Conflict reduction: gateway_builtin_tools.go was the #2 merge conflict
hotspot. Now upstream can freely add/modify tools without any conflict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…lision

Upstream added 000024_team_attachments_refactor in the latest merge.
Local migrations renumbered:
  000024_cron_unique_name → 000025
  000025_file_access_log → 000026
  000026_team_tasks_workspace_followup → 000027
  000027_projects → 000028

RequiredSchemaVersion bumped to 28.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… provider loss

When config:changed is broadcast by subsystems that haven't called
ApplyDBSecrets (e.g. registerProvidersFromDB), the updatedCfg payload
has empty TTS API keys. setupTTS() then registers fewer providers,
causing the TTS manager to be replaced with one missing MiniMax.

Fix: explicitly re-apply DB secrets to updatedCfg before calling
setupTTS() in the tts-config-reload subscriber.

Fixes: tts provider not found: minimax after startup
…pstream tools

seedLocalBuiltinTools() was calling Seed() with only 2 local tools.
Since Seed() does a reconcile DELETE of all rows not in the provided
list, this wiped the 28 upstream tools just seeded by seedBuiltinTools().

Fix (Option B): add Upsert() to BuiltinToolStore interface and
PGBuiltinToolStore that performs the same INSERT ON CONFLICT ... DO UPDATE
but WITHOUT the reconcile DELETE step. seedLocalBuiltinTools() now
uses Upsert() keeping the upstream/VNPAY fork separation clean.

Result: DB now has 33 builtin tools seeded correctly (31 upstream, 2 local).
All tools visible in Admin UI.
…able

Root cause: setupToolRegistry() calls setupTTS(cfg) BEFORE pgStores exists.
At that point tts.minimax.api_key is empty (stored in config_secrets DB),
so TTS manager has primary=minimax but providers={edge only}.

The tts-config-reload subscriber was a safety net but registerProvidersFromDB
broadcasts config:changed BEFORE subscriber is registered at startup,
so the event is missed entirely.

Fix: after pgStores is set up, apply DB secrets to cfg and rebuild the TTS
manager immediately, ensuring minimax is registered from the first tool call.
Covers:
- TestGetProvider_ProviderNotRegistered: reproduces startup bug (primary=minimax
  but minimax not in providers map when API key is absent)
- TestGetProvider_AfterRebuildWithSecrets: validates fix (rebuild manager after
  DB secrets applied → GetProvider returns correct provider)
- TestSynthesizeWithFallback_PrimaryNotInProviders: confirms fallback to edge when
  primary is missing from providers (old container behaviour)
- TestUpdateManager_SwapsProvider: validates UpdateManager hot-swap
- Additional: AllFail, HasProviders, PrimaryProvider_FirstRegistered
C6: LoadPath*, LoadPathScoped, LoadPathAny now resolve symlinks via
EvalSymlinks() and validate resolved path stays within baseDir.
sanitizePath() strips control chars before logging (prevents log injection).

C4: Per-IP rate limiting middleware (100 req/min) on /v1/media/{id}
endpoint. In-memory sliding window counter with periodic cleanup.

C3: Added security audit logging for memory_search tool invocations
(agent_id, query_len, results_count) for observability.

Addresses HIGH-02, HIGH-03, MED-03 from Security Assessment T-202.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestSanitizePath: control char stripping
- TestResolveAndValidate: baseDir boundary check, symlink traversal block
- TestLoadPathAny: admin context enforcement
- TestRateLimiter: allow/block/window reset/independent IPs
- TestRateLimitMiddleware: 429 response + Retry-After header
- TestExtractClientIP: XFF, X-Real-IP, RemoteAddr parsing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Azure OpenAI

Upstream changes (v1.68.1 → v1.74.1, 17 commits):
- feat: pgvector semantic search for KG entities + team KG sharing
- feat: extractive memory fallback when LLM flush fails
- feat: Azure OpenAI + Foundry header support (nextlevelbuilder#319)
- fix: YAML multiline skill frontmatter parsing (nextlevelbuilder#325)
- fix: skills DB file_path directory resolution (nextlevelbuilder#322)
- fix: MCP OpenAI compatibility (nextlevelbuilder#321)
- fix: Windows build + filesystem security (nextlevelbuilder#318)
- feat: MediaMaxBytes enforcement for Telegram/Discord (nextlevelbuilder#317)
- fix: memory embedding cache reliability
- fix: UI usage tab crash, image download overlay
- fix: teams post-turn processor + PendingTeamDispatch

Conflicts resolved:
- internal/agent/loop.go: take upstream (no local changes)
- internal/tools/filesystem.go: take upstream (our comment moved)
- internal/upgrade/version.go: keep ours (28 > upstream 25)
- Removed old filesystem_{unix,windows}.go (replaced by
  filesystem_security_{unix,windows}.go upstream)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MiltonSilvaJr pushed a commit to vellus-ai/argoclaw that referenced this pull request Mar 23, 2026
…E fix + Zalo QR restart + TDD/PBT

PR nextlevelbuilder#182 (cherry-pick core fixes, without Party Mode):
- Sort non-contiguous SSE tool_call indices (prevents nil pointer panic)
- Log truncated tool call arguments instead of silently discarding
- extractDefaultModel from provider settings JSONB

PR nextlevelbuilder#346:
- Zalo QR session restart: cancel previous session instead of blocking

Tests (TDD + PBT):
- Non-contiguous indices: 1000+ PBT random inputs
- Truncated JSON arguments: 6 edge cases
- All tests PASS (verified on VM goclaw-pilot)

PR nextlevelbuilder#350: SKIPPED — core fix (generateId) already in PR nextlevelbuilder#352.
Provider listing UX improvements deferred.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MiltonSilvaJr added a commit to vellus-ai/argoclaw that referenced this pull request Mar 23, 2026
…E fix + Zalo QR restart + TDD/PBT (#11)

PR nextlevelbuilder#182 (cherry-pick core fixes, without Party Mode):
- Sort non-contiguous SSE tool_call indices (prevents nil pointer panic)
- Log truncated tool call arguments instead of silently discarding
- extractDefaultModel from provider settings JSONB

PR nextlevelbuilder#346:
- Zalo QR session restart: cancel previous session instead of blocking

Tests (TDD + PBT):
- Non-contiguous indices: 1000+ PBT random inputs
- Truncated JSON arguments: 6 edge cases
- All tests PASS (verified on VM goclaw-pilot)

PR nextlevelbuilder#350: SKIPPED — core fix (generateId) already in PR nextlevelbuilder#352.
Provider listing UX improvements deferred.

Co-authored-by: Milton Silva <milton@vellus.tech>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
duhd-vnpay and others added 3 commits March 23, 2026 16:51
When re-enabling a cron job, read its schedule fields and compute
next_run_at inline (matching AddJob/recomputeStaleJobs logic) instead
of leaving it NULL. Also clear next_run_at on disable so GetDueJobs
skips it immediately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fixes

Upstream changes (v1.75.0 → v2.4.3):
- Multi-tenant isolation: API key binding, session ownership, cron filtering
- Teams: force v2, task hints, blocker escalation, audit events
- Channels: extract media from replied-to messages
- Security: shell deny gaps (halt/su/doas), WebSocket panic fix
- UI: theme toggle, version display, cron page redesign
- Fix: Telegram poll race, Zalo QR restart, summoning display names

Local merge notes:
- Renumber local migrations 025-028 → 029-032 to avoid upstream collision
- Bump RequiredSchemaVersion to 32
- Keep local: file ACL, internal_api, MCP env expansion, Zalo fixes
- Add build tag linux to cmd/pkg-helper (syscall.Umask)
- Fix MakeHistory 3rd arg (tenantID), providerReg ctx params

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…urrogates

1. sessions_list.go — Save() uses INSERT ... ON CONFLICT DO UPDATE
   instead of UPDATE-only. First-run cron sessions now persist correctly
   (fixes silent 0-row UPDATE).

2. gateway_cron.go — Reset session cache before each cron handler call
   so ExecuteWithRetry attempts start fresh, preventing stale messages
   (empty user + "..." fallback) from polluting retries.

3. sanitize_json.go — Strip lone surrogate escapes (\uD800–\uDFFF) from
   marshaled JSON before sending to LLM APIs. Applied to Anthropic,
   OpenAI, and Codex providers. Prevents "invalid high surrogate" API
   rejections from corrupted json.RawMessage content.

Fixes: nextlevelbuilder#379

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mrgoonie
Copy link
Copy Markdown
Contributor

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented Mar 24, 2026

Claude finished @mrgoonie's task in 3m 21s —— View job


PR Review

  • Gather context and understand the request
  • Read changed files (core bug fix: openai.go, lanes.go, tracing.go)
  • Read changed files (new party system, stores, migrations)
  • Read changed files (UI / other)
  • Analyze and write review feedback

Summary

This PR has a misleading title. The stated fix (nil pointer crash in OpenAI SSE accumulation) is real and correct, but the PR actually contains a large new feature — a "party" (multi-agent session orchestration) system — accounting for ~3000 of the 3246 added lines across 48 files. These should be separate PRs for reviewability.


🔴 Critical Issues

1. Migration number conflict — will break migrations

The PR adds migrations/000018_party_sessions.up.sql but migration #18 already exists:

migrations/000018_team_tasks_workspace_followup.up.sql  ← already exists

Current RequiredSchemaVersion in internal/upgrade/version.go is 28. Per CLAUDE.md, the new migration must be numbered to match the next version. The new file must be renamed:

migrations/000029_party_sessions.up.sql

And version.go should be updated to 29. As-is, this will either silently skip the new migration or fail depending on how golang-migrate handles collisions. This is a data loss / broken schema risk.

Fix this →


🟡 High Priority Issues

2. Core bug fix is correct but not yet applied to this branch

The current internal/providers/openai.go:229 still contains the bug:

// BUG: assumes contiguous keys 0, 1, 2... — panics on {0, 2}
for i := 0; i < len(accumulators); i++ {
    acc := accumulators[i]  // nil when key i doesn't exist
    ...acc.rawArgs...        // nil pointer dereference → panic
}

The PR description says the fix iterates with sorted keys. This is the correct approach. The fix + new tests are good additions.

3. recover() in lanes swallows panics silently

The PR adds a defer recover() in lanes.go goroutine. While this prevents process-level crashes, it means:

  • The user's request silently fails — no error is propagated to the caller
  • fn() may be mid-execution when the panic fires, leaving state partially applied (e.g., a half-completed agent run that updated some DB rows but not others)
  • Panics in Go typically indicate programming errors — swallowing them makes debugging harder

The recover is only acceptable as a last resort, not as a substitute for fixing underlying bugs. Ensure the log message includes the full stack trace (debug.Stack()), not just the panic value, so crashes can be diagnosed from logs.

4. PR scope mismatch — too large to safely review in one pass

The PR title describes a 3-file bug fix. It actually contains:

Area New files LoC added
internal/party/ 4 new ~545
internal/gateway/methods/party.go 1 new ~564
internal/store/party_store.go + pg/party.go 2 new ~196
pkg/protocol/party.go 1 new ~26
ui/web/src/pages/party/ 6 new ~1000
Migration 2 new ~24

This pattern ("sneak feature in with a bug fix") makes it very difficult to review the feature code rigorously. The feature should be submitted as a separate PR.


🟠 Medium Priority Issues

5. SweepOrphanTraces only fixes traces, not spans

SweepOrphanTraces marks stuck running traces as error. However, spans within those traces will still show running status in the DB, causing inconsistency if the UI renders per-span status. Consider also sweeping orphaned spans:

UPDATE spans SET status = 'error', error = 'orphaned on restart'
WHERE status = 'running' AND created_at < NOW() - INTERVAL '1 hour'

6. Startup sweep lacks a timeout context

The sweep is called on startup in a background goroutine. If cmd/gateway.go passes a long-lived context, a slow/unresponsive DB could block the sweep goroutine indefinitely. Use a bounded context:

go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    ...sweep(ctx)...
}()

7. sessions_ops.go change needs explanation

internal/store/pg/sessions_ops.go has +8/-8 changes. Without seeing the diff (the PR branch isn't accessible for direct diff), it's unclear what changed. Verify there are no unintended regressions to session management.


🟢 Positive Observations

  • Root cause correctly identified: Non-contiguous SSE tool call indices from extended-thinking models is a real edge case that would only surface in production.
  • Test coverage: 18 new tests across 3 packages, including the exact crash scenario and edge cases (large gaps, high indices). This is well done.
  • Defensive coding: The panic recovery and orphan sweep are good operational hygiene additions.
  • i18n coverage: New UI strings added to all 3 locale files (en/vi/zh). Follows project conventions.

Action Required

Before merging:

  1. Rename migration file from 000018_party_sessions000029_party_sessions and update version.go to 29
  2. Verify the recover() in lanes.go logs the full stack trace
  3. Consider splitting the party feature into a separate PR for proper review
  4. Explain the sessions_ops.go changes in the PR description

duhd-vnpay and others added 2 commits March 24, 2026 23:38
…n reliability, storage

Merged 42 upstream commits from nextlevelbuilder/goclaw origin/main.

Key upstream changes included:
- feat(config): system_configs DB table with per-tenant isolation + System Settings modal
- fix: cron reliability — clear session in DB on cold-cache restart (nextlevelbuilder#294, nextlevelbuilder#424)
- refactor(security): remove cross-tenant bypass, enforce strict tenant isolation
- fix(agent): set TenantID on streaming events in runLoop emitRun
- fix: add panic recovery to tool, cron, and summarization goroutines
- fix: check errors in cron store Scan/Unmarshal to prevent data corruption
- feat(teams): smart post-turn task decision + stale detection
- fix(skills): unblock agent read_file on skill paths
- feat(storage): bump upload limit to 50MB, add file upload to Team Workspace
- feat(gateway): version update checker
- fix(channels): propagate tenant_id to pending history

Conflict resolutions:
- gateway_cron.go: adopt upstream sessionMgr.Reset()+Save() pattern; revert Save() to UPDATE-only
- sessions_list.go: revert UPSERT → UPDATE-only (upstream fix covers cold-cache via explicit Save after Reset)
- upgrade/version.go: bump RequiredSchemaVersion 32→33 (adds upstream migration 33_system_configs)
- migrations: renumber upstream 000029_system_configs → 000033_system_configs (avoids conflict with local 000029_kg_entity_embeddings)
- gateway.go: keep local projects wiring + File ACL MediaServeHandler; drop WithCrossTenant on LoadAll per security refactor
- gateway_http_handlers.go: keep local wireHTTP signature with token + projectsH return
- media_serve.go: keep local File ACL implementation (scoped token, rate limiting)
- i18n/index.ts: union both namespace sets; add upstream "system-settings"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rProject signatures

Upstream added context.Context to ResolverFunc and GetForProject after the
multi-tenant refactor. Test file still used old signatures — updated all
call sites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@viettranx viettranx closed this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants