fix: prevent nil pointer crash in OpenAI SSE tool call accumulation by duhd-vnpay · Pull Request #182 · nextlevelbuilder/goclaw

duhd-vnpay · 2026-03-13T15:04:05Z

Summary

Root cause fix: openai.go — accumulators map (keyed by SSE tool_call index) was iterated with for i := 0; i < len(map); i++, assuming contiguous keys 0,1,2... SSE indices from extended thinking models can be non-contiguous (e.g. {0, 2}), causing accumulators[i] → nil → panic on acc.rawArgs
Defense: lanes.go — add defer recover() in scheduler lane goroutine so panics log error + return semaphore token instead of crashing the process
Cleanup: tracing.go + gateway.go — add SweepOrphanTraces() to mark stuck running traces as error on startup (traces older than 1h with status running = orphan from crash)

Changed files

File	Change
`internal/providers/openai.go`	Iterate map by sorted keys instead of sequential 0..len-1
`internal/scheduler/lanes.go`	Add `recover()` in goroutine defer
`internal/store/tracing_store.go`	Add `SweepOrphanTraces` to interface
`internal/store/pg/tracing.go`	Implement `SweepOrphanTraces` (UPDATE running→error)
`cmd/gateway.go`	Call sweep on startup in background goroutine

Test results (18 tests, all PASS)

providers (9/9 PASS):

ContiguousToolCalls — indices 0,1 (happy path)
NonContiguousToolCallIndices — indices 0,2 skip 1 (the crash scenario)
LargeGapToolCallIndices — indices 0,5,10
SingleToolCallAtHighIndex — single call at index 3
ToolCallWithThoughtSignature — metadata preserved
TextOnly / EmptyToolCalls — no tool calls
HTTPError — 429 → proper HTTPError
CancelledContext — context cancel → error

scheduler (5/5 PASS):

NoPanic — normal operation
PanicRecovery — panic → lane recovers, accepts new work
MultiplePanics — 3 panics → all semaphore tokens returned
CancelledContext — cancelled ctx with full lane → error
StatsAfterPanic — active counter decremented after panic

store/pg (4/4 PASS, integration with PostgreSQL):

SweepsOldRunning — running trace 2h old → swept
IgnoresRecentRunning — running trace 5min old → NOT swept
IgnoresCompleted — completed trace → NOT swept
NoOrphans — no running traces → swept=0

🤖 Generated with Claude Code

…ing, photo handling - Fix zaloBotInfo to use account_name/display_name (not name) - Add Label() method for bot display name resolution - Handle 3 response formats in getUpdates: array, single object, wrapped - Add photo_url field to zaloMessage for Zalo CDN image URLs - Add display_name/is_bot to zaloFrom, chat_type to zaloChat - Use PhotoURL with fallback to Photo in handleImageMessage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ssing Zalo CDN URLs are auth-restricted and expire quickly, causing read_image tool failures. Now downloads photos to temp files (like Telegram channel) so the agent pipeline can base64-encode and process them normally. Falls back to passing the URL directly if download fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… instance updates When encryption key is empty, credentials stayed as map[string]any from JSON unmarshal, causing pgx driver to fail encoding into bytea. Now credentials are always marshaled to []byte regardless of encryption. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Party Mode to GoClaw: structured multi-persona AI discussions with Standard (single LLM call), Deep (parallel thinking + cross-talk), and Token-Ring (sequential turns) modes. Backend: PartyStore + PG implementation, party engine with parallel goroutines, 7 RPC methods (party.start/round/question/add_context/ summary/exit/list), 10 WebSocket events, migration 000014. Frontend: React dashboard page with session list, chat view, persona sidebar, mode controls, start dialog with 6 team presets, i18n (en/vi/zh). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously, sanitizeHistory() only cleaned the in-memory copy for each LLM request but never persisted the fix — causing the same "dropping orphaned tool message" WARN to repeat on every single request forever. Changes: - sanitizeHistory() now returns drop count alongside cleaned messages - When orphans are detected, cleaned history is persisted back to the session store via new SetHistory() method, then saved to DB - Per-message WARN logs downgraded to DEBUG (cleanup is logged once at INFO level with total count) - Added SetHistory() to SessionStore interface + both implementations Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a delegated agent (e.g. ui-ux-design-agent) spawns subagents, the announce session key uses the format delegate:{uuid8}:{agentKey}:{id}. The scheduler's RunFunc only handled agent:{agentId}:{rest} format, falling back to the hardcoded "default" agent — which doesn't exist in managed-mode deployments where the default agent has a custom key. Add delegate: prefix parsing to extract the target agent key from position 2 of the session key parts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add sender_id and channel to team task metadata for audit trail - Remove assistant prefill in team task reminder (thinking models reject it) - Add unit tests for team access control and sender_id tracking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, mobile UX, budget, traces) Resolved 13 file conflicts: - zalo.go: kept local struct tags + upstream's n==0 safety check - channel_instances.go: used upstream's credential merging (supersedes local bytea fix) - factory.go, stores.go: merged both Party + Contacts/Activity/Snapshots stores - loop_history.go: kept upstream skill inlining constants + local session persistence - session_store.go, sessions/manager.go, sessions_ops.go: deduplicated SetHistory() - sidebar.tsx, routes.tsx: added Party to upstream's restructured sidebar/routes - protocol.ts, i18n/index.ts, sidebar.json (×3): merged Party + upstream events/translations - version.go: bumped to schema version 18 Migration collision fix: renamed 000014_party_sessions → 000018_party_sessions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add nullish coalescing for PERSONA_COLORS array index - Remove unused destructured `round` variable from PartyPage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…otocol Backend: - gateway_providers: read default_model from provider settings JSONB - party.go: getEngine() prefers providers with DefaultModel, alphabetical fallback for determinism (fixes random Go map iteration) - party.go: add slog.Error for round failures (was silently swallowed) Frontend: - use-party.ts: align all RPC params/events with snake_case wire format, add transformSession/mapStatus/selectSession helpers - party-start-dialog.tsx: use actual DB persona keys (morpheus-persona etc.) - party-page.tsx: use selectSession() for proper state hydration - connection-status.tsx: fix status text alignment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

transformSession() was dropping history/summary fields from backend response, and selectSession() reset messages to empty array. Old sessions appeared blank when clicked in sidebar. - transformSession: preserve _history and _summary from backend - hydrateMessages: new helper to convert RoundResult[] + SummaryResult into PartyMessage[] (round headers, persona messages, summary) - selectSession: call hydrateMessages() instead of setMessages([]) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Root cause: `accumulators` map iterated with sequential `for i := 0; i < len(map)` but SSE tool_call indices can be non-contiguous (e.g. {0, 2}), causing nil dereference on `accumulators[i]` when key `i` doesn't exist. Fixes: 1. openai.go: iterate map by sorted keys instead of sequential 0..len-1 2. lanes.go: add defer recover() in scheduler goroutine to prevent panics from crashing the entire process — logs error and returns semaphore token 3. tracing: add SweepOrphanTraces() to mark stuck running traces as error on gateway startup (running > 1h = orphan from previous crash) Test results (18 tests): - providers: 9/9 PASS (contiguous, non-contiguous, large gap, high index, thought_signature, text-only, empty, HTTP error, cancelled context) - scheduler: 5/5 PASS (no panic, panic recovery, multiple panics, cancelled context, stats after panic) - store/pg: 4/4 PASS (sweeps old running, ignores recent, ignores completed, no orphans) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… loops When LLM hits max_tokens, tool call arguments may be incomplete/malformed JSON. - Add truncation guard: skip tool execution when finish_reason=length, ask LLM to retry smaller - Wire per-agent max_tokens from other_config JSONB (default 8192) - Log warning on JSON parse failure for non-empty tool call arguments (truncation indicator) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts: # internal/tools/team_tasks_tool.go

Previously, when an SSE stream was prematurely closed (e.g. proxy timeout, network interruption), both Anthropic and OpenAI providers would return a response with finish_reason="stop" and missing usage data — silently delivering truncated content. The agent loop only checked for finish_reason="length", so these cases went undetected. Changes: - Anthropic provider: track message_stop event; if absent, set finish_reason="interrupted" and add scanner.Err() check - OpenAI provider: track [DONE] marker; if absent, set finish_reason="interrupted" - Agent loop: on "interrupted" finish_reason, discard partial response and retry once (with retry event emitted to channels) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

In Docker deployments, ~/.goclaw resolves to /app/.goclaw/ which is empty. The actual workspace files live at GOCLAW_WORKSPACE (/app/workspace/). This caused the Storage UI to show no files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Download media URLs, verify local files, and detect MIME types for WhatsApp incoming messages. Replaces raw string paths with resolved media info including proper logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Migration file was renumbered to avoid conflict with upstream. RequiredSchemaVersion bumped from 18 to 19. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Create project entity for per-group MCP env overrides with auto-updating timestamps, FK to agent_teams, and unique constraints on slug and (channel_type, chat_id). Bump RequiredSchemaVersion to 20. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…erride types Define store.ProjectStore interface (CRUD + MCP override methods), Project struct embedding BaseModel, and ProjectMCPOverride struct. Wire Projects field into Stores container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PostgreSQL-backed ProjectStore: full CRUD, chat-ID lookup, MCP override upsert/delete/list with secret-pattern rejection. Register projects and project_mcp_overrides in tablesWithUpdatedAt, wire into NewPGStores factory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add ProjectID and ProjectOverrides fields to RunRequest and DelegateRunRequest for per-project MCP env overrides. Introduce composite poolKey (name:projectID) in Pool.Acquire/Release so different projects get separate MCP stdio processes with merged environment variables. Update LoadForAgent, connectViaPool, and filterTools to thread project context through the entire chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… project) Extend ResolverFunc signature to accept ResolveOpts (ProjectID, ProjectOverrides). Add Router.GetForProject() which uses composite cache key "agentID:projectID" to prevent cross-project MCP contamination in shared agents like sdlc-assistant. The resolver closure now passes opts.ProjectID and opts.ProjectOverrides through to mcpMgr.LoadForAgent(), connecting the per-project MCP scoping pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…n and announce - Create resolveProjectOverrides() helper in cmd/gateway_consumer_project.go - Add ProjectID/ProjectOverrides tool context keys for delegation chain propagation - Set project context in agent loop from RunRequest fields - Add ProjectID/ProjectOverrides to DelegationTask struct - Wire project context through delegate_prep.go buildRunRequest - Use GetForProject() in consumer main lane, process lane, and runAgentFn - Propagate project resolution through all 4 announce lanes - Pass pgStores.Projects to consumeInboundMessages

Implements Google Chat as a new channel type (google_chat) in GoClaw: - Service Account JWT auth with auto-refresh (no Google client libs) - Pub/Sub REST pull loop for inbound messages with dedup cache - Markdown → Google Chat text format conversion (bold, italic, links) - Cards V2 for tables/structured content - Byte-aware chunking for Vietnamese/CJK (3900-byte safety limit) - Drive upload for long-form responses with retention cleanup - Thread routing for group conversations - Placeholder edit pattern (Thinking... → final response) - Exponential backoff retry on 429/5xx - 28 unit tests, all passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… path matching Code review findings: matchesAnyPrefix logic was overly complex with TrimSuffix/HasSuffix dance. Simplified to direct comparison. Added case-insensitive matching on Windows (paths are case-insensitive there). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…urable - Add allowed_routes allowlist (default: GET /v1/projects/by-chat, POST /v1/projects, PUT /v1/projects/) — blocks all other endpoints - Allowlist loaded from builtin_tools.settings at runtime via settingsGetter; falls back to defaultAllowedRoutes when store is unavailable - SetSettingsGetter wired in gateway.go after pgStores init - Seed internal_api into builtin_tools with default settings + metadata hint - 7 tests: allowlist enforcement, settings getter override, no-token, error handling

…process

Lead agents in a team have their ToolWorkspace overridden to the team workspace. This commit: - Syncs promptWorkspace with ToolWorkspaceFromCtx so the system prompt matches actual tool behaviour (prevents stale personal-workspace paths) - Adds isLeadWorkspace flag to buildTeamWorkspaceSection so lead agents get guidance about relative paths resolving to team workspace Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…le tools, sandbox fixes # Conflicts: # internal/agent/loop_history.go

…63.0

…ity fixes, GPT-5.4 Upstream commits: 38dfcf8..32ced98 (31 commits) Key changes: - fix(codex,memory): sanitize tool call IDs (replaces local fix) - feat(ui): kanban, media gallery, team leader indicator, toast notifications - fix(security): defusedxml, extractor chain retry/timeout - feat(tools): Defuddle extractor chain for web_fetch - fix(telegram): bounded handlers, duplicate suppression, stall-aware downloads - feat(teams): attachments refactor, semantic search - feat(discord): writer management commands Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…igrations to 26-27 - uniquifyToolCallIDs: strip non-alphanumeric chars from runID before appending to tool_use.id (Anthropic rejects colons in cron:/heartbeat: RunIDs) - Upstream c6bab9a only fixed Codex provider; this fixes the Anthropic path - Renumber local migrations 22-23 → 26-27 to avoid collision with upstream Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stream conflicts Move local-only builtin tools (team_message, internal_api) from gateway_builtin_tools.go to gateway_builtin_tools_local.go. This: - Restores gateway_builtin_tools.go to exact upstream state (zero diff) - Restores policy.go tool group lists to exact upstream state - Adds AppendToToolGroup() helper to policy.go for dynamic group extension - Local tools seeded via seedLocalBuiltinTools() after upstream seed - Local policy groups registered via registerLocalToolGroups() Conflict reduction: gateway_builtin_tools.go was the #2 merge conflict hotspot. Now upstream can freely add/modify tools without any conflict. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…lision Upstream added 000024_team_attachments_refactor in the latest merge. Local migrations renumbered: 000024_cron_unique_name → 000025 000025_file_access_log → 000026 000026_team_tasks_workspace_followup → 000027 000027_projects → 000028 RequiredSchemaVersion bumped to 28. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… provider loss When config:changed is broadcast by subsystems that haven't called ApplyDBSecrets (e.g. registerProvidersFromDB), the updatedCfg payload has empty TTS API keys. setupTTS() then registers fewer providers, causing the TTS manager to be replaced with one missing MiniMax. Fix: explicitly re-apply DB secrets to updatedCfg before calling setupTTS() in the tts-config-reload subscriber. Fixes: tts provider not found: minimax after startup

…pstream tools seedLocalBuiltinTools() was calling Seed() with only 2 local tools. Since Seed() does a reconcile DELETE of all rows not in the provided list, this wiped the 28 upstream tools just seeded by seedBuiltinTools(). Fix (Option B): add Upsert() to BuiltinToolStore interface and PGBuiltinToolStore that performs the same INSERT ON CONFLICT ... DO UPDATE but WITHOUT the reconcile DELETE step. seedLocalBuiltinTools() now uses Upsert() keeping the upstream/VNPAY fork separation clean. Result: DB now has 33 builtin tools seeded correctly (31 upstream, 2 local). All tools visible in Admin UI.

…able Root cause: setupToolRegistry() calls setupTTS(cfg) BEFORE pgStores exists. At that point tts.minimax.api_key is empty (stored in config_secrets DB), so TTS manager has primary=minimax but providers={edge only}. The tts-config-reload subscriber was a safety net but registerProvidersFromDB broadcasts config:changed BEFORE subscriber is registered at startup, so the event is missed entirely. Fix: after pgStores is set up, apply DB secrets to cfg and rebuild the TTS manager immediately, ensuring minimax is registered from the first tool call.

Covers: - TestGetProvider_ProviderNotRegistered: reproduces startup bug (primary=minimax but minimax not in providers map when API key is absent) - TestGetProvider_AfterRebuildWithSecrets: validates fix (rebuild manager after DB secrets applied → GetProvider returns correct provider) - TestSynthesizeWithFallback_PrimaryNotInProviders: confirms fallback to edge when primary is missing from providers (old container behaviour) - TestUpdateManager_SwapsProvider: validates UpdateManager hot-swap - Additional: AllFail, HasProviders, PrimaryProvider_FirstRegistered

C6: LoadPath*, LoadPathScoped, LoadPathAny now resolve symlinks via EvalSymlinks() and validate resolved path stays within baseDir. sanitizePath() strips control chars before logging (prevents log injection). C4: Per-IP rate limiting middleware (100 req/min) on /v1/media/{id} endpoint. In-memory sliding window counter with periodic cleanup. C3: Added security audit logging for memory_search tool invocations (agent_id, query_len, results_count) for observability. Addresses HIGH-02, HIGH-03, MED-03 from Security Assessment T-202. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- TestSanitizePath: control char stripping - TestResolveAndValidate: baseDir boundary check, symlink traversal block - TestLoadPathAny: admin context enforcement - TestRateLimiter: allow/block/window reset/independent IPs - TestRateLimitMiddleware: 429 response + Retry-After header - TestExtractClientIP: XFF, X-Real-IP, RemoteAddr parsing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Azure OpenAI Upstream changes (v1.68.1 → v1.74.1, 17 commits): - feat: pgvector semantic search for KG entities + team KG sharing - feat: extractive memory fallback when LLM flush fails - feat: Azure OpenAI + Foundry header support (nextlevelbuilder#319) - fix: YAML multiline skill frontmatter parsing (nextlevelbuilder#325) - fix: skills DB file_path directory resolution (nextlevelbuilder#322) - fix: MCP OpenAI compatibility (nextlevelbuilder#321) - fix: Windows build + filesystem security (nextlevelbuilder#318) - feat: MediaMaxBytes enforcement for Telegram/Discord (nextlevelbuilder#317) - fix: memory embedding cache reliability - fix: UI usage tab crash, image download overlay - fix: teams post-turn processor + PendingTeamDispatch Conflicts resolved: - internal/agent/loop.go: take upstream (no local changes) - internal/tools/filesystem.go: take upstream (our comment moved) - internal/upgrade/version.go: keep ours (28 > upstream 25) - Removed old filesystem_{unix,windows}.go (replaced by filesystem_security_{unix,windows}.go upstream) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…E fix + Zalo QR restart + TDD/PBT PR nextlevelbuilder#182 (cherry-pick core fixes, without Party Mode): - Sort non-contiguous SSE tool_call indices (prevents nil pointer panic) - Log truncated tool call arguments instead of silently discarding - extractDefaultModel from provider settings JSONB PR nextlevelbuilder#346: - Zalo QR session restart: cancel previous session instead of blocking Tests (TDD + PBT): - Non-contiguous indices: 1000+ PBT random inputs - Truncated JSON arguments: 6 edge cases - All tests PASS (verified on VM goclaw-pilot) PR nextlevelbuilder#350: SKIPPED — core fix (generateId) already in PR nextlevelbuilder#352. Provider listing UX improvements deferred. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…E fix + Zalo QR restart + TDD/PBT (#11) PR nextlevelbuilder#182 (cherry-pick core fixes, without Party Mode): - Sort non-contiguous SSE tool_call indices (prevents nil pointer panic) - Log truncated tool call arguments instead of silently discarding - extractDefaultModel from provider settings JSONB PR nextlevelbuilder#346: - Zalo QR session restart: cancel previous session instead of blocking Tests (TDD + PBT): - Non-contiguous indices: 1000+ PBT random inputs - Truncated JSON arguments: 6 edge cases - All tests PASS (verified on VM goclaw-pilot) PR nextlevelbuilder#350: SKIPPED — core fix (generateId) already in PR nextlevelbuilder#352. Provider listing UX improvements deferred. Co-authored-by: Milton Silva <milton@vellus.tech> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When re-enabling a cron job, read its schedule fields and compute next_run_at inline (matching AddJob/recomputeStaleJobs logic) instead of leaving it NULL. Also clear next_run_at on disable so GetDueJobs skips it immediately. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…fixes Upstream changes (v1.75.0 → v2.4.3): - Multi-tenant isolation: API key binding, session ownership, cron filtering - Teams: force v2, task hints, blocker escalation, audit events - Channels: extract media from replied-to messages - Security: shell deny gaps (halt/su/doas), WebSocket panic fix - UI: theme toggle, version display, cron page redesign - Fix: Telegram poll race, Zalo QR restart, summoning display names Local merge notes: - Renumber local migrations 025-028 → 029-032 to avoid upstream collision - Bump RequiredSchemaVersion to 32 - Keep local: file ACL, internal_api, MCP env expansion, Zalo fixes - Add build tag linux to cmd/pkg-helper (syscall.Umask) - Fix MakeHistory 3rd arg (tenantID), providerReg ctx params Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…urrogates 1. sessions_list.go — Save() uses INSERT ... ON CONFLICT DO UPDATE instead of UPDATE-only. First-run cron sessions now persist correctly (fixes silent 0-row UPDATE). 2. gateway_cron.go — Reset session cache before each cron handler call so ExecuteWithRetry attempts start fresh, preventing stale messages (empty user + "..." fallback) from polluting retries. 3. sanitize_json.go — Strip lone surrogate escapes (\uD800–\uDFFF) from marshaled JSON before sending to LLM APIs. Applied to Anthropic, OpenAI, and Codex providers. Prevents "invalid high surrogate" API rejections from corrupted json.RawMessage content. Fixes: nextlevelbuilder#379 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mrgoonie · 2026-03-24T08:47:01Z

@claude review

claude · 2026-03-24T08:47:17Z

Claude finished @mrgoonie's task in 3m 21s —— View job

PR Review

Gather context and understand the request
Read changed files (core bug fix: openai.go, lanes.go, tracing.go)
Read changed files (new party system, stores, migrations)
Read changed files (UI / other)
Analyze and write review feedback

Summary

This PR has a misleading title. The stated fix (nil pointer crash in OpenAI SSE accumulation) is real and correct, but the PR actually contains a large new feature — a "party" (multi-agent session orchestration) system — accounting for ~3000 of the 3246 added lines across 48 files. These should be separate PRs for reviewability.

🔴 Critical Issues

1. Migration number conflict — will break migrations

The PR adds migrations/000018_party_sessions.up.sql but migration #18 already exists:

migrations/000018_team_tasks_workspace_followup.up.sql  ← already exists

Current RequiredSchemaVersion in internal/upgrade/version.go is 28. Per CLAUDE.md, the new migration must be numbered to match the next version. The new file must be renamed:

migrations/000029_party_sessions.up.sql

And version.go should be updated to 29. As-is, this will either silently skip the new migration or fail depending on how golang-migrate handles collisions. This is a data loss / broken schema risk.

Fix this →

🟡 High Priority Issues

2. Core bug fix is correct but not yet applied to this branch

The current internal/providers/openai.go:229 still contains the bug:

// BUG: assumes contiguous keys 0, 1, 2... — panics on {0, 2}
for i := 0; i < len(accumulators); i++ {
    acc := accumulators[i]  // nil when key i doesn't exist
    ...acc.rawArgs...        // nil pointer dereference → panic
}

The PR description says the fix iterates with sorted keys. This is the correct approach. The fix + new tests are good additions.

3. `recover()` in lanes swallows panics silently

The PR adds a defer recover() in lanes.go goroutine. While this prevents process-level crashes, it means:

The user's request silently fails — no error is propagated to the caller
fn() may be mid-execution when the panic fires, leaving state partially applied (e.g., a half-completed agent run that updated some DB rows but not others)
Panics in Go typically indicate programming errors — swallowing them makes debugging harder

The recover is only acceptable as a last resort, not as a substitute for fixing underlying bugs. Ensure the log message includes the full stack trace (debug.Stack()), not just the panic value, so crashes can be diagnosed from logs.

4. PR scope mismatch — too large to safely review in one pass

The PR title describes a 3-file bug fix. It actually contains:

Area	New files	LoC added
`internal/party/`	4 new	~545
`internal/gateway/methods/party.go`	1 new	~564
`internal/store/party_store.go` + `pg/party.go`	2 new	~196
`pkg/protocol/party.go`	1 new	~26
`ui/web/src/pages/party/`	6 new	~1000
Migration	2 new	~24

This pattern ("sneak feature in with a bug fix") makes it very difficult to review the feature code rigorously. The feature should be submitted as a separate PR.

🟠 Medium Priority Issues

5. `SweepOrphanTraces` only fixes traces, not spans

SweepOrphanTraces marks stuck running traces as error. However, spans within those traces will still show running status in the DB, causing inconsistency if the UI renders per-span status. Consider also sweeping orphaned spans:

UPDATE spans SET status = 'error', error = 'orphaned on restart'
WHERE status = 'running' AND created_at < NOW() - INTERVAL '1 hour'

6. Startup sweep lacks a timeout context

The sweep is called on startup in a background goroutine. If cmd/gateway.go passes a long-lived context, a slow/unresponsive DB could block the sweep goroutine indefinitely. Use a bounded context:

go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    ...sweep(ctx)...
}()

7. `sessions_ops.go` change needs explanation

internal/store/pg/sessions_ops.go has +8/-8 changes. Without seeing the diff (the PR branch isn't accessible for direct diff), it's unclear what changed. Verify there are no unintended regressions to session management.

🟢 Positive Observations

Root cause correctly identified: Non-contiguous SSE tool call indices from extended-thinking models is a real edge case that would only surface in production.
Test coverage: 18 new tests across 3 packages, including the exact crash scenario and edge cases (large gaps, high indices). This is well done.
Defensive coding: The panic recovery and orphan sweep are good operational hygiene additions.
i18n coverage: New UI strings added to all 3 locale files (en/vi/zh). Follows project conventions.

Action Required

Before merging:

Rename migration file from 000018_party_sessions → 000029_party_sessions and update version.go to 29
Verify the recover() in lanes.go logs the full stack trace
Consider splitting the party feature into a separate PR for proper review
Explain the sessions_ops.go changes in the PR description

…n reliability, storage Merged 42 upstream commits from nextlevelbuilder/goclaw origin/main. Key upstream changes included: - feat(config): system_configs DB table with per-tenant isolation + System Settings modal - fix: cron reliability — clear session in DB on cold-cache restart (nextlevelbuilder#294, nextlevelbuilder#424) - refactor(security): remove cross-tenant bypass, enforce strict tenant isolation - fix(agent): set TenantID on streaming events in runLoop emitRun - fix: add panic recovery to tool, cron, and summarization goroutines - fix: check errors in cron store Scan/Unmarshal to prevent data corruption - feat(teams): smart post-turn task decision + stale detection - fix(skills): unblock agent read_file on skill paths - feat(storage): bump upload limit to 50MB, add file upload to Team Workspace - feat(gateway): version update checker - fix(channels): propagate tenant_id to pending history Conflict resolutions: - gateway_cron.go: adopt upstream sessionMgr.Reset()+Save() pattern; revert Save() to UPDATE-only - sessions_list.go: revert UPSERT → UPDATE-only (upstream fix covers cold-cache via explicit Save after Reset) - upgrade/version.go: bump RequiredSchemaVersion 32→33 (adds upstream migration 33_system_configs) - migrations: renumber upstream 000029_system_configs → 000033_system_configs (avoids conflict with local 000029_kg_entity_embeddings) - gateway.go: keep local projects wiring + File ACL MediaServeHandler; drop WithCrossTenant on LoadAll per security refactor - gateway_http_handlers.go: keep local wireHTTP signature with token + projectsH return - media_serve.go: keep local File ACL implementation (scoped token, rate limiting) - i18n/index.ts: union both namespace sets; add upstream "system-settings" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rProject signatures Upstream added context.Context to ResolverFunc and GetForProject after the multi-tenant refactor. Test file still used old signatures — updated all call sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

duhd-vnpay and others added 30 commits March 10, 2026 15:44

Merge branch 'nextlevelbuilder:main' into main

6d6a7c7

Merge remote-tracking branch 'origin/main'

68a9e95

Merge branch 'main' of https://github.com/duhd-vnpay/goclaw

925a5c1

fix: TypeScript strict mode errors in Party Mode UI

f26d52a

- Add nullish coalescing for PERSONA_COLORS array index - Remove unused destructured `round` variable from PartyPage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test: add data-testid attributes to Party Mode components for E2E tes…

b332024

…ting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main'

ba3871e

Merge tag 'v1.19.0'

578197e

# Conflicts: # internal/tools/team_tasks_tool.go

feat(whatsapp): add media resolution for incoming messages

59cbab4

Download media URLs, verify local files, and detect MIME types for WhatsApp incoming messages. Replaces raw string paths with resolved media info including proper logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: rename migration 000018 to 000019 and bump schema version

dfa9782

Migration file was renumbered to avoid conflict with upstream. RequiredSchemaVersion bumped from 18 to 19. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

duhd-vnpay and others added 20 commits March 19, 2026 14:24

feat: wire file ACL — scoped tokens, audit, config

9a8b32a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: wire internal_api tool into tool registry (gateway_setup)

1d0e106

fix(mcp): expand env vars with os.ExpandEnv before passing to MCP sub…

71aaafe

…process

Merge upstream origin/main — heartbeat, sandbox hints, UI polish

a050ff6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: merge upstream v1.63.0 — intent hardening, UI refactor, Defudd…

eab3a44

…le tools, sandbox fixes # Conflicts: # internal/agent/loop_history.go

fix(test): update mockAgent to implement Provider method added in v1.…

b4d5f28

…63.0

duhd-vnpay and others added 3 commits March 23, 2026 16:51

duhd-vnpay and others added 2 commits March 24, 2026 23:38

viettranx closed this Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent nil pointer crash in OpenAI SSE tool call accumulation#182

fix: prevent nil pointer crash in OpenAI SSE tool call accumulation#182
duhd-vnpay wants to merge 92 commits intonextlevelbuilder:mainfrom
duhd-vnpay:main

duhd-vnpay commented Mar 13, 2026

Uh oh!

mrgoonie commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

duhd-vnpay commented Mar 13, 2026

Summary

Changed files

Test results (18 tests, all PASS)

Uh oh!

mrgoonie commented Mar 24, 2026

Uh oh!

claude Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Summary

🔴 Critical Issues

1. Migration number conflict — will break migrations

🟡 High Priority Issues

2. Core bug fix is correct but not yet applied to this branch

3. recover() in lanes swallows panics silently

4. PR scope mismatch — too large to safely review in one pass

🟠 Medium Priority Issues

5. SweepOrphanTraces only fixes traces, not spans

6. Startup sweep lacks a timeout context

7. sessions_ops.go change needs explanation

🟢 Positive Observations

Action Required

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented Mar 24, 2026 •

edited

Loading

3. `recover()` in lanes swallows panics silently

5. `SweepOrphanTraces` only fixes traces, not spans

7. `sessions_ops.go` change needs explanation