Skip to content

feat(observability): add OpenTelemetry distributed tracing export#344

Merged
AbirAbbas merged 3 commits intoAgent-Field:mainfrom
tmchow:feat/otel-distributed-tracing
Apr 7, 2026
Merged

feat(observability): add OpenTelemetry distributed tracing export#344
AbirAbbas merged 3 commits intoAgent-Field:mainfrom
tmchow:feat/otel-distributed-tracing

Conversation

@tmchow
Copy link
Copy Markdown
Contributor

@tmchow tmchow commented Apr 7, 2026

Summary

Add OpenTelemetry trace export to the control plane execution pipeline. Each execution creates a root span, and reasoner/skill invocations create child spans with execution metadata (agent ID, execution ID, workflow ID) as span attributes. Exports via OTLP HTTP to any OTel-compatible backend (Jaeger, Grafana Tempo, Langfuse).

This builds on the existing observability infrastructure in execution_metrics.go (Prometheus counters/histograms) and observability_forwarder.go (webhook event forwarding) by adding distributed tracing as a third observability signal.

Note: I based direction of this on PR #342 (execution observability RFC)

Changes

  • New control-plane/internal/observability/ package:
    • tracer.go: TracerProvider initialization with OTLP HTTP exporter, resource attributes, and span helper methods (StartExecutionSpan, StartStepSpan, RecordError)
    • execution_tracer.go: Subscribes to GlobalExecutionEventBus and GlobalReasonerEventBus, translates events into OTel spans with parent-child relationships matching the execution DAG
  • TracingConfig added to FeatureConfig in config.go with env var overrides (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, AGENTFIELD_TRACING_ENABLED)
  • Wired into server.go: initialization in NewAgentFieldServer, start in Start(), graceful flush in shutdown
  • Default config in agentfield.yaml (disabled by default)

Demo

OpenTelemetry tracing demo

Testing

  • ./scripts/test-all.sh (all control-plane tests pass)
  • Additional verification:
    • 10 new unit tests covering tracer initialization, span creation with parent-child relationships, execution lifecycle (created -> completed/failed), duplicate event handling, and graceful shutdown with open span cleanup
    • Full go build ./... passes with no compilation errors
    • Existing test suites (events, config, services, handlers) all pass unchanged

Checklist

  • I updated documentation where applicable.
  • I added or updated tests (or none were needed).
  • I updated CHANGELOG.md (or this change does not warrant a changelog entry).

Screenshots (if UI-related)

N/A - backend observability feature, no UI changes.

Related issues

Relates to #342 (execution observability RFC)

This contribution was developed with AI assistance (Claude Code).

Compound Engineering

Add OTel trace export to the control plane execution pipeline. Each
execution creates a root span, and reasoner/skill invocations create
child spans with execution metadata as span attributes.

- New `control-plane/internal/observability/` package with TracerProvider
  initialization (OTLP HTTP exporter) and ExecutionTracer that subscribes
  to the existing execution and reasoner event buses
- TracingConfig added to FeatureConfig with YAML config and env var
  overrides (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME)
- Disabled by default; opt in via `features.tracing.enabled: true` or
  AGENTFIELD_TRACING_ENABLED=true
- Integrated into server lifecycle (start/shutdown with graceful flush)
- 10 unit tests covering tracer init, span parent-child relationships,
  execution lifecycle, duplicate handling, and shutdown cleanup
@tmchow tmchow requested review from a team and AbirAbbas as code owners April 7, 2026 02:11
tmchow and others added 2 commits April 6, 2026 19:29
The initial OTel dependency pull upgraded transitive deps
(golang.org/x/*, grpc) that require Go 1.25, but CI uses Go 1.24.
Pin OTel to v1.32-v1.35 and restore the original grpc/x/ versions.
Resolve go.mod/go.sum conflicts from main branch updates (newer
golang.org/x/*, grpc, protobuf versions). Default tracing insecure
config to false to prevent accidental unencrypted trace export in
production.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AbirAbbas AbirAbbas enabled auto-merge April 7, 2026 17:08
@AbirAbbas AbirAbbas added this pull request to the merge queue Apr 7, 2026
Merged via the queue into Agent-Field:main with commit a9884e8 Apr 7, 2026
12 checks passed
@tmchow
Copy link
Copy Markdown
Contributor Author

tmchow commented Apr 7, 2026

Tx for the merge @AbirAbbas !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants