Skip to content

feat(adf): subscription-based provider routing with persona and skill chains#705

Open
AlexMikhalev wants to merge 32 commits intomainfrom
task/28-provider-model-fallback-schema
Open

feat(adf): subscription-based provider routing with persona and skill chains#705
AlexMikhalev wants to merge 32 commits intomainfrom
task/28-provider-model-fallback-schema

Conversation

@AlexMikhalev
Copy link
Contributor

Summary

Stats

  • 21 files changed, +4,645 lines
  • 103 tests passing (92 unit + 11 integration)

ADRs

  • ADR-002: Subscription-only model providers (ban opencode/Zen)
  • ADR-003: Four-tier model routing (Quick/Deep/Implementation/Oracle)
  • ADR-004: Terraphim persona identity layer (8 named personas)
  • ADR-005: kimi-for-coding as implementation tier

Test plan

  • cargo test -p terraphim_spawner -- 92 unit tests pass
  • Integration tests -- 11 tests pass (provider routing, circuit breaker, subscription guard, NDJSON, skills, personas)
  • cargo clippy -p terraphim_spawner -- clean
  • Rebuild adf binary on bigbox and restart service
  • Verify agents start with new provider routing

AlexMikhalev and others added 30 commits March 17, 2026 16:24
Security hardening for terraphim_rlm crate:

1. Created validation.rs module with:
   - validate_snapshot_name(): Prevents path traversal attacks
   - validate_code_input(): Enforces MAX_CODE_SIZE (1MB) limit
   - validate_session_id(): Validates UUID format
   - validate_recursion_depth(): Prevents stack overflow
   - Security constants: MAX_CODE_SIZE, MAX_INPUT_SIZE, MAX_RECURSION_DEPTH

2. Fixed race condition in firecracker.rs:
   - Changed snapshot counter from read-then-write to atomic write lock
   - Added validate_snapshot_name() call before snapshot creation
   - Prevents TOCTOU vulnerability where concurrent snapshots could exceed limit

3. Enhanced mcp_tools.rs:
   - Added MAX_CODE_SIZE validation for rlm_code tool
   - Added MAX_CODE_SIZE validation for rlm_bash tool
   - Returns proper MCP error format for validation failures

Refs #426
Refs: PR #426
Features:
- fcctl-core to terraphim_firecracker adapter
- Sub-500ms VM allocation (267ms measured)
- ULID-based VM ID enforcement
- Full trait implementation with error preservation
- 119 tests passing

Validation: All acceptance criteria met
…erTier enum

Add provider, fallback_provider, fallback_model, and provider_tier fields
to AgentDefinition for subscription-based model routing (ADR-002, ADR-003).
Add ProviderTier enum (Quick/Deep/Implementation/Oracle) with per-tier
timeout values. Add opencode CLI support in spawner arg inference.

All new fields are Optional with serde(default) for backward compatibility.

Fixes #28
Refs #29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add provider allowlist and banned_providers fields to OrchestratorConfig.
Default banned list includes "opencode" (Zen pay-per-use proxy, ADR-002).
Add validate_provider() method to spawner AgentConfig that rejects any
model string starting with a banned prefix, while correctly allowing
opencode-go/ (subscription) and kimi-for-coding/ (subscription).

Fixes #31

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add persona_name, persona_symbol, persona_vibe, meta_cortex_connections,
and skill_chain fields to AgentDefinition for the four-layer identity
stack (Persona/Role/SFIA/Skills). All fields Optional/default for
backward compatibility. See ADR-004.

Fixes #32

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SpawnRequest struct and spawn_with_fallback() method that:
- Validates banned providers before spawn (ADR-002)
- Uses ProviderTier timeout values (Quick=30s, Deep=60s, Impl=120s, Oracle=300s)
- Retries with fallback provider/model on primary failure
- Integrates per-provider circuit breakers (3 failures = open 5 min)
- Returns SpawnerError on both primary and fallback failure

SpawnRequest avoids circular dependency with terraphim_orchestrator
by mirroring needed AgentDefinition fields in the spawner crate.

Fixes #30

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add persona_name, persona_symbol, persona_vibe, and meta_cortex_connections
to SpawnRequest. The build_persona_prefix() function generates a markdown
identity block that is prepended to the agent task prompt when persona
fields are configured. Agents without persona config are unaffected.

Fixes #33

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SkillChainRegistry with 31 terraphim-skills and 16 zestic-skills.
Provides validate_chain() to verify skill chains and validate_skill_chains()
on OrchestratorConfig to validate all agents. Backward compatible -- agents
with empty skill_chain pass validation.

Fixes #34

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add OpenCodeEvent struct with parsing for opencode run --format json
output events (step_start, text, tool_use, step_finish, result).
Includes text_content(), total_tokens(), parse_line(), and parse_lines()
helper methods for structured output consumption.

Fixes #37

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SkillResolver in skill_resolver.rs that maps skill chain names to actual
skill file paths from the terraphim-skills repository.

Features:
- SkillResolver with registry of terraphim skills (security-audit, code-review,
  session-search, local-knowledge, git-safety-guard, devops, disciplined-research,
  architecture, disciplined-design, requirements-traceability, testing,
  acceptance-testing, documentation, md-book, implementation, rust-development,
  visual-testing, quality-gate)
- resolve_skill_chain() method that takes Vec<skill_name> and returns
  resolved skill metadata (name, description, applicable_to, paths, source)
- Validation of skill chains with proper error handling
- Comprehensive test coverage including valid resolution, missing skill errors,
  and empty chain handling
- Exported from crate as SkillResolver, SkillSource, ResolvedSkill,
  SkillResolutionError

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h Refs #36

Extend SkillResolver to support zestic-engineering-skills from zestic-ai/6d-prompts
repository alongside existing terraphim-skills.

Features:
- Add SkillSource enum (Terraphim, Zestic) to distinguish skill origins
- Initialize zestic skills registry with 12 skills: quality-oversight,
  responsible-ai, insight-synthesis, perspective-investigation, product-vision,
  wardley-mapping, business-scenario-design, rust-mastery, cross-platform,
  frontend, via-negativa-analysis, strategy-execution
- Each resolved skill includes its source metadata
- SkillChainRegistry validation accepts skills from both sources
- Comprehensive tests for mixed skill chains (terraphim + zestic together)

Tests added:
- test_resolver_has_zestic_skills
- test_resolve_zestic_skill
- test_resolve_mixed_skill_chain
- test_validate_mixed_skill_chain_valid
- test_validate_only_zestic_skills
- test_mixed_chain_with_invalid_skills
- test_zestic_skill_source_in_resolved
- test_all_skill_names_includes_zestic
- test_zestic_skill_structure
- test_resolver_custom_paths_for_zestic

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…routing Refs #39

Replace all legacy codex CLI references with opencode + subscription providers.
Add persona, skill_chain, provider/model/fallback fields to all agents.
New agents: compliance-watchdog, drift-detector, spec-validator, test-guardian,
documentation-generator, implementation-swarm, compound-review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SSH non-interactive shells (systemd) do not include ~/.bun/bin or
~/.local/bin in PATH. Use full paths for opencode and claude binaries.
Also fix TOML escape in --since quote.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Creates crates/terraphim_judge_evaluator with:
- SimpleAgent struct wrapping terraphim_router for KG lookups
- KgMatch struct for representing term matches
- lookup_terms() method using Aho-Corasick automata
- enrich_prompt() for appending KG context to judge prompts
- Comprehensive tests for all functionality

Also adds automation/judge/model-mapping.json configuration file.
Implements JudgeModelRouter for tier-based LLM model selection:
- TierConfig struct for provider+model pairs
- ModelMappingConfig for complete configuration
- JudgeModelRouter with methods:
  * from_config(path) - load from JSON file
  * resolve_tier(tier) - get provider/model for a tier
  * resolve_profile(profile) - get tier sequence for a profile
- Supports tiers: quick, deep, tiebreaker, oracle
- Supports profiles: default, thorough, critical, exhaustive
- Comprehensive tests for all functionality
- Updates automation/judge/model-mapping.json with profile definitions
Implements JudgeAgent as a supervised agent combining:
- SimpleAgent for Knowledge Graph context enrichment
- JudgeModelRouter for tier-based model selection
- SupervisedAgent trait from terraphim_agent_supervisor

JudgeVerdict struct with:
- verdict: final evaluation result (PASS/FAIL/NEEDS_REVIEW)
- scores: BTreeMap of detailed category scores
- judge_tier: which tier produced the verdict
- judge_cli: command used for evaluation
- latency_ms: evaluation duration

Evaluation pipeline:
1. Load file content
2. Enrich with KG context (optional)
3. Select model tier based on profile
4. Parse verdict from CLI output (JSON or text format)

Includes comprehensive tests for:
- Verdict parsing and manipulation
- Supervised agent lifecycle
- Profile-based tier selection
- System message handling
- Add stagger_delay_ms config field (default: 5000ms)
- Insert stagger delay between Safety agent spawns in run()
- Add random jitter (0 to stagger_delay_ms) for Core agent cron spawns
- Add tests for stagger delay configuration

Refs #16
- Add ReviewRequest struct with from_agent, to_agent, artifact_path, review_type
- Add ReviewPair config for defining (producer, reviewer) pairs
- Add review_queue: Vec<ReviewRequest> to orchestrator state
- Add submit_review_request(), review_queue(), process_review_queue() methods
- Add check_review_trigger() to automatically queue reviews on agent completion
- Add tests for review queue operations and config loading

Refs #17
- Add drift_detection module with DriftDetector and DriftReport
- Load strategic goals from plans/ directory markdown files
- Check drift every N ticks (configurable via drift_detection.check_interval_ticks)
- Calculate drift score by comparing agent outputs against strategic goals
- DriftReport includes agent, drift_score, and explanation
- Log warnings when drift_score exceeds threshold (default: 0.6)
- Add DriftDetectionConfig to OrchestratorConfig
- Add tests for drift detection functionality

Refs #18
- Add session_rotation module with SessionRotationManager and AgentSession
- Add SessionRotationConfig with max_sessions_before_rotation (default: 10)
- Track completions_since_rotation and completed_sessions per agent
- on_agent_completion() records completion and triggers rotation at threshold
- Rotation creates new session ID and clears accumulated context
- Add comprehensive tests for session rotation

Refs #19
- Add convergence_detector module with ConvergenceDetector and ConvergenceSignal
- Add ConvergenceConfig with threshold (default: 0.95) and consecutive_threshold (default: 3)
- Calculate output similarity using Jaccard index on word sets
- Detect convergence after N consecutive similar outputs
- ConvergenceSignal includes agent, similarity, and consecutive_count
- Reset on divergence to handle changing outputs
- Add comprehensive tests for convergence detection

Refs #20
…iven mode

Issue #8: Extend orchestrator config for issue-driven mode
- Add WorkflowConfig with mode (time_only/issue_only/dual), poll_interval_secs, max_concurrent_tasks
- Add TrackerConfig with tracker_type (gitea/linear), url, token_env_var, owner, repo
- Add ConcurrencyConfig with max_parallel_agents, queue_depth, starvation_timeout_secs
- All optional in orchestrator.toml with backward compatible defaults (time_only mode is default)
- Tests: parse config with and without workflow section, defaults applied

Issue #9: Unified dispatcher
- Create crates/terraphim_orchestrator/src/dispatcher.rs
- DispatchTask enum: TimeTask(agent_name, schedule), IssueTask(agent_name, issue_id, priority)
- DispatchQueue: priority queue backed by BinaryHeap with round-robin fairness
- ConcurrencyController: semaphore-based with starvation timeout
- Methods: submit(task), next() -> Option<DispatchTask>, active_count(), is_full()
- Fairness: alternates between time and issue tasks at equal priority
- Tests: submit/dequeue, priority ordering, concurrency limits, fairness

Issue #10: Issue mode controller
- Create crates/terraphim_orchestrator/src/issue_mode.rs
- IssueMode struct using terraphim_tracker::GiteaTracker to poll for issues
- Poll loop every poll_interval_secs fetching ready issues via PageRank sorting
- Filter out blocked issues and already-running tasks
- Map issues to agents based on labels ([ADF] -> implementation-swarm) or title patterns
- Submit IssueTask to DispatchQueue
- Tests: poll cycle, issue-to-agent mapping, priority calculation, blocked issue filtering

Issue #11: Time mode refactor
- Refactor crates/terraphim_orchestrator/src/scheduler.rs
- Extract TimeMode struct wrapping existing cron scheduler
- TimeMode submits TimeTask to DispatchQueue instead of spawning directly (when configured)
- Maintain backward compatibility: legacy mode spawns directly if no WorkflowConfig
- Tests: TimeMode submits to queue, legacy mode still works

Refs #8 #9 #10 #11
- Add ModeCoordinator that manages TimeMode and IssueMode
- Implement unified shutdown: signal modes, drain queue, wait for active tasks
- Extend spawner integration to dispatch tasks from queue to agents
- Add stall detection: log warning when queue exceeds threshold
- Add comprehensive tests for dual mode, shutdown coordination, and stall detection

Refs #12
Create comprehensive E2E test suite in tests/e2e_tests.rs:
- test_dual_mode_operation: verify both time and issue tasks processed
- test_time_mode_only: legacy config compatibility
- test_issue_mode_only: issue-only config verification
- test_fairness_under_load: no starvation between task types
- test_graceful_shutdown: clean termination with queue draining
- test_stall_detection: warning logged when queue exceeds threshold
- Additional tests for concurrency limits, prioritization, and backward compatibility

Uses mock tracker and avoids real API calls.

Refs #13
- Create src/compat.rs: Symphony compatibility layer with type aliases, adapters
- Add migration helpers for enabling dual mode from legacy configs
- Create MIGRATION.md with comprehensive migration guide
- Update CLAUDE.md with dual mode architecture description
- Include backward compatibility notes and configuration examples
- Export compat module from lib.rs

Refs #14
- Run cargo test --workspace (orchestrator tests pass)
- Fix all warnings in orchestrator crate
- Create CHANGELOG.md entry for v1.9.0 release
- Verify backward compatibility with test_backward_compatibility
- Build release binary: cargo build --release -p terraphim_orchestrator
- All 142 tests pass in orchestrator crate

Release includes:
- Dual mode orchestrator with TimeMode and IssueMode
- Unified dispatch queue with fairness
- Stall detection and graceful shutdown
- Symphony compatibility layer
- Comprehensive E2E test suite
- Migration documentation

Refs #15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create article index works differently when called from Axum server and from ripgrep middleware

1 participant