Self-Evolution
godotz.ai implements a Darwin Gödel Machine (DGM) loop: the system can propose modifications to its own configuration, validate them in a sandbox, and — with human approval — promote them to production. This page covers the concrete mechanisms: pheromone trails, EMA tracking, cascade optimization, the context engine, regression detection, and plan caching.
The DGM Loop
measure performance
↓
identify improvement candidate ←─ pheromone trails
↓
generate modification proposal
↓
sandbox validation ←─ regression detection
↓
human approval gate
↓
promote to production ←─ cascade optimization
↓
EMA tracking (observe effect)
↓
(loop)
Each phase has a fail-closed exit: if the phase produces an error or a regression, the loop halts and the operator is notified. No modification ever reaches production without passing sandbox validation and explicit human approval.
Pheromone Trails
Pheromone trails are the signal source for the DGM loop. Every agent action deposits a trace with an outcome score. High-density, high-score trails indicate effective paths; sparse or low-score trails signal areas for improvement.
Trail Structure
{
"trail_id": "tr-20260607-a3f1",
"agent": "agent/codex",
"action": "tool_call",
"tool": "bash",
"args": {"command": "cargo test --workspace"},
"outcome": {
"score": 0.87,
"latency_ms": 4200,
"tokens": 312,
"result": "success"
},
"context_hash": "sha256:a4b2...",
"timestamp": "2026-06-07T14:32:00Z"
}
Scores range from 0.0 (complete failure) to 1.0 (perfect outcome). The scoring function is configurable per action type:
# evolution.config.yaml
scoring:
tool_call:
success_base: 0.7
bonus:
fast: 0.1 # latency < p50
cached: 0.15 # response from cache
penalty:
retry: -0.2 # required at least one retry
timeout: -0.4 # hit the step timeout
model_call:
success_base: 0.6
bonus:
cache_hit: 0.2
penalty:
budget_reject: -0.5
Trail Aggregation
Trails are aggregated into a path graph. Frequently traversed paths with high scores become candidates for optimization:
path: plan → fetch → parse → analyze → report
(0.82) (0.91) (0.78) (0.65) (0.88)
lowest score node: analyze (0.65) → candidate for improvement
EMA Performance Tracking
The system tracks model and agent performance using Exponential Moving Average (EMA). This gives recent observations higher weight without discarding history.
EMA_new = α * score_new + (1 - α) * EMA_prev
godotz.ai uses α = 0.15 by default (configurable). This means a new observation contributes 15% of the new EMA value, so roughly the last 13 observations determine the effective average.
Per-Model EMA
# Tracked automatically in .omc/evolution/model-ema.yaml
glm-5.1:
ema: 0.83
sample_count: 1420
last_updated: "2026-06-07T14:00:00Z"
glm-4.7:
ema: 0.79
sample_count: 340
last_updated: "2026-06-07T14:00:00Z"
claude-sonnet-4-6:
ema: 0.91
sample_count: 210
last_updated: "2026-06-07T14:00:00Z"
When a model’s EMA drops below the fallback_threshold (default 0.60), it is automatically demoted in routing priority and the next model in the fallback chain is preferred:
evolution:
fallback_threshold: 0.60
fallback_window: 50 # demotion triggered after 50-sample EMA drop
recovery_window: 200 # samples required to restore original priority
Fallback Chain
primary: glm-5.1 (EMA 0.83)
↓ if EMA < 0.60
fallback-1: glm-4.7 (EMA 0.79)
↓ if EMA < 0.60
fallback-2: claude-sonnet-4-6 (EMA 0.91)
↓ if unavailable
fallback-3: ollama/mistral-7b (local)
Context Engine
The context engine determines what background information to inject into an agent’s context at session start. It uses pheromone trail data and EMA scores to prioritize relevant context.
Context Assembly
1. Identify task type from swarm role and prompt
2. Query Mnemopi for relevant semantic memories
3. Query pheromone trails for high-score paths on similar tasks
4. Query plan cache for reusable sub-plans
5. Assemble context in priority order:
a. Mandatory (system prompt, role definition)
b. Task-specific (relevant memories, EMA-ranked)
c. Opportunistic (cached plans, trail-derived hints)
6. Truncate to fit within context window budget
# evolution.config.yaml
context_engine:
enabled: true
max_context_tokens: 8000 # budget for injected context
memory_threshold: 0.65 # min relevance score to include
trail_top_k: 5 # top-5 high-score trail paths to inject
plan_cache_enabled: true
Context Injection Format
<!-- context-engine: task=code-review, score=0.88 -->
**High-performance patterns for this task type:**
- Use `lsp references` before modifying exported symbols (trail score: 0.94)
- Run `bd status --filter feat/` to identify related open tasks (trail score: 0.89)
**Relevant memory:**
- "Always check `src/middleware/` before adding new middleware" (relevance: 0.92)
Plan Cache
The plan cache stores successful execution plans keyed by task fingerprint. When a new task closely matches a cached plan, the context engine injects the cached plan as a starting point rather than having the orchestrator replan from scratch.
task fingerprint = SHA-256(
role + goal_embedding_quantized + input_schema_hash
)
# .omc/evolution/plan-cache/
feat-auth-jwt-a3f1b2.yaml # cached plan, score 0.88
fix-redis-timeout-c9d4e5.yaml # cached plan, score 0.91
feat-deploy-k8s-0fa3b1.yaml # cached plan, score 0.76
Cache entry format:
# .omc/evolution/plan-cache/feat-auth-jwt-a3f1b2.yaml
fingerprint: "a3f1b2..."
goal: "implement JWT authentication"
score: 0.88
use_count: 7
last_used: "2026-06-06T11:22:00Z"
plan:
- step: investigate existing auth
tools: [lsp, search]
- step: implement token generation
tools: [edit, bash]
- step: write tests
tools: [write, bash]
- step: verify
tools: [bash, lsp]
Cached plans are suggestions, not constraints. The orchestrator can deviate if the task context differs.
Regression Detection
Before any modification is promoted, the regression detector runs the full benchmark suite against the candidate change in a sandboxed environment.
Regression Gate
candidate change
→ sandbox clone of workspace
→ run benchmark suite
→ compare scores to production baseline
→ if any benchmark regresses > threshold: REJECT
→ if all benchmarks pass: APPROVE for human gate
evolution:
regression:
sandbox: "docker" # or "nix-sandbox"
threshold: -0.05 # max allowed score drop (5%)
required_benchmarks:
- name: core-tools
weight: 1.0 # regression here always blocks
- name: model-routing
weight: 0.8
- name: memory-recall
weight: 0.6 # regression here blocks only if > 8%
timeout_seconds: 600
A promotion that causes even a single required benchmark to regress by more than threshold is automatically rejected and logged to .omc/evolution/rejections/.
Rejection Log
# .omc/evolution/rejections/2026-06-07-r42.yaml
timestamp: "2026-06-07T15:44:00Z"
proposal_id: "prop-a3f2c1"
description: "Increase glm-5.1 concurrency to 15"
reason: regression
benchmarks:
- name: core-tools
baseline: 0.89
candidate: 0.81
delta: -0.08 # exceeded -0.05 threshold
rejection_message: >
Benchmark 'core-tools' regressed 8% (0.89 → 0.81), exceeding the
-5% threshold. Candidate rejected. Increase was likely causing
gateway queue saturation. Try concurrency 12 instead.
Cascade Optimization
When a modification is promoted, cascade optimization applies the change across all dependent components without requiring manual updates:
promoted change: update glm-5.1 concurrency limit to 12
→ update litellm_config.yaml
→ update swarm defaults maxConcurrency
→ update EMA fallback thresholds (scale by concurrency ratio)
→ invalidate plan cache entries that assumed old concurrency
→ reload gateway without downtime (graceful restart)
evolution:
cascade:
enabled: true
auto_reload: true # reload gateway/services on promotion
cache_invalidation: true # clear plan cache entries affected by change
notify:
- ntfy: "omp-evolution" # send notification on promotion
Cascade changes are atomic: either all dependent updates succeed or the promotion is rolled back to the pre-change snapshot.
Human Approval Gate
No modification reaches production without a human approval step. The approval interface is a CLI prompt:
omc evolution review
# Output:
# Pending proposals:
#
# [1] prop-a3f2c1: Increase glm-5.1 concurrency to 12
# Benchmark delta: +3.2% core-tools, +1.8% model-routing
# Risk: LOW | Est. cost impact: -$2.40/day
#
# [2] prop-b9c4d2: Add plan cache entry for auth tasks
# Benchmark delta: +5.1% core-tools
# Risk: LOW | Est. cost impact: -$0.80/day
#
# Approve [1]? (y/N/details):
Approved proposals are tagged with the approver’s identity and timestamp before promotion. This record is stored in .omc/evolution/promoted/ indefinitely and is included in Langfuse audit traces.
Rollback
Every promotion creates a versioned snapshot before applying changes:
omc evolution history # list promotions with scores
omc evolution rollback <id> # atomic rollback, completes in <60s
Rollback restores all cascade-updated components simultaneously. In-flight requests complete against the old configuration; new requests see the rolled-back state immediately.
Related
- Architecture Overview — self-evolution operates across all layers
- Memory — Mnemopi entries feed the context engine
- Orchestration — swarm YAML is the primary evolution target
- Task Graph — beads hooks trigger pheromone trail deposits