Self-Evolution

godotz.ai implements a Darwin Gödel Machine (DGM) loop: the system can propose modifications to its own configuration, validate them in a sandbox, and — with human approval — promote them to production. This page covers the concrete mechanisms: pheromone trails, EMA tracking, cascade optimization, the context engine, regression detection, and plan caching.

The DGM Loop

measure performance
      ↓
identify improvement candidate  ←─ pheromone trails
      ↓
generate modification proposal
      ↓
sandbox validation              ←─ regression detection
      ↓
human approval gate
      ↓
promote to production           ←─ cascade optimization
      ↓
EMA tracking (observe effect)
      ↓
(loop)

Each phase has a fail-closed exit: if the phase produces an error or a regression, the loop halts and the operator is notified. No modification ever reaches production without passing sandbox validation and explicit human approval.

Pheromone Trails

Pheromone trails are the signal source for the DGM loop. Every agent action deposits a trace with an outcome score. High-density, high-score trails indicate effective paths; sparse or low-score trails signal areas for improvement.

Trail Structure

{
  "trail_id": "tr-20260607-a3f1",
  "agent": "agent/codex",
  "action": "tool_call",
  "tool": "bash",
  "args": {"command": "cargo test --workspace"},
  "outcome": {
    "score": 0.87,
    "latency_ms": 4200,
    "tokens": 312,
    "result": "success"
  },
  "context_hash": "sha256:a4b2...",
  "timestamp": "2026-06-07T14:32:00Z"
}

Scores range from 0.0 (complete failure) to 1.0 (perfect outcome). The scoring function is configurable per action type:

# evolution.config.yaml
scoring:
  tool_call:
    success_base: 0.7
    bonus:
      fast: 0.1        # latency < p50
      cached: 0.15     # response from cache
    penalty:
      retry: -0.2      # required at least one retry
      timeout: -0.4    # hit the step timeout
  model_call:
    success_base: 0.6
    bonus:
      cache_hit: 0.2
    penalty:
      budget_reject: -0.5

Trail Aggregation

Trails are aggregated into a path graph. Frequently traversed paths with high scores become candidates for optimization:

path: plan → fetch → parse → analyze → report
      (0.82)  (0.91)  (0.78)  (0.65)   (0.88)

lowest score node: analyze (0.65) → candidate for improvement

EMA Performance Tracking

The system tracks model and agent performance using Exponential Moving Average (EMA). This gives recent observations higher weight without discarding history.

EMA_new = α * score_new + (1 - α) * EMA_prev

godotz.ai uses α = 0.15 by default (configurable). This means a new observation contributes 15% of the new EMA value, so roughly the last 13 observations determine the effective average.

Per-Model EMA

# Tracked automatically in .omc/evolution/model-ema.yaml
glm-5.1:
  ema: 0.83
  sample_count: 1420
  last_updated: "2026-06-07T14:00:00Z"
glm-4.7:
  ema: 0.79
  sample_count: 340
  last_updated: "2026-06-07T14:00:00Z"
claude-sonnet-4-6:
  ema: 0.91
  sample_count: 210
  last_updated: "2026-06-07T14:00:00Z"

When a model’s EMA drops below the fallback_threshold (default 0.60), it is automatically demoted in routing priority and the next model in the fallback chain is preferred:

evolution:
  fallback_threshold: 0.60
  fallback_window: 50        # demotion triggered after 50-sample EMA drop
  recovery_window: 200       # samples required to restore original priority

Fallback Chain

primary: glm-5.1 (EMA 0.83)
  ↓ if EMA < 0.60
fallback-1: glm-4.7 (EMA 0.79)
  ↓ if EMA < 0.60
fallback-2: claude-sonnet-4-6 (EMA 0.91)
  ↓ if unavailable
fallback-3: ollama/mistral-7b (local)

Context Engine

The context engine determines what background information to inject into an agent’s context at session start. It uses pheromone trail data and EMA scores to prioritize relevant context.

Context Assembly

1. Identify task type from swarm role and prompt
2. Query Mnemopi for relevant semantic memories
3. Query pheromone trails for high-score paths on similar tasks
4. Query plan cache for reusable sub-plans
5. Assemble context in priority order:
     a. Mandatory (system prompt, role definition)
     b. Task-specific (relevant memories, EMA-ranked)
     c. Opportunistic (cached plans, trail-derived hints)
6. Truncate to fit within context window budget

# evolution.config.yaml
context_engine:
  enabled: true
  max_context_tokens: 8000     # budget for injected context
  memory_threshold: 0.65       # min relevance score to include
  trail_top_k: 5               # top-5 high-score trail paths to inject
  plan_cache_enabled: true

Context Injection Format

<!-- context-engine: task=code-review, score=0.88 -->
**High-performance patterns for this task type:**
- Use `lsp references` before modifying exported symbols (trail score: 0.94)
- Run `bd status --filter feat/` to identify related open tasks (trail score: 0.89)

**Relevant memory:**
- "Always check `src/middleware/` before adding new middleware" (relevance: 0.92)

Plan Cache

The plan cache stores successful execution plans keyed by task fingerprint. When a new task closely matches a cached plan, the context engine injects the cached plan as a starting point rather than having the orchestrator replan from scratch.

task fingerprint = SHA-256(
  role + goal_embedding_quantized + input_schema_hash
)

# .omc/evolution/plan-cache/
feat-auth-jwt-a3f1b2.yaml      # cached plan, score 0.88
fix-redis-timeout-c9d4e5.yaml  # cached plan, score 0.91
feat-deploy-k8s-0fa3b1.yaml    # cached plan, score 0.76

Cache entry format:

# .omc/evolution/plan-cache/feat-auth-jwt-a3f1b2.yaml
fingerprint: "a3f1b2..."
goal: "implement JWT authentication"
score: 0.88
use_count: 7
last_used: "2026-06-06T11:22:00Z"
plan:
  - step: investigate existing auth
    tools: [lsp, search]
  - step: implement token generation
    tools: [edit, bash]
  - step: write tests
    tools: [write, bash]
  - step: verify
    tools: [bash, lsp]

Cached plans are suggestions, not constraints. The orchestrator can deviate if the task context differs.

Regression Detection

Before any modification is promoted, the regression detector runs the full benchmark suite against the candidate change in a sandboxed environment.

Regression Gate

candidate change
  → sandbox clone of workspace
  → run benchmark suite
  → compare scores to production baseline
  → if any benchmark regresses > threshold: REJECT
  → if all benchmarks pass: APPROVE for human gate

evolution:
  regression:
    sandbox: "docker"            # or "nix-sandbox"
    threshold: -0.05             # max allowed score drop (5%)
    required_benchmarks:
      - name: core-tools
        weight: 1.0              # regression here always blocks
      - name: model-routing
        weight: 0.8
      - name: memory-recall
        weight: 0.6              # regression here blocks only if > 8%
    timeout_seconds: 600

A promotion that causes even a single required benchmark to regress by more than threshold is automatically rejected and logged to .omc/evolution/rejections/.

Rejection Log

# .omc/evolution/rejections/2026-06-07-r42.yaml
timestamp: "2026-06-07T15:44:00Z"
proposal_id: "prop-a3f2c1"
description: "Increase glm-5.1 concurrency to 15"
reason: regression
benchmarks:
  - name: core-tools
    baseline: 0.89
    candidate: 0.81
    delta: -0.08               # exceeded -0.05 threshold
rejection_message: >
  Benchmark 'core-tools' regressed 8% (0.89 → 0.81), exceeding the
  -5% threshold. Candidate rejected. Increase was likely causing
  gateway queue saturation. Try concurrency 12 instead.

Cascade Optimization

When a modification is promoted, cascade optimization applies the change across all dependent components without requiring manual updates:

promoted change: update glm-5.1 concurrency limit to 12
  → update litellm_config.yaml
  → update swarm defaults maxConcurrency
  → update EMA fallback thresholds (scale by concurrency ratio)
  → invalidate plan cache entries that assumed old concurrency
  → reload gateway without downtime (graceful restart)

evolution:
  cascade:
    enabled: true
    auto_reload: true          # reload gateway/services on promotion
    cache_invalidation: true   # clear plan cache entries affected by change
    notify:
      - ntfy: "omp-evolution"  # send notification on promotion

Cascade changes are atomic: either all dependent updates succeed or the promotion is rolled back to the pre-change snapshot.

Human Approval Gate

No modification reaches production without a human approval step. The approval interface is a CLI prompt:

omc evolution review

# Output:
# Pending proposals:
#
# [1] prop-a3f2c1: Increase glm-5.1 concurrency to 12
#     Benchmark delta: +3.2% core-tools, +1.8% model-routing
#     Risk: LOW | Est. cost impact: -$2.40/day
#
# [2] prop-b9c4d2: Add plan cache entry for auth tasks
#     Benchmark delta: +5.1% core-tools
#     Risk: LOW | Est. cost impact: -$0.80/day
#
# Approve [1]? (y/N/details):

Approved proposals are tagged with the approver’s identity and timestamp before promotion. This record is stored in .omc/evolution/promoted/ indefinitely and is included in Langfuse audit traces.

Rollback

Every promotion creates a versioned snapshot before applying changes:

omc evolution history            # list promotions with scores
omc evolution rollback <id>      # atomic rollback, completes in <60s

Rollback restores all cascade-updated components simultaneously. In-flight requests complete against the old configuration; new requests see the rolled-back state immediately.

Architecture Overview — self-evolution operates across all layers
Memory — Mnemopi entries feed the context engine
Orchestration — swarm YAML is the primary evolution target
Task Graph — beads hooks trigger pheromone trail deposits