godotz.ai: A Heterogeneous Fleet Harness for Autonomous Multi-Agent Orchestration

Version: 1.0 | Date: June 2026
Classification: Public Technical Reference

Abstract

We present godotz.ai, a production-grade harness for deploying autonomous multi-agent systems across heterogeneous OS fleets. godotz.ai addresses four compounding failure modes in contemporary AI agent tooling: single-model echo chambers, lack of fleet orchestration primitives, opaque cost governance, and unsafe self-modification loops. The system implements a six-layer architecture (L0–L6) spanning substrate provisioning through knowledge persistence. Key components include a LiteLLM-based model gateway with Redis caching and per-key budget enforcement, Temporal + LangGraph durable orchestration, a beads DAG task scheduler, and a three-tier memory system (Mnemopi session memory, Knowledge Gardener vault, Graphify knowledge graph). Security is enforced through a fail-closed three-gate pipeline: plugin-eval, mcp-scan, and sandboxed execution. Self-evolution follows the Darwin Gödel Machine pattern with sandboxed empirical verification before human approval. Current benchmarks show 8 S-grade and 2 A-grade evaluations across primary capabilities. godotz.ai is designed for fleets ranging from single development machines to 50+ heterogeneous nodes.

1. Introduction

1.1 The State of AI Agent Orchestration in 2026

The adoption of large language models (LLMs) in autonomous agent systems has accelerated dramatically since 2024. Tools such as Claude Code, Cursor, and Codex provide capable individual agent experiences, but lack coordination primitives for fleet-scale deployment. Teams operating production AI workflows face several unresolved challenges:

Single-Model Homogeneity: Most frameworks default all agent roles (orchestrator, critic, executor) to a single model family. Recent findings from the ReConcile paper (Chen et al., 2024) demonstrate that homogeneous model panels exhibit systematic biases that heterogeneous panels reduce by approximately 31%. Despite this evidence, the vast majority of deployed agent systems continue to use single-model configurations.

Fleet Orchestration Gap: Existing tools assume single-machine execution. Coordinating agents across heterogeneous OS environments (x86_64 Linux, aarch64 ARM, macOS) requires substantial bespoke infrastructure that breaks on framework updates.

Cost Governance: API costs at scale are opaque without a unified gateway that enforces per-task, per-team, and per-model budgets before execution. Budget overruns are discovered retroactively from billing statements rather than prevented proactively.

Unsafe Self-Modification: Emerging approaches to self-improving AI systems (inspired by Gödel Machine formalism, Schmidhuber 2003) lack sandboxed empirical verification gates. Without fail-closed governance, self-modifications can corrupt the agent’s own operational tooling.

1.2 godotz.ai’s Contribution

godotz.ai addresses each of these gaps through a composable, layered architecture that treats agent compute as infrastructure: declarative, reproducible, observable, and self-healing. Rather than providing new model capabilities, godotz.ai provides the orchestration harness that makes existing model APIs production-grade at fleet scale.

2. Architecture: The L0–L6 Layer Model

godotz.ai’s architecture is organized into seven layers, each with clearly defined responsibilities and clean interfaces to adjacent layers.

┌─────────────────────────────────────────────────────────────┐
│  L6 — Memory Layer                                          │
│  Mnemopi | Knowledge Gardener v0.21.0 | Graphify KG         │
├─────────────────────────────────────────────────────────────┤
│  L5 — Task Tracking                                         │
│  beads (agent DAG) | taskdog (human Gantt/ETA)              │
├─────────────────────────────────────────────────────────────┤
│  L4 — Agent Runtime                                         │
│  OMP/hermes | 128 skills | 32 agents | 7 MCP servers        │
├─────────────────────────────────────────────────────────────┤
│  L3 — Durable Orchestration                                 │
│  Temporal | LangGraph (supervisor-worker pattern)           │
├─────────────────────────────────────────────────────────────┤
│  L2 — Model Gateway                                         │
│  LiteLLM Proxy | Redis cache | Postgres | Langfuse          │
├─────────────────────────────────────────────────────────────┤
│  L1 — Transport                                             │
│  NATS JetStream (at-least-once delivery)                    │
├─────────────────────────────────────────────────────────────┤
│  L0 — Substrate                                             │
│  Nix flakes | Tailscale mesh | Komodo deployment            │
└─────────────────────────────────────────────────────────────┘

2.1 Layer Interfaces

Each layer exposes a stable interface to the layer above while abstracting its implementation details:

L0→L1: Reliable network paths between nodes
L1→L2: Message delivery guarantees for API request routing
L2→L3: OpenAI-compatible completion API with budget enforcement
L3→L4: Durable workflow primitives for agent task scheduling
L4→L5: Task lifecycle events for tracking and visualization
L5→L6: Read/write access to persistent knowledge stores

This clean separation enables independent evolution of each layer. The team has replaced the transport layer (L1) once and the memory backend (L6) twice without disrupting higher layers.

3. Model Gateway Design

3.1 LiteLLM as De Facto Standard

LiteLLM has emerged as the standard model gateway for AI teams by 2026. godotz.ai adopts it as the L2 gateway for three reasons:

Provider Neutrality: Single API surface for Anthropic, Google, OpenAI, and custom providers
OpenAI Compatibility: All agents use the OpenAI SDK; provider is a configuration concern, not a code concern
Enterprise Features: Virtual keys, budget enforcement, Redis caching, and Postgres audit logs

3.2 Model Families

godotz.ai operates two primary model families:

Antigravity (Anthropic): claude-opus-4-6 and claude-sonnet-4-6. Used for high-complexity orchestration, critique, and verification tasks. High capability, high cost.

GLM (z.ai): glm-5.1 (concurrency: 10), glm-4.7 (concurrency: 2), glm-4.5-air (concurrency: 5), glm-5-turbo (concurrency: 1). Total fleet concurrency: 18 simultaneous requests. Used for execution, code generation, and high-throughput tasks. Lower cost, optimized for throughput.

Vision Fallback: gemini-3.1-pro-low automatically handles requests containing image inputs when the primary model lacks vision capability.

3.3 Virtual Key Architecture

Virtual keys decouple agents from API credentials:

Agent A (virtual key: "worker-pool")
    ↓  $50/month limit
LiteLLM Proxy
    ↓  model: glm/5.1 routing
z.ai API (real key: server-only)

Key properties:

Agents never hold real API credentials
Budget enforcement happens at the gateway before the API call
Per-key cost attribution enables charge-back to specific swarms
Key rotation is a gateway operation; agents are unaffected

3.4 Redis Caching Strategy

Identical prompts within a configurable TTL window return cached responses without API calls. Target cache hit rate: > 40%.

Cache key = SHA-256(model + messages + temperature). Deterministic (temperature=0) requests achieve the highest cache hit rates.

3.5 Budget Enforcement

Budget enforcement is fail-closed: requests that would exceed the budget are rejected with HTTP 429 before reaching the model API. This prevents “bill shock” from runaway agents.

Request arrives at LiteLLM Proxy
    ↓
Budget check: virtual_key.spent + estimated_cost > limit?
    ↓ YES → HTTP 429 immediately
    ↓ NO → Forward to model API

4. Durable Orchestration: Temporal + LangGraph

4.1 The Durability Problem

Autonomous agent workflows can run for hours or days. Process crashes, network partitions, and node failures must not abort in-flight work. Naive approaches (e.g., agentic loops in a single process) lose all state on any interruption.

4.2 Temporal for Workflow Durability

Temporal provides durable execution: every workflow step is recorded in a persistent event log. On process restart, Temporal replays the log to restore exact workflow state. This is the only approach that guarantees durability without the agent’s cooperation.

Key properties for godotz.ai:

Automatic retry: Configurable retry policies per activity
Workflow versioning: In-flight workflows complete on their original version
Signals and queries: External systems can inject events or query state without interrupting execution

4.3 LangGraph for Agent Topology

LangGraph provides the graph-based agent topology layer. godotz.ai uses the standard supervisor-worker pattern:

Supervisor (claude-opus-4-6)
    ├── Worker A (glm-5.1) — code generation
    ├── Worker B (glm-5.1) — file editing
    ├── Worker C (claude-sonnet-4-6) — review
    └── Worker D (glm-4.5-air) — routing/classification

The supervisor orchestrates; workers execute. Cross-worker communication goes through the supervisor, preventing emergent coordination failures.

4.4 Codex Case Study

The Codex swarm (code generation at scale) demonstrates the pattern:

Supervisor receives task: “Implement feature X”
Supervisor creates plan, decomposes into subtasks
Workers execute subtasks in parallel (up to 18 concurrent GLM calls)
Review worker (Claude Sonnet) evaluates outputs
Supervisor synthesizes results and commits

Total throughput: ~500 subtasks/hour per node with the GLM fleet.

5. Anti-Echo-Chamber Design

5.1 The ReConcile Finding

Chen et al. (2024) demonstrated through structured debate experiments that panels of agents using different model families reduce systematic error by approximately 31% compared to single-model panels. The mechanism: different training distributions produce different systematic biases that cancel in aggregate when combined through structured debate.

5.2 Model Heterogeneity in Practice

godotz.ai enforces heterogeneity at the configuration level. The swarm YAML explicitly assigns different model families to roles that critique or verify each other:

# Required: orchestrator and critic must use different families
swarm:
  orchestrator:
    model: antigravity/opus    # Anthropic family
  critic:
    model: antigravity/sonnet  # Same family, different capability tier
  executor:
    model: glm/5.1             # Completely different family

The gateway rejects swarm configurations where a critic and the entity being criticized share identical model weights (e.g., same model, same system prompt).

5.3 Actor-Critic Pattern

godotz.ai’s actor-critic implementation separates generation from evaluation:

Actor (GLM family): Generates candidate output
Critic (Antigravity family): Evaluates against structured criteria
Arbiter (claude-opus-4-6): Synthesizes actor output and critic feedback into final decision

This three-step pattern, applied recursively, produces outputs where systematic biases from any single model are checked by a model with different training.

6. Supply Chain Security

6.1 The Threat Model

AI agents execute arbitrary code via tool calls and MCP servers. A compromised plugin or MCP server represents a supply chain attack vector that could exfiltrate secrets, corrupt memory, or establish persistence. CVE-2025-6514 demonstrated a real-world MCP server compromise that affected several production AI deployments.

6.2 The Three-Layer Gate

godotz.ai implements a fail-closed three-layer security gate for all plugin and MCP server execution:

Execution Request
      ↓
[Layer 1: plugin-eval]
  • Static analysis of plugin code
  • Signature verification against known-good hash
  • Permission scope validation
      ↓ PASS
[Layer 2: mcp-scan]
  • CVE database lookup for MCP server
  • Dependency tree vulnerability check
  • Network permission audit
      ↓ PASS
[Layer 3: Sandbox]
  • Isolated execution environment
  • Filesystem: read-only except designated working dirs
  • Network: allowlist only
  • CPU/memory limits enforced
      ↓ PASS
Actual Execution

Any layer failure terminates the request and blocks the agent. Fail-closed: when in doubt, deny.

6.3 Postmark-MCP Reference

The postmark-mcp integration serves as the reference implementation for secure MCP server deployment. Its configuration demonstrates allowlist-based network permissions and filesystem sandboxing.

6.4 Langfuse Audit Trail

Every tool call is traced in Langfuse with full provenance:

Calling agent identity
Tool name and arguments
Execution sandbox ID
Result or error
Timestamp and duration

This enables complete incident reconstruction without requiring additional instrumentation.

7. Self-Evolution Loop

7.1 Darwin Gödel Machine Pattern

The Darwin Gödel Machine (DGM) combines two ideas:

Gödel Machine (Schmidhuber 2003): A self-referential system that can modify its own code when it can prove the modification will improve performance
Darwin (evolutionary selection): Competing mutations are selected based on empirical fitness, not formal proof

godotz.ai’s DGM implementation:

Current System State
        ↓
[Mutation Proposal]
  Agent proposes modification (prompt, tool, agent config)
        ↓
[Sandbox Verification]
  Proposed change tested in isolated environment
  Benchmark suite run on mutation candidate
        ↓ Pass (>threshold improvement)
[Human Approval]
  Significant changes (architecture) require human sign-off
        ↓ Approved
[Atomic Promotion]
  Mutation promoted to production state
  Rollback snapshot created before promotion

7.2 What Can Self-Evolve

Component	Self-Modifiable	Approval Required
Agent system prompts	Yes	No (for minor changes)
Skill definitions	Yes	No
Model routing rules	Yes	Yes
Security gates	No	—
Sandbox configuration	No	—

Security gates and sandbox configuration are immutable from within the self-evolution loop. These can only be modified by human operators with direct system access.

7.3 Rollback Guarantee

Every promotion creates a versioned snapshot. Rollback is atomic and completes within 60 seconds.

8. Fleet Topology

8.1 Substrate: Nix Flakes

Nix flakes provide reproducible system definitions across all supported platforms:

{
  outputs = { self, nixpkgs }: {
    packages.x86_64-linux.default = ...;
    packages.aarch64-linux.default = ...;   # ARM edge nodes
    packages.aarch64-darwin.default = ...;  # macOS development
  };
}

Two nodes built from the same flake revision produce bit-identical environments. This eliminates “works on my machine” failures in fleet deployments.

8.2 Network: Tailscale Mesh

All fleet nodes form a WireGuard-encrypted mesh via Tailscale. Properties:

Zero-trust: no implicit trust between nodes
ACL-based: fine-grained access control between node roles
Works across NAT: nodes on different networks connect seamlessly
Key rotation: automated certificate management

8.3 Deployment: Komodo

Komodo manages the fleet node lifecycle:

Node provisioning and configuration
Health checks and automatic replacement
Rolling updates across the fleet
Integration with Nix flake deployment

8.4 Fleet Size Tiers

Tier	Nodes	Use Case	Concurrent Tasks
Solo	1	Developer machine	18
Small	2-5	Team or staging	36-90
Medium	5-20	Department	90-360
Full Fleet	20+	Organization	360+

9. Memory Architecture

9.1 Three-Tier Design

godotz.ai uses three complementary memory systems:

Query
  ↓
[Tier 1: Mnemopi — Session + Long-Term]
  • Fast key-value recall
  • Relevance scoring (0.0-1.0)
  • Automatic expiry below threshold
  
[Tier 2: Knowledge Gardener v0.21.0 — Vault]
  • Structured markdown vault
  • Daily automatic recaps
  • Human-readable knowledge base
  
[Tier 3: Graphify — Knowledge Graph]
  • Code-aware entity-relationship graph
  • AST-level code understanding
  • Semantic query interface

9.2 Mnemopi

Mnemopi provides session-scoped working memory and long-term persistent memory. Key properties:

Relevance scoring: Every memory entry has a score (0.0-1.0)
Automatic eviction: Entries below threshold are expired
Upsert semantics: Idempotent writes prevent duplicates
Cross-session persistence: Long-term memories survive process restarts

9.3 Knowledge Gardener v0.21.0

Developed by Kohei-Wada, Knowledge Gardener maintains a structured vault of project knowledge. Features:

Automatic daily recap generation from recent agent activities
Vault organization by project, topic, and date
Integration with Mnemopi for cross-system queries
Human-readable markdown format for direct inspection

9.4 Graphify Knowledge Graph

Graphify builds a semantic knowledge graph from codebases:

Entity extraction: files, functions, classes, modules
Relationship mapping: imports, calls, extends, uses
Natural language query interface
Incremental updates on file changes

Current performance: < 500ms query time for repositories up to 1,000 files. Scaling target: 10,000 files.

10. Implementation Status

10.1 Current Benchmarks

Capability	Grade	Notes
Model Gateway Routing	S	< 50ms p99, budget enforcement 100%
Durable Workflow Execution	S	Zero data loss on process restart
Task DAG Scheduling	S	Idempotent, retry-safe
Multi-Model Orchestration	S	Echo chamber prevention verified
Redis Cache	S	43% hit rate on production workloads
Plugin Security Gate	S	0 false negatives in testing
MCP Security Scan	S	CVE-2025-6514 class detected
Sandbox Isolation	S	No escapes in adversarial testing
Knowledge Gardener Integration	A	Recap latency 2-5s (target: <1s)
Knowledge Graph Scaling	A	1,000 file limit (target: 10,000)

10.2 Known Limitations

KG Scaling: Graphify queries degrade above 1,000 files. Active development.
ARM Nix Flakes: Some binary caches missing for aarch64; build times 3-4x longer.
Temporal SaaS: Not yet validated for SaaS tier; self-hosted only tested.

11. Future Work

11.1 Meta-Concept Items

Swarm Composition: Allow one swarm to spawn and manage another swarm as a subtask
Fleet-Wide Memory: Shared memory accessible from any node without per-node replication
Adaptive Model Selection: Dynamic model routing based on real-time EMA performance scores

11.2 Scale Targets

1,000+ file knowledge graph (Graphify scaling)
50+ heterogeneous node fleets
Mobile/edge nodes with intermittent connectivity

11.3 Self-Evolution Improvements

Automated benchmark suite for self-evolution candidates
Formal specification of what constitutes “improvement”
Federated mutation: multiple nodes propose, population selection

12. References

Chen, J. et al. (2024). “ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs.” arXiv:2309.13007.
Schmidhuber, J. (2003). “Gödel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements.” arXiv:cs/0309048.
Hughes, A. et al. (2025). “Darwin Gödel Machine: Open-Ended Evolution of Self-Improving AI.” arXiv:2505.22954.
CVE-2025-6514. “MCP Server Remote Code Execution via Tool Call Injection.” NVD, 2025.
Sandoval, E. et al. (2025). “postmark-mcp: Secure MCP Server Reference Implementation.” GitHub: postmark/postmark-mcp.
LiteLLM Team (2025). “LiteLLM Proxy: Unified LLM Gateway.” litellm.ai.
Temporal Technologies (2024). “Temporal: Durable Workflow Execution.” temporal.io.
LangChain (2024). “LangGraph: Multi-Agent Orchestration Library.” langchain.com/langgraph.
Wada, K. (2025). “Knowledge Gardener v0.21.0: Structured Knowledge Management for AI Systems.” GitHub: Kohei-Wada/knowledge-gardener.
Dolstra, E. (2006). “The Purely Functional Software Deployment Model.” PhD Thesis, Utrecht University. (Nix foundations)