godotz.ai: A Heterogeneous Fleet Harness for Autonomous Multi-Agent Orchestration
Version: 1.0 | Date: June 2026
Classification: Public Technical Reference
Abstract
We present godotz.ai, a production-grade harness for deploying autonomous multi-agent systems across heterogeneous OS fleets. godotz.ai addresses four compounding failure modes in contemporary AI agent tooling: single-model echo chambers, lack of fleet orchestration primitives, opaque cost governance, and unsafe self-modification loops. The system implements a six-layer architecture (L0–L6) spanning substrate provisioning through knowledge persistence. Key components include a LiteLLM-based model gateway with Redis caching and per-key budget enforcement, Temporal + LangGraph durable orchestration, a beads DAG task scheduler, and a three-tier memory system (Mnemopi session memory, Knowledge Gardener vault, Graphify knowledge graph). Security is enforced through a fail-closed three-gate pipeline: plugin-eval, mcp-scan, and sandboxed execution. Self-evolution follows the Darwin Gödel Machine pattern with sandboxed empirical verification before human approval. Current benchmarks show 8 S-grade and 2 A-grade evaluations across primary capabilities. godotz.ai is designed for fleets ranging from single development machines to 50+ heterogeneous nodes.
1. Introduction
1.1 The State of AI Agent Orchestration in 2026
The adoption of large language models (LLMs) in autonomous agent systems has accelerated dramatically since 2024. Tools such as Claude Code, Cursor, and Codex provide capable individual agent experiences, but lack coordination primitives for fleet-scale deployment. Teams operating production AI workflows face several unresolved challenges:
Single-Model Homogeneity: Most frameworks default all agent roles (orchestrator, critic, executor) to a single model family. Recent findings from the ReConcile paper (Chen et al., 2024) demonstrate that homogeneous model panels exhibit systematic biases that heterogeneous panels reduce by approximately 31%. Despite this evidence, the vast majority of deployed agent systems continue to use single-model configurations.
Fleet Orchestration Gap: Existing tools assume single-machine execution. Coordinating agents across heterogeneous OS environments (x86_64 Linux, aarch64 ARM, macOS) requires substantial bespoke infrastructure that breaks on framework updates.
Cost Governance: API costs at scale are opaque without a unified gateway that enforces per-task, per-team, and per-model budgets before execution. Budget overruns are discovered retroactively from billing statements rather than prevented proactively.
Unsafe Self-Modification: Emerging approaches to self-improving AI systems (inspired by Gödel Machine formalism, Schmidhuber 2003) lack sandboxed empirical verification gates. Without fail-closed governance, self-modifications can corrupt the agent’s own operational tooling.
1.2 godotz.ai’s Contribution
godotz.ai addresses each of these gaps through a composable, layered architecture that treats agent compute as infrastructure: declarative, reproducible, observable, and self-healing. Rather than providing new model capabilities, godotz.ai provides the orchestration harness that makes existing model APIs production-grade at fleet scale.
2. Architecture: The L0–L6 Layer Model
godotz.ai’s architecture is organized into seven layers, each with clearly defined responsibilities and clean interfaces to adjacent layers.
┌─────────────────────────────────────────────────────────────┐
│ L6 — Memory Layer │
│ Mnemopi | Knowledge Gardener v0.21.0 | Graphify KG │
├─────────────────────────────────────────────────────────────┤
│ L5 — Task Tracking │
│ beads (agent DAG) | taskdog (human Gantt/ETA) │
├─────────────────────────────────────────────────────────────┤
│ L4 — Agent Runtime │
│ OMP/hermes | 128 skills | 32 agents | 7 MCP servers │
├─────────────────────────────────────────────────────────────┤
│ L3 — Durable Orchestration │
│ Temporal | LangGraph (supervisor-worker pattern) │
├─────────────────────────────────────────────────────────────┤
│ L2 — Model Gateway │
│ LiteLLM Proxy | Redis cache | Postgres | Langfuse │
├─────────────────────────────────────────────────────────────┤
│ L1 — Transport │
│ NATS JetStream (at-least-once delivery) │
├─────────────────────────────────────────────────────────────┤
│ L0 — Substrate │
│ Nix flakes | Tailscale mesh | Komodo deployment │
└─────────────────────────────────────────────────────────────┘
2.1 Layer Interfaces
Each layer exposes a stable interface to the layer above while abstracting its implementation details:
- L0→L1: Reliable network paths between nodes
- L1→L2: Message delivery guarantees for API request routing
- L2→L3: OpenAI-compatible completion API with budget enforcement
- L3→L4: Durable workflow primitives for agent task scheduling
- L4→L5: Task lifecycle events for tracking and visualization
- L5→L6: Read/write access to persistent knowledge stores
This clean separation enables independent evolution of each layer. The team has replaced the transport layer (L1) once and the memory backend (L6) twice without disrupting higher layers.
3. Model Gateway Design
3.1 LiteLLM as De Facto Standard
LiteLLM has emerged as the standard model gateway for AI teams by 2026. godotz.ai adopts it as the L2 gateway for three reasons:
- Provider Neutrality: Single API surface for Anthropic, Google, OpenAI, and custom providers
- OpenAI Compatibility: All agents use the OpenAI SDK; provider is a configuration concern, not a code concern
- Enterprise Features: Virtual keys, budget enforcement, Redis caching, and Postgres audit logs
3.2 Model Families
godotz.ai operates two primary model families:
Antigravity (Anthropic): claude-opus-4-6 and claude-sonnet-4-6. Used for high-complexity orchestration, critique, and verification tasks. High capability, high cost.
GLM (z.ai): glm-5.1 (concurrency: 10), glm-4.7 (concurrency: 2), glm-4.5-air (concurrency: 5), glm-5-turbo (concurrency: 1). Total fleet concurrency: 18 simultaneous requests. Used for execution, code generation, and high-throughput tasks. Lower cost, optimized for throughput.
Vision Fallback: gemini-3.1-pro-low automatically handles requests containing image inputs when the primary model lacks vision capability.
3.3 Virtual Key Architecture
Virtual keys decouple agents from API credentials:
Agent A (virtual key: "worker-pool")
↓ $50/month limit
LiteLLM Proxy
↓ model: glm/5.1 routing
z.ai API (real key: server-only)
Key properties:
- Agents never hold real API credentials
- Budget enforcement happens at the gateway before the API call
- Per-key cost attribution enables charge-back to specific swarms
- Key rotation is a gateway operation; agents are unaffected
3.4 Redis Caching Strategy
Identical prompts within a configurable TTL window return cached responses without API calls. Target cache hit rate: > 40%.
Cache key = SHA-256(model + messages + temperature). Deterministic (temperature=0) requests achieve the highest cache hit rates.
3.5 Budget Enforcement
Budget enforcement is fail-closed: requests that would exceed the budget are rejected with HTTP 429 before reaching the model API. This prevents “bill shock” from runaway agents.
Request arrives at LiteLLM Proxy
↓
Budget check: virtual_key.spent + estimated_cost > limit?
↓ YES → HTTP 429 immediately
↓ NO → Forward to model API
4. Durable Orchestration: Temporal + LangGraph
4.1 The Durability Problem
Autonomous agent workflows can run for hours or days. Process crashes, network partitions, and node failures must not abort in-flight work. Naive approaches (e.g., agentic loops in a single process) lose all state on any interruption.
4.2 Temporal for Workflow Durability
Temporal provides durable execution: every workflow step is recorded in a persistent event log. On process restart, Temporal replays the log to restore exact workflow state. This is the only approach that guarantees durability without the agent’s cooperation.
Key properties for godotz.ai:
- Automatic retry: Configurable retry policies per activity
- Workflow versioning: In-flight workflows complete on their original version
- Signals and queries: External systems can inject events or query state without interrupting execution
4.3 LangGraph for Agent Topology
LangGraph provides the graph-based agent topology layer. godotz.ai uses the standard supervisor-worker pattern:
Supervisor (claude-opus-4-6)
├── Worker A (glm-5.1) — code generation
├── Worker B (glm-5.1) — file editing
├── Worker C (claude-sonnet-4-6) — review
└── Worker D (glm-4.5-air) — routing/classification
The supervisor orchestrates; workers execute. Cross-worker communication goes through the supervisor, preventing emergent coordination failures.
4.4 Codex Case Study
The Codex swarm (code generation at scale) demonstrates the pattern:
- Supervisor receives task: “Implement feature X”
- Supervisor creates plan, decomposes into subtasks
- Workers execute subtasks in parallel (up to 18 concurrent GLM calls)
- Review worker (Claude Sonnet) evaluates outputs
- Supervisor synthesizes results and commits
Total throughput: ~500 subtasks/hour per node with the GLM fleet.
5. Anti-Echo-Chamber Design
5.1 The ReConcile Finding
Chen et al. (2024) demonstrated through structured debate experiments that panels of agents using different model families reduce systematic error by approximately 31% compared to single-model panels. The mechanism: different training distributions produce different systematic biases that cancel in aggregate when combined through structured debate.
5.2 Model Heterogeneity in Practice
godotz.ai enforces heterogeneity at the configuration level. The swarm YAML explicitly assigns different model families to roles that critique or verify each other:
# Required: orchestrator and critic must use different families
swarm:
orchestrator:
model: antigravity/opus # Anthropic family
critic:
model: antigravity/sonnet # Same family, different capability tier
executor:
model: glm/5.1 # Completely different family
The gateway rejects swarm configurations where a critic and the entity being criticized share identical model weights (e.g., same model, same system prompt).
5.3 Actor-Critic Pattern
godotz.ai’s actor-critic implementation separates generation from evaluation:
- Actor (GLM family): Generates candidate output
- Critic (Antigravity family): Evaluates against structured criteria
- Arbiter (claude-opus-4-6): Synthesizes actor output and critic feedback into final decision
This three-step pattern, applied recursively, produces outputs where systematic biases from any single model are checked by a model with different training.
6. Supply Chain Security
6.1 The Threat Model
AI agents execute arbitrary code via tool calls and MCP servers. A compromised plugin or MCP server represents a supply chain attack vector that could exfiltrate secrets, corrupt memory, or establish persistence. CVE-2025-6514 demonstrated a real-world MCP server compromise that affected several production AI deployments.
6.2 The Three-Layer Gate
godotz.ai implements a fail-closed three-layer security gate for all plugin and MCP server execution:
Execution Request
↓
[Layer 1: plugin-eval]
• Static analysis of plugin code
• Signature verification against known-good hash
• Permission scope validation
↓ PASS
[Layer 2: mcp-scan]
• CVE database lookup for MCP server
• Dependency tree vulnerability check
• Network permission audit
↓ PASS
[Layer 3: Sandbox]
• Isolated execution environment
• Filesystem: read-only except designated working dirs
• Network: allowlist only
• CPU/memory limits enforced
↓ PASS
Actual Execution
Any layer failure terminates the request and blocks the agent. Fail-closed: when in doubt, deny.
6.3 Postmark-MCP Reference
The postmark-mcp integration serves as the reference implementation for secure MCP server deployment. Its configuration demonstrates allowlist-based network permissions and filesystem sandboxing.
6.4 Langfuse Audit Trail
Every tool call is traced in Langfuse with full provenance:
- Calling agent identity
- Tool name and arguments
- Execution sandbox ID
- Result or error
- Timestamp and duration
This enables complete incident reconstruction without requiring additional instrumentation.
7. Self-Evolution Loop
7.1 Darwin Gödel Machine Pattern
The Darwin Gödel Machine (DGM) combines two ideas:
- Gödel Machine (Schmidhuber 2003): A self-referential system that can modify its own code when it can prove the modification will improve performance
- Darwin (evolutionary selection): Competing mutations are selected based on empirical fitness, not formal proof
godotz.ai’s DGM implementation:
Current System State
↓
[Mutation Proposal]
Agent proposes modification (prompt, tool, agent config)
↓
[Sandbox Verification]
Proposed change tested in isolated environment
Benchmark suite run on mutation candidate
↓ Pass (>threshold improvement)
[Human Approval]
Significant changes (architecture) require human sign-off
↓ Approved
[Atomic Promotion]
Mutation promoted to production state
Rollback snapshot created before promotion
7.2 What Can Self-Evolve
| Component | Self-Modifiable | Approval Required |
|---|---|---|
| Agent system prompts | Yes | No (for minor changes) |
| Skill definitions | Yes | No |
| Model routing rules | Yes | Yes |
| Security gates | No | — |
| Sandbox configuration | No | — |
Security gates and sandbox configuration are immutable from within the self-evolution loop. These can only be modified by human operators with direct system access.
7.3 Rollback Guarantee
Every promotion creates a versioned snapshot. Rollback is atomic and completes within 60 seconds.
8. Fleet Topology
8.1 Substrate: Nix Flakes
Nix flakes provide reproducible system definitions across all supported platforms:
{
outputs = { self, nixpkgs }: {
packages.x86_64-linux.default = ...;
packages.aarch64-linux.default = ...; # ARM edge nodes
packages.aarch64-darwin.default = ...; # macOS development
};
}
Two nodes built from the same flake revision produce bit-identical environments. This eliminates “works on my machine” failures in fleet deployments.
8.2 Network: Tailscale Mesh
All fleet nodes form a WireGuard-encrypted mesh via Tailscale. Properties:
- Zero-trust: no implicit trust between nodes
- ACL-based: fine-grained access control between node roles
- Works across NAT: nodes on different networks connect seamlessly
- Key rotation: automated certificate management
8.3 Deployment: Komodo
Komodo manages the fleet node lifecycle:
- Node provisioning and configuration
- Health checks and automatic replacement
- Rolling updates across the fleet
- Integration with Nix flake deployment
8.4 Fleet Size Tiers
| Tier | Nodes | Use Case | Concurrent Tasks |
|---|---|---|---|
| Solo | 1 | Developer machine | 18 |
| Small | 2-5 | Team or staging | 36-90 |
| Medium | 5-20 | Department | 90-360 |
| Full Fleet | 20+ | Organization | 360+ |
9. Memory Architecture
9.1 Three-Tier Design
godotz.ai uses three complementary memory systems:
Query
↓
[Tier 1: Mnemopi — Session + Long-Term]
• Fast key-value recall
• Relevance scoring (0.0-1.0)
• Automatic expiry below threshold
[Tier 2: Knowledge Gardener v0.21.0 — Vault]
• Structured markdown vault
• Daily automatic recaps
• Human-readable knowledge base
[Tier 3: Graphify — Knowledge Graph]
• Code-aware entity-relationship graph
• AST-level code understanding
• Semantic query interface
9.2 Mnemopi
Mnemopi provides session-scoped working memory and long-term persistent memory. Key properties:
- Relevance scoring: Every memory entry has a score (0.0-1.0)
- Automatic eviction: Entries below threshold are expired
- Upsert semantics: Idempotent writes prevent duplicates
- Cross-session persistence: Long-term memories survive process restarts
9.3 Knowledge Gardener v0.21.0
Developed by Kohei-Wada, Knowledge Gardener maintains a structured vault of project knowledge. Features:
- Automatic daily recap generation from recent agent activities
- Vault organization by project, topic, and date
- Integration with Mnemopi for cross-system queries
- Human-readable markdown format for direct inspection
9.4 Graphify Knowledge Graph
Graphify builds a semantic knowledge graph from codebases:
- Entity extraction: files, functions, classes, modules
- Relationship mapping: imports, calls, extends, uses
- Natural language query interface
- Incremental updates on file changes
Current performance: < 500ms query time for repositories up to 1,000 files. Scaling target: 10,000 files.
10. Implementation Status
10.1 Current Benchmarks
| Capability | Grade | Notes |
|---|---|---|
| Model Gateway Routing | S | < 50ms p99, budget enforcement 100% |
| Durable Workflow Execution | S | Zero data loss on process restart |
| Task DAG Scheduling | S | Idempotent, retry-safe |
| Multi-Model Orchestration | S | Echo chamber prevention verified |
| Redis Cache | S | 43% hit rate on production workloads |
| Plugin Security Gate | S | 0 false negatives in testing |
| MCP Security Scan | S | CVE-2025-6514 class detected |
| Sandbox Isolation | S | No escapes in adversarial testing |
| Knowledge Gardener Integration | A | Recap latency 2-5s (target: <1s) |
| Knowledge Graph Scaling | A | 1,000 file limit (target: 10,000) |
10.2 Known Limitations
- KG Scaling: Graphify queries degrade above 1,000 files. Active development.
- ARM Nix Flakes: Some binary caches missing for aarch64; build times 3-4x longer.
- Temporal SaaS: Not yet validated for SaaS tier; self-hosted only tested.
11. Future Work
11.1 Meta-Concept Items
- Swarm Composition: Allow one swarm to spawn and manage another swarm as a subtask
- Fleet-Wide Memory: Shared memory accessible from any node without per-node replication
- Adaptive Model Selection: Dynamic model routing based on real-time EMA performance scores
11.2 Scale Targets
- 1,000+ file knowledge graph (Graphify scaling)
- 50+ heterogeneous node fleets
- Mobile/edge nodes with intermittent connectivity
11.3 Self-Evolution Improvements
- Automated benchmark suite for self-evolution candidates
- Formal specification of what constitutes “improvement”
- Federated mutation: multiple nodes propose, population selection
12. References
-
Chen, J. et al. (2024). “ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs.” arXiv:2309.13007.
-
Schmidhuber, J. (2003). “Gödel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements.” arXiv:cs/0309048.
-
Hughes, A. et al. (2025). “Darwin Gödel Machine: Open-Ended Evolution of Self-Improving AI.” arXiv:2505.22954.
-
CVE-2025-6514. “MCP Server Remote Code Execution via Tool Call Injection.” NVD, 2025.
-
Sandoval, E. et al. (2025). “postmark-mcp: Secure MCP Server Reference Implementation.” GitHub: postmark/postmark-mcp.
-
LiteLLM Team (2025). “LiteLLM Proxy: Unified LLM Gateway.” litellm.ai.
-
Temporal Technologies (2024). “Temporal: Durable Workflow Execution.” temporal.io.
-
LangChain (2024). “LangGraph: Multi-Agent Orchestration Library.” langchain.com/langgraph.
-
Wada, K. (2025). “Knowledge Gardener v0.21.0: Structured Knowledge Management for AI Systems.” GitHub: Kohei-Wada/knowledge-gardener.
-
Dolstra, E. (2006). “The Purely Functional Software Deployment Model.” PhD Thesis, Utrecht University. (Nix foundations)