godotz.ai: Product Requirements Document
Version: 1.2 | Date: 2026-06-08 | Status: Active
Authors: godotz.ai Architecture Team
Classification: Internal / Open Source Reference
1. Problem Statement
1.1 The Fragmentation Problem
The AI agent tooling landscape in 2026 is deeply fragmented. Teams running autonomous agent workflows face four compounding failure modes:
-
Single-model echo chambers — Most frameworks default to one model family for all roles (orchestrator, critic, executor). When the orchestrator and the reviewer share the same weights, systematic biases amplify rather than cancel.
-
No fleet orchestration primitive — Existing tools assume a single machine. Running agents across heterogeneous OS nodes (Linux workstations, ARM edge devices, macOS laptops) requires bespoke glue code that breaks on every framework update.
-
Opaque cost and budget enforcement — API costs spiral because there is no standard gateway that can enforce per-task, per-team, or per-model budgets before a runaway agent burns the monthly allowance.
-
Unsafe self-improvement loops — Several projects attempt Darwin-Gödel-style self-modification but lack sandboxed empirical verification gates. Without fail-closed governance, a single bad mutation can corrupt the agent’s own tooling.
1.2 Market Context
By Q2 2026:
- Claude Code, Cursor, and Codex each have millions of users but no cross-fleet coordination layer.
- LiteLLM has become the de facto API gateway for AI teams but lacks a higher-level orchestration harness.
- Temporal’s durable execution pattern is adopted by tech companies but rarely applied to AI agent DAGs.
- The ReConcile paper (2024) showed that heterogeneous model panels reduce systematic error by 31% vs single-model panels, yet most frameworks ignore this finding.
1.3 Opportunity
godotz.ai fills the gap between raw model APIs and production-grade autonomous agent fleets. It provides the orchestration harness, not the models — a neutral layer any team can deploy on existing infrastructure.
2. Product Vision
godotz.ai is the autonomous multi-agent harness for heterogeneous OS fleets. It treats agent compute like infrastructure: declarative, reproducible, observable, and self-healing.
Core promise: Define your swarm in YAML. godotz.ai handles routing, scheduling, memory, cost, and verification — across every node in your fleet.
3. Target Users
3.1 Primary Audience
| Persona | Description | Key Pain Point |
|---|
| AI Engineer | Builds autonomous pipelines, LLM applications | Framework churn, echo chambers, no fleet primitive |
| ML Ops Engineer | Operates model inference infrastructure | Cost overruns, no unified gateway, observability gaps |
| DevOps / Platform Engineer | Manages heterogeneous OS fleets | No agent-native scheduler, bespoke glue code |
| System Architect | Designs multi-agent architectures | Lack of durability, no standard memory layer, security gaps |
3.2 Secondary Audience
| Persona | Description |
|---|
| Research Engineer | Runs long-horizon autonomous research agents |
| Security Engineer | Audits AI supply chains, plugin provenance |
| Product Manager | Needs PRD + dashboard to track agent throughput and cost |
| Open Source Contributor | Extends godotz.ai with new agents, skills, or MCP servers |
4. User Stories
4.1 Core Infrastructure (AI Engineer)
| ID | As a… | I want to… | So that… |
|---|
| US-001 | AI Engineer | Define a swarm in a single YAML file | I can version-control my agent topology |
| US-002 | AI Engineer | Route tasks to different model families based on role | My orchestrator and critic never share the same weights |
| US-003 | AI Engineer | Retry failed agent tasks with exponential backoff | Transient API errors don’t abort long-running workflows |
| US-004 | AI Engineer | Inspect the full DAG of agent tasks in real time | I can debug which node is blocking a swarm |
| US-005 | AI Engineer | Persist agent memory across sessions | Long-horizon tasks don’t lose context between runs |
4.2 Fleet Operations (ML Ops / DevOps)
| ID | As a… | I want to… | So that… |
|---|
| US-006 | ML Ops Engineer | Set a monthly budget cap per model family | Runaway agents can’t burn the monthly quota |
| US-007 | ML Ops Engineer | See per-agent, per-task token consumption | I can attribute costs to specific workflows |
| US-008 | DevOps Engineer | Deploy godotz.ai across Linux + ARM + macOS nodes | My fleet is heterogeneous and I don’t want to manage three tool stacks |
| US-009 | DevOps Engineer | Pin agent versions with Nix flakes | Every node runs identical, reproducible environments |
| US-010 | ML Ops Engineer | Route to a vision fallback model automatically | Agents that receive image inputs don’t fail silently |
4.3 Security (Security Engineer)
| ID | As a… | I want to… | So that… |
|---|
| US-011 | Security Engineer | Scan every plugin before it executes | No supply-chain attack propagates to production |
| US-012 | Security Engineer | Run agent code in a sandboxed environment | A compromised agent can’t exfiltrate secrets |
| US-013 | Security Engineer | Block MCP servers flagged by CVE database | Known vulnerabilities are caught at the gate |
| US-014 | Security Engineer | Audit the full provenance of every tool call | I can reconstruct any incident from Langfuse traces |
4.4 Self-Evolution (Advanced AI Engineer)
| ID | As a… | I want to… | So that… |
|---|
| US-015 | AI Engineer | Allow agents to propose patches to their own prompts | The system improves without manual iteration |
| US-016 | AI Engineer | Gate every self-modification with empirical verification | A bad self-edit is rejected before it reaches production |
| US-017 | AI Engineer | Require human approval for significant architecture changes | I stay in the loop on non-trivial mutations |
| US-018 | AI Engineer | Roll back any self-modification atomically | A bad change is reversible in under 60 seconds |
4.5 Knowledge and Memory
| ID | As a… | I want to… | So that… |
|---|
| US-019 | AI Engineer | Store structured knowledge in a vault with automatic recaps | Agents don’t re-derive the same facts on every run |
| US-020 | AI Engineer | Query a knowledge graph built from my codebase | Agents navigate large repos without exhausting context windows |
| US-021 | AI Engineer | Have agent memories expire based on relevance scoring | The memory system stays lean over time |
4.6 Edge Cases
| ID | As a… | I want to… | So that… |
|---|
| US-022 | AI Engineer | Handle agent node failures gracefully | A single node going offline doesn’t abort the swarm |
| US-023 | ML Ops Engineer | Enforce strict rate limits per model | One agent can’t starve all others on the same gateway |
| US-024 | Security Engineer | Toggle “hardcore mode” to disable permissive fallbacks | High-security environments get fail-closed behavior end-to-end |
| US-025 | DevOps Engineer | Monitor fleet health with a live status dashboard | I know the health of every node without SSH-ing in |
5. Functional Requirements
5.1 L0 — Substrate Layer (Nix + Tailscale)
| Req ID | Requirement |
|---|
| FR-L0-001 | System SHALL provide Nix flake definitions for all supported platforms (x86_64-linux, aarch64-linux, aarch64-darwin) |
| FR-L0-002 | System SHALL establish encrypted mesh connectivity between all fleet nodes via Tailscale |
| FR-L0-003 | System SHALL support Komodo-based deployment orchestration for fleet node lifecycle management |
| FR-L0-004 | System SHALL ensure reproducible builds; two nodes built from the same flake SHALL produce bit-identical environments |
5.2 L1 — Transport Layer (NATS)
| Req ID | Requirement |
|---|
| FR-L1-001 | System SHALL use NATS JetStream for all inter-agent message passing |
| FR-L1-002 | System SHALL support at-least-once delivery semantics with configurable acknowledgement timeouts |
| FR-L1-003 | System SHALL provide NATS subject namespacing per fleet node and per agent role |
5.3 L2 — Model Gateway (LiteLLM Proxy)
| Req ID | Requirement |
|---|
| FR-L2-001 | System SHALL expose a unified OpenAI-compatible API endpoint routing to all configured model providers |
| FR-L2-002 | System SHALL enforce per-key budget limits; requests exceeding budget SHALL be rejected with HTTP 429 |
| FR-L2-003 | System SHALL cache identical prompts in Redis with configurable TTL |
| FR-L2-004 | System SHALL persist all request/response pairs to Postgres for audit and cost attribution |
| FR-L2-005 | System SHALL support virtual keys mapped to model families (Antigravity keys, GLM keys) |
| FR-L2-006 | System SHALL provide an EMA (Exponential Moving Average) fallback chain when a primary model fails |
| FR-L2-007 | System SHALL route vision-capable requests to gemini-3.1-pro-low when the primary model lacks vision |
5.4 L3 — Orchestration Layer (Temporal + LangGraph)
| Req ID | Requirement |
|---|
| FR-L3-001 | System SHALL use Temporal for durable workflow execution with automatic retry on transient failures |
| FR-L3-002 | System SHALL implement the supervisor-worker LangGraph pattern as the standard 2-tier orchestration model |
| FR-L3-003 | System SHALL support nested subgraph composition for complex multi-agent workflows |
| FR-L3-004 | System SHALL provide workflow versioning so in-flight workflows can complete on the version they started |
5.5 L4 — Agent Runtime (godotz.ai/hermes)
| Req ID | Requirement |
|---|
| FR-L4-001 | System SHALL provide a declarative YAML swarm definition format |
| FR-L4-002 | System SHALL support 128+ skills (pre-built agent capabilities) loadable per swarm |
| FR-L4-003 | System SHALL support 32+ agent types with configurable roles |
| FR-L4-004 | System SHALL support 7 MCP servers exposing 79 tools to agents |
| FR-L4-005 | System SHALL enforce concurrency limits per model family (GLM-5.1: 10, GLM-5-Turbo: 1, GLM-4.7: 2, GLM-4.5-Air: 5) |
5.6 L5 — Task Tracking (beads + taskdog)
| Req ID | Requirement |
|---|
| FR-L5-001 | System SHALL represent agent workflows as DAGs using the beads task primitive |
| FR-L5-002 | System SHALL provide a bd CLI for task creation, status, retry, and cancellation |
| FR-L5-003 | System SHALL provide a Gantt/ETA view via taskdog for human operators |
| FR-L5-004 | System SHALL guarantee idempotent task execution; re-running a completed task SHALL be a no-op |
5.7 L6 — Memory Layer (Mnemopi + KG + Graphify)
| Req ID | Requirement |
|---|
| FR-L6-001 | System SHALL provide session-scoped and persistent long-term memory via Mnemopi |
| FR-L6-002 | System SHALL maintain a Knowledge Gardener vault with automatic daily recaps |
| FR-L6-003 | System SHALL build and query a code knowledge graph via Graphify |
| FR-L6-004 | System SHALL score memory relevance and expire low-relevance entries automatically |
5.8 Security Requirements
| Req ID | Requirement |
|---|
| FR-SEC-001 | System SHALL scan all plugins via plugin-eval before execution |
| FR-SEC-002 | System SHALL scan all MCP servers via mcp-scan and block servers flagged by CVE database |
| FR-SEC-003 | System SHALL execute agent code in a fail-closed sandbox |
| FR-SEC-004 | System SHALL trace all tool calls via Langfuse for full audit provenance |
| FR-SEC-005 | System SHALL provide a “hardcore mode” toggle that disables all permissive fallbacks |
5.9 Self-Evolution Requirements
| Req ID | Requirement |
|---|
| FR-DGM-001 | System SHALL implement the Darwin Gödel Machine pattern for self-modification proposals |
| FR-DGM-002 | Every self-modification SHALL be tested in a sandboxed environment before promotion |
| FR-DGM-003 | Significant architecture mutations SHALL require human approval before taking effect |
| FR-DGM-004 | All self-modifications SHALL be atomically reversible within 60 seconds |
6. Non-Functional Requirements
| NFR ID | Requirement |
|---|
| NFR-P-001 | Model gateway routing latency SHALL be < 100ms (p99) excluding model inference time |
| NFR-P-002 | Task graph scheduling latency SHALL be < 50ms per node |
| NFR-P-003 | Memory retrieval (Mnemopi) SHALL respond in < 200ms for datasets up to 10,000 entries |
| NFR-P-004 | Knowledge graph queries (Graphify) SHALL return results in < 500ms for repos up to 1,000 files |
6.2 Reliability
| NFR ID | Requirement |
|---|
| NFR-R-001 | godotz.ai core services SHALL maintain 99.9% uptime (< 8.7h downtime/year) |
| NFR-R-002 | A single node failure SHALL NOT abort in-flight workflows on other nodes |
| NFR-R-003 | All durable workflows SHALL survive process restarts via Temporal persistence |
6.3 Security
| NFR ID | Requirement |
|---|
| NFR-S-001 | All inter-node communication SHALL be encrypted (Tailscale WireGuard) |
| NFR-S-002 | All model API keys SHALL be stored in vault (never plaintext in config files) |
| NFR-S-003 | The sandbox SHALL be fail-closed: any sandbox breach attempt SHALL terminate the agent |
6.4 Observability
| NFR ID | Requirement |
|---|
| NFR-O-001 | Every model API call SHALL be traced in Langfuse with full prompt/response |
| NFR-O-002 | Every task SHALL emit lifecycle events (created, started, completed, failed) to NATS |
| NFR-O-003 | Budget consumption SHALL be queryable per virtual key, per agent, per day |
6.5 Scalability
| NFR ID | Requirement |
|---|
| NFR-SC-001 | godotz.ai SHALL support fleet sizes from 1 to 50+ heterogeneous nodes |
| NFR-SC-002 | Total concurrent model calls SHALL scale with fleet size (18 per node baseline) |
| NFR-SC-003 | Knowledge graph SHALL support repositories up to 10,000 files |
7. Success Metrics
7.1 Adoption Metrics
| Metric | Target (M3) | Target (M6) |
|---|
| Active fleet nodes | 3 | 20+ |
| Swarms defined | 10 | 100+ |
| Daily agent tasks | 500 | 5,000+ |
7.2 Quality Metrics
| Metric | Target |
|---|
| Self-evolution verification pass rate | > 80% of proposals survive sandbox |
| Echo chamber reduction | > 25% error reduction vs single-model panel (ReConcile baseline) |
| Budget overrun incidents | 0 per month |
7.3 Throughput Metrics
| Metric | Baseline | Target |
|---|
| Tasks per hour per node | 100 | 500 |
| Memory retrieval accuracy | — | > 90% relevance score |
| Knowledge graph query hit rate | — | > 85% |
7.4 Cost Metrics
| Metric | Target |
|---|
| Cost per 1,000 agent tasks | < $2.00 (GLM family) |
| Cache hit rate (Redis) | > 40% for repeated prompts |
| Budget enforcement accuracy | 100% (zero overruns) |
8. Roadmap Phases
Phase M0 — Foundation (Complete)
- IA document + scaffold
- Core architecture design
- Model gateway selection (LiteLLM)
- Initial Nix flake definition
Phase M1 — Core Gateway (Complete)
- LiteLLM Proxy deployment
- Virtual key management
- Redis cache integration
- Postgres persistence
- Basic budget enforcement
Phase M2 — Orchestration (In Progress)
- Temporal workflow engine
- LangGraph supervisor-worker pattern
- beads task graph CLI
- NATS JetStream transport
Phase M3 — Memory + Knowledge
- Mnemopi session/persistent memory
- Knowledge Gardener v0.21.0 vault
- Graphify knowledge graph
- Automatic daily recaps
Phase M4 — Security + Self-Evolution
- plugin-eval gate
- mcp-scan integration
- Sandboxed execution
- Darwin Gödel Machine self-modification loop
- Hardcore mode toggle
Phase M5 — Fleet + Observability
- Tailscale mesh configuration
- Multi-node deployment
- Langfuse full tracing
- taskdog Gantt dashboard
- Status dashboard
Phase M6 — Scale + Polish
- 1,000+ file knowledge graph support
- Mobile/edge node support
- Performance tuning (< 100ms routing)
- Full documentation suite
- Public release
9. Open Questions & Risks
9.1 Open Questions
| Q# | Question | Owner | Status |
|---|
| Q-001 | Which Temporal SaaS tier or self-hosted version for fleet deployment? | Architecture | Open |
| Q-002 | Should Knowledge Gardener recaps run on a schedule or trigger-based? | ML Ops | Open |
| Q-003 | How to handle model API key rotation without downtime? | Security | Open |
| Q-004 | GLM model availability via z.ai API on ARM nodes? | ML Ops | Investigating |
9.2 Risks
| Risk ID | Risk | Likelihood | Impact | Mitigation |
|---|
| R-001 | z.ai API deprecates GLM-5.1 | Medium | High | Abstract model IDs behind virtual keys; swap backend without code changes |
| R-002 | Temporal license change | Low | High | Temporal is MIT + BSL; monitor for changes; keep LangGraph as standalone fallback |
| R-003 | Nix flake complexity deters contributors | Medium | Medium | Provide Docker fallback for quick start; Nix is optional for dev, required for prod fleet |
| R-004 | Sandbox escape via novel agent exploit | Low | Critical | Defense in depth: plugin-eval + sandbox + Tailscale ACL + human approval gates |
| R-005 | LiteLLM API surface changes break gateway | Medium | Medium | Pin LiteLLM version; integration tests on upgrade |
10. Appendix: Component Inventory
| Layer | Component | Technology | Status |
|---|
| L0 | Nix Flake | Nix 2.x | Active |
| L0 | Tailscale Mesh | Tailscale OSS | Active |
| L0 | Komodo Deploy | Komodo | Active |
| L1 | Message Bus | NATS JetStream | Active |
| L2 | Model Gateway | LiteLLM Proxy | Active |
| L2 | Cache | Redis | Active |
| L2 | Persistence | Postgres | Active |
| L2 | Observability | Langfuse | Active |
| L3 | Durable Exec | Temporal | Active |
| L3 | Agent Graph | LangGraph | Active |
| L4 | Harness | godotz.ai/hermes | Active |
| L5 | Task DAG | beads | Active |
| L5 | Human Gantt | taskdog | Active |
| L6 | Session Memory | Mnemopi | Active |
| L6 | Knowledge Vault | Knowledge Gardener v0.21.0 | Active |
| L6 | Code KG | Graphify | Active |