godotz.ai: Product Requirements Document

Version: 1.2 | Date: 2026-06-08 | Status: Active
Authors: godotz.ai Architecture Team
Classification: Internal / Open Source Reference


1. Problem Statement

1.1 The Fragmentation Problem

The AI agent tooling landscape in 2026 is deeply fragmented. Teams running autonomous agent workflows face four compounding failure modes:

  1. Single-model echo chambers — Most frameworks default to one model family for all roles (orchestrator, critic, executor). When the orchestrator and the reviewer share the same weights, systematic biases amplify rather than cancel.

  2. No fleet orchestration primitive — Existing tools assume a single machine. Running agents across heterogeneous OS nodes (Linux workstations, ARM edge devices, macOS laptops) requires bespoke glue code that breaks on every framework update.

  3. Opaque cost and budget enforcement — API costs spiral because there is no standard gateway that can enforce per-task, per-team, or per-model budgets before a runaway agent burns the monthly allowance.

  4. Unsafe self-improvement loops — Several projects attempt Darwin-Gödel-style self-modification but lack sandboxed empirical verification gates. Without fail-closed governance, a single bad mutation can corrupt the agent’s own tooling.

1.2 Market Context

By Q2 2026:

  • Claude Code, Cursor, and Codex each have millions of users but no cross-fleet coordination layer.
  • LiteLLM has become the de facto API gateway for AI teams but lacks a higher-level orchestration harness.
  • Temporal’s durable execution pattern is adopted by tech companies but rarely applied to AI agent DAGs.
  • The ReConcile paper (2024) showed that heterogeneous model panels reduce systematic error by 31% vs single-model panels, yet most frameworks ignore this finding.

1.3 Opportunity

godotz.ai fills the gap between raw model APIs and production-grade autonomous agent fleets. It provides the orchestration harness, not the models — a neutral layer any team can deploy on existing infrastructure.


2. Product Vision

godotz.ai is the autonomous multi-agent harness for heterogeneous OS fleets. It treats agent compute like infrastructure: declarative, reproducible, observable, and self-healing.

Core promise: Define your swarm in YAML. godotz.ai handles routing, scheduling, memory, cost, and verification — across every node in your fleet.


3. Target Users

3.1 Primary Audience

PersonaDescriptionKey Pain Point
AI EngineerBuilds autonomous pipelines, LLM applicationsFramework churn, echo chambers, no fleet primitive
ML Ops EngineerOperates model inference infrastructureCost overruns, no unified gateway, observability gaps
DevOps / Platform EngineerManages heterogeneous OS fleetsNo agent-native scheduler, bespoke glue code
System ArchitectDesigns multi-agent architecturesLack of durability, no standard memory layer, security gaps

3.2 Secondary Audience

PersonaDescription
Research EngineerRuns long-horizon autonomous research agents
Security EngineerAudits AI supply chains, plugin provenance
Product ManagerNeeds PRD + dashboard to track agent throughput and cost
Open Source ContributorExtends godotz.ai with new agents, skills, or MCP servers

4. User Stories

4.1 Core Infrastructure (AI Engineer)

IDAs a…I want to…So that…
US-001AI EngineerDefine a swarm in a single YAML fileI can version-control my agent topology
US-002AI EngineerRoute tasks to different model families based on roleMy orchestrator and critic never share the same weights
US-003AI EngineerRetry failed agent tasks with exponential backoffTransient API errors don’t abort long-running workflows
US-004AI EngineerInspect the full DAG of agent tasks in real timeI can debug which node is blocking a swarm
US-005AI EngineerPersist agent memory across sessionsLong-horizon tasks don’t lose context between runs

4.2 Fleet Operations (ML Ops / DevOps)

IDAs a…I want to…So that…
US-006ML Ops EngineerSet a monthly budget cap per model familyRunaway agents can’t burn the monthly quota
US-007ML Ops EngineerSee per-agent, per-task token consumptionI can attribute costs to specific workflows
US-008DevOps EngineerDeploy godotz.ai across Linux + ARM + macOS nodesMy fleet is heterogeneous and I don’t want to manage three tool stacks
US-009DevOps EngineerPin agent versions with Nix flakesEvery node runs identical, reproducible environments
US-010ML Ops EngineerRoute to a vision fallback model automaticallyAgents that receive image inputs don’t fail silently

4.3 Security (Security Engineer)

IDAs a…I want to…So that…
US-011Security EngineerScan every plugin before it executesNo supply-chain attack propagates to production
US-012Security EngineerRun agent code in a sandboxed environmentA compromised agent can’t exfiltrate secrets
US-013Security EngineerBlock MCP servers flagged by CVE databaseKnown vulnerabilities are caught at the gate
US-014Security EngineerAudit the full provenance of every tool callI can reconstruct any incident from Langfuse traces

4.4 Self-Evolution (Advanced AI Engineer)

IDAs a…I want to…So that…
US-015AI EngineerAllow agents to propose patches to their own promptsThe system improves without manual iteration
US-016AI EngineerGate every self-modification with empirical verificationA bad self-edit is rejected before it reaches production
US-017AI EngineerRequire human approval for significant architecture changesI stay in the loop on non-trivial mutations
US-018AI EngineerRoll back any self-modification atomicallyA bad change is reversible in under 60 seconds

4.5 Knowledge and Memory

IDAs a…I want to…So that…
US-019AI EngineerStore structured knowledge in a vault with automatic recapsAgents don’t re-derive the same facts on every run
US-020AI EngineerQuery a knowledge graph built from my codebaseAgents navigate large repos without exhausting context windows
US-021AI EngineerHave agent memories expire based on relevance scoringThe memory system stays lean over time

4.6 Edge Cases

IDAs a…I want to…So that…
US-022AI EngineerHandle agent node failures gracefullyA single node going offline doesn’t abort the swarm
US-023ML Ops EngineerEnforce strict rate limits per modelOne agent can’t starve all others on the same gateway
US-024Security EngineerToggle “hardcore mode” to disable permissive fallbacksHigh-security environments get fail-closed behavior end-to-end
US-025DevOps EngineerMonitor fleet health with a live status dashboardI know the health of every node without SSH-ing in

5. Functional Requirements

5.1 L0 — Substrate Layer (Nix + Tailscale)

Req IDRequirement
FR-L0-001System SHALL provide Nix flake definitions for all supported platforms (x86_64-linux, aarch64-linux, aarch64-darwin)
FR-L0-002System SHALL establish encrypted mesh connectivity between all fleet nodes via Tailscale
FR-L0-003System SHALL support Komodo-based deployment orchestration for fleet node lifecycle management
FR-L0-004System SHALL ensure reproducible builds; two nodes built from the same flake SHALL produce bit-identical environments

5.2 L1 — Transport Layer (NATS)

Req IDRequirement
FR-L1-001System SHALL use NATS JetStream for all inter-agent message passing
FR-L1-002System SHALL support at-least-once delivery semantics with configurable acknowledgement timeouts
FR-L1-003System SHALL provide NATS subject namespacing per fleet node and per agent role

5.3 L2 — Model Gateway (LiteLLM Proxy)

Req IDRequirement
FR-L2-001System SHALL expose a unified OpenAI-compatible API endpoint routing to all configured model providers
FR-L2-002System SHALL enforce per-key budget limits; requests exceeding budget SHALL be rejected with HTTP 429
FR-L2-003System SHALL cache identical prompts in Redis with configurable TTL
FR-L2-004System SHALL persist all request/response pairs to Postgres for audit and cost attribution
FR-L2-005System SHALL support virtual keys mapped to model families (Antigravity keys, GLM keys)
FR-L2-006System SHALL provide an EMA (Exponential Moving Average) fallback chain when a primary model fails
FR-L2-007System SHALL route vision-capable requests to gemini-3.1-pro-low when the primary model lacks vision

5.4 L3 — Orchestration Layer (Temporal + LangGraph)

Req IDRequirement
FR-L3-001System SHALL use Temporal for durable workflow execution with automatic retry on transient failures
FR-L3-002System SHALL implement the supervisor-worker LangGraph pattern as the standard 2-tier orchestration model
FR-L3-003System SHALL support nested subgraph composition for complex multi-agent workflows
FR-L3-004System SHALL provide workflow versioning so in-flight workflows can complete on the version they started

5.5 L4 — Agent Runtime (godotz.ai/hermes)

Req IDRequirement
FR-L4-001System SHALL provide a declarative YAML swarm definition format
FR-L4-002System SHALL support 128+ skills (pre-built agent capabilities) loadable per swarm
FR-L4-003System SHALL support 32+ agent types with configurable roles
FR-L4-004System SHALL support 7 MCP servers exposing 79 tools to agents
FR-L4-005System SHALL enforce concurrency limits per model family (GLM-5.1: 10, GLM-5-Turbo: 1, GLM-4.7: 2, GLM-4.5-Air: 5)

5.6 L5 — Task Tracking (beads + taskdog)

Req IDRequirement
FR-L5-001System SHALL represent agent workflows as DAGs using the beads task primitive
FR-L5-002System SHALL provide a bd CLI for task creation, status, retry, and cancellation
FR-L5-003System SHALL provide a Gantt/ETA view via taskdog for human operators
FR-L5-004System SHALL guarantee idempotent task execution; re-running a completed task SHALL be a no-op

5.7 L6 — Memory Layer (Mnemopi + KG + Graphify)

Req IDRequirement
FR-L6-001System SHALL provide session-scoped and persistent long-term memory via Mnemopi
FR-L6-002System SHALL maintain a Knowledge Gardener vault with automatic daily recaps
FR-L6-003System SHALL build and query a code knowledge graph via Graphify
FR-L6-004System SHALL score memory relevance and expire low-relevance entries automatically

5.8 Security Requirements

Req IDRequirement
FR-SEC-001System SHALL scan all plugins via plugin-eval before execution
FR-SEC-002System SHALL scan all MCP servers via mcp-scan and block servers flagged by CVE database
FR-SEC-003System SHALL execute agent code in a fail-closed sandbox
FR-SEC-004System SHALL trace all tool calls via Langfuse for full audit provenance
FR-SEC-005System SHALL provide a “hardcore mode” toggle that disables all permissive fallbacks

5.9 Self-Evolution Requirements

Req IDRequirement
FR-DGM-001System SHALL implement the Darwin Gödel Machine pattern for self-modification proposals
FR-DGM-002Every self-modification SHALL be tested in a sandboxed environment before promotion
FR-DGM-003Significant architecture mutations SHALL require human approval before taking effect
FR-DGM-004All self-modifications SHALL be atomically reversible within 60 seconds

6. Non-Functional Requirements

6.1 Performance

NFR IDRequirement
NFR-P-001Model gateway routing latency SHALL be < 100ms (p99) excluding model inference time
NFR-P-002Task graph scheduling latency SHALL be < 50ms per node
NFR-P-003Memory retrieval (Mnemopi) SHALL respond in < 200ms for datasets up to 10,000 entries
NFR-P-004Knowledge graph queries (Graphify) SHALL return results in < 500ms for repos up to 1,000 files

6.2 Reliability

NFR IDRequirement
NFR-R-001godotz.ai core services SHALL maintain 99.9% uptime (< 8.7h downtime/year)
NFR-R-002A single node failure SHALL NOT abort in-flight workflows on other nodes
NFR-R-003All durable workflows SHALL survive process restarts via Temporal persistence

6.3 Security

NFR IDRequirement
NFR-S-001All inter-node communication SHALL be encrypted (Tailscale WireGuard)
NFR-S-002All model API keys SHALL be stored in vault (never plaintext in config files)
NFR-S-003The sandbox SHALL be fail-closed: any sandbox breach attempt SHALL terminate the agent

6.4 Observability

NFR IDRequirement
NFR-O-001Every model API call SHALL be traced in Langfuse with full prompt/response
NFR-O-002Every task SHALL emit lifecycle events (created, started, completed, failed) to NATS
NFR-O-003Budget consumption SHALL be queryable per virtual key, per agent, per day

6.5 Scalability

NFR IDRequirement
NFR-SC-001godotz.ai SHALL support fleet sizes from 1 to 50+ heterogeneous nodes
NFR-SC-002Total concurrent model calls SHALL scale with fleet size (18 per node baseline)
NFR-SC-003Knowledge graph SHALL support repositories up to 10,000 files

7. Success Metrics

7.1 Adoption Metrics

MetricTarget (M3)Target (M6)
Active fleet nodes320+
Swarms defined10100+
Daily agent tasks5005,000+

7.2 Quality Metrics

MetricTarget
Self-evolution verification pass rate> 80% of proposals survive sandbox
Echo chamber reduction> 25% error reduction vs single-model panel (ReConcile baseline)
Budget overrun incidents0 per month

7.3 Throughput Metrics

MetricBaselineTarget
Tasks per hour per node100500
Memory retrieval accuracy> 90% relevance score
Knowledge graph query hit rate> 85%

7.4 Cost Metrics

MetricTarget
Cost per 1,000 agent tasks< $2.00 (GLM family)
Cache hit rate (Redis)> 40% for repeated prompts
Budget enforcement accuracy100% (zero overruns)

8. Roadmap Phases

Phase M0 — Foundation (Complete)

  • IA document + scaffold
  • Core architecture design
  • Model gateway selection (LiteLLM)
  • Initial Nix flake definition

Phase M1 — Core Gateway (Complete)

  • LiteLLM Proxy deployment
  • Virtual key management
  • Redis cache integration
  • Postgres persistence
  • Basic budget enforcement

Phase M2 — Orchestration (In Progress)

  • Temporal workflow engine
  • LangGraph supervisor-worker pattern
  • beads task graph CLI
  • NATS JetStream transport

Phase M3 — Memory + Knowledge

  • Mnemopi session/persistent memory
  • Knowledge Gardener v0.21.0 vault
  • Graphify knowledge graph
  • Automatic daily recaps

Phase M4 — Security + Self-Evolution

  • plugin-eval gate
  • mcp-scan integration
  • Sandboxed execution
  • Darwin Gödel Machine self-modification loop
  • Hardcore mode toggle

Phase M5 — Fleet + Observability

  • Tailscale mesh configuration
  • Multi-node deployment
  • Langfuse full tracing
  • taskdog Gantt dashboard
  • Status dashboard

Phase M6 — Scale + Polish

  • 1,000+ file knowledge graph support
  • Mobile/edge node support
  • Performance tuning (< 100ms routing)
  • Full documentation suite
  • Public release

9. Open Questions & Risks

9.1 Open Questions

Q#QuestionOwnerStatus
Q-001Which Temporal SaaS tier or self-hosted version for fleet deployment?ArchitectureOpen
Q-002Should Knowledge Gardener recaps run on a schedule or trigger-based?ML OpsOpen
Q-003How to handle model API key rotation without downtime?SecurityOpen
Q-004GLM model availability via z.ai API on ARM nodes?ML OpsInvestigating

9.2 Risks

Risk IDRiskLikelihoodImpactMitigation
R-001z.ai API deprecates GLM-5.1MediumHighAbstract model IDs behind virtual keys; swap backend without code changes
R-002Temporal license changeLowHighTemporal is MIT + BSL; monitor for changes; keep LangGraph as standalone fallback
R-003Nix flake complexity deters contributorsMediumMediumProvide Docker fallback for quick start; Nix is optional for dev, required for prod fleet
R-004Sandbox escape via novel agent exploitLowCriticalDefense in depth: plugin-eval + sandbox + Tailscale ACL + human approval gates
R-005LiteLLM API surface changes break gatewayMediumMediumPin LiteLLM version; integration tests on upgrade

10. Appendix: Component Inventory

LayerComponentTechnologyStatus
L0Nix FlakeNix 2.xActive
L0Tailscale MeshTailscale OSSActive
L0Komodo DeployKomodoActive
L1Message BusNATS JetStreamActive
L2Model GatewayLiteLLM ProxyActive
L2CacheRedisActive
L2PersistencePostgresActive
L2ObservabilityLangfuseActive
L3Durable ExecTemporalActive
L3Agent GraphLangGraphActive
L4Harnessgodotz.ai/hermesActive
L5Task DAGbeadsActive
L5Human GantttaskdogActive
L6Session MemoryMnemopiActive
L6Knowledge VaultKnowledge Gardener v0.21.0Active
L6Code KGGraphifyActive