Model Gateway

godotz.ai routes every model call through a central LiteLLM proxy. No agent holds a real provider API key. The gateway handles routing, caching, budget enforcement, and audit logging — the agent just sees an OpenAI-compatible endpoint at http://gateway:4000.

Why a Central Gateway

Running agents directly against provider APIs creates four problems that compound at scale:

Key sprawl — each agent needs credentials for every provider it might use
No cost visibility — spending is discovered from billing statements, not before requests
No caching — identical prompts pay full price every time
No fallback — a provider outage kills any agent that uses only that provider

The gateway solves all four. It is the only component that holds real API keys.

The 8-Model Configuration

godotz.ai operates two model families plus a local fallback, totaling eight model entries:

# litellm_config.yaml
model_list:
  # GLM family — z.ai provider (4 models)
  - model_name: glm-5.1
    litellm_params:
      model: zhipuai/glm-5.1
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 10
      tpm: 100000

  - model_name: glm-4.7
    litellm_params:
      model: zhipuai/glm-4.7
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 2
      tpm: 40000

  - model_name: glm-4.5-air
    litellm_params:
      model: zhipuai/glm-4.5-air
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 5
      tpm: 80000

  - model_name: glm-5-turbo
    litellm_params:
      model: zhipuai/glm-5-turbo
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 1
      tpm: 20000

  # Antigravity family — Anthropic provider (3 models)
  - model_name: claude-opus-4-6
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 32000

  - model_name: claude-sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 16000

  - model_name: claude-haiku-4-5
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 8000

  # Ollama — local fallback (1 model)
  - model_name: ollama/mistral-7b
    litellm_params:
      model: ollama/mistral:7b
      api_base: http://localhost:11434
      stream: true

Model Routing Logic

Agents specify a model by name or by routing tag. Routing tags let the gateway pick the best available model without the agent naming it explicitly:

Tag	Models tried (in order)	Use case
`omp/orchestrator`	claude-opus-4-6	Planning, synthesis, hard reasoning
`omp/worker`	glm-5.1, glm-4.7	Execution, code generation
`omp/critic`	claude-sonnet-4-6	Evaluation, verification
`omp/fast`	glm-5-turbo, glm-4.5-air	High-throughput, latency-sensitive
`omp/local`	ollama/mistral-7b	Offline, air-gapped, cost-zero
`omp/vision`	claude-sonnet-4-6	Image inputs, multimodal

The gateway falls back to the next model in the list on a 429, 500, or timeout. A model that fails three consecutive requests is removed from the routing pool for 60 seconds.

request(model="omp/worker")
  → try glm-5.1         [429 rate limited]
  → try glm-4.7         [200 OK]
  ← response

Virtual Key Architecture

Every agent receives a virtual key at session start. Real provider keys never leave the gateway host.

┌─────────────┐         ┌──────────────────┐         ┌──────────────┐
│   Agent      │  vk-xx  │  LiteLLM Proxy   │  real   │  Provider    │
│  (no keys)   │ ──────► │  key: vk-abc123  │ ──────► │  API         │
└─────────────┘         │  budget: $5.00   │         └──────────────┘
                        │  models: [glm-*] │
                        └──────────────────┘

Creating a virtual key with budget and model restrictions:

litellm --create-key \
  --key-alias "agent-codex-session-42" \
  --max-budget 5.00 \
  --budget-duration "1d" \
  --models "glm-5.1,glm-4.7,glm-4.5-air,glm-5-turbo" \
  --metadata '{"agent_id": "codex", "session": 42}'

Key properties:

Budget is pre-checked before the request reaches the provider — overspend is impossible
Key expiry is enforced at the gateway, not the provider
Revoked keys return HTTP 401 immediately
All key activity is logged to Postgres with full metadata

Budget Enforcement

Budget enforcement is fail-closed: a request that would exceed the remaining budget is rejected before it reaches the model API.

incoming request
  → check virtual key budget in Postgres
  → if remaining_budget < estimated_cost: return 429
  → send to provider
  → deduct actual_cost from budget
  ← return response

Budget configuration hierarchy (most specific wins):

general_settings:
  default_team_settings:
    max_budget: 50.00
    budget_duration: "1d"

  per_model_budget:
    claude-opus-4-6:
      max_budget: 10.00
      budget_duration: "1d"
    glm-5.1:
      max_budget: 30.00
      budget_duration: "1d"

When a budget is exhausted, agents receive HTTP 429 with:

{
  "error": {
    "message": "Budget exceeded for key vk-abc123. Remaining: $0.00",
    "type": "budget_exceeded",
    "code": 429
  }
}

Redis Caching

Identical prompts within the cache TTL window return stored responses without hitting the provider.

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600          # 1 hour default
    similarity_threshold: 0.95  # semantic similarity for near-duplicate detection

Cache key computation:

key = SHA-256(model + messages_json + temperature + top_p)

Deterministic requests (temperature=0) achieve the highest hit rates. The target hit rate is 40%+ for typical agent workloads, which reduces costs proportionally.

Checking cache performance:

# Redis CLI
redis-cli INFO stats | grep keyspace
redis-cli --stat

# LiteLLM dashboard
open http://gateway:4000/ui
# → "Cache Hit Rate" panel

Concurrency Limits

Each model has a configured concurrency ceiling enforced at the gateway. Requests beyond the ceiling are queued, not dropped:

Model	Max concurrent	Notes
glm-5.1	10	Primary worker model
glm-4.7	2	Heavy reasoning tasks
glm-4.5-air	5	Fast draft generation
glm-5-turbo	1	Rate-limited plan; sequential only
claude-opus-4-6	3	Orchestrator role; expensive
claude-sonnet-4-6	5	Critic and verification
claude-haiku-4-5	10	Quick classification
ollama/mistral-7b	2	Local GPU constraint

Total fleet concurrency: 38 simultaneous requests across all models.

Docker Control Plane Setup

The gateway runs alongside Postgres and Redis in the Docker control plane:

# docker-compose.gateway.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    environment:
      DATABASE_URL: postgresql://litellm:${DB_PASS}@postgres:5432/litellm
      REDIS_URL: redis://redis:6379
      LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY}
      LANGFUSE_HOST: http://langfuse:3000
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    command: --config /app/config.yaml --port 4000

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: ${DB_PASS}

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru

Architecture Overview — where the gateway sits in the L0–L6 stack
Observability — Langfuse trace integration
Orchestration — model selection in swarm YAML