Model Gateway

godotz.ai routes every model call through a central LiteLLM proxy. No agent holds a real provider API key. The gateway handles routing, caching, budget enforcement, and audit logging — the agent just sees an OpenAI-compatible endpoint at http://gateway:4000.


Why a Central Gateway

Running agents directly against provider APIs creates four problems that compound at scale:

  1. Key sprawl — each agent needs credentials for every provider it might use
  2. No cost visibility — spending is discovered from billing statements, not before requests
  3. No caching — identical prompts pay full price every time
  4. No fallback — a provider outage kills any agent that uses only that provider

The gateway solves all four. It is the only component that holds real API keys.


The 8-Model Configuration

godotz.ai operates two model families plus a local fallback, totaling eight model entries:

# litellm_config.yaml
model_list:
  # GLM family — z.ai provider (4 models)
  - model_name: glm-5.1
    litellm_params:
      model: zhipuai/glm-5.1
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 10
      tpm: 100000

  - model_name: glm-4.7
    litellm_params:
      model: zhipuai/glm-4.7
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 2
      tpm: 40000

  - model_name: glm-4.5-air
    litellm_params:
      model: zhipuai/glm-4.5-air
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 5
      tpm: 80000

  - model_name: glm-5-turbo
    litellm_params:
      model: zhipuai/glm-5-turbo
      api_key: os.environ/ZHIPUAI_API_KEY
      rpm: 1
      tpm: 20000

  # Antigravity family — Anthropic provider (3 models)
  - model_name: claude-opus-4-6
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 32000

  - model_name: claude-sonnet-4-6
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 16000

  - model_name: claude-haiku-4-5
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 8000

  # Ollama — local fallback (1 model)
  - model_name: ollama/mistral-7b
    litellm_params:
      model: ollama/mistral:7b
      api_base: http://localhost:11434
      stream: true

Model Routing Logic

Agents specify a model by name or by routing tag. Routing tags let the gateway pick the best available model without the agent naming it explicitly:

TagModels tried (in order)Use case
omp/orchestratorclaude-opus-4-6Planning, synthesis, hard reasoning
omp/workerglm-5.1, glm-4.7Execution, code generation
omp/criticclaude-sonnet-4-6Evaluation, verification
omp/fastglm-5-turbo, glm-4.5-airHigh-throughput, latency-sensitive
omp/localollama/mistral-7bOffline, air-gapped, cost-zero
omp/visionclaude-sonnet-4-6Image inputs, multimodal

The gateway falls back to the next model in the list on a 429, 500, or timeout. A model that fails three consecutive requests is removed from the routing pool for 60 seconds.

request(model="omp/worker")
  → try glm-5.1         [429 rate limited]
  → try glm-4.7         [200 OK]
  ← response

Virtual Key Architecture

Every agent receives a virtual key at session start. Real provider keys never leave the gateway host.

┌─────────────┐         ┌──────────────────┐         ┌──────────────┐
│   Agent      │  vk-xx  │  LiteLLM Proxy   │  real   │  Provider    │
│  (no keys)   │ ──────► │  key: vk-abc123  │ ──────► │  API         │
└─────────────┘         │  budget: $5.00   │         └──────────────┘
                        │  models: [glm-*] │
                        └──────────────────┘

Creating a virtual key with budget and model restrictions:

litellm --create-key \
  --key-alias "agent-codex-session-42" \
  --max-budget 5.00 \
  --budget-duration "1d" \
  --models "glm-5.1,glm-4.7,glm-4.5-air,glm-5-turbo" \
  --metadata '{"agent_id": "codex", "session": 42}'

Key properties:

  • Budget is pre-checked before the request reaches the provider — overspend is impossible
  • Key expiry is enforced at the gateway, not the provider
  • Revoked keys return HTTP 401 immediately
  • All key activity is logged to Postgres with full metadata

Budget Enforcement

Budget enforcement is fail-closed: a request that would exceed the remaining budget is rejected before it reaches the model API.

incoming request
  → check virtual key budget in Postgres
  → if remaining_budget < estimated_cost: return 429
  → send to provider
  → deduct actual_cost from budget
  ← return response

Budget configuration hierarchy (most specific wins):

general_settings:
  default_team_settings:
    max_budget: 50.00
    budget_duration: "1d"

  per_model_budget:
    claude-opus-4-6:
      max_budget: 10.00
      budget_duration: "1d"
    glm-5.1:
      max_budget: 30.00
      budget_duration: "1d"

When a budget is exhausted, agents receive HTTP 429 with:

{
  "error": {
    "message": "Budget exceeded for key vk-abc123. Remaining: $0.00",
    "type": "budget_exceeded",
    "code": 429
  }
}

Redis Caching

Identical prompts within the cache TTL window return stored responses without hitting the provider.

litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: redis
    port: 6379
    ttl: 3600          # 1 hour default
    similarity_threshold: 0.95  # semantic similarity for near-duplicate detection

Cache key computation:

key = SHA-256(model + messages_json + temperature + top_p)

Deterministic requests (temperature=0) achieve the highest hit rates. The target hit rate is 40%+ for typical agent workloads, which reduces costs proportionally.

Checking cache performance:

# Redis CLI
redis-cli INFO stats | grep keyspace
redis-cli --stat

# LiteLLM dashboard
open http://gateway:4000/ui
# → "Cache Hit Rate" panel

Concurrency Limits

Each model has a configured concurrency ceiling enforced at the gateway. Requests beyond the ceiling are queued, not dropped:

ModelMax concurrentNotes
glm-5.110Primary worker model
glm-4.72Heavy reasoning tasks
glm-4.5-air5Fast draft generation
glm-5-turbo1Rate-limited plan; sequential only
claude-opus-4-63Orchestrator role; expensive
claude-sonnet-4-65Critic and verification
claude-haiku-4-510Quick classification
ollama/mistral-7b2Local GPU constraint

Total fleet concurrency: 38 simultaneous requests across all models.


Docker Control Plane Setup

The gateway runs alongside Postgres and Redis in the Docker control plane:

# docker-compose.gateway.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    ports:
      - "4000:4000"
    environment:
      DATABASE_URL: postgresql://litellm:${DB_PASS}@postgres:5432/litellm
      REDIS_URL: redis://redis:6379
      LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY}
      LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY}
      LANGFUSE_HOST: http://langfuse:3000
    volumes:
      - ./litellm_config.yaml:/app/config.yaml
    command: --config /app/config.yaml --port 4000

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: litellm
      POSTGRES_USER: litellm
      POSTGRES_PASSWORD: ${DB_PASS}

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru