Model Gateway
godotz.ai routes every model call through a central LiteLLM proxy. No agent holds a real provider API key. The gateway handles routing, caching, budget enforcement, and audit logging — the agent just sees an OpenAI-compatible endpoint at http://gateway:4000.
Why a Central Gateway
Running agents directly against provider APIs creates four problems that compound at scale:
- Key sprawl — each agent needs credentials for every provider it might use
- No cost visibility — spending is discovered from billing statements, not before requests
- No caching — identical prompts pay full price every time
- No fallback — a provider outage kills any agent that uses only that provider
The gateway solves all four. It is the only component that holds real API keys.
The 8-Model Configuration
godotz.ai operates two model families plus a local fallback, totaling eight model entries:
# litellm_config.yaml
model_list:
# GLM family — z.ai provider (4 models)
- model_name: glm-5.1
litellm_params:
model: zhipuai/glm-5.1
api_key: os.environ/ZHIPUAI_API_KEY
rpm: 10
tpm: 100000
- model_name: glm-4.7
litellm_params:
model: zhipuai/glm-4.7
api_key: os.environ/ZHIPUAI_API_KEY
rpm: 2
tpm: 40000
- model_name: glm-4.5-air
litellm_params:
model: zhipuai/glm-4.5-air
api_key: os.environ/ZHIPUAI_API_KEY
rpm: 5
tpm: 80000
- model_name: glm-5-turbo
litellm_params:
model: zhipuai/glm-5-turbo
api_key: os.environ/ZHIPUAI_API_KEY
rpm: 1
tpm: 20000
# Antigravity family — Anthropic provider (3 models)
- model_name: claude-opus-4-6
litellm_params:
model: anthropic/claude-opus-4-6
api_key: os.environ/ANTHROPIC_API_KEY
max_tokens: 32000
- model_name: claude-sonnet-4-6
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
max_tokens: 16000
- model_name: claude-haiku-4-5
litellm_params:
model: anthropic/claude-haiku-4-5
api_key: os.environ/ANTHROPIC_API_KEY
max_tokens: 8000
# Ollama — local fallback (1 model)
- model_name: ollama/mistral-7b
litellm_params:
model: ollama/mistral:7b
api_base: http://localhost:11434
stream: true
Model Routing Logic
Agents specify a model by name or by routing tag. Routing tags let the gateway pick the best available model without the agent naming it explicitly:
| Tag | Models tried (in order) | Use case |
|---|---|---|
omp/orchestrator | claude-opus-4-6 | Planning, synthesis, hard reasoning |
omp/worker | glm-5.1, glm-4.7 | Execution, code generation |
omp/critic | claude-sonnet-4-6 | Evaluation, verification |
omp/fast | glm-5-turbo, glm-4.5-air | High-throughput, latency-sensitive |
omp/local | ollama/mistral-7b | Offline, air-gapped, cost-zero |
omp/vision | claude-sonnet-4-6 | Image inputs, multimodal |
The gateway falls back to the next model in the list on a 429, 500, or timeout. A model that fails three consecutive requests is removed from the routing pool for 60 seconds.
request(model="omp/worker")
→ try glm-5.1 [429 rate limited]
→ try glm-4.7 [200 OK]
← response
Virtual Key Architecture
Every agent receives a virtual key at session start. Real provider keys never leave the gateway host.
┌─────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Agent │ vk-xx │ LiteLLM Proxy │ real │ Provider │
│ (no keys) │ ──────► │ key: vk-abc123 │ ──────► │ API │
└─────────────┘ │ budget: $5.00 │ └──────────────┘
│ models: [glm-*] │
└──────────────────┘
Creating a virtual key with budget and model restrictions:
litellm --create-key \
--key-alias "agent-codex-session-42" \
--max-budget 5.00 \
--budget-duration "1d" \
--models "glm-5.1,glm-4.7,glm-4.5-air,glm-5-turbo" \
--metadata '{"agent_id": "codex", "session": 42}'
Key properties:
- Budget is pre-checked before the request reaches the provider — overspend is impossible
- Key expiry is enforced at the gateway, not the provider
- Revoked keys return HTTP 401 immediately
- All key activity is logged to Postgres with full metadata
Budget Enforcement
Budget enforcement is fail-closed: a request that would exceed the remaining budget is rejected before it reaches the model API.
incoming request
→ check virtual key budget in Postgres
→ if remaining_budget < estimated_cost: return 429
→ send to provider
→ deduct actual_cost from budget
← return response
Budget configuration hierarchy (most specific wins):
general_settings:
default_team_settings:
max_budget: 50.00
budget_duration: "1d"
per_model_budget:
claude-opus-4-6:
max_budget: 10.00
budget_duration: "1d"
glm-5.1:
max_budget: 30.00
budget_duration: "1d"
When a budget is exhausted, agents receive HTTP 429 with:
{
"error": {
"message": "Budget exceeded for key vk-abc123. Remaining: $0.00",
"type": "budget_exceeded",
"code": 429
}
}
Redis Caching
Identical prompts within the cache TTL window return stored responses without hitting the provider.
litellm_settings:
cache: true
cache_params:
type: redis
host: redis
port: 6379
ttl: 3600 # 1 hour default
similarity_threshold: 0.95 # semantic similarity for near-duplicate detection
Cache key computation:
key = SHA-256(model + messages_json + temperature + top_p)
Deterministic requests (temperature=0) achieve the highest hit rates. The target hit rate is 40%+ for typical agent workloads, which reduces costs proportionally.
Checking cache performance:
# Redis CLI
redis-cli INFO stats | grep keyspace
redis-cli --stat
# LiteLLM dashboard
open http://gateway:4000/ui
# → "Cache Hit Rate" panel
Concurrency Limits
Each model has a configured concurrency ceiling enforced at the gateway. Requests beyond the ceiling are queued, not dropped:
| Model | Max concurrent | Notes |
|---|---|---|
| glm-5.1 | 10 | Primary worker model |
| glm-4.7 | 2 | Heavy reasoning tasks |
| glm-4.5-air | 5 | Fast draft generation |
| glm-5-turbo | 1 | Rate-limited plan; sequential only |
| claude-opus-4-6 | 3 | Orchestrator role; expensive |
| claude-sonnet-4-6 | 5 | Critic and verification |
| claude-haiku-4-5 | 10 | Quick classification |
| ollama/mistral-7b | 2 | Local GPU constraint |
Total fleet concurrency: 38 simultaneous requests across all models.
Docker Control Plane Setup
The gateway runs alongside Postgres and Redis in the Docker control plane:
# docker-compose.gateway.yml
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
environment:
DATABASE_URL: postgresql://litellm:${DB_PASS}@postgres:5432/litellm
REDIS_URL: redis://redis:6379
LANGFUSE_PUBLIC_KEY: ${LANGFUSE_PUBLIC_KEY}
LANGFUSE_SECRET_KEY: ${LANGFUSE_SECRET_KEY}
LANGFUSE_HOST: http://langfuse:3000
volumes:
- ./litellm_config.yaml:/app/config.yaml
command: --config /app/config.yaml --port 4000
postgres:
image: postgres:16
environment:
POSTGRES_DB: litellm
POSTGRES_USER: litellm
POSTGRES_PASSWORD: ${DB_PASS}
redis:
image: redis:7-alpine
command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru
Related
- Architecture Overview — where the gateway sits in the L0–L6 stack
- Observability — Langfuse trace integration
- Orchestration — model selection in swarm YAML