Performance Benchmarks
Source: HARNESS-SPEC v3.1, Section 23
Measurement environment: Linux x64, i9-12900K, RTX 3080, 30Gi RAM
Optimization stack: godotz.ai 12-layer (hyper/meta+alpha)
Grade Scale
| Grade | Meaning |
|---|---|
| S | Exceeds target; production-ready at measured scale |
| A | Meets target; some improvement headroom remains |
| B | Functional; needs tuning under load |
Summary Table
| Metric | Layer | Grade | Value |
|---|---|---|---|
| Context hydration token reduction | L10 | S | 95.2% |
| Tool start latency | L1–L12 | S | 12ms avg |
| Pheromone routing dispatch latency | L3 | S | 16.8ms |
| Semantic cache hit latency | L4 | S | 6.1ms |
| Regression suite full execution | L9 | S | 9ms |
| Cascade 3-model escalation | L2 | S | Verified end-to-end |
| Budget enforcement accuracy | L8 | S | Per-run + per-tool tracked |
| Plan cache inference reduction | L5 | S | 46% |
| Context promotion routing | L6 | A | sonnet-4-6 → gemini-3.1-pro |
| EMA model accuracy (glm-5.1) | L7 | A | 0.941 (n=8 tasks) |
S-Grade Metrics
1. Context Hydration Token Reduction — 95.2%
Layer: L10 omp-hydrate
Mechanism: On-demand context loading from the Graphify knowledge graph rather than reading full source files.
| Condition | Token Cost |
|---|---|
| Full codebase read | Baseline |
| Graphify hydration | −95.2% |
| Measured range | 81–95% depending on query specificity |
omp-hydrate <query> returns the exact subgraph needed. --files mode returns only paths, avoiding content load entirely. The 95.2% peak corresponds to narrow-scope architectural queries on the omp-playground 18-file codebase; broader queries land at the lower bound of the range.
2. Tool Start Latency — 12ms avg
Layer: All layers (composite)
Mechanism: All 11 godotz.ai optimization scripts (~/.local/bin/omp-*) are implemented as lightweight Bash or Python scripts with no heavy import chains.
| Script | Language | Typical start |
|---|---|---|
omp-cascade | Bash | ~8ms |
omp-pheromone | Python | ~15ms |
omp-ema-tracker | Python | ~14ms |
omp-budget | Python | ~12ms |
omp-cache | Bash | ~6ms |
The 12ms average reflects the full tool invocation round-trip under normal system load. This is below the perceptible threshold for hook execution.
3. Pheromone Routing Dispatch Latency — 16.8ms
Layer: L3 omp-pheromone
Mechanism: Ant-colony optimization reads current pheromone weights from ~/.omp/pheromone.json (keeping the last 200 task outcomes), computes slot allocation across the four model tiers, and returns a routing decision.
| Model | Slots | Pheromone Weight |
|---|---|---|
| glm-5.1 | 10 | 2.19 |
| glm-4.7 | 3 | 0.66 |
| glm-5-turbo | 3 | 0.66 |
| glm-4.5-air | 1 | 0.10 |
Evaporation rate: 0.05/cycle. Total slots held constant at 18. The 16.8ms figure includes JSON read, weight computation, and slot recommendation output. Evaporation rate was chosen to prevent slot thrashing on volatile workloads.
4. Semantic Cache Hit Latency — 6.1ms
Layer: L4 omp-cache
Mechanism: Task hash lookup against ~/.omp/semantic-cache/ (filesystem-based, no in-process cache server).
| Operation | Latency |
|---|---|
| Cache HIT (file read + return) | ~6ms |
| Cache MISS (exit 1) | ~4ms |
| Cache STORE | ~8ms |
Current entries: 4. The filesystem approach avoids a Redis dependency while remaining below the tool invocation overhead threshold. Hit rate on repeated workflow runs: measured at 100% for identical task hashes.
5. Regression Suite Full Execution — 9ms
Layer: L9 omp-regression
Mechanism: 3 golden tests run sequentially via omp-regression run. Each test executes a command and diffs against a stored expected output in ~/.omp/regression-golden/.
| Metric | Value |
|---|---|
| Total tests | 3 |
| Execution time | ~9ms |
| Pass rate (current) | 3/3 |
| Test add latency | <2ms |
The 9ms figure covers full suite execution. Note that the current 3 tests use placeholder expected outputs — the timing benchmark is valid, but regression detection accuracy is contingent on real golden values being authored.
6. Cascade 3-Model Escalation — Verified
Layer: L2 omp-cascade
Mechanism: Confidence-gated escalation chain: glm-4.5-air → glm-4.7 → glm-5.1 → Antigravity fallback. Escalates on failure or confidence score below 0.7.
| Stage | Model | Triggers on |
|---|---|---|
| Primary | glm-4.5-air | First attempt |
| Secondary | glm-4.7 | Primary failure or low confidence |
| Tertiary | glm-5.1 | Secondary failure |
| Fallback | Antigravity | Tertiary failure (manual) |
Log: /tmp/omp-cascade.log. End-to-end 3-model chain verified in integration testing. The cascade is invoked manually — it is not yet integrated into the godotz.ai task dispatch loop (see known limitations).
7. Budget Enforcement — Per-Run and Per-Tool
Layer: L8 omp-budget
Mechanism: Token budget tracked per session run and per individual tool call. Enforcement via omp-budget check <tokens> before tool dispatch.
| Limit | Default | Configurable |
|---|---|---|
| Per-run | 500K tokens | Yes |
| Per-tool | 50K tokens | Yes |
DB: ~/.omp/budget.json. Budget enforcement is external to godotz.ai core — the calling agent or workflow must invoke omp-budget check explicitly. The enforcement accuracy for configured budgets is 100% (no overruns observed when check is called correctly in the dispatch path).
8. Plan Cache Inference Reduction — 46%
Layer: L5 omp-plan-cache
Mechanism: Repeated workflow plans are stored by MD5 hash of the task description. On a cache hit, the stored plan JSON is returned immediately, skipping LLM inference for planning.
| Metric | Value |
|---|---|
| Measured inference reduction | 46% |
| Hash algorithm | MD5 (exact match only) |
| Current entries | 2 |
| Hit latency | <10ms |
The 46% reduction applies to workflows where the same task type is run repeatedly (e.g., omp swarm run on the same codebase). Semantic similarity matching is a roadmap item — the current implementation only hits on exact task description matches.
A-Grade Metrics
9. Context Promotion Routing — sonnet-4-6 → gemini-3.1-pro
Layer: L6 omp-ctxeng / ~/.omp/models.yml
Grade: A — functional, accuracy depends on context quality scoring data volume
Mechanism: When a task’s context engineering score indicates sonnet-4-6 is underperforming for the context type, models.yml routes the promotion target to gemini-3.1-pro-high. Context quality data points are logged to ~/.omp/context-engineering.json.
The promotion mechanism is in place and triggering correctly. A grade reflects that the context quality scoring dataset is still small (early-session data), so routing decisions have limited empirical backing. This improves automatically as more sessions accumulate data points.
10. EMA Model Accuracy — glm-5.1 EMA 0.941
Layer: L7 omp-ema-tracker
Grade: A — reliable at n=8; production confidence requires n≥50
Mechanism: Exponential moving average of model task success rates. Alpha=0.3. Updated via omp-ema-tracker log <model> <type> <true|false> [tokens].
| Model | EMA Score | Task Count | Routing Signal |
|---|---|---|---|
| glm-5.1 | 0.941 | 8 | ★ (>0.8) |
| glm-4.7 | — | — | — |
| glm-4.5-air | — | — | — |
EMA threshold signals: ★ = high confidence (>0.8), ↓ = de-prioritize (<0.6). The 0.941 EMA for glm-5.1 is consistent with its role as the primary complex-task model, but 8 tasks is a small sample for stable EMA convergence. Grade A reflects accuracy at current data volume, not a ceiling on the mechanism.
EMA Tracking Detail
The EMA tracker uses alpha=0.3, which weights recent outcomes more heavily than older ones while damping noise:
EMA_new = alpha × outcome + (1 - alpha) × EMA_prev
= 0.3 × outcome + 0.7 × EMA_prev
A single failure on glm-5.1 (current EMA 0.941) would produce:
EMA_new = 0.3 × 0 + 0.7 × 0.941 = 0.659
This would drop the model below the ↓ threshold (0.6), making the EMA responsive to quality degradation after approximately 3 consecutive failures from a healthy baseline.
Regression Detection Architecture
The regression gate (Layer 9) uses golden-value diffing, not statistical detection:
omp-regression add "cascade-chain" "3 models checked"
omp-regression run # diffs live output against golden
This is deterministic: either the output matches or it doesn’t. It is not a performance anomaly detector. Latency regression detection (e.g., cache hit time doubling) requires separate monitoring outside the current regression gate implementation.
Notes on Measurement Conditions
All latency figures were measured on a single node under normal interactive load (no concurrent swarm runs, no GPU inference active). The 18-file omp-playground codebase is small — hydration savings percentages are expected to hold or improve on larger codebases due to higher baseline read cost. Budget and EMA figures will stabilize as session volume increases.