Anyone can write "sub-10ms typical" on a landing page. Whether the page survives an honest reader running a stopwatch is a different question. This post is the stopwatch.
Two numbers we'll lead with, and we'll show our work for both.
The first is in-process rule evaluation. The second is end-to-end through the gateway, including auth, gRPC roundtrips between services, normalization, classification, policy decision, and audit. Different layers, different hardware, different scopes. Both are real measurements.
What we measured, and on what
Two test environments. We ran the engine bench on a developer laptop and the end-to-end gateway on a small production VPS. The gap between them matters when you read the rest of the post, so we want it stated up front.
| Environment | CPU | RAM | What it ran |
|---|---|---|---|
| Bench host | AMD Ryzen 5 5600H, 12 thread, ~3.3 GHz | 27 GB | cargo bench — engine layer only |
| Deploy host | AMD EPYC-Rome, 2 vCPU @ 2.0 GHz | 3.7 GB | Full gateway with NATS, Redis, Postgres, ClickHouse |
The bench host is laptop-class. The deploy host is the smallest VPS that runs the full stack reasonably. Numbers from one don't generalise to the other. We deliberately chose hardware on both ends so readers can see the floor and the typical.
The engine: 44 µs for 263 rules
The detection engine is what runs against every tool call. It compiles 263 rules at startup, then evaluates each request through five layers: a fast funnel, regex rules, dictionary scan, signal compounding, and normalisation for encoded payloads.
The full pipeline benchmark drives a JSON tool-call payload through every layer and reports criterion mean.
| Bench | Mean | What it does |
|---|---|---|
stage_full_safe | 44.3 µs | Full engine on a clean SQL query |
stage_full_attack | 36.4 µs | Full engine on a destructive SQL payload |
stage_full_encoded | 28.6 µs | Full engine on a base64-encoded attack |
The attack path is faster than the safe path. That looks counter-intuitive until you remember rule evaluation can short-circuit the moment the score crosses the block threshold. Safe input sweeps every rule because nothing fires. Worth knowing if you ever care about adversarial latency.
The breakdown by layer shows where time goes inside a single 44µs evaluation.
| Layer | Mean | Notes |
|---|---|---|
stage_json_extract | 285 ns | Pull tool params out of the request body |
stage_context_create | 174 ns | Build the per-call evaluation context |
stage_funnel_clean | 2.57 µs | Aho-Corasick funnel on plaintext |
stage_funnel_base64 | 3.99 µs | Same funnel after base64 unwrap |
stage_normalize | 5.06 µs | Multi-step normalisation for encoded variants |
stage_l1_rules_safe | 7.23 µs | L1 regex layer, safe input, all 263 rules |
stage_l1_rules_attack | 7.22 µs | L1 regex layer, attack input |
stage_l2_dictionary | 67 ns | L2 keyword dictionary scan |
stage_l3_signals | 1.71 µs | L3 compound signal scoring |
Two things worth pointing at. The full pipeline costs more than the sum of its parts because of context propagation between layers. And L2 dictionary at 67 nanoseconds is doing single-pass Aho-Corasick across thousands of keywords, which is one of the reasons we put it before the more expensive regex layer.
The bench output also reports compile_110_builtin_rules at 286ms. That's the one-time cost of parsing 13 TOML files, compiling 263 regex patterns, and validating taxonomy on gateway startup. It's amortised across every request the gateway serves for the rest of its uptime. You pay it once when the process starts. Then 44 microseconds forever.
The gateway: 5ms p50 end-to-end on 2 vCPUs
The engine bench tells you the cost of the rules. The gateway bench tells you what shows up at the wire. Auth, agent identification, normalisation, the gRPC roundtrip to the intent service, policy decision, scope token mint, audit emit. All of it.
We drove a 540-call burst against the production gateway with the request rate-limited so the demo org could absorb it. 42 calls produced complete stage-latency log lines. Smaller sample size than we wanted; we'll explain why under "what we couldn't measure" below.
| Percentile | End-to-end |
|---|---|
| p50 | 5.14 ms |
| p90 | 6.80 ms |
| p95 | 7.92 ms |
| p99 | 10.53 ms |
| mean | 5.33 ms |
| n | 42 |
Per-stage breakdown shows nothing dominating. Each gateway stage takes around a millisecond, which is roughly what we'd expect from gRPC roundtrips between services on the same box.
| Stage | p50 | p95 | What happens |
|---|---|---|---|
| auth | 0.55 ms | 1.38 ms | JWT verify, kill-list check |
| license_ratelimit | 0.81 ms | 1.85 ms | Per-org budget check in Redis |
| identify_delegate | 0.79 ms | 2.12 ms | gRPC to ag-registry, agent profile |
| normalize_extract | 0.75 ms | 2.75 ms | Decode and JSON-extract tool params |
| session_toolauth | 0.55 ms | 2.16 ms | Per-tool grant check, session budget |
| boundary_check | 0.38 ms | 0.70 ms | Trust boundary enforcement |
| classify_grpc | 1.00 ms | 1.76 ms | gRPC to ag-intent, the engine fires here |
| impersonation_match | 0.00 ms | 0.00 ms | Skipped for these calls |
That single 1.00ms classify_grpc figure includes the 44µs engine evaluation we measured on the laptop, plus the gRPC roundtrip. Most of that millisecond is network and serialisation. The actual engine, on the much slower production CPU, is still well inside that envelope.
The honest caveat: when the LLM Judge fires, latency changes
Clampd has an optional gray-zone escalation path. When a request scores between two configurable thresholds (default 0.2 and 0.75) and the rules engine isn't confident, the gateway hands the request to an LLM-as-Judge for a semantic second opinion. When that fires, end-to-end latency changes character.
From a smaller earlier sample (n=6) where the LLM Judge fallback fired, end-to-end latency was 200 to 400 milliseconds. Almost all of that is the upstream LLM API roundtrip, not Clampd. Our pipeline accounts for ~5ms of it. But if you write "sub-10ms" on a landing page and your reader benchmarks a Suspicious-class call, they'll see 300ms and quote you. So we don't write "sub-10ms" anywhere any more.
The escalation is opt-in, and we've published the configurable thresholds so teams can tune them or disable the layer entirely. The default ships off in self-hosted unless you set MODEL_ESCALATION_ENABLED=true.
What we couldn't measure
Before anyone runs the same script and gets different numbers, here's what's missing from this post.
- Allow-path stats. All 42 samples are deny-path calls. Deny short-circuits before the downstream forward, which makes it the fast path. Allow path adds the cost of forwarding to the actual tool plus the tool's own latency. We tried 700 calls across 5 different categories. About 5% returned 200 OK, 13% returned 403 deny with a complete log line, and the remaining 82% were rate-limited at a layer that doesn't emit Stage 9 audit. We couldn't fix that from outside the box without changes to the demo org's licensing config. Allow-path numbers are coming in a follow-up.
- p99 is wobbly at n=42. Our 10.53ms p99 is from a sample size where one or two outliers swing the number a lot. Take the p50 and p95 figures more seriously than the p99.
- Concurrency curve. We didn't measure how p95 changes at 1, 10, 50, 100 concurrent connections. Your mileage will vary, especially on a 2-vCPU box.
- Cold start. We measured warm gateway. The first ~286ms after process start is taken up by ruleset compilation. After that the gateway settles.
Reproducing this on your own machine
Engine bench, on any machine that can build the workspace.
# Clone, then from the services directory:
cargo bench -p ag-engine --bench evaluate
cargo bench -p ag-engine --bench stages
cargo bench -p ag-intent --bench classify
# Each writes a criterion HTML report under target/criterion/
End-to-end gateway bench, against a local docker stack.
# From the clampd/ compose directory:
docker compose --profile full up -d
sleep 30 # let services pass health checks
# 500 requests, 10 concurrent, against the safe path:
hey -n 500 -c 10 -m POST \
-H "Content-Type: application/json" \
-H "X-AG-Key: clmpd_demo_key" \
-d '{"agent_id":"b0000000-0000-0000-0000-000000000001",
"tool":"db.query","params":{"sql":"SELECT 1"}}' \
http://localhost:8080/v1/proxy
# Per-stage breakdown lives in the gateway log:
docker compose logs ag-gateway 2>&1 \
| grep "Stage latency breakdown" \
| tail -500
Numbers won't match ours exactly. Hardware, host noise, kernel version, and how warm the JIT-style ruleset is all matter. They should land in the same order of magnitude on a comparable machine.
What changed on the marketing site this week
Writing this post forced us to re-read our own copy. The phrase "sub-10ms typical latency" appeared in 13 places: the meta description, the hero stat, two FAQ answers, and the JSON-LD schema. With a deny-path p95 of 7.92ms and an unmeasured allow path, we couldn't honestly defend "typical" without more samples. So we changed it.
The site now leads with "44µs rule evaluation" (bench-backed and reproducible) and "single-digit ms end-to-end on commodity hardware" (deliberately wide enough to be true even after the 500-sample run lands). When we measure allow-path properly, we'll narrow the language. Until then, the homepage and this post agree.
Try Clampd in 60 seconds
One line of Python or TypeScript. Works with OpenAI, Anthropic, LangChain, CrewAI, Google ADK, and any MCP server. Self-hosted, source-available, no telemetry by default.
pip install clampd npm install @clampd/sdkGet Started → Why Clampd