LLM-as-Judge: when (and why) we let an LLM grade our security decisions

Pure regex-based security misses semantic attacks. Pure LLM-based security is too slow and too expensive to run on every call. The hybrid we shipped: regex catches the obvious, an LLM judge resolves the gray zone, and the LLM never gets in the path of high-volume traffic. Here's the architecture and the four conditions under which the judge does NOT fire.

Anyone who has shipped a security tool that relies on regex has hit the wall: the regex is correct, the regex is fast, the regex catches the textbook case, and an attacker word-smiths around it in an afternoon. Anyone who has shipped a security tool that relies on a model has hit a different wall: the model is smart, the model is slow, the model costs real money on every call, and you are now committing to every security decision being a 200-millisecond LLM round trip.

The hybrid is the obvious answer. Use regex for the cases where it's right; use a model for the cases where regex isn't enough. The interesting part isn't the choice — it's where you put the seam, and what happens when the model isn't there.

This post walks the seam in detail, including the four conditions under which we deliberately do NOT call the LLM judge.

The two thresholds

Every tool call gets a numeric risk score from the rules engine in the range [0.0, 1.0]. We split that range into three zones with two configurable thresholds.

Auto-allow

Gray zone (LLM judge fires)

Auto-block

0.0 0.2 (LOW) 0.75 (HIGH) 1.0

The defaults are 0.2 and 0.75, configurable via environment:

// services/crates/ag-gateway/src/model_escalation.rs
static LLM_JUDGE_LOW_THRESHOLD: LazyLock<f64> = LazyLock::new(|| {
    std::env::var("CLAMPD_LLM_JUDGE_LOW")
        .ok().and_then(|v| v.parse().ok())
        .unwrap_or(0.2)
});
static LLM_JUDGE_HIGH_THRESHOLD: LazyLock<f64> = LazyLock::new(|| {
    std::env::var("CLAMPD_LLM_JUDGE_HIGH")
        .ok().and_then(|v| v.parse().ok())
        .unwrap_or(0.75)
});

The split has direct cost implications. With 0.2 / 0.75 thresholds and a typical workload, only the gray-zone subset invokes the LLM — auto-allow and auto-block traffic never does. The exact percentage of calls that land in the gray zone depends entirely on your workload and tuning; we don't publish a single number because every customer's distribution differs. That's the entire point of the architecture: cheap calls stay cheap; expensive calls happen rarely.

If you tighten the thresholds (say 0.4 / 0.6), more calls fall into the gray zone, more LLM invocations happen, you pay more, accuracy on borderline cases improves, latency on those calls goes up. Tunable per workload.

What happens in the gray zone

When a call's rules-derived risk falls between LOW and HIGH, the gateway packages the tool name, action, parameters, prompt context (if available), and the rules' matched labels, then calls a configured LLM with a structured prompt asking for a refined risk estimate. The model returns a JSON document with a numeric score and brief reasoning. The gateway takes the maximum of the rules score and the LLM score.

// final score logic, simplified
let final_risk = max(rules_risk, model_risk, llm_judge_risk);

The "max" is deliberate. We want the LLM to push a borderline call up if it sees something the rules missed (jailbreak phrasing, social-engineering tone, multi-language injection). We don't want the LLM to override a clean rule-based block, because then a prompt-engineered request can talk its way out of the block via the very pipe that's supposed to stop it.

The four cases where the judge does NOT fire

This is the part that gets less attention and matters more than the LLM call itself. The judge is bypassed in four conditions, each with a reason.

// services/crates/ag-gateway/src/model_escalation.rs
pub fn needs_llm_judge(config: &LlmJudgeConfig, rules_risk: f64) -> bool {
    if !config.enabled {
        return false;
    }
    if config.api_key.is_empty() {
        return false;
    }
    if LLM_JUDGE_STATE.is_in_cooldown() {
        debug!("LLM judge in cooldown, skipping");
        return false;
    }
    if LLM_JUDGE_STATE.is_rate_limited(config.max_calls_per_minute) {
        debug!("LLM judge rate limited, skipping");
        return false;
    }
    rules_risk >= *LLM_JUDGE_LOW_THRESHOLD && rules_risk <= *LLM_JUDGE_HIGH_THRESHOLD
}

1. Disabled by config

The LLM judge is off by default on self-hosted installs. Operators must set MODEL_ESCALATION_ENABLED=true and provide an LLM API key for the layer to engage at all. Until they do, every gray-zone call is resolved on rules alone.

Why off-by-default: a user spinning up Clampd to evaluate it doesn't need to also be paying for an OpenAI API key on day one. The product has to give a complete answer using rules-only before we add a configurable layer that calls a third party.

2. API key not configured

Even when enabled via env var, an empty API key means we don't try to call. Saves an HTTP round-trip that would just 401.

3. In cooldown after consecutive failures

If the LLM judge fails N times in a row (configurable, defaults conservative), the layer enters a cooldown period during which no calls are made. Reasoning: if the upstream LLM is down, sick, or misconfigured, hammering it on every call makes a bad situation worse and pushes our gray-zone latency from "200ms" to "200ms × every gray-zone call until you fix it." The cooldown avoids the storm.

While cooldown is active, gray-zone calls fall through to whatever the rules score said. That's the fail-open vs fail-closed knob (next section).

4. Rate-limited per minute

A configurable maximum LLM-judge calls per minute caps cost in the worst case (a flood of gray-zone-scoring calls). Once the cap is reached for the current minute, further gray-zone calls bypass the judge for that minute and resolve on rules alone.

This bound is what lets us promise customers that LLM-judge cost is not unbounded. You can pin it to "at most N calls/minute, hard."

All four bypasses are observable

Every bypass logs a structured event. If the judge is being skipped because of cooldown, your dashboard's risk feed shows it. If you're hitting the per-minute rate limit, you see the rate. This is on purpose. A security architect needs to be able to answer the question "how often is the LLM judge actually firing?" instead of trusting the marketing.

Fail-open vs fail-closed

When the judge call itself fails (timeout, 5xx from the LLM provider, malformed JSON), we have a choice: trust the rules-derived score (fail-open) or block the call (fail-closed). The default is fail-open. Configurable.

The reasoning: when the LLM is down or slow, the rules engine is still right about the obvious cases. A call that the rules scored at 0.3 and was about to be auto-allowed should not become a block just because the LLM is throwing 503s. That would amount to "we DDoS our own customers when our LLM provider has a bad day."

For high-security deployments where the operator would rather block-on-uncertainty, set MODEL_FAIL_OPEN=false. The gray zone then defaults to deny when the judge is unreachable.

What the judge actually sees

Not the raw user prompt. Not credentials. Not the response body. The structured request looks like this:

// LlmJudgeRequest (services/crates/ag-gateway/src/model_escalation.rs)
{
  "tool_name":        "db.query",
  "action":           "read",
  "params_summary":   "SELECT name FROM users WHERE region = 'EU'",
  "rules_risk_score": 0.42,
  "matched_rules":    ["R005", "R025"],
  "agent_id":        "agent-7b3a-..."
}

The judge replies with a refined risk score and a one-line reasoning that lands in the audit trail. The original tool params do go into params_summary, but downstream credentials, response bodies, and full prompts do not. This is intentional: the LLM should be able to grade the decision, not gain access to data that the gateway is designed to govern.

The legacy HTTP-backend escalation path (ModelRequest, used when you point at a non-LLM model endpoint) carries an extra params_json + session_flags + classification. The LLM-judge path keeps the payload minimal because LLM tokens are billed.

Why this isn't "AI security AI"

One reasonable objection to LLM-as-judge: "you're using AI to police AI, the same model class that does the attacking." Two things matter here.

The judge isn't authoritative. It refines a number that came from deterministic rules. It can push a 0.4 to 0.7 (catching something the rules didn't see), but it can't push a 0.9 (rules-blocked) back down to 0.5. The deterministic floor stays in place.
The judge is one of three signals, not the signal. The final risk is max(rules_risk, model_risk, llm_judge_risk). Even if a clever attacker gets the judge to underrate their request, the rules and the per-agent behavioural baseline still vote.

The LLM judge is a catch-net for the gray zone. It is not the ground truth. The rules and the EMA-based behavioural layer are.

What hybrids buy you in dollars

Concretely: if your workload puts 5% of calls in the gray zone, a 1000-call sample runs the LLM 50 times. At a typical 200ms per call and a few cents per call, that's reasonable. A pure-LLM design would invoke the LLM 1000 times: 5 minutes of cumulative LLM time and 20× the bill. A pure-rules design would invoke the LLM 0 times and miss the gray zone entirely.

The hybrid sits at the cost of pure-rules plus a small premium, with detection coverage closer to pure-LLM. The exact ratio is workload-dependent; the architecture's job is to make sure you can tune it without rewriting anything.

What you can do without Clampd

Even if you don't use us, the architectural pattern transfers. Three things to take away:

Don't make the LLM the floor. Whatever rules engine you have, run it first. Use the LLM only for the cases the rules don't decide. Otherwise you're paying full price for low-value clarity.
Bound the LLM's failure modes explicitly. Cooldown after consecutive failures, per-minute rate limit, fail-open/closed knob. Without these, your security posture during an LLM provider incident becomes accidentally untested.
Audit when the LLM does and doesn't fire. If you can't answer "what % of decisions used the LLM judge yesterday?" from your audit log, you can't reason about the architecture.

Try Clampd in 60 seconds

One line of Python or TypeScript. The LLM-as-Judge layer is opt-in via MODEL_ESCALATION_ENABLED=true. Bring your own API key. Cost-bounded by per-minute rate limit and configurable thresholds. Self-hosted, source-available.

pip install clampd npm install @clampd/sdk
Get Started → See the latency numbers

← Back to blog Share on X →