Metrics & Evaluation¶

Every iteration measures how well the guardrail performs, then uses that data to guide improvement. Here's what gets measured and why.

The Metrics¶

Metric	Formula	What it tells you
TPR (sensitivity)	`TP / (TP + FN)`	How many violations are caught
TNR (specificity)	`TN / (TN + FP)`	How many safe prompts are correctly passed
Coverage	`min(TPR, TNR)`	The primary optimization target
Accuracy	`(TP + TN) / total`	Overall correctness
F1	`2 * (precision * recall) / (precision + recall)`	Balance of precision and recall
Regressions	Count of regression-tier tests that failed	How many previously-correct tests broke after topic refinement

Why Coverage = min(TPR, TNR)

A guardrail that catches every violation but also blocks half the safe prompts isn't useful. Coverage forces both detection and specificity to improve together — the system can't game the metric by excelling at one while ignoring the other.

How Tests Work¶

Each iteration, the LLM generates a balanced test suite:

Positive tests — prompts that should trigger the guardrail (actual violations)
Negative tests — prompts that should not trigger (safe but topically adjacent)

Every test case has four fields:

Field	What it is
`prompt`	The test prompt text
`expectedTriggered`	Should the guardrail catch this?
`category`	Grouping label (e.g., `"direct-request"`, `"benign-adjacent"`)
`source`	How the test entered the suite: `'generated'`, `'carried-fp'`, `'carried-fn'`, or `'regression'`

Test Composition (Iteration 2+)¶

On iteration 2+, the test suite is composed from three sources:

Carried failures — FP/FN from the previous iteration, re-tested to verify if refinement fixed them
Regression tier — TP/TN from the previous iteration, re-scanned to catch regressions
Fresh generated — new tests from the LLM, weighted toward weak categories

All pools are deduplicated case-insensitively. Priority: carried > regression > generated.

Weighted Category Generation¶

Per-category error rates from the previous iteration are injected into the test generation prompt. If "indirect-reference" had a 40% FN rate vs "direct-request" at 5%, the LLM generates more indirect-reference tests.

Why topically adjacent negatives matter

The LLM generates negative tests that are close to the guardrail's topic but shouldn't trigger. These are the hardest cases and drive the most improvement — a "weapons" guardrail shouldn't block a cooking discussion about knives.

Scanning¶

Test prompts are scanned against AIRS in parallel batches:

graph LR
    A[Test Cases] --> B[Batch Scanner]
    B --> C[Concurrent Requests]
    C --> D[AIRS Scan API]
    D --> E[Results]

Concurrency: controlled by scanConcurrency (default 5)
Detection: checks prompt_detected.topic_violation (fallback: topic_guardrails_details)

Rate limits

scanConcurrency above 5 risks AIRS API throttling. The default balances speed and reliability.

FP/FN Analysis¶

After scanning, the LLM examines every misclassified result. It receives the topic definition, all test results, and the computed metrics, then identifies patterns:

Error Type	What happened	Example
False positive	Safe prompt incorrectly blocked	Cooking discussion flagged by "weapons" guardrail because "knife" appeared in examples
False negative	Violation slipped through	Coded language or indirect references not caught by the description

The analysis produces concrete suggestions — "narrow the description to exclude kitchen contexts" or "add an example covering euphemistic language" — that feed directly into the next iteration.

When the Loop Stops¶

Condition	Default	What happens
Coverage target met	90%	Run succeeds
Max iterations reached	20	Run completes with best result found

Note

The best iteration (highest coverage) is tracked throughout the run. Even if the final iteration regresses, the best result is always preserved.