Guardrail Generation¶
Prisma AIRS CLI's guardrail generation capability uses an LLM-driven feedback loop to create, test, and iteratively refine custom topic guardrails for Prisma AIRS security profiles.
How It Works¶
- Generate — An LLM produces a custom topic definition (name, description, examples) based on your intent (block or allow)
- Deploy — The topic is created/updated in AIRS via the Management API and linked to your security profile
- Test — Synthetic test prompts are scanned against the profile to measure detection accuracy
- Evaluate — Metrics (TPR, TNR, coverage, F1) determine how well the guardrail performs
- Improve — The LLM analyzes failures and refines the topic definition
- Repeat — The loop continues until coverage reaches the target threshold (default 90%)
CLI Usage¶
Guardrail generation lives under airs runtime topics:
# Interactive mode — prompts for all inputs
airs runtime topics generate
# Non-interactive with all options
airs runtime topics generate \
--topic "Block discussions about weapons manufacturing" \
--intent block \
--profile my-security-profile \
--target-coverage 90 \
--max-iterations 5
# Resume, report, list runs
airs runtime topics resume <runId>
airs runtime topics report <runId>
airs runtime topics runs
Key Concepts¶
- Intent:
block(detect violating prompts) orallow(detect benign prompts that should pass through) - Coverage:
min(TPR, TNR)— both detection types must meet the threshold - Topic name lock: After iteration 1, only the description and examples are refined — the name stays fixed
- Test composition: Iteration 2+ carries forward failed tests and adds regression checks alongside fresh LLM-generated tests
Platform Constraints¶
Achievable coverage depends on the topic domain and intent. AIRS has platform-level behaviors that limit what custom topic guardrails can accomplish.
Block-Intent on High-Sensitivity Topics¶
Certain topic domains (explosives/weapons, CSAM, etc.) trigger AIRS built-in safety layers that override custom topic definitions entirely:
- These topics achieve 100% TPR but 0% TNR — the guardrail blocks ALL content, including completely unrelated prompts
- Description refinement, exclusion clauses, and example tuning have zero observable effect
- The built-in safety layer appears to key off the topic name/domain, not the description
Recommendation
Do not create custom block-intent topics for content that AIRS already handles via built-in safety. Use the default AIRS security profiles instead.
Allow-Intent Matching Behavior¶
Allow-intent matching uses broad semantic similarity, not logical constraint evaluation:
- Exclusion clauses ("not X", "excludes Y") do not work — they often increase false positives by adding semantic overlap with the excluded domain
- Shorter, simpler descriptions (under 100 characters) consistently outperform longer, more specific ones
- Typical achievable coverage for allow-intent topics: 40–70% depending on topic breadth
- Best results usually come from the first few iterations; extended refinement often degrades coverage
Description Truncation¶
AIRS enforces hard limits on topic definitions. Descriptions exceeding 250 characters are silently truncated by clampTopic(), which can strip the positive definition while preserving only exclusion clauses — further degrading performance.
| Constraint | Limit |
|---|---|
| Topic name | 100 characters |
| Description | 250 characters |
| Each example | 250 characters |
| Max examples | 5 |
| Combined (description + examples) | 1000 characters |
Related¶
- Core Loop Architecture — detailed loop state machine
- Memory System — cross-run learning persistence
- Metrics & Evaluation — how TP/TN/FP/FN are classified
- Topic Constraints — AIRS limits on topic definitions
- Resumable Runs — pause and resume loop runs