Peering Behind the Shield: Guardrail Identification in Large Language Models

Ziqing Yang¹, Yixin Wu¹, Rui Wen², Michael Backes¹, Yang Zhang¹

¹CISPA Helmholtz Center for Information Security

²Institute of Science Tokyo

AAAI 2026 TrustAgent Workshop & AICS Workshop

Resources

Paper

Code

Motivation
Main Contributions
Method
Results
Citation

Motivation

AI agents rely on safety guardrails to block harmful behavior. In practice, these guardrails are hidden: users and attackers do not know which guardrail is deployed or how it operates.

An attacker without this knowledge is like someone trying to break into a house without knowing the key, so that attacks are inefficient and unreliable. Once the guardrail is identified, however, guard-specific attacks become far more effective.

Despite their importance, guardrails are treated as black-box components, and existing model identification methods fail to reveal them. This raises a key question:

Can we identify the guardrail deployed inside a black-box AI agent?

An attacker who does not know the guardrail is like someone trying to break into a house without knowing the key.

Main Contributions

This work makes the following key contributions:

First formulation of the guardrail identification problem
AP-Test: a novel guardrail identification framework
Unified identification of input and output guardrails
Match Score: a principled decision metric
Extensive empirical validation

Method

AP-Test identifies guardrails in two phases:

Adversarial Prompt Optimization
Adversarial Prompt Testing

Left: Overview of an AI agent. Right: Framework of our AP-Test.

The core idea is to generate prompts that are unsafe for one specific guardrail while remaining safe for all others, then observe how a black-box AI agent responds.

Results

AP-Test achieves perfect identification performance across a wide range of settings:

Candidate Guardrails: WildGuard, LlamaGuard, LlamaGuard2, LlamaGuard3
Base LLMs: Llama 3.1, GPT-4o
Deployment Modes: Input guard, output guard, and combined input–output guards

Below are the key findings from our experiments:

100% classification accuracy and AUC = 1.00 for both input and output guard tests
Robust to safety-aligned LLM behavior and additional guardrails
Successfully identifies derivative guardrails (e.g., fine-tuned variants)
Ablation studies confirm the necessity of each method component

These results demonstrate that AP-Test provides a practical and reliable path toward understanding and auditing real-world AI safety mechanisms.

Citation

If you find this work useful, please cite:

Formatted Citation:

Yang, Z., Wu, Y., Wen, R., Backes, M., & Zhang, Y. (2026). Peering Behind the Shield: Guardrail Identification in Large Language Models. The AAAI Workshop on Artificial Intelligence for Cyber Security (AICS).

BibTeX: