Setting Up AI Guardrails

AI Guardrails let Knowledge Agent owners add safety controls to how an agent handles user inputs. Guardrails screen every input before it reaches the agent — across web chat, the search bar, the API, and Slack — so potentially harmful or off-topic queries are blocked at the door.

Two guardrail types are available: Jailbreak Detection and the Custom Input Prompt Guardrail. Both are configured per Knowledge Agent, managed by the agent's owner, and are off by default.

👥

Access Required

You must be a Knowledge Agent Owner or relevant custom role to configure guardrails.

To access guardrails, open your Knowledge Agent's settings and go to the Guardrails section.


Jailbreak Detection

Jailbreak Detection screens user inputs for adversarial prompting — attempts to manipulate or override your agent's behavior. When enabled, every input is evaluated before it reaches the agent.

  • Enable it. Toggle Jailbreak Detection on in the Guardrails section of your Knowledge Agent settings.
  • On pass. The input is forwarded to the agent and processed normally. The user sees no difference.
  • On fail. The input is blocked and the user receives a generic error message. The error is intentionally vague so bad actors can't use it to refine their approach. Jailbreak Detection applies to all users on all channels — there's no per-user exception.
✍️

Note.

If the detection service is temporarily unavailable, the system fails open — the input is assumed safe and processed normally.


Custom Input Prompt Guardrail

The Custom Input Prompt Guardrail lets you define your own rules for what inputs your agent should accept. You write a plain-language rule describing what an input must (or must not) do, set how strictly it's enforced, and define the message a user sees when their input is blocked.

  1. In your Knowledge Agent settings, go to the Guardrails section.
  2. Click Add Rule under the Custom Input Prompt Guardrail.
  3. Give the rule a name.
  4. Write a moderation prompt — a plain-language description of what inputs should be allowed or blocked. For example: "This agent should only handle questions related to banking operations" or "Does this query ask the user to compare our product to a competitor?"
  5. Set the confidence threshold to Low, Medium, or High. A higher setting requires greater confidence that the input matches your rule before it's allowed through — stricter settings block more aggressively.
  6. Write a custom error message that users will see when their input is blocked.
  7. Toggle the rule on. You can add multiple rules to a single agent. Each rule has its own name, moderation prompt, confidence threshold, and custom error message, and each can be toggled on or off independently.

Like Jailbreak Detection, Custom Input Prompt Guardrail rules apply to all users on all channels — web chat, search bar, API, and Slack.

💡

Tip.

Start with a Medium confidence threshold and adjust based on how your users interact with the agent. Lower thresholds allow more inputs through; higher thresholds block more aggressively.


Credit usage for guardrails

Each guardrail evaluation costs 1 credit per request evaluated. This applies to both Jailbreak Detection and Custom Input Prompt Guardrail evaluations, and the cost is additive to whatever the agent does afterward (for example, a search and answer generation).

Because guardrails are opt-in and off by default, agents with no guardrails enabled incur no additional credit cost. If the guardrail evaluation service is unavailable and the evaluation fails, no credit is consumed.


Frequently asked questions about AI Guardrails

Are guardrails on by default? No. Every guardrail is opt-in and off by default. They're configured per Knowledge Agent by the agent owner.

Which channels do guardrails apply to? All of them — web chat, the search bar, the API, and the Slack bot. When a guardrail is enabled, it evaluates every input regardless of channel or user.

What does a user see when their input is blocked? It depends on the guardrail. Jailbreak Detection returns a generic error on purpose, so the detection can't be reverse-engineered. The Custom Input Prompt Guardrail returns the specific custom error message the agent owner wrote for that rule.

Can I create more than one Custom Input Prompt rule? Yes. You can add multiple rules on a single agent, each with its own name, moderation prompt, confidence threshold, and custom error message. Each rule can be toggled on or off independently.

What does the confidence threshold do? Each Custom Input Prompt rule has a Low, Medium, or High setting that controls how confident the model must be that an input matches the rule before it's allowed to pass. A stricter setting blocks more aggressively.

What happens if the jailbreak detection service is down? It fails open — if the detection service is unavailable, the input is assumed safe and processed normally.

Can admins see a log of blocked inputs? Not at this time. Audit and logging visibility for blocked inputs is not part of this release.