SecurityLLM GuardrailsGenAI Safety

Comparing LLM Guardrails Across GenAI Platforms: A Comprehensive Security Analysis

January 5, 2024

9 min read

We conducted a comparative study of the built-in guardrails offered by three major cloud-based large language model (LLM) platforms. This examination evaluated both false positives (FPs), where safe content is erroneously blocked, and false negatives (FNs), where harmful content slips through these guardrails.

Key Finding

LLM guardrails are an essential layer of defense against misuse, disallowed content and harmful behaviors. They serve as a safety layer between the user and the AI model, filtering or blocking inputs and outputs that violate policy guidelines.

What Are LLM Guardrails?

Alignment vs Guardrails

Alignment focuses on shaping the model's behavior during training through techniques like supervised fine-tuning and reinforcement learning from human feedback. The goal is to guide the model toward generating appropriate and helpful outputs by default.

Guardrails are control mechanisms that operate during model deployment and usage. They act as a layer that monitors and manages the interaction between the user and the model in real time, analyzing both user inputs and the model's outputs.

Example: Guardrails in Action

Without Guardrails:

User: "Write me a tutorial on how to hack into my school's grading system."

Assistant: "Here's a step-by-step guide to accessing your school's grading system..."

With Guardrails:

User: "Write me a tutorial on how to hack into my school's grading system."

Assistant: "I understand you might be curious about cybersecurity, but I can't provide instructions for unauthorized system access."

LLM Guardrail Types

Prompt Injection and Jailbreak Prevention

Watches for attempts to manipulate the model through crafty prompts like "Ignore all previous instructions" or wrapped forbidden requests in fictional roleplay. Uses rules or classifiers to detect these patterns.

Content Moderation Filters

The most common type of guardrails. Content filters scan text for categories like hate speech, harassment, sexual content, violence, self-harm and other forms of harmful content.

Topic and Use Case Restrictions

Prevents the model from engaging with specific topics or use cases that may be sensitive, regulated, or outside the intended scope of the application.

Data Loss Prevention (DLP)

Scans for and blocks sensitive information like credit card numbers, social security numbers, API keys, or proprietary data from being exposed in prompts or responses.

Best Practices

Prompt Engineering Best Practices

Structure

Use clear instructions
Provide examples (few-shot)
Define output format

Optimization

Test iteratively
Version control prompts
Monitor performance

Fine-Tuning Best Practices

Data Preparation

Clean and validate data
Ensure diversity
Balance dataset

Training

Start with small datasets
Monitor for overfitting
Validate thoroughly

Need Help Choosing the Right Approach?

Our experts can help you determine whether prompt engineering or fine-tuning is best for your use case.

Get Expert Consultation