AI Evaluation System

Evaluate AI Quality at Scale

Comprehensive evaluation framework with LLM judges, security-focused evaluators, golden response libraries, and real-time streaming evaluation. Better than Langfuse with built-in security intelligence.

5 Security

Built-in security evaluators using threat detection

Real-time

Streaming evaluation during response generation

Golden Lib

Security-verified baseline response library

Regression

Automatic regression detection between runs

Evaluation Framework

A complete toolkit for evaluating AI responses, from automated LLM judges to human annotation workflows.

Custom Evaluators

Create evaluators with multiple types: LLM judge, regex, keyword, semantic similarity, and security.

  • LLM-as-judge with custom criteria
  • Regex and keyword matching
  • Semantic similarity with embeddings
  • Binary, numeric, categorical scores

Test Datasets

Manage test datasets with input/output pairs, expected responses, and metadata.

  • JSONL import/export
  • Version control
  • Expected output matching
  • Custom metadata fields

Evaluation Runs

Execute evaluations at scale with parallel processing and detailed result tracking.

  • Batch evaluation execution
  • Multi-evaluator runs
  • Progress tracking
  • Result export

Security-Focused Evaluators

Unique to Bastio - powered by our threat detection engine

Unlike Langfuse, Bastio provides pre-built security evaluators that leverage our advanced threat detection capabilities. Evaluate AI safety without building custom prompts.

Prompt Injection

Detect injection attempts using pattern matching and ML analysis

Jailbreak

Identify jailbreak patterns and instruction override attempts

PII Leakage

Check responses for sensitive data exposure

Data Exfiltration

Detect credential and environment variable extraction

Composite

Combined security score from all evaluators

Real-Time Streaming Evaluation

Langfuse is batch-only - Bastio evaluates during generation

Configure per-proxy evaluation that runs during response streaming. Detect issues in real-time and optionally block unsafe responses before they reach users.

Configuration Options

  • Sample Rate

    Evaluate 10%, 50%, or 100% of requests

  • Alert Threshold

    Trigger alerts when score drops below threshold

  • Block Threshold

    Stop response when critical issues detected

  • Eval Timeout

    Maximum evaluation time before skipping

Example Configuration

{
  "proxy_id": "proxy_abc123",
  "enabled": true,
  "evaluator_ids": [
    "eval_security_composite",
    "eval_pii_leakage"
  ],
  "sample_rate": 1.0,
  "alert_on_fail": true,
  "alert_threshold": 0.5,
  "block_on_fail": true,
  "block_threshold": 0.3,
  "eval_timeout_ms": 5000
}
Performance Note:Security evaluators use fast mode (<10ms) to minimize streaming latency.

Golden Response Library

Security-verified baseline responses for regression testing

Store security-verified golden responses with semantic search. Use them as baselines for regression testing and to ensure consistent AI behavior.

Key Features

  • Security Verification - Mark responses as security-reviewed
  • Semantic Search - Find similar responses using embeddings
  • Human Scores - Track quality ratings per response
  • Usage Tracking - See how often goldens are matched
  • Categories & Tags - Organize by use case

From Annotation to Golden

Promote high-quality annotated responses directly to the golden library:

  1. 1.Review response in annotation queue
  2. 2.Add quality scores and feedback
  3. 3.Click "Save as Golden" to preserve
  4. 4.Security team verifies the response
  5. 5.Use in regression testing

Retroactive Evaluation & Comparison

Evaluate historical data and compare runs to detect regressions over time.

Backfill Evaluation

Run evaluators on historical conversations and traces at scale.

  • Filter by date range
  • Filter by proxy or security score
  • Background job processing
  • Progress tracking & cancellation

Run Comparison

Compare two evaluation runs to detect score regressions or improvements.

  • Per-evaluator score deltas
  • Regression/improvement flags
  • Item-level change tracking
  • Statistical summary

Evaluator Validation

Measure LLM judge alignment with human ground truth

Add ground truth labels and generate validation reports to ensure your LLM judges align with human expectations.

Ground Truth Labels

Add human-verified scores for specific conversations to build a validation dataset.

Validation Reports

Generate reports with precision, recall, F1 score, MAE, RMSE, and correlation metrics.

Continuous Improvement

Use validation insights to refine evaluator prompts and improve alignment over time.

Human-in-the-Loop Annotation

Build annotation queues for systematic human review. Perfect for QA workflows, fine-tuning dataset creation, and evaluator validation.

Queue Management

Create queues
Assign reviewers
Claim items
Bulk operations
Status tracking
Priority levels

Annotation Workflow

  • 1.Items added to queue (manually or via filters)
  • 2.Reviewer claims item
  • 3.Add scores, feedback, corrections
  • 4.Optionally save as golden response
  • 5.Mark complete or escalate

Bastio vs Langfuse

Bastio's evaluation system builds on Langfuse's concepts while adding unique security-focused capabilities.

FeatureLangfuseBastio
LLM-as-Judge Evaluators
Custom Evaluators
Test Datasets
Annotation Queues
Security Evaluators-
Real-time Streaming Evaluation-
Golden Response Library-
Built-in Threat Detection-
Evaluator Validation Framework-

Included in Every Plan

Evaluation features are included with all Bastio plans. No additional charges for evaluators, runs, or golden responses.

FeatureFreeStarterProEnterprise
Custom Evaluators525100Unlimited
Test Datasets320100Unlimited
Evaluation Runs/month101001,000Unlimited
Golden Responses202002,000Unlimited
Streaming Evaluation
Security Evaluators

Start Evaluating Your AI

Get comprehensive AI evaluation with built-in security intelligence. Create your first evaluator in minutes.

Questions about evaluation? Contact us for a free consultation.