Evaluate AI Quality at Scale
Comprehensive evaluation framework with LLM judges, security-focused evaluators, golden response libraries, and real-time streaming evaluation. Better than Langfuse with built-in security intelligence.
5 Security
Built-in security evaluators using threat detection
Real-time
Streaming evaluation during response generation
Golden Lib
Security-verified baseline response library
Regression
Automatic regression detection between runs
Evaluation Framework
A complete toolkit for evaluating AI responses, from automated LLM judges to human annotation workflows.
Custom Evaluators
Create evaluators with multiple types: LLM judge, regex, keyword, semantic similarity, and security.
- LLM-as-judge with custom criteria
- Regex and keyword matching
- Semantic similarity with embeddings
- Binary, numeric, categorical scores
Test Datasets
Manage test datasets with input/output pairs, expected responses, and metadata.
- JSONL import/export
- Version control
- Expected output matching
- Custom metadata fields
Evaluation Runs
Execute evaluations at scale with parallel processing and detailed result tracking.
- Batch evaluation execution
- Multi-evaluator runs
- Progress tracking
- Result export
Security-Focused Evaluators
Unique to Bastio - powered by our threat detection engine
Unlike Langfuse, Bastio provides pre-built security evaluators that leverage our advanced threat detection capabilities. Evaluate AI safety without building custom prompts.
Prompt Injection
Detect injection attempts using pattern matching and ML analysis
Jailbreak
Identify jailbreak patterns and instruction override attempts
PII Leakage
Check responses for sensitive data exposure
Data Exfiltration
Detect credential and environment variable extraction
Composite
Combined security score from all evaluators
Real-Time Streaming Evaluation
Langfuse is batch-only - Bastio evaluates during generation
Configure per-proxy evaluation that runs during response streaming. Detect issues in real-time and optionally block unsafe responses before they reach users.
Configuration Options
- Sample Rate
Evaluate 10%, 50%, or 100% of requests
- Alert Threshold
Trigger alerts when score drops below threshold
- Block Threshold
Stop response when critical issues detected
- Eval Timeout
Maximum evaluation time before skipping
Example Configuration
{
"proxy_id": "proxy_abc123",
"enabled": true,
"evaluator_ids": [
"eval_security_composite",
"eval_pii_leakage"
],
"sample_rate": 1.0,
"alert_on_fail": true,
"alert_threshold": 0.5,
"block_on_fail": true,
"block_threshold": 0.3,
"eval_timeout_ms": 5000
}Golden Response Library
Security-verified baseline responses for regression testing
Store security-verified golden responses with semantic search. Use them as baselines for regression testing and to ensure consistent AI behavior.
Key Features
- Security Verification - Mark responses as security-reviewed
- Semantic Search - Find similar responses using embeddings
- Human Scores - Track quality ratings per response
- Usage Tracking - See how often goldens are matched
- Categories & Tags - Organize by use case
From Annotation to Golden
Promote high-quality annotated responses directly to the golden library:
- 1.Review response in annotation queue
- 2.Add quality scores and feedback
- 3.Click "Save as Golden" to preserve
- 4.Security team verifies the response
- 5.Use in regression testing
Retroactive Evaluation & Comparison
Evaluate historical data and compare runs to detect regressions over time.
Backfill Evaluation
Run evaluators on historical conversations and traces at scale.
- Filter by date range
- Filter by proxy or security score
- Background job processing
- Progress tracking & cancellation
Run Comparison
Compare two evaluation runs to detect score regressions or improvements.
- Per-evaluator score deltas
- Regression/improvement flags
- Item-level change tracking
- Statistical summary
Evaluator Validation
Measure LLM judge alignment with human ground truth
Add ground truth labels and generate validation reports to ensure your LLM judges align with human expectations.
Ground Truth Labels
Add human-verified scores for specific conversations to build a validation dataset.
Validation Reports
Generate reports with precision, recall, F1 score, MAE, RMSE, and correlation metrics.
Continuous Improvement
Use validation insights to refine evaluator prompts and improve alignment over time.
Human-in-the-Loop Annotation
Build annotation queues for systematic human review. Perfect for QA workflows, fine-tuning dataset creation, and evaluator validation.
Queue Management
Annotation Workflow
- 1.Items added to queue (manually or via filters)
- 2.Reviewer claims item
- 3.Add scores, feedback, corrections
- 4.Optionally save as golden response
- 5.Mark complete or escalate
Bastio vs Langfuse
Bastio's evaluation system builds on Langfuse's concepts while adding unique security-focused capabilities.
| Feature | Langfuse | Bastio |
|---|---|---|
| LLM-as-Judge Evaluators | ||
| Custom Evaluators | ||
| Test Datasets | ||
| Annotation Queues | ||
| Security Evaluators | - | |
| Real-time Streaming Evaluation | - | |
| Golden Response Library | - | |
| Built-in Threat Detection | - | |
| Evaluator Validation Framework | - |
Included in Every Plan
Evaluation features are included with all Bastio plans. No additional charges for evaluators, runs, or golden responses.
| Feature | Free | Starter | Pro | Enterprise |
|---|---|---|---|---|
| Custom Evaluators | 5 | 25 | 100 | Unlimited |
| Test Datasets | 3 | 20 | 100 | Unlimited |
| Evaluation Runs/month | 10 | 100 | 1,000 | Unlimited |
| Golden Responses | 20 | 200 | 2,000 | Unlimited |
| Streaming Evaluation | ||||
| Security Evaluators |
Start Evaluating Your AI
Get comprehensive AI evaluation with built-in security intelligence. Create your first evaluator in minutes.
Questions about evaluation? Contact us for a free consultation.