AI Evaluation System
Comprehensive evaluation framework with LLM judges, security evaluators, golden response libraries, and real-time streaming evaluation.
AI Evaluation System
Bastio provides a comprehensive AI evaluation framework that goes beyond standard evaluation tools like Langfuse. With built-in security evaluators, real-time streaming evaluation, and golden response libraries, you can ensure AI quality and safety at scale.
Quick Start
1. Create an Evaluator
curl -X POST https://api.bastio.com/evaluation/evaluators \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Helpfulness",
"type": "llm_judge",
"criteria": {
"prompt": "Rate how helpful this response is on a scale of 1-10. Consider accuracy, completeness, and usefulness.",
"model": "gpt-4o"
},
"score_type": "numeric"
}'2. Create a Test Dataset
curl -X POST https://api.bastio.com/evaluation/datasets \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Customer Support Q&A",
"description": "Test cases for support bot evaluation"
}'3. Run an Evaluation
curl -X POST https://api.bastio.com/evaluation/runs \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Support Bot v1.2 Evaluation",
"dataset_id": "ds_abc123",
"evaluator_ids": ["eval_helpfulness", "eval_security_composite"]
}'Evaluator Types
Bastio supports multiple evaluator types for different use cases:
LLM Judge
Use an LLM to evaluate responses based on custom criteria.
{
"name": "Tone Check",
"type": "llm_judge",
"criteria": {
"prompt": "Evaluate if this response maintains a professional and friendly tone. Score 1 if yes, 0 if no.",
"model": "gpt-4o-mini"
},
"score_type": "binary"
}Regex Matching
Check for specific patterns in responses.
{
"name": "No Email Disclosure",
"type": "regex",
"criteria": {
"pattern": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
"expected": false
},
"score_type": "binary"
}Keyword Detection
Search for specific keywords or phrases.
{
"name": "Apology Check",
"type": "keyword",
"criteria": {
"keywords": ["sorry", "apologize", "apologies"],
"match_any": true
},
"score_type": "binary"
}Semantic Similarity
Compare responses to expected outputs using embeddings.
{
"name": "Answer Accuracy",
"type": "semantic",
"criteria": {
"min_similarity": 0.85
},
"score_type": "numeric"
}Security Evaluators
Bastio provides five pre-built security evaluators that leverage our threat detection engine:
Prompt Injection Detection
{
"type": "security_prompt_injection"
}Detects prompt injection attempts using pattern matching and ML-based analysis.
Jailbreak Detection
{
"type": "security_jailbreak"
}Identifies jailbreak patterns and instruction override attempts.
PII Leakage Detection
{
"type": "security_pii_leakage"
}Checks responses for sensitive data exposure (emails, phone numbers, SSNs, etc.).
Data Exfiltration Detection
{
"type": "security_data_exfiltration"
}Detects attempts to extract credentials, API keys, or environment variables.
Composite Security Score
{
"type": "security_composite"
}Combines all security checks into a single comprehensive score.
Real-Time Streaming Evaluation
Unlike batch-only evaluation tools, Bastio can evaluate responses during streaming generation.
Configuration
curl -X POST https://api.bastio.com/evaluation/streaming-configs \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"proxy_id": "proxy_abc123",
"enabled": true,
"evaluator_ids": ["eval_security_composite"],
"sample_rate": 1.0,
"alert_on_fail": true,
"alert_threshold": 0.5,
"block_on_fail": true,
"block_threshold": 0.3,
"eval_timeout_ms": 5000
}'Configuration Options
| Option | Description | Default |
|---|---|---|
sample_rate | Percentage of requests to evaluate (0.0-1.0) | 1.0 |
alert_on_fail | Send alert when score below threshold | true |
alert_threshold | Score threshold for alerts | 0.5 |
block_on_fail | Stop response when critical issues detected | false |
block_threshold | Score threshold for blocking | 0.3 |
eval_timeout_ms | Maximum evaluation time before skipping | 5000 |
View Results
curl -X GET "https://api.bastio.com/evaluation/streaming-results?proxy_id=proxy_abc123&limit=50" \
-H "Authorization: Bearer YOUR_TOKEN"Golden Response Library
Store security-verified baseline responses for regression testing.
Create a Golden Response
curl -X POST https://api.bastio.com/evaluation/golden-responses \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Password Reset Flow",
"category": "customer_support",
"input_messages": [
{"role": "user", "content": "How do I reset my password?"}
],
"golden_output": "To reset your password, visit our login page and click \"Forgot Password\". Enter your email address and we will send you a reset link.",
"tags": ["password", "authentication"]
}'Security Verification
curl -X POST https://api.bastio.com/evaluation/golden-responses/verify/gr_abc123 \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"security_score": 0.95
}'Semantic Search
Find similar golden responses using embeddings:
curl -X POST https://api.bastio.com/evaluation/golden-responses/search \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "how to change account credentials",
"min_similarity": 0.7,
"limit": 5
}'Backfill Evaluation
Evaluate historical conversations and traces at scale.
Create a Backfill Job
curl -X POST https://api.bastio.com/evaluation/backfill \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Security Audit Q1 2024",
"evaluator_ids": ["eval_security_composite", "eval_pii_leakage"],
"filter_date_start": "2024-01-01T00:00:00Z",
"filter_date_end": "2024-03-31T23:59:59Z",
"filter_proxy_ids": ["proxy_abc123"],
"data_source": "conversations"
}'Monitor Progress
curl -X GET https://api.bastio.com/evaluation/backfill/job_xyz789 \
-H "Authorization: Bearer YOUR_TOKEN"Response:
{
"id": "job_xyz789",
"status": "running",
"total_items": 5000,
"processed_items": 2340,
"failed_items": 12,
"avg_scores": {
"eval_security_composite": 0.92,
"eval_pii_leakage": 0.98
}
}Run Comparison
Compare two evaluation runs to detect regressions.
curl -X POST https://api.bastio.com/evaluation/compare-runs \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"baseline_run_id": "run_baseline",
"comparison_run_id": "run_new_model"
}'Response:
{
"baseline_run_id": "run_baseline",
"comparison_run_id": "run_new_model",
"evaluator_scores": [
{
"evaluator_id": "eval_helpfulness",
"baseline_avg": 0.85,
"comparison_avg": 0.82,
"delta": -0.03,
"is_regression": true
}
],
"summary": {
"total_items_compared": 500,
"items_regressed": 45,
"items_improved": 120,
"items_unchanged": 335,
"regression_rate": 0.09
}
}Evaluator Validation
Compare LLM evaluator scores against human ground truth.
Add Ground Truth Labels
curl -X POST https://api.bastio.com/evaluation/ground-truth \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"evaluator_id": "eval_helpfulness",
"conversation_id": "conv_abc123",
"ground_truth_score": 8.5,
"labeled_by": "reviewer@company.com"
}'Generate Validation Report
curl -X POST https://api.bastio.com/evaluation/validation/eval_helpfulness/report \
-H "Authorization: Bearer YOUR_TOKEN"Response:
{
"evaluator_id": "eval_helpfulness",
"total_samples": 200,
"binary_metrics": {
"accuracy": 0.89,
"precision": 0.91,
"recall": 0.87,
"f1_score": 0.89
},
"numeric_metrics": {
"mae": 0.42,
"rmse": 0.58,
"correlation": 0.84
}
}Annotation Queue
Build human annotation workflows for QA and dataset creation.
Add to Queue
curl -X POST https://api.bastio.com/evaluation/annotation-queue \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"conversation_id": "conv_abc123",
"priority": "high",
"evaluator_ids": ["eval_helpfulness"]
}'Claim Items
curl -X POST https://api.bastio.com/evaluation/annotation-queue/item_xyz/claim \
-H "Authorization: Bearer YOUR_TOKEN"Submit Annotation
curl -X POST https://api.bastio.com/evaluation/annotation-queue/item_xyz/submit \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"scores": {"helpfulness": 9, "accuracy": 8},
"feedback": "Clear and accurate response",
"save_as_golden": true
}'API Reference
Evaluators
| Method | Endpoint | Description |
|---|---|---|
| GET | /evaluation/evaluators | List evaluators |
| POST | /evaluation/evaluators | Create evaluator |
| GET | /evaluation/evaluators/:id | Get evaluator |
| PUT | /evaluation/evaluators/:id | Update evaluator |
| DELETE | /evaluation/evaluators/:id | Delete evaluator |
Datasets
| Method | Endpoint | Description |
|---|---|---|
| GET | /evaluation/datasets | List datasets |
| POST | /evaluation/datasets | Create dataset |
| GET | /evaluation/datasets/:id | Get dataset |
| PUT | /evaluation/datasets/:id | Update dataset |
| DELETE | /evaluation/datasets/:id | Delete dataset |
| POST | /evaluation/datasets/:id/items | Add test case |
Evaluation Runs
| Method | Endpoint | Description |
|---|---|---|
| GET | /evaluation/runs | List runs |
| POST | /evaluation/runs | Start run |
| GET | /evaluation/runs/:id | Get run status |
| POST | /evaluation/runs/:id/cancel | Cancel run |
| POST | /evaluation/compare-runs | Compare two runs |
Golden Responses
| Method | Endpoint | Description |
|---|---|---|
| GET | /evaluation/golden-responses | List responses |
| POST | /evaluation/golden-responses | Create response |
| GET | /evaluation/golden-responses/:id | Get response |
| PUT | /evaluation/golden-responses/:id | Update response |
| DELETE | /evaluation/golden-responses/:id | Delete response |
| POST | /evaluation/golden-responses/verify/:id | Verify response |
| POST | /evaluation/golden-responses/search | Semantic search |
Streaming Evaluation
| Method | Endpoint | Description |
|---|---|---|
| GET | /evaluation/streaming-configs | List configs |
| POST | /evaluation/streaming-configs | Create config |
| GET | /evaluation/streaming-configs/:proxyId | Get config |
| PUT | /evaluation/streaming-configs/:proxyId | Update config |
| DELETE | /evaluation/streaming-configs/:proxyId | Delete config |
| GET | /evaluation/streaming-results | List results |
Backfill
| Method | Endpoint | Description |
|---|---|---|
| GET | /evaluation/backfill | List jobs |
| POST | /evaluation/backfill | Create job |
| GET | /evaluation/backfill/:id | Get job status |
| POST | /evaluation/backfill/:id/cancel | Cancel job |
Validation
| Method | Endpoint | Description |
|---|---|---|
| GET | /evaluation/ground-truth | List labels |
| POST | /evaluation/ground-truth | Add label |
| DELETE | /evaluation/ground-truth/:id | Delete label |
| POST | /evaluation/validation/:evaluatorId/report | Generate report |
| GET | /evaluation/validation/reports | List reports |
Best Practices
1. Start with Security Evaluators
Use the built-in security evaluators before deploying any AI application:
{
"evaluator_ids": [
"security_prompt_injection",
"security_jailbreak",
"security_pii_leakage"
]
}2. Build a Golden Response Library
For critical user journeys, create golden responses:
- Identify key conversation scenarios
- Create ideal responses
- Verify with security team
- Use for regression testing
3. Validate LLM Judges
Periodically validate your LLM judges against human annotations:
- Sample 100+ conversations
- Have humans label them
- Generate validation report
- Tune evaluator prompts if correlation is low
4. Use Streaming Evaluation in Production
Enable streaming evaluation with blocking for security-critical applications:
{
"enabled": true,
"evaluator_ids": ["security_composite"],
"block_on_fail": true,
"block_threshold": 0.3
}5. Automate Regression Detection
Run comparison evaluations after model updates:
- Create baseline run before update
- Deploy new model
- Create comparison run
- Check regression rate
- Roll back if regression > threshold
Pricing
Evaluation features are included with all Bastio plans:
| Feature | Free | Starter | Pro | Enterprise |
|---|---|---|---|---|
| Custom Evaluators | 5 | 25 | 100 | Unlimited |
| Test Datasets | 3 | 20 | 100 | Unlimited |
| Evaluation Runs/month | 10 | 100 | 1,000 | Unlimited |
| Golden Responses | 20 | 200 | 2,000 | Unlimited |
| Streaming Evaluation | Yes | Yes | Yes | Yes |
| Security Evaluators | Yes | Yes | Yes | Yes |
| Backfill Jobs | 1/month | 10/month | 100/month | Unlimited |