Comprehensive evaluation framework with LLM judges, security evaluators, golden response libraries, and real-time streaming evaluation.

AI Evaluation System

Bastio provides a comprehensive AI evaluation framework that goes beyond standard evaluation tools like Langfuse. With built-in security evaluators, real-time streaming evaluation, and golden response libraries, you can ensure AI quality and safety at scale.

Quick Start

1. Create an Evaluator

curl -X POST https://api.bastio.com/evaluation/evaluators \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Helpfulness",
    "type": "llm_judge",
    "criteria": {
      "prompt": "Rate how helpful this response is on a scale of 1-10. Consider accuracy, completeness, and usefulness.",
      "model": "gpt-4o"
    },
    "score_type": "numeric"
  }'

2. Create a Test Dataset

curl -X POST https://api.bastio.com/evaluation/datasets \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Q&A",
    "description": "Test cases for support bot evaluation"
  }'

3. Run an Evaluation

curl -X POST https://api.bastio.com/evaluation/runs \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Bot v1.2 Evaluation",
    "dataset_id": "ds_abc123",
    "evaluator_ids": ["eval_helpfulness", "eval_security_composite"]
  }'

Evaluator Types

Bastio supports multiple evaluator types for different use cases:

LLM Judge

Use an LLM to evaluate responses based on custom criteria.

{
  "name": "Tone Check",
  "type": "llm_judge",
  "criteria": {
    "prompt": "Evaluate if this response maintains a professional and friendly tone. Score 1 if yes, 0 if no.",
    "model": "gpt-4o-mini"
  },
  "score_type": "binary"
}

Regex Matching

Check for specific patterns in responses.

{
  "name": "No Email Disclosure",
  "type": "regex",
  "criteria": {
    "pattern": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
    "expected": false
  },
  "score_type": "binary"
}

Keyword Detection

Search for specific keywords or phrases.

{
  "name": "Apology Check",
  "type": "keyword",
  "criteria": {
    "keywords": ["sorry", "apologize", "apologies"],
    "match_any": true
  },
  "score_type": "binary"
}

Semantic Similarity

Compare responses to expected outputs using embeddings.

{
  "name": "Answer Accuracy",
  "type": "semantic",
  "criteria": {
    "min_similarity": 0.85
  },
  "score_type": "numeric"
}

Security Evaluators

Bastio provides five pre-built security evaluators that leverage our threat detection engine:

Prompt Injection Detection

{
  "type": "security_prompt_injection"
}

Detects prompt injection attempts using pattern matching and ML-based analysis.

Jailbreak Detection

{
  "type": "security_jailbreak"
}

Identifies jailbreak patterns and instruction override attempts.

PII Leakage Detection

{
  "type": "security_pii_leakage"
}

Checks responses for sensitive data exposure (emails, phone numbers, SSNs, etc.).

Data Exfiltration Detection

{
  "type": "security_data_exfiltration"
}

Detects attempts to extract credentials, API keys, or environment variables.

Composite Security Score

{
  "type": "security_composite"
}

Combines all security checks into a single comprehensive score.

Real-Time Streaming Evaluation

Unlike batch-only evaluation tools, Bastio can evaluate responses during streaming generation.

Configuration

curl -X POST https://api.bastio.com/evaluation/streaming-configs \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "proxy_id": "proxy_abc123",
    "enabled": true,
    "evaluator_ids": ["eval_security_composite"],
    "sample_rate": 1.0,
    "alert_on_fail": true,
    "alert_threshold": 0.5,
    "block_on_fail": true,
    "block_threshold": 0.3,
    "eval_timeout_ms": 5000
  }'

Configuration Options

Option	Description	Default
`sample_rate`	Percentage of requests to evaluate (0.0-1.0)	1.0
`alert_on_fail`	Send alert when score below threshold	true
`alert_threshold`	Score threshold for alerts	0.5
`block_on_fail`	Stop response when critical issues detected	false
`block_threshold`	Score threshold for blocking	0.3
`eval_timeout_ms`	Maximum evaluation time before skipping	5000

View Results

curl -X GET "https://api.bastio.com/evaluation/streaming-results?proxy_id=proxy_abc123&limit=50" \
  -H "Authorization: Bearer YOUR_TOKEN"

Golden Response Library

Store security-verified baseline responses for regression testing.

Create a Golden Response

curl -X POST https://api.bastio.com/evaluation/golden-responses \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Password Reset Flow",
    "category": "customer_support",
    "input_messages": [
      {"role": "user", "content": "How do I reset my password?"}
    ],
    "golden_output": "To reset your password, visit our login page and click \"Forgot Password\". Enter your email address and we will send you a reset link.",
    "tags": ["password", "authentication"]
  }'

Security Verification

curl -X POST https://api.bastio.com/evaluation/golden-responses/verify/gr_abc123 \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "security_score": 0.95
  }'

Semantic Search

Find similar golden responses using embeddings:

curl -X POST https://api.bastio.com/evaluation/golden-responses/search \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how to change account credentials",
    "min_similarity": 0.7,
    "limit": 5
  }'

Backfill Evaluation

Evaluate historical conversations and traces at scale.

Create a Backfill Job

curl -X POST https://api.bastio.com/evaluation/backfill \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Security Audit Q1 2024",
    "evaluator_ids": ["eval_security_composite", "eval_pii_leakage"],
    "filter_date_start": "2024-01-01T00:00:00Z",
    "filter_date_end": "2024-03-31T23:59:59Z",
    "filter_proxy_ids": ["proxy_abc123"],
    "data_source": "conversations"
  }'

Monitor Progress

curl -X GET https://api.bastio.com/evaluation/backfill/job_xyz789 \
  -H "Authorization: Bearer YOUR_TOKEN"

Response:

{
  "id": "job_xyz789",
  "status": "running",
  "total_items": 5000,
  "processed_items": 2340,
  "failed_items": 12,
  "avg_scores": {
    "eval_security_composite": 0.92,
    "eval_pii_leakage": 0.98
  }
}

Run Comparison

Compare two evaluation runs to detect regressions.

curl -X POST https://api.bastio.com/evaluation/compare-runs \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "baseline_run_id": "run_baseline",
    "comparison_run_id": "run_new_model"
  }'

Response:

{
  "baseline_run_id": "run_baseline",
  "comparison_run_id": "run_new_model",
  "evaluator_scores": [
    {
      "evaluator_id": "eval_helpfulness",
      "baseline_avg": 0.85,
      "comparison_avg": 0.82,
      "delta": -0.03,
      "is_regression": true
    }
  ],
  "summary": {
    "total_items_compared": 500,
    "items_regressed": 45,
    "items_improved": 120,
    "items_unchanged": 335,
    "regression_rate": 0.09
  }
}

Evaluator Validation

Compare LLM evaluator scores against human ground truth.

Add Ground Truth Labels

curl -X POST https://api.bastio.com/evaluation/ground-truth \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluator_id": "eval_helpfulness",
    "conversation_id": "conv_abc123",
    "ground_truth_score": 8.5,
    "labeled_by": "reviewer@company.com"
  }'

Generate Validation Report

curl -X POST https://api.bastio.com/evaluation/validation/eval_helpfulness/report \
  -H "Authorization: Bearer YOUR_TOKEN"

Response:

{
  "evaluator_id": "eval_helpfulness",
  "total_samples": 200,
  "binary_metrics": {
    "accuracy": 0.89,
    "precision": 0.91,
    "recall": 0.87,
    "f1_score": 0.89
  },
  "numeric_metrics": {
    "mae": 0.42,
    "rmse": 0.58,
    "correlation": 0.84
  }
}

Annotation Queue

Build human annotation workflows for QA and dataset creation.

Add to Queue

curl -X POST https://api.bastio.com/evaluation/annotation-queue \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_id": "conv_abc123",
    "priority": "high",
    "evaluator_ids": ["eval_helpfulness"]
  }'

Claim Items

curl -X POST https://api.bastio.com/evaluation/annotation-queue/item_xyz/claim \
  -H "Authorization: Bearer YOUR_TOKEN"

Submit Annotation

curl -X POST https://api.bastio.com/evaluation/annotation-queue/item_xyz/submit \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "scores": {"helpfulness": 9, "accuracy": 8},
    "feedback": "Clear and accurate response",
    "save_as_golden": true
  }'

API Reference

Evaluators

Method	Endpoint	Description
GET	`/evaluation/evaluators`	List evaluators
POST	`/evaluation/evaluators`	Create evaluator
GET	`/evaluation/evaluators/:id`	Get evaluator
PUT	`/evaluation/evaluators/:id`	Update evaluator
DELETE	`/evaluation/evaluators/:id`	Delete evaluator

Datasets

Method	Endpoint	Description
GET	`/evaluation/datasets`	List datasets
POST	`/evaluation/datasets`	Create dataset
GET	`/evaluation/datasets/:id`	Get dataset
PUT	`/evaluation/datasets/:id`	Update dataset
DELETE	`/evaluation/datasets/:id`	Delete dataset
POST	`/evaluation/datasets/:id/items`	Add test case

Evaluation Runs

Method	Endpoint	Description
GET	`/evaluation/runs`	List runs
POST	`/evaluation/runs`	Start run
GET	`/evaluation/runs/:id`	Get run status
POST	`/evaluation/runs/:id/cancel`	Cancel run
POST	`/evaluation/compare-runs`	Compare two runs

Golden Responses

Method	Endpoint	Description
GET	`/evaluation/golden-responses`	List responses
POST	`/evaluation/golden-responses`	Create response
GET	`/evaluation/golden-responses/:id`	Get response
PUT	`/evaluation/golden-responses/:id`	Update response
DELETE	`/evaluation/golden-responses/:id`	Delete response
POST	`/evaluation/golden-responses/verify/:id`	Verify response
POST	`/evaluation/golden-responses/search`	Semantic search

Streaming Evaluation

Method	Endpoint	Description
GET	`/evaluation/streaming-configs`	List configs
POST	`/evaluation/streaming-configs`	Create config
GET	`/evaluation/streaming-configs/:proxyId`	Get config
PUT	`/evaluation/streaming-configs/:proxyId`	Update config
DELETE	`/evaluation/streaming-configs/:proxyId`	Delete config
GET	`/evaluation/streaming-results`	List results

Backfill

Method	Endpoint	Description
GET	`/evaluation/backfill`	List jobs
POST	`/evaluation/backfill`	Create job
GET	`/evaluation/backfill/:id`	Get job status
POST	`/evaluation/backfill/:id/cancel`	Cancel job

Validation

Method	Endpoint	Description
GET	`/evaluation/ground-truth`	List labels
POST	`/evaluation/ground-truth`	Add label
DELETE	`/evaluation/ground-truth/:id`	Delete label
POST	`/evaluation/validation/:evaluatorId/report`	Generate report
GET	`/evaluation/validation/reports`	List reports

Best Practices

1. Start with Security Evaluators

Use the built-in security evaluators before deploying any AI application:

{
  "evaluator_ids": [
    "security_prompt_injection",
    "security_jailbreak",
    "security_pii_leakage"
  ]
}

2. Build a Golden Response Library

For critical user journeys, create golden responses:

Identify key conversation scenarios
Create ideal responses
Verify with security team
Use for regression testing

3. Validate LLM Judges

Periodically validate your LLM judges against human annotations:

Sample 100+ conversations
Have humans label them
Generate validation report
Tune evaluator prompts if correlation is low

4. Use Streaming Evaluation in Production

Enable streaming evaluation with blocking for security-critical applications:

{
  "enabled": true,
  "evaluator_ids": ["security_composite"],
  "block_on_fail": true,
  "block_threshold": 0.3
}

5. Automate Regression Detection

Run comparison evaluations after model updates:

Create baseline run before update
Deploy new model
Create comparison run
Check regression rate
Roll back if regression > threshold

Pricing

Evaluation features are included with all Bastio plans:

Feature	Free	Starter	Pro	Enterprise
Custom Evaluators	5	25	100	Unlimited
Test Datasets	3	20	100	Unlimited
Evaluation Runs/month	10	100	1,000	Unlimited
Golden Responses	20	200	2,000	Unlimited
Streaming Evaluation	Yes	Yes	Yes	Yes
Security Evaluators	Yes	Yes	Yes	Yes
Backfill Jobs	1/month	10/month	100/month	Unlimited