Bastio

AI Evaluation System

Comprehensive evaluation framework with LLM judges, security evaluators, golden response libraries, and real-time streaming evaluation.

AI Evaluation System

Bastio provides a comprehensive AI evaluation framework that goes beyond standard evaluation tools like Langfuse. With built-in security evaluators, real-time streaming evaluation, and golden response libraries, you can ensure AI quality and safety at scale.

Quick Start

1. Create an Evaluator

curl -X POST https://api.bastio.com/evaluation/evaluators \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Helpfulness",
    "type": "llm_judge",
    "criteria": {
      "prompt": "Rate how helpful this response is on a scale of 1-10. Consider accuracy, completeness, and usefulness.",
      "model": "gpt-4o"
    },
    "score_type": "numeric"
  }'

2. Create a Test Dataset

curl -X POST https://api.bastio.com/evaluation/datasets \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Q&A",
    "description": "Test cases for support bot evaluation"
  }'

3. Run an Evaluation

curl -X POST https://api.bastio.com/evaluation/runs \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Bot v1.2 Evaluation",
    "dataset_id": "ds_abc123",
    "evaluator_ids": ["eval_helpfulness", "eval_security_composite"]
  }'

Evaluator Types

Bastio supports multiple evaluator types for different use cases:

LLM Judge

Use an LLM to evaluate responses based on custom criteria.

{
  "name": "Tone Check",
  "type": "llm_judge",
  "criteria": {
    "prompt": "Evaluate if this response maintains a professional and friendly tone. Score 1 if yes, 0 if no.",
    "model": "gpt-4o-mini"
  },
  "score_type": "binary"
}

Regex Matching

Check for specific patterns in responses.

{
  "name": "No Email Disclosure",
  "type": "regex",
  "criteria": {
    "pattern": "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}",
    "expected": false
  },
  "score_type": "binary"
}

Keyword Detection

Search for specific keywords or phrases.

{
  "name": "Apology Check",
  "type": "keyword",
  "criteria": {
    "keywords": ["sorry", "apologize", "apologies"],
    "match_any": true
  },
  "score_type": "binary"
}

Semantic Similarity

Compare responses to expected outputs using embeddings.

{
  "name": "Answer Accuracy",
  "type": "semantic",
  "criteria": {
    "min_similarity": 0.85
  },
  "score_type": "numeric"
}

Security Evaluators

Bastio provides five pre-built security evaluators that leverage our threat detection engine:

Prompt Injection Detection

{
  "type": "security_prompt_injection"
}

Detects prompt injection attempts using pattern matching and ML-based analysis.

Jailbreak Detection

{
  "type": "security_jailbreak"
}

Identifies jailbreak patterns and instruction override attempts.

PII Leakage Detection

{
  "type": "security_pii_leakage"
}

Checks responses for sensitive data exposure (emails, phone numbers, SSNs, etc.).

Data Exfiltration Detection

{
  "type": "security_data_exfiltration"
}

Detects attempts to extract credentials, API keys, or environment variables.

Composite Security Score

{
  "type": "security_composite"
}

Combines all security checks into a single comprehensive score.

Real-Time Streaming Evaluation

Unlike batch-only evaluation tools, Bastio can evaluate responses during streaming generation.

Configuration

curl -X POST https://api.bastio.com/evaluation/streaming-configs \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "proxy_id": "proxy_abc123",
    "enabled": true,
    "evaluator_ids": ["eval_security_composite"],
    "sample_rate": 1.0,
    "alert_on_fail": true,
    "alert_threshold": 0.5,
    "block_on_fail": true,
    "block_threshold": 0.3,
    "eval_timeout_ms": 5000
  }'

Configuration Options

OptionDescriptionDefault
sample_ratePercentage of requests to evaluate (0.0-1.0)1.0
alert_on_failSend alert when score below thresholdtrue
alert_thresholdScore threshold for alerts0.5
block_on_failStop response when critical issues detectedfalse
block_thresholdScore threshold for blocking0.3
eval_timeout_msMaximum evaluation time before skipping5000

View Results

curl -X GET "https://api.bastio.com/evaluation/streaming-results?proxy_id=proxy_abc123&limit=50" \
  -H "Authorization: Bearer YOUR_TOKEN"

Golden Response Library

Store security-verified baseline responses for regression testing.

Create a Golden Response

curl -X POST https://api.bastio.com/evaluation/golden-responses \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Password Reset Flow",
    "category": "customer_support",
    "input_messages": [
      {"role": "user", "content": "How do I reset my password?"}
    ],
    "golden_output": "To reset your password, visit our login page and click \"Forgot Password\". Enter your email address and we will send you a reset link.",
    "tags": ["password", "authentication"]
  }'

Security Verification

curl -X POST https://api.bastio.com/evaluation/golden-responses/verify/gr_abc123 \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "security_score": 0.95
  }'

Find similar golden responses using embeddings:

curl -X POST https://api.bastio.com/evaluation/golden-responses/search \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "how to change account credentials",
    "min_similarity": 0.7,
    "limit": 5
  }'

Backfill Evaluation

Evaluate historical conversations and traces at scale.

Create a Backfill Job

curl -X POST https://api.bastio.com/evaluation/backfill \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Security Audit Q1 2024",
    "evaluator_ids": ["eval_security_composite", "eval_pii_leakage"],
    "filter_date_start": "2024-01-01T00:00:00Z",
    "filter_date_end": "2024-03-31T23:59:59Z",
    "filter_proxy_ids": ["proxy_abc123"],
    "data_source": "conversations"
  }'

Monitor Progress

curl -X GET https://api.bastio.com/evaluation/backfill/job_xyz789 \
  -H "Authorization: Bearer YOUR_TOKEN"

Response:

{
  "id": "job_xyz789",
  "status": "running",
  "total_items": 5000,
  "processed_items": 2340,
  "failed_items": 12,
  "avg_scores": {
    "eval_security_composite": 0.92,
    "eval_pii_leakage": 0.98
  }
}

Run Comparison

Compare two evaluation runs to detect regressions.

curl -X POST https://api.bastio.com/evaluation/compare-runs \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "baseline_run_id": "run_baseline",
    "comparison_run_id": "run_new_model"
  }'

Response:

{
  "baseline_run_id": "run_baseline",
  "comparison_run_id": "run_new_model",
  "evaluator_scores": [
    {
      "evaluator_id": "eval_helpfulness",
      "baseline_avg": 0.85,
      "comparison_avg": 0.82,
      "delta": -0.03,
      "is_regression": true
    }
  ],
  "summary": {
    "total_items_compared": 500,
    "items_regressed": 45,
    "items_improved": 120,
    "items_unchanged": 335,
    "regression_rate": 0.09
  }
}

Evaluator Validation

Compare LLM evaluator scores against human ground truth.

Add Ground Truth Labels

curl -X POST https://api.bastio.com/evaluation/ground-truth \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluator_id": "eval_helpfulness",
    "conversation_id": "conv_abc123",
    "ground_truth_score": 8.5,
    "labeled_by": "reviewer@company.com"
  }'

Generate Validation Report

curl -X POST https://api.bastio.com/evaluation/validation/eval_helpfulness/report \
  -H "Authorization: Bearer YOUR_TOKEN"

Response:

{
  "evaluator_id": "eval_helpfulness",
  "total_samples": 200,
  "binary_metrics": {
    "accuracy": 0.89,
    "precision": 0.91,
    "recall": 0.87,
    "f1_score": 0.89
  },
  "numeric_metrics": {
    "mae": 0.42,
    "rmse": 0.58,
    "correlation": 0.84
  }
}

Annotation Queue

Build human annotation workflows for QA and dataset creation.

Add to Queue

curl -X POST https://api.bastio.com/evaluation/annotation-queue \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "conversation_id": "conv_abc123",
    "priority": "high",
    "evaluator_ids": ["eval_helpfulness"]
  }'

Claim Items

curl -X POST https://api.bastio.com/evaluation/annotation-queue/item_xyz/claim \
  -H "Authorization: Bearer YOUR_TOKEN"

Submit Annotation

curl -X POST https://api.bastio.com/evaluation/annotation-queue/item_xyz/submit \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "scores": {"helpfulness": 9, "accuracy": 8},
    "feedback": "Clear and accurate response",
    "save_as_golden": true
  }'

API Reference

Evaluators

MethodEndpointDescription
GET/evaluation/evaluatorsList evaluators
POST/evaluation/evaluatorsCreate evaluator
GET/evaluation/evaluators/:idGet evaluator
PUT/evaluation/evaluators/:idUpdate evaluator
DELETE/evaluation/evaluators/:idDelete evaluator

Datasets

MethodEndpointDescription
GET/evaluation/datasetsList datasets
POST/evaluation/datasetsCreate dataset
GET/evaluation/datasets/:idGet dataset
PUT/evaluation/datasets/:idUpdate dataset
DELETE/evaluation/datasets/:idDelete dataset
POST/evaluation/datasets/:id/itemsAdd test case

Evaluation Runs

MethodEndpointDescription
GET/evaluation/runsList runs
POST/evaluation/runsStart run
GET/evaluation/runs/:idGet run status
POST/evaluation/runs/:id/cancelCancel run
POST/evaluation/compare-runsCompare two runs

Golden Responses

MethodEndpointDescription
GET/evaluation/golden-responsesList responses
POST/evaluation/golden-responsesCreate response
GET/evaluation/golden-responses/:idGet response
PUT/evaluation/golden-responses/:idUpdate response
DELETE/evaluation/golden-responses/:idDelete response
POST/evaluation/golden-responses/verify/:idVerify response
POST/evaluation/golden-responses/searchSemantic search

Streaming Evaluation

MethodEndpointDescription
GET/evaluation/streaming-configsList configs
POST/evaluation/streaming-configsCreate config
GET/evaluation/streaming-configs/:proxyIdGet config
PUT/evaluation/streaming-configs/:proxyIdUpdate config
DELETE/evaluation/streaming-configs/:proxyIdDelete config
GET/evaluation/streaming-resultsList results

Backfill

MethodEndpointDescription
GET/evaluation/backfillList jobs
POST/evaluation/backfillCreate job
GET/evaluation/backfill/:idGet job status
POST/evaluation/backfill/:id/cancelCancel job

Validation

MethodEndpointDescription
GET/evaluation/ground-truthList labels
POST/evaluation/ground-truthAdd label
DELETE/evaluation/ground-truth/:idDelete label
POST/evaluation/validation/:evaluatorId/reportGenerate report
GET/evaluation/validation/reportsList reports

Best Practices

1. Start with Security Evaluators

Use the built-in security evaluators before deploying any AI application:

{
  "evaluator_ids": [
    "security_prompt_injection",
    "security_jailbreak",
    "security_pii_leakage"
  ]
}

2. Build a Golden Response Library

For critical user journeys, create golden responses:

  1. Identify key conversation scenarios
  2. Create ideal responses
  3. Verify with security team
  4. Use for regression testing

3. Validate LLM Judges

Periodically validate your LLM judges against human annotations:

  1. Sample 100+ conversations
  2. Have humans label them
  3. Generate validation report
  4. Tune evaluator prompts if correlation is low

4. Use Streaming Evaluation in Production

Enable streaming evaluation with blocking for security-critical applications:

{
  "enabled": true,
  "evaluator_ids": ["security_composite"],
  "block_on_fail": true,
  "block_threshold": 0.3
}

5. Automate Regression Detection

Run comparison evaluations after model updates:

  1. Create baseline run before update
  2. Deploy new model
  3. Create comparison run
  4. Check regression rate
  5. Roll back if regression > threshold

Pricing

Evaluation features are included with all Bastio plans:

FeatureFreeStarterProEnterprise
Custom Evaluators525100Unlimited
Test Datasets320100Unlimited
Evaluation Runs/month101001,000Unlimited
Golden Responses202002,000Unlimited
Streaming EvaluationYesYesYesYes
Security EvaluatorsYesYesYesYes
Backfill Jobs1/month10/month100/monthUnlimited

Next Steps