Evaluation

Evaluate AI quality with built-in security intelligence

LLM judges, security-focused evaluators, golden response libraries, and real-time streaming evaluation — a complete framework for measuring and improving AI output quality.

5

Security evaluators

< 10ms

Eval latency

Real-time

Streaming evaluation

Evaluator Library

LLM judges, regex, similarity, and security.

EvaluatorTypeScoringStatus
response-qualityLLM Judge1–5 numericActive
pii-leakageSecurityPass / FailActive
prompt-injectionSecurityPass / FailActive
format-checkRegexPass / FailActive
topic-relevanceSimilarity0–1 scoreDraft
Evaluation Runs

Batch runs with regression tracking.

run_04794% pass rate across 500 items

3 evaluators · 1 hour ago

run_046Regression detected: quality score −12%

3 evaluators · 6 hours ago

run_04598% pass rate, 2 PII flags

5 evaluators · 1 day ago

run_044Baseline established for v3 prompt

3 evaluators · 3 days ago

Security Evaluators

Five built-in evaluators powered by the threat detection engine.

EvaluatorMethodLatencyDefault Action
Prompt InjectionPattern + ML<10msBlock
Jailbreak DetectionPattern matching<10msBlock
PII Leakage14-type scanner<5msSanitize
Data ExfiltrationContent analysis<10msBlock
Composite ScoreWeighted aggregate<15msAlert

What's included

Quality, security, and regression testing — built in

Every evaluation feature included with your plan. LLM judges, security evaluators, golden responses, and annotation queues at no extra configuration.

LLM-as-judge evaluators
Regex and keyword matching
Semantic similarity scoring
5 built-in security evaluators
Real-time streaming evaluation
Golden response library
Test dataset management
Batch evaluation runs
Run-to-run regression detection
Human annotation queues
Evaluator validation framework
JSONL import/export

Create Evaluator

Define custom evaluators with LLM judges, regex patterns, or semantic similarity. Set scoring type and thresholds.

curl -X POST https://api.bastio.com/v1/evaluators \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "response-quality",
    "type": "llm_judge",
    "model": "gpt-4o",
    "criteria": "Rate helpfulness 1-5",
    "scoring": "numeric",
    "threshold": 3.0
  }'

Run Evaluation

Execute evaluations against a dataset or live traffic. Results include per-item scores and aggregate metrics.

curl -X POST https://api.bastio.com/v1/evaluations/run \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_id": "ds_customer_support",
    "evaluator_ids": [
      "response-quality",
      "pii-leakage",
      "prompt-injection"
    ],
    "compare_to": "run_046"
  }'

Security Evaluators

Five built-in evaluators powered by the threat detection engine. Prompt injection, jailbreak, PII, and more — no custom prompts needed.

Streaming Evaluation

Evaluate during response generation, not after. Detect issues in real-time and block unsafe responses before they reach users.

Regression Detection

Compare evaluation runs to catch quality drops. Per-evaluator score deltas with automatic regression and improvement flags.

Start evaluating your AI output

Evaluation included with every plan. Security evaluators, LLM judges, and regression testing from day one.