Evaluation

Evaluate AI quality with built-in security intelligence

LLM judges, security-focused evaluators, golden response libraries, and real-time streaming evaluation — a complete framework for measuring and improving AI output quality.

Security evaluators

< 10ms

Eval latency

Real-time

Streaming evaluation

Evaluator Library

LLM judges, regex, similarity, and security.

Evaluator	Type	Scoring	Status
response-quality	LLM Judge	1–5 numeric	Active
pii-leakage	Security	Pass / Fail	Active
prompt-injection	Security	Pass / Fail	Active
format-check	Regex	Pass / Fail	Active
topic-relevance	Similarity	0–1 score	Draft

Evaluation Runs

Batch runs with regression tracking.

run_047 — 94% pass rate across 500 items

3 evaluators · 1 hour ago

run_046 — Regression detected: quality score −12%

3 evaluators · 6 hours ago

run_045 — 98% pass rate, 2 PII flags

5 evaluators · 1 day ago

run_044 — Baseline established for v3 prompt

3 evaluators · 3 days ago

Security Evaluators

Five built-in evaluators powered by the threat detection engine.

Evaluator	Method	Latency	Default Action
Prompt Injection	Pattern + ML	<10ms	Block
Jailbreak Detection	Pattern matching	<10ms	Block
PII Leakage	14-type scanner	<5ms	Sanitize
Data Exfiltration	Content analysis	<10ms	Block
Composite Score	Weighted aggregate	<15ms	Alert

What's included

Quality, security, and regression testing — built in

Every evaluation feature included with your plan. LLM judges, security evaluators, golden responses, and annotation queues at no extra configuration.

LLM-as-judge evaluators

Regex and keyword matching

Semantic similarity scoring

5 built-in security evaluators

Real-time streaming evaluation

Golden response library

Test dataset management

Batch evaluation runs

Run-to-run regression detection

Human annotation queues

Evaluator validation framework

JSONL import/export

Create Evaluator

Define custom evaluators with LLM judges, regex patterns, or semantic similarity. Set scoring type and thresholds.

curl -X POST https://api.bastio.com/v1/evaluators \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "response-quality",
    "type": "llm_judge",
    "model": "gpt-4o",
    "criteria": "Rate helpfulness 1-5",
    "scoring": "numeric",
    "threshold": 3.0
  }'

Run Evaluation

Execute evaluations against a dataset or live traffic. Results include per-item scores and aggregate metrics.

curl -X POST https://api.bastio.com/v1/evaluations/run \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_id": "ds_customer_support",
    "evaluator_ids": [
      "response-quality",
      "pii-leakage",
      "prompt-injection"
    ],
    "compare_to": "run_046"
  }'

Security Evaluators

Five built-in evaluators powered by the threat detection engine. Prompt injection, jailbreak, PII, and more — no custom prompts needed.

Streaming Evaluation

Evaluate during response generation, not after. Detect issues in real-time and block unsafe responses before they reach users.

Regression Detection

Compare evaluation runs to catch quality drops. Per-evaluator score deltas with automatic regression and improvement flags.

Start evaluating your AI output

Evaluation included with every plan. Security evaluators, LLM judges, and regression testing from day one.

Get started free Read the docs