Evaluation
Evaluate AI quality with built-in security intelligence
LLM judges, security-focused evaluators, golden response libraries, and real-time streaming evaluation — a complete framework for measuring and improving AI output quality.
5
Security evaluators
< 10ms
Eval latency
Real-time
Streaming evaluation
LLM judges, regex, similarity, and security.
| Evaluator | Type | Scoring | Status |
|---|---|---|---|
| response-quality | LLM Judge | 1–5 numeric | Active |
| pii-leakage | Security | Pass / Fail | Active |
| prompt-injection | Security | Pass / Fail | Active |
| format-check | Regex | Pass / Fail | Active |
| topic-relevance | Similarity | 0–1 score | Draft |
Batch runs with regression tracking.
run_047 — 94% pass rate across 500 items
3 evaluators · 1 hour ago
run_046 — Regression detected: quality score −12%
3 evaluators · 6 hours ago
run_045 — 98% pass rate, 2 PII flags
5 evaluators · 1 day ago
run_044 — Baseline established for v3 prompt
3 evaluators · 3 days ago
Five built-in evaluators powered by the threat detection engine.
| Evaluator | Method | Latency | Default Action |
|---|---|---|---|
| Prompt Injection | Pattern + ML | <10ms | Block |
| Jailbreak Detection | Pattern matching | <10ms | Block |
| PII Leakage | 14-type scanner | <5ms | Sanitize |
| Data Exfiltration | Content analysis | <10ms | Block |
| Composite Score | Weighted aggregate | <15ms | Alert |
What's included
Quality, security, and regression testing — built in
Every evaluation feature included with your plan. LLM judges, security evaluators, golden responses, and annotation queues at no extra configuration.
Create Evaluator
Define custom evaluators with LLM judges, regex patterns, or semantic similarity. Set scoring type and thresholds.
curl -X POST https://api.bastio.com/v1/evaluators \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "response-quality",
"type": "llm_judge",
"model": "gpt-4o",
"criteria": "Rate helpfulness 1-5",
"scoring": "numeric",
"threshold": 3.0
}'Run Evaluation
Execute evaluations against a dataset or live traffic. Results include per-item scores and aggregate metrics.
curl -X POST https://api.bastio.com/v1/evaluations/run \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"dataset_id": "ds_customer_support",
"evaluator_ids": [
"response-quality",
"pii-leakage",
"prompt-injection"
],
"compare_to": "run_046"
}'Security Evaluators
Five built-in evaluators powered by the threat detection engine. Prompt injection, jailbreak, PII, and more — no custom prompts needed.
Streaming Evaluation
Evaluate during response generation, not after. Detect issues in real-time and block unsafe responses before they reach users.
Regression Detection
Compare evaluation runs to catch quality drops. Per-evaluator score deltas with automatic regression and improvement flags.
Start evaluating your AI output
Evaluation included with every plan. Security evaluators, LLM judges, and regression testing from day one.