EndpointEvaluator — Regression Testing for LLM Outputs

1.0.0 — 2026-05-30

EndpointEvaluator's first publicly-supported API release. Documents the contract as-shipped at public launch.

Endpoints

POST /api/v1/login — magic-link delivery. Unauthenticated. Returns uniform 200 regardless of whether email exists, to avoid account-existence leakage.
POST /api/v1/evaluate — single evaluation. Authenticated. Four scoring methods: lexical, semantic, inferential, combined. Returns verdict + bonus-gated raw score.
POST /api/v1/evaluate/batch — up to 10 evaluations per request. Authenticated. Bonus features required (Large credit pack); returns 403 forbidden otherwise. Supports mixed-outcome 207 Multi-Status responses.
GET /api/v1/evaluations/:id — fetch evaluation by id. Authenticated. 1 credit per lookup. Debug Mode rows additionally return output_text/reference_text/text_retention_until.
GET /api/v1/balance — fetch current account balance, tier, and bonus status. Authenticated.

Authentication

Header: X-API-Key: <your-api-key>. Obtain via web signup. All authenticated endpoints require it.
POST /api/v1/login is the only authenticated-scope-adjacent endpoint that does NOT require X-API-Key.

Errors

Every 4xx/5xx response includes error_type (string) as the discriminator. Branch on error_type, not HTTP status.

Closed catalog at 1.0.0 (16 values):

auth_required, auth_invalid, rate_limited, idempotency_key_invalid, idempotency_key_conflict, validation_error, insufficient_credits, forbidden, not_found, evaluation_error, partial_combined_unavailable, all_methods_failed, all_methods_failed_transient, batch_all_methods_failed, batch_all_methods_failed_transient, formatter_contract_violation.

The _transient variants are structural subtypes signaling "retry once is appropriate." See the /docs page error section.

Rate limiting

Free tier: 1 request / 60s.
Paid tier: 10 requests / 60s.
Window: fixed 60-second sliding window keyed on account.
Headers on every response: x-ratelimit-limit, x-ratelimit-remaining, x-ratelimit-reset.
429 response includes retry-after header (seconds).

Idempotency

Optional. Behind feature flag at 1.0.0 — see /docs idempotency section for current status.
Header: Idempotency-Key: <client-chosen-string>.
Replays return original response with idempotent-replayed: true header.

Pricing

Free: 500 credits/day, 7-day result retention.
Small ($9 / month): 50K credits, 7-day retention.
Medium ($29 / month): 200K credits, 7-day retention.
Large ($99 / month): 800K credits, 30-day retention, batch endpoint enabled, raw scores in responses.

Credit cost per scoring method: lexical 20, semantic 40, inferential 80, combined 100. GET /api/v1/evaluations/:id lookup: 1 credit.

Debug Mode

Off by default. When enabled per-account, output_text and reference_text are persisted on evaluation_runs rows for the configured retention window (24h default at 1.0.0) and surfaced via GET /api/v1/evaluations/:id. Off-mode rows do not persist text.

Conventions

All requests/responses are JSON. Content-Type: application/json.
Maximum 1,000 characters per output_text / reference_text field. For longer texts, decompose and evaluate per-segment.
Optional passthrough fields client_id and group_id (≤255 chars each) are echoed back on responses and stored on evaluation_runs. Not used for routing, dedup, or rate-limiting.

Migration notes

None. This is the first publicly-supported version.

EndpointEvaluator › API Documentation › Changelog