EndpointEvaluator › API Documentation
For a step-by-step explanation, see our Quick Start Guide. For API contract changes by version, see the Changelog.
POST /api/v1/evaluate
Evaluate LLM output against a reference text. Returns a verdict indicating whether the output is consistent with the reference. See Scoring Details for method comparisons and examples.
Requires: X-API-Key header
curl -X POST https://endpointevaluator.com/api/v1/evaluate \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"output_text": "The order ships tomorrow.",
"reference_text": "Your order will ship tomorrow morning.",
"scoring_method": "combined",
"client_id": "test-case-42",
"group_id": "ci-run-2026-05-16"
}'
Request Fields
| Field | Type | Required | Max Length | Description |
|---|---|---|---|---|
output_text |
string | Yes | 1,000 | The LLM output to evaluate |
reference_text |
string | Yes | 1,000 | The expected/reference response to compare against |
scoring_method |
string | No | — |
lexical
(default), semantic, inferential, or
combined
|
cache |
boolean | No | — |
Set to true
to check cache before scoring. Cache hits cost half credits. Not applicable for lexical.
|
client_id |
string | No | 255 | Your identifier for this evaluation (e.g. test case name) |
group_id |
string | No | 255 | Group identifier for batching evaluations (e.g. CI run ID) |
For longer texts, decompose into individual responses and evaluate each separately.
Scoring Methods
| Method | Description | Credits | Cached |
|---|---|---|---|
lexical |
Literal text comparison | 20 | — |
semantic |
Meaning comparison | 40 | 20 |
inferential |
Logical consistency | 80 | 40 |
combined |
All three methods | 100 | 50 |
Cache hits cost half credits. Caching is not applicable for lexical scoring.
{
"evaluation_id": "eval_abc123xyz789",
"client_id": "test-case-42",
"group_id": "ci-run-2026-05-16",
"scoring_method": "combined",
"verdict": "consistent",
"method_verdicts": {
"lexical": "consistent",
"semantic": "consistent",
"inferential": "consistent"
},
"requested_methods": ["lexical", "semantic", "inferential"],
"completed_methods": ["lexical", "semantic", "inferential"],
"missing_methods": [],
"flags": [],
"method_errors": {},
"credits_charged": 100,
"credits_refunded": 0,
"credits_consumed": 100,
"credits_remaining": 400
}
Single-method responses omit method_verdicts, requested_methods, completed_methods, missing_methods, flags, and method_errors; those fields describe cross-method outcomes that only apply to combined.
{
"evaluation_id": "eval_AbC123dEf456",
"client_id": "test-case-42",
"group_id": "ci-run-2026-05-16",
"scoring_method": "combined",
"verdict": "consistent",
"method_verdicts": {...},
"raw_score": 0.8421,
"raw_scores": {
"lexical": 0.91,
"semantic": 0.84,
"inferential": 0.86
},
"credits_charged": 100,
"credits_consumed": 100,
"credits_remaining": 799900
}
Accounts with bonus features unlocked additionally receive
raw_score
on every method and raw_scores
on combined evaluations. See
pricing
to unlock.
Response Fields
| Field | Type | Description |
|---|---|---|
evaluation_id |
string | Unique identifier for this evaluation |
scoring_method |
string |
Method used: lexical, semantic, inferential, or
combined
|
verdict |
string |
Consistency result: consistent, ambiguous, or
inconsistent
|
method_verdicts |
object | Combined only. Per-method verdicts, one per completed sub-method. |
requested_methods
|
array |
Combined only.
The sub-methods the combined call attempted, in presentation order ["lexical", "semantic", "inferential"].
|
completed_methods
|
array |
Combined only.
Sub-methods that produced a score. May be a subset of requested_methods.
|
missing_methods
|
array |
Combined only.
Always an empty array on a 200 response. A combined call with any missing sub-method fails closed with a 503 partial_combined_unavailable; see
Error Responses
below.
|
flags |
array |
Combined only.
On a 200 response, either empty or
["nli_contradicted"]
(inferential score below the contradiction threshold). Partial combined calls return 503
partial_combined_unavailable
rather than a 200 with an advisory flag; see Error Responses
below.
|
method_errors
|
object |
Combined only.
Always an empty object on a 200 response. Per-sub-method error details appear on the 503
partial_combined_unavailable
body and on the 422/503 all-methods-failed body; see Error Responses
below.
|
credits_charged
|
integer |
Credits charged net of any refund. Equals credits_consumed.
|
credits_refunded
|
integer |
Always 0
on a 200 response. Non-zero values appear only on error responses (422 / 503) where a full refund has been issued; see
Error Responses
below.
|
credits_consumed
|
integer |
Credits used for this evaluation (net after refund). Kept alongside
credits_charged
for backwards compatibility; the two are always equal.
|
credits_remaining
|
integer | Credits remaining on your account after this evaluation. |
cache_hit |
boolean |
Present only when cache: true
was requested. true
if the result was served from cache (at half credit cost).
|
client_id
|
string | Echoed back from the request. Omitted if not provided. |
group_id
|
string | Echoed back from the request. Omitted if not provided. |
raw_score
|
number | Bonus feature. Present only for accounts with bonus features unlocked. Numeric score in [0.0, 1.0] for the method (or min across completed methods for combined). |
raw_scores
|
object |
Combined only, bonus feature.
Per-sub-method numeric scores. Only entries for
completed_methods
are present.
|
POST /api/v1/evaluate/batch
Bonus Feature
Evaluate multiple outputs in a single API request. Submit up to 10 evaluations at once, ideal for CI/CD pipelines running test suites.
Requires Large tier purchase to unlock.
Per-item client_id
and group_id
are echoed back on each item in the response.
Requires: X-API-Key header
curl -X POST https://endpointevaluator.com/api/v1/evaluate/batch \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"evaluations": [
{
"output_text": "Paris is the capital of France",
"reference_text": "The capital of France is Paris",
"scoring_method": "inferential"
},
{
"output_text": "Water boils at 100C",
"reference_text": "Water boils at 212F",
"scoring_method": "inferential"
}
]
}'
{
"evaluations": [
{
"evaluation_id": "eval_abc123...",
"scoring_method": "inferential",
"verdict": "consistent",
"credits_consumed": 80
},
{
"evaluation_id": "eval_def456...",
"scoring_method": "inferential",
"verdict": "consistent",
"credits_consumed": 80
}
],
"count": 2,
"total_credits_consumed": 160,
"credits_remaining": 340
}
{
"evaluations": [
{
"index": 0,
"evaluation_id": "eval_abc123...",
"scoring_method": "inferential",
"verdict": "consistent",
"credits_consumed": 80
},
{
"index": 1,
"error_type": "evaluation_error",
"error": "Evaluation failed",
"method_errors": {
"inferential": { "reason": "pool_timeout", "transient": true }
},
"transient": true
}
],
"count": 2,
"errors": 1,
"total_credits_consumed": 80,
"credits_refunded": 80,
"credits_remaining": 340
}
Batch returns 207 Multi-Status
when any item in the batch failed. The top-level envelope carries
count
(total items), errors
(count of failed items),
total_credits_consumed
(net after refunds), credits_refunded
(sum of refunds across failed items), and credits_remaining. Each entry in
evaluations
carries an index
matching its position in the request.
Failed items carry one of two error_type
values: "evaluation_error"
(every requested method failed on that item) or
"partial_combined_unavailable"
(the item used scoring_method: "combined"
and at least one sub-method was missing). Both shapes include a human-readable
error
string, a method_errors
object, and a transient
flag. The partial_combined_unavailable
envelope additionally carries requested_methods, completed_methods, and
missing_methods
(same shape as the single-request 503 body).
transient: true
means at least one underlying sub-method failed with a retry-worthy condition (rate limit, upstream 5xx, timeout);
transient: false
means the failure is permanent for that input (unscorable). Partial-combined items are always transient: true: an upstream sub-method's unavailability is retry-worthy regardless of individual sub-method classifications.
A request whose validation
fails before scoring begins returns
422 validation_error
for the whole request, not 207; 207 is only for batches where at least one item validated and ran.
Each evaluation in the batch follows the same parameters as the single evaluate endpoint. Max 10 evaluations per request.
GET /api/v1/balance
Check your current credit balance. Shows free credits, paid credits, and total available. This call is free.
Requires: X-API-Key header
curl -X GET https://endpointevaluator.com/api/v1/balance \ -H "X-API-Key: YOUR_API_KEY"
{
"free": 450,
"paid": 0,
"total": 450,
"tier": "free",
"bonus_unlocked": false
}
| Field | Type | Description |
|---|---|---|
| free | integer | Free credits remaining |
| paid | integer | Paid credits remaining |
| total | integer | Total credits (free + paid) |
| tier | string |
Account tier: free
or paid
|
| bonus_unlocked | boolean | Whether bonus features are active on this account |
GET /api/v1/evaluations/:evaluation_id
Look up the details of a prior evaluation. Useful for auditing or debugging your evaluation history. This call costs one credit.
Requires: X-API-Key header
curl -X GET https://endpointevaluator.com/api/v1/evaluations/eval_abc123xyz789 \ -H "X-API-Key: YOUR_API_KEY"
{
"evaluation_id": "eval_abc123xyz789",
"scoring_method": "combined",
"evaluated_at": "2026-03-24T18:30:00Z",
"client_id": "test-case-42",
"group_id": "ci-run-2026-05-16",
"credits_consumed": 100,
"result": {
"verdict": "consistent",
"method_verdicts": {
"lexical": "consistent",
"semantic": "consistent",
"inferential": "consistent"
}
}
}
The credits_consumed
field shows credits used for the original evaluation, not this lookup.
Input text is not stored by default.
We do not retain output_text
or reference_text
on the evaluation record. Lookups return verdicts and metadata only; the original inputs are gone.
If you need the inputs echoed back for debugging (for example, to correlate a surprising verdict with the exact text that produced it), enable
Debug Mode
from your dashboard. With Debug Mode active, evaluations written during the next 24 hours include output_text, reference_text, and a
text_retention_until
timestamp alongside the other fields. After that timestamp the cleanup worker scrubs the text back to null; the evaluation record itself is preserved. Outside the Debug Mode window, these fields are omitted from the response entirely.
Retention depends on tier.
Free, Small, and Medium tiers retain evaluation rows for 7 days. Large tier (with bonus features unlocked) retains for 30 days. Debug Mode is an additional opt-in window for the input/reference text itself, controlled per-account, default 24h. See pricing for tier details.
Rate Limits
API requests are rate-limited per account based on your tier.
| Tier | Limit |
|---|---|
| Free | 1 request per minute |
| Paid accounts | 10 requests per minute |
Any paid balance puts the account on the paid tier, regardless of which credit pack (Small, Medium, or Large) was purchased. Rate limit headers are included in API responses: X-RateLimit-Limit, X-RateLimit-Remaining,
X-RateLimit-Reset
Error Responses
Every error response carries a top-level
error_type
discriminator. Clients should branch on error_type,
not on HTTP status alone, for forward compatibility. The closed catalog at v1.0 has
16 values listed below; new values will arrive only in MINOR or MAJOR releases per the changelog.
For server-side error logging guidance (severity by error_type),
see the
Production Checklist
.
Idempotency-related errors are emitted only when the idempotency feature is enabled per-account; off by default at v1.0.
| Status | Description | Example Response |
|---|---|---|
| 401 | Missing or invalid API key |
{"error_type": "auth_required", "error": "API key required. Include X-API-Key: <your-api-key> header"}
Invalid keys return
error_type: "auth_invalid".
|
| 402 | Insufficient credits |
{"error_type": "insufficient_credits", "error": "Insufficient credits", "credits_needed": 20, "balance": {...}}
|
| 403 | Feature requires bonus |
{"error_type": "forbidden", "error": "Batch evaluation requires bonus features"}
|
| 404 | Resource not found |
{"error_type": "not_found", "error": "Evaluation not found"}
|
| 409 | Idempotency conflict |
In-flight or expired key replay. Honor the
retry-after
header value (default 5s) before retrying.
{"error_type": "idempotency_key_conflict", "error": "Idempotency key conflict"}
|
| 422 | Validation error — malformed or missing input fields |
{"error_type": "validation_error", "errors": {"output_text": ["is required"]}}
|
| 422 | Idempotency key malformed |
Idempotency-Key header value failed validation (length or character constraints).
{"error_type": "idempotency_key_invalid", "error": "Invalid idempotency key"}
|
| 422 | All requested scoring methods failed with permanent errors (input unscorable). Credits fully refunded. |
Same envelope as 503 below; error_type
is "all_methods_failed".
|
| 429 | Rate limit exceeded (too many requests from this account) |
{"error_type": "rate_limited", "error": "Rate limit exceeded", "retry_after": 60}
|
| 503 |
All requested scoring methods failed with at least one transient error.
Credits fully refunded. Retry after the Retry-After header.
|
See All-methods-failed body below. |
| 503 |
scoring_method: "combined"
with at least one sub-method missing. Combined fails closed rather than returning a partial result labeled as combined. Credits fully refunded. Retry after the
Retry-After
header.
|
See Partial-combined body
below. error_type
is "partial_combined_unavailable".
|
| 503 |
Internal scoring-formatter contract violation. Always carries retry-after: 5. Retry once; persistent recurrence indicates a bug — please report.
|
{"error_type": "formatter_contract_violation", "error": "Internal scoring formatter error; please retry"}
|
All-methods-failed body (422 and 503)
Both 422 and 503 responses share the same body shape; only the
error_type
field and the HTTP status differ. 503 additionally sets a
Retry-After
header. The body is the same whether the request was a single-method call or a combined call.
{
"error_type": "all_methods_failed_transient",
"error": "Evaluation failed",
"evaluation_id": "eval_abc123xyz789",
"scoring_method": "combined",
"requested_methods": ["lexical", "semantic", "inferential"],
"completed_methods": [],
"missing_methods": ["lexical", "semantic", "inferential"],
"method_errors": {
"lexical": { "reason": "no_key_tokens", "transient": false },
"semantic": { "reason": "pool_timeout", "transient": true },
"inferential": { "reason": "pool_timeout", "transient": true }
},
"credits_charged": 0,
"credits_refunded": 100
}
For 422 the same body is returned with
error_type: "all_methods_failed"
and every per-method transient
flag set to false. The
transient
flag on each entry of method_errors
tells the client whether that specific sub-method's failure is retry-worthy in isolation; the top-level
error_type
tells the client whether the whole call is.
Partial-combined body (503)
Returned when scoring_method: "combined"
and at least one sub-method failed to produce a score. Combined fails closed: rather than return a 200 with a partial combined result (which CI pipelines cannot reliably distinguish from a real combined score), the API returns 503 with a full refund. Retry after the
Retry-After
header.
{
"error_type": "partial_combined_unavailable",
"error": "Partial combined evaluation unavailable",
"evaluation_id": "eval_abc123xyz789",
"scoring_method": "combined",
"requested_methods": ["lexical", "semantic", "inferential"],
"completed_methods": ["lexical"],
"missing_methods": ["semantic", "inferential"],
"method_errors": {
"semantic": { "reason": "pool_timeout", "transient": true },
"inferential": { "reason": "nli_unparseable", "transient": false }
},
"credits_charged": 0,
"credits_refunded": 100
}
Per-sub-method transient
flags are preserved for clients that want to log the underlying failure shape; the top-level call is always retry-worthy regardless of sub-method classification, so no HTTP-status disambiguation like the 422 / 503 split on
all_methods_failed
applies here.
FAQ
Common questions about combined scoring and its fails-closed semantics. See Response Fields and Error Responses above for the underlying field and status-code definitions.
Why did I get 503 when I asked for combined?
Combined score is an aggregation across lexical, semantic, and inferential methods. If any sub-method is unavailable or fails to produce a score, combined cannot be computed with integrity, so the request fails closed with HTTP 503 partial_combined_unavailable.
Why is combined all-or-nothing? Can't I get partial results?
Combined is a deterministic aggregation contract for CI-grade use cases. Partial aggregation would be a different product with different semantics. Keeping the boundary sharp prevents silently computing combined scores from incomplete inputs.
How do I get partial results if I want them?
Request individual methods directly: scoring_method: "lexical", scoring_method: "semantic", or scoring_method: "inferential". If semantic is temporarily down and you want lexical anyway, call scoring_method: "lexical". Individual methods do not fail closed on each other.
Am I charged for a 503 partial_combined_unavailable?
No. Full refund. credits_charged: 0
on the response. Your credit balance is untouched.
How should my retry logic handle partial-combined 503s?
Two options. (a) Retry with exponential backoff.
partial_combined_unavailable
errors are usually transient. (b) Fall back to requesting individual methods, which do not require cross-method availability. Either is acceptable; choose based on your tolerance for latency versus scope.