EndpointEvaluator › API Documentation

For a step-by-step explanation, see our Quick Start Guide. For API contract changes by version, see the Changelog.

POST /api/v1/evaluate

Evaluate LLM output against a reference text. Returns a verdict indicating whether the output is consistent with the reference. See Scoring Details for method comparisons and examples.

Requires: X-API-Key header

Request

curl -X POST https://endpointevaluator.com/api/v1/evaluate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "output_text": "The order ships tomorrow.",
    "reference_text": "Your order will ship tomorrow morning.",
    "scoring_method": "combined",
    "client_id": "test-case-42",
    "group_id": "ci-run-2026-05-16"
  }'

Request Fields

Field	Type	Required	Max Length	Description
`output_text`	string	Yes	1,000	The LLM output to evaluate
`reference_text`	string	Yes	1,000	The expected/reference response to compare against
`scoring_method`	string	No	—	`lexical` (default), `semantic`, `inferential`, or `combined`
`cache`	boolean	No	—	Set to `true` to check cache before scoring. Cache hits cost half credits. Not applicable for lexical.
`client_id`	string	No	255	Your identifier for this evaluation (e.g. test case name)
`group_id`	string	No	255	Group identifier for batching evaluations (e.g. CI run ID)

For longer texts, decompose into individual responses and evaluate each separately.

Scoring Methods

Method	Description	Credits	Cached
`lexical`	Literal text comparison	20	—
`semantic`	Meaning comparison	40	20
`inferential`	Logical consistency	80	40
`combined`	All three methods	100	50

Cache hits cost half credits. Caching is not applicable for lexical scoring.

Response

{
  "evaluation_id": "eval_abc123xyz789",
  "client_id": "test-case-42",
  "group_id": "ci-run-2026-05-16",
  "scoring_method": "combined",
  "verdict": "consistent",
  "method_verdicts": {
    "lexical": "consistent",
    "semantic": "consistent",
    "inferential": "consistent"
  },
  "requested_methods": ["lexical", "semantic", "inferential"],
  "completed_methods": ["lexical", "semantic", "inferential"],
  "missing_methods": [],
  "flags": [],
  "method_errors": {},
  "credits_charged": 100,
  "credits_refunded": 0,
  "credits_consumed": 100,
  "credits_remaining": 400
}

Single-method responses omit method_verdicts, requested_methods, completed_methods, missing_methods, flags, and method_errors; those fields describe cross-method outcomes that only apply to combined.

Bonus account response (additional fields)

{
  "evaluation_id": "eval_AbC123dEf456",
  "client_id": "test-case-42",
  "group_id": "ci-run-2026-05-16",
  "scoring_method": "combined",
  "verdict": "consistent",
  "method_verdicts": {...},
  "raw_score": 0.8421,
  "raw_scores": {
    "lexical": 0.91,
    "semantic": 0.84,
    "inferential": 0.86
  },
  "credits_charged": 100,
  "credits_consumed": 100,
  "credits_remaining": 799900
}

Accounts with bonus features unlocked additionally receive raw_score on every method and raw_scores on combined evaluations. See pricing to unlock.

Response Fields

Field	Type	Description
`evaluation_id`	string	Unique identifier for this evaluation
`scoring_method`	string	Method used: `lexical`, `semantic`, `inferential`, or `combined`
`verdict`	string	Consistency result: `consistent`, `ambiguous`, or `inconsistent`
`method_verdicts`	object	Combined only. Per-method verdicts, one per completed sub-method.
`requested_methods`	array	Combined only. The sub-methods the combined call attempted, in presentation order `["lexical", "semantic", "inferential"]`.
`completed_methods`	array	Combined only. Sub-methods that produced a score. May be a subset of `requested_methods`.
`missing_methods`	array	Combined only. Always an empty array on a 200 response. A combined call with any missing sub-method fails closed with a 503 `partial_combined_unavailable`; see Error Responses below.
`flags`	array	Combined only. On a 200 response, either empty or `["nli_contradicted"]` (inferential score below the contradiction threshold). Partial combined calls return 503 `partial_combined_unavailable` rather than a 200 with an advisory flag; see Error Responses below.
`method_errors`	object	Combined only. Always an empty object on a 200 response. Per-sub-method error details appear on the 503 `partial_combined_unavailable` body and on the 422/503 all-methods-failed body; see Error Responses below.
`credits_charged`	integer	Credits charged net of any refund. Equals `credits_consumed`.
`credits_refunded`	integer	Always `0` on a 200 response. Non-zero values appear only on error responses (422 / 503) where a full refund has been issued; see Error Responses below.
`credits_consumed`	integer	Credits used for this evaluation (net after refund). Kept alongside `credits_charged` for backwards compatibility; the two are always equal.
`credits_remaining`	integer	Credits remaining on your account after this evaluation.
`cache_hit`	boolean	Present only when `cache: true` was requested. `true` if the result was served from cache (at half credit cost).
`client_id`	string	Echoed back from the request. Omitted if not provided.
`group_id`	string	Echoed back from the request. Omitted if not provided.
`raw_score`	number	Bonus feature. Present only for accounts with bonus features unlocked. Numeric score in [0.0, 1.0] for the method (or min across completed methods for combined).
`raw_scores`	object	Combined only, bonus feature. Per-sub-method numeric scores. Only entries for `completed_methods` are present.

POST /api/v1/evaluate/batch

Bonus Feature

Evaluate multiple outputs in a single API request. Submit up to 10 evaluations at once, ideal for CI/CD pipelines running test suites. Requires Large tier purchase to unlock. Per-item client_id and group_id are echoed back on each item in the response.

Requires: X-API-Key header

Request

curl -X POST https://endpointevaluator.com/api/v1/evaluate/batch \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluations": [
      {
        "output_text": "Paris is the capital of France",
        "reference_text": "The capital of France is Paris",
        "scoring_method": "inferential"
      },
      {
        "output_text": "Water boils at 100C",
        "reference_text": "Water boils at 212F",
        "scoring_method": "inferential"
      }
    ]
  }'

Response — 200 OK (all items succeeded)

{
  "evaluations": [
    {
      "evaluation_id": "eval_abc123...",
      "scoring_method": "inferential",
      "verdict": "consistent",
      "credits_consumed": 80
    },
    {
      "evaluation_id": "eval_def456...",
      "scoring_method": "inferential",
      "verdict": "consistent",
      "credits_consumed": 80
    }
  ],
  "count": 2,
  "total_credits_consumed": 160,
  "credits_remaining": 340
}

Response — 207 Multi-Status (some items failed)

{
  "evaluations": [
    {
      "index": 0,
      "evaluation_id": "eval_abc123...",
      "scoring_method": "inferential",
      "verdict": "consistent",
      "credits_consumed": 80
    },
    {
      "index": 1,
      "error_type": "evaluation_error",
      "error": "Evaluation failed",
      "method_errors": {
        "inferential": { "reason": "pool_timeout", "transient": true }
      },
      "transient": true
    }
  ],
  "count": 2,
  "errors": 1,
  "total_credits_consumed": 80,
  "credits_refunded": 80,
  "credits_remaining": 340
}

Batch returns 207 Multi-Status when any item in the batch failed. The top-level envelope carries count (total items), errors (count of failed items), total_credits_consumed (net after refunds), credits_refunded (sum of refunds across failed items), and credits_remaining. Each entry in evaluations carries an index matching its position in the request.

Failed items carry one of two error_type values: "evaluation_error" (every requested method failed on that item) or "partial_combined_unavailable" (the item used scoring_method: "combined" and at least one sub-method was missing). Both shapes include a human-readable error string, a method_errors object, and a transient flag. The partial_combined_unavailable envelope additionally carries requested_methods, completed_methods, and missing_methods (same shape as the single-request 503 body). transient: true means at least one underlying sub-method failed with a retry-worthy condition (rate limit, upstream 5xx, timeout); transient: false means the failure is permanent for that input (unscorable). Partial-combined items are always transient: true: an upstream sub-method's unavailability is retry-worthy regardless of individual sub-method classifications.

A request whose validation fails before scoring begins returns 422 validation_error for the whole request, not 207; 207 is only for batches where at least one item validated and ran.

Each evaluation in the batch follows the same parameters as the single evaluate endpoint. Max 10 evaluations per request.

GET /api/v1/balance

Check your current credit balance. Shows free credits, paid credits, and total available. This call is free.

Requires: X-API-Key header

Request

curl -X GET https://endpointevaluator.com/api/v1/balance \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "free": 450,
  "paid": 0,
  "total": 450,
  "tier": "free",
  "bonus_unlocked": false
}

Response Fields

Field	Type	Description
free	integer	Free credits remaining
paid	integer	Paid credits remaining
total	integer	Total credits (free + paid)
tier	string	Account tier: `free` or `paid`
bonus_unlocked	boolean	Whether bonus features are active on this account

GET /api/v1/evaluations/:evaluation_id

Look up the details of a prior evaluation. Useful for auditing or debugging your evaluation history. This call costs one credit.

Requires: X-API-Key header

Request

curl -X GET https://endpointevaluator.com/api/v1/evaluations/eval_abc123xyz789 \
  -H "X-API-Key: YOUR_API_KEY"

Response

{
  "evaluation_id": "eval_abc123xyz789",
  "scoring_method": "combined",
  "evaluated_at": "2026-03-24T18:30:00Z",
  "client_id": "test-case-42",
  "group_id": "ci-run-2026-05-16",
  "credits_consumed": 100,
  "result": {
    "verdict": "consistent",
    "method_verdicts": {
      "lexical": "consistent",
      "semantic": "consistent",
      "inferential": "consistent"
    }
  }
}

The credits_consumed field shows credits used for the original evaluation, not this lookup.

Input text is not stored by default.

We do not retain output_text or reference_text on the evaluation record. Lookups return verdicts and metadata only; the original inputs are gone.

If you need the inputs echoed back for debugging (for example, to correlate a surprising verdict with the exact text that produced it), enable Debug Mode from your dashboard. With Debug Mode active, evaluations written during the next 24 hours include output_text, reference_text, and a text_retention_until timestamp alongside the other fields. After that timestamp the cleanup worker scrubs the text back to null; the evaluation record itself is preserved. Outside the Debug Mode window, these fields are omitted from the response entirely.

Retention depends on tier.

Free, Small, and Medium tiers retain evaluation rows for 7 days. Large tier (with bonus features unlocked) retains for 30 days. Debug Mode is an additional opt-in window for the input/reference text itself, controlled per-account, default 24h. See pricing for tier details.

Rate Limits

API requests are rate-limited per account based on your tier.

Tier	Limit
Free	1 request per minute
Paid accounts	10 requests per minute

Any paid balance puts the account on the paid tier, regardless of which credit pack (Small, Medium, or Large) was purchased. Rate limit headers are included in API responses: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset

Error Responses

Every error response carries a top-level error_type discriminator. Clients should branch on error_type, not on HTTP status alone, for forward compatibility. The closed catalog at v1.0 has 16 values listed below; new values will arrive only in MINOR or MAJOR releases per the changelog. For server-side error logging guidance (severity by error_type), see the Production Checklist .

Idempotency-related errors are emitted only when the idempotency feature is enabled per-account; off by default at v1.0.

Status	Description	Example Response
401	Missing or invalid API key	`{"error_type": "auth_required", "error": "API key required. Include X-API-Key: <your-api-key> header"}` Invalid keys return `error_type: "auth_invalid"`.
402	Insufficient credits	`{"error_type": "insufficient_credits", "error": "Insufficient credits", "credits_needed": 20, "balance": {...}}`
403	Feature requires bonus	`{"error_type": "forbidden", "error": "Batch evaluation requires bonus features"}`
404	Resource not found	`{"error_type": "not_found", "error": "Evaluation not found"}`
409	Idempotency conflict	In-flight or expired key replay. Honor the `retry-after` header value (default 5s) before retrying. `{"error_type": "idempotency_key_conflict", "error": "Idempotency key conflict"}`
422	Validation error — malformed or missing input fields	`{"error_type": "validation_error", "errors": {"output_text": ["is required"]}}`
422	Idempotency key malformed	Idempotency-Key header value failed validation (length or character constraints). `{"error_type": "idempotency_key_invalid", "error": "Invalid idempotency key"}`
422	All requested scoring methods failed with permanent errors (input unscorable). Credits fully refunded.	Same envelope as 503 below; `error_type` is `"all_methods_failed"`.
429	Rate limit exceeded (too many requests from this account)	`{"error_type": "rate_limited", "error": "Rate limit exceeded", "retry_after": 60}`
503	All requested scoring methods failed with at least one transient error. Credits fully refunded. Retry after the `Retry-After` header.	See All-methods-failed body below.
503	`scoring_method: "combined"` with at least one sub-method missing. Combined fails closed rather than returning a partial result labeled as combined. Credits fully refunded. Retry after the `Retry-After` header.	See Partial-combined body below. `error_type` is `"partial_combined_unavailable"`.
503	Internal scoring-formatter contract violation. Always carries `retry-after: 5`. Retry once; persistent recurrence indicates a bug — please report.	`{"error_type": "formatter_contract_violation", "error": "Internal scoring formatter error; please retry"}`

All-methods-failed body (422 and 503)

Both 422 and 503 responses share the same body shape; only the error_type field and the HTTP status differ. 503 additionally sets a Retry-After header. The body is the same whether the request was a single-method call or a combined call.

{
  "error_type": "all_methods_failed_transient",
  "error": "Evaluation failed",
  "evaluation_id": "eval_abc123xyz789",
  "scoring_method": "combined",
  "requested_methods": ["lexical", "semantic", "inferential"],
  "completed_methods": [],
  "missing_methods": ["lexical", "semantic", "inferential"],
  "method_errors": {
    "lexical": { "reason": "no_key_tokens", "transient": false },
    "semantic": { "reason": "pool_timeout", "transient": true },
    "inferential": { "reason": "pool_timeout", "transient": true }
  },
  "credits_charged": 0,
  "credits_refunded": 100
}

For 422 the same body is returned with error_type: "all_methods_failed" and every per-method transient flag set to false. The transient flag on each entry of method_errors tells the client whether that specific sub-method's failure is retry-worthy in isolation; the top-level error_type tells the client whether the whole call is.

Partial-combined body (503)

Returned when scoring_method: "combined" and at least one sub-method failed to produce a score. Combined fails closed: rather than return a 200 with a partial combined result (which CI pipelines cannot reliably distinguish from a real combined score), the API returns 503 with a full refund. Retry after the Retry-After header.

{
  "error_type": "partial_combined_unavailable",
  "error": "Partial combined evaluation unavailable",
  "evaluation_id": "eval_abc123xyz789",
  "scoring_method": "combined",
  "requested_methods": ["lexical", "semantic", "inferential"],
  "completed_methods": ["lexical"],
  "missing_methods": ["semantic", "inferential"],
  "method_errors": {
    "semantic": { "reason": "pool_timeout", "transient": true },
    "inferential": { "reason": "nli_unparseable", "transient": false }
  },
  "credits_charged": 0,
  "credits_refunded": 100
}

Per-sub-method transient flags are preserved for clients that want to log the underlying failure shape; the top-level call is always retry-worthy regardless of sub-method classification, so no HTTP-status disambiguation like the 422 / 503 split on all_methods_failed applies here.

FAQ

Common questions about combined scoring and its fails-closed semantics. See Response Fields and Error Responses above for the underlying field and status-code definitions.

Why did I get 503 when I asked for combined?

Combined score is an aggregation across lexical, semantic, and inferential methods. If any sub-method is unavailable or fails to produce a score, combined cannot be computed with integrity, so the request fails closed with HTTP 503 partial_combined_unavailable.

Why is combined all-or-nothing? Can't I get partial results?

Combined is a deterministic aggregation contract for CI-grade use cases. Partial aggregation would be a different product with different semantics. Keeping the boundary sharp prevents silently computing combined scores from incomplete inputs.

How do I get partial results if I want them?

Request individual methods directly: scoring_method: "lexical", scoring_method: "semantic", or scoring_method: "inferential". If semantic is temporarily down and you want lexical anyway, call scoring_method: "lexical". Individual methods do not fail closed on each other.

Am I charged for a 503 `partial_combined_unavailable`?

No. Full refund. credits_charged: 0 on the response. Your credit balance is untouched.

How should my retry logic handle partial-combined 503s?

Two options. (a) Retry with exponential backoff. partial_combined_unavailable errors are usually transient. (b) Fall back to requesting individual methods, which do not require cross-method availability. Either is acceptable; choose based on your tolerance for latency versus scope.

EndpointEvaluator › API Documentation

POST /api/v1/evaluate

Request Fields

Scoring Methods

Response Fields

POST /api/v1/evaluate/batch

GET /api/v1/balance

GET /api/v1/evaluations/:evaluation_id

Rate Limits

Error Responses

All-methods-failed body (422 and 503)

Partial-combined body (503)

FAQ

Why did I get 503 when I asked for combined?

Why is combined all-or-nothing? Can't I get partial results?

How do I get partial results if I want them?

Am I charged for a 503 partial_combined_unavailable?

How should my retry logic handle partial-combined 503s?

Am I charged for a 503 `partial_combined_unavailable`?