EndpointEvaluator API Documentation

For a step-by-step explanation, see our Quick Start Guide. For API contract changes by version, see the Changelog.

POST /api/v1/evaluate

Evaluate LLM output against a reference text. Returns a verdict indicating whether the output is consistent with the reference. See Scoring Details for method comparisons and examples.

Requires: X-API-Key header

Request
curl -X POST https://endpointevaluator.com/api/v1/evaluate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "output_text": "The order ships tomorrow.",
    "reference_text": "Your order will ship tomorrow morning.",
    "scoring_method": "combined",
    "client_id": "test-case-42",
    "group_id": "ci-run-2026-05-16"
  }'

Request Fields

Field Type Required Max Length Description
output_text string Yes 1,000 The LLM output to evaluate
reference_text string Yes 1,000 The expected/reference response to compare against
scoring_method string No lexical (default), semantic, inferential, or combined
cache boolean No Set to true to check cache before scoring. Cache hits cost half credits. Not applicable for lexical.
client_id string No 255 Your identifier for this evaluation (e.g. test case name)
group_id string No 255 Group identifier for batching evaluations (e.g. CI run ID)

For longer texts, decompose into individual responses and evaluate each separately.

Scoring Methods

Method Description Credits Cached
lexical Literal text comparison 20
semantic Meaning comparison 40 20
inferential Logical consistency 80 40
combined All three methods 100 50

Cache hits cost half credits. Caching is not applicable for lexical scoring.

Response
{
  "evaluation_id": "eval_abc123xyz789",
  "client_id": "test-case-42",
  "group_id": "ci-run-2026-05-16",
  "scoring_method": "combined",
  "verdict": "consistent",
  "method_verdicts": {
    "lexical": "consistent",
    "semantic": "consistent",
    "inferential": "consistent"
  },
  "requested_methods": ["lexical", "semantic", "inferential"],
  "completed_methods": ["lexical", "semantic", "inferential"],
  "missing_methods": [],
  "flags": [],
  "method_errors": {},
  "credits_charged": 100,
  "credits_refunded": 0,
  "credits_consumed": 100,
  "credits_remaining": 400
}

Single-method responses omit method_verdicts, requested_methods, completed_methods, missing_methods, flags, and method_errors; those fields describe cross-method outcomes that only apply to combined.

Bonus account response (additional fields)
{
  "evaluation_id": "eval_AbC123dEf456",
  "client_id": "test-case-42",
  "group_id": "ci-run-2026-05-16",
  "scoring_method": "combined",
  "verdict": "consistent",
  "method_verdicts": {...},
  "raw_score": 0.8421,
  "raw_scores": {
    "lexical": 0.91,
    "semantic": 0.84,
    "inferential": 0.86
  },
  "credits_charged": 100,
  "credits_consumed": 100,
  "credits_remaining": 799900
}

Accounts with bonus features unlocked additionally receive raw_score on every method and raw_scores on combined evaluations. See pricing to unlock.

Response Fields

Field Type Description
evaluation_id string Unique identifier for this evaluation
scoring_method string Method used: lexical, semantic, inferential, or combined
verdict string Consistency result: consistent, ambiguous, or inconsistent
method_verdicts object Combined only. Per-method verdicts, one per completed sub-method.
requested_methods array Combined only. The sub-methods the combined call attempted, in presentation order ["lexical", "semantic", "inferential"].
completed_methods array Combined only. Sub-methods that produced a score. May be a subset of requested_methods.
missing_methods array Combined only. Always an empty array on a 200 response. A combined call with any missing sub-method fails closed with a 503 partial_combined_unavailable; see Error Responses below.
flags array Combined only. On a 200 response, either empty or ["nli_contradicted"] (inferential score below the contradiction threshold). Partial combined calls return 503 partial_combined_unavailable rather than a 200 with an advisory flag; see Error Responses below.
method_errors object Combined only. Always an empty object on a 200 response. Per-sub-method error details appear on the 503 partial_combined_unavailable body and on the 422/503 all-methods-failed body; see Error Responses below.
credits_charged integer Credits charged net of any refund. Equals credits_consumed.
credits_refunded integer Always 0 on a 200 response. Non-zero values appear only on error responses (422 / 503) where a full refund has been issued; see Error Responses below.
credits_consumed integer Credits used for this evaluation (net after refund). Kept alongside credits_charged for backwards compatibility; the two are always equal.
credits_remaining integer Credits remaining on your account after this evaluation.
cache_hit boolean Present only when cache: true was requested. true if the result was served from cache (at half credit cost).
client_id string Echoed back from the request. Omitted if not provided.
group_id string Echoed back from the request. Omitted if not provided.
raw_score number Bonus feature. Present only for accounts with bonus features unlocked. Numeric score in [0.0, 1.0] for the method (or min across completed methods for combined).
raw_scores object Combined only, bonus feature. Per-sub-method numeric scores. Only entries for completed_methods are present.

POST /api/v1/evaluate/batch

Bonus Feature

Evaluate multiple outputs in a single API request. Submit up to 10 evaluations at once, ideal for CI/CD pipelines running test suites. Requires Large tier purchase to unlock. Per-item client_id and group_id are echoed back on each item in the response.

Requires: X-API-Key header

Request
curl -X POST https://endpointevaluator.com/api/v1/evaluate/batch \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluations": [
      {
        "output_text": "Paris is the capital of France",
        "reference_text": "The capital of France is Paris",
        "scoring_method": "inferential"
      },
      {
        "output_text": "Water boils at 100C",
        "reference_text": "Water boils at 212F",
        "scoring_method": "inferential"
      }
    ]
  }'
Response — 200 OK (all items succeeded)
{
  "evaluations": [
    {
      "evaluation_id": "eval_abc123...",
      "scoring_method": "inferential",
      "verdict": "consistent",
      "credits_consumed": 80
    },
    {
      "evaluation_id": "eval_def456...",
      "scoring_method": "inferential",
      "verdict": "consistent",
      "credits_consumed": 80
    }
  ],
  "count": 2,
  "total_credits_consumed": 160,
  "credits_remaining": 340
}
Response — 207 Multi-Status (some items failed)
{
  "evaluations": [
    {
      "index": 0,
      "evaluation_id": "eval_abc123...",
      "scoring_method": "inferential",
      "verdict": "consistent",
      "credits_consumed": 80
    },
    {
      "index": 1,
      "error_type": "evaluation_error",
      "error": "Evaluation failed",
      "method_errors": {
        "inferential": { "reason": "pool_timeout", "transient": true }
      },
      "transient": true
    }
  ],
  "count": 2,
  "errors": 1,
  "total_credits_consumed": 80,
  "credits_refunded": 80,
  "credits_remaining": 340
}

Batch returns 207 Multi-Status when any item in the batch failed. The top-level envelope carries count (total items), errors (count of failed items), total_credits_consumed (net after refunds), credits_refunded (sum of refunds across failed items), and credits_remaining. Each entry in evaluations carries an index matching its position in the request.

Failed items carry one of two error_type values: "evaluation_error" (every requested method failed on that item) or "partial_combined_unavailable" (the item used scoring_method: "combined" and at least one sub-method was missing). Both shapes include a human-readable error string, a method_errors object, and a transient flag. The partial_combined_unavailable envelope additionally carries requested_methods, completed_methods, and missing_methods (same shape as the single-request 503 body). transient: true means at least one underlying sub-method failed with a retry-worthy condition (rate limit, upstream 5xx, timeout); transient: false means the failure is permanent for that input (unscorable). Partial-combined items are always transient: true: an upstream sub-method's unavailability is retry-worthy regardless of individual sub-method classifications.

A request whose validation fails before scoring begins returns 422 validation_error for the whole request, not 207; 207 is only for batches where at least one item validated and ran.

Each evaluation in the batch follows the same parameters as the single evaluate endpoint. Max 10 evaluations per request.

GET /api/v1/balance

Check your current credit balance. Shows free credits, paid credits, and total available. This call is free.

Requires: X-API-Key header

Request
curl -X GET https://endpointevaluator.com/api/v1/balance \
  -H "X-API-Key: YOUR_API_KEY"
Response
{
  "free": 450,
  "paid": 0,
  "total": 450,
  "tier": "free",
  "bonus_unlocked": false
}
Response Fields
Field Type Description
free integer Free credits remaining
paid integer Paid credits remaining
total integer Total credits (free + paid)
tier string Account tier: free or paid
bonus_unlocked boolean Whether bonus features are active on this account

GET /api/v1/evaluations/:evaluation_id

Look up the details of a prior evaluation. Useful for auditing or debugging your evaluation history. This call costs one credit.

Requires: X-API-Key header

Request
curl -X GET https://endpointevaluator.com/api/v1/evaluations/eval_abc123xyz789 \
  -H "X-API-Key: YOUR_API_KEY"
Response
{
  "evaluation_id": "eval_abc123xyz789",
  "scoring_method": "combined",
  "evaluated_at": "2026-03-24T18:30:00Z",
  "client_id": "test-case-42",
  "group_id": "ci-run-2026-05-16",
  "credits_consumed": 100,
  "result": {
    "verdict": "consistent",
    "method_verdicts": {
      "lexical": "consistent",
      "semantic": "consistent",
      "inferential": "consistent"
    }
  }
}

The credits_consumed field shows credits used for the original evaluation, not this lookup.

Input text is not stored by default.

We do not retain output_text or reference_text on the evaluation record. Lookups return verdicts and metadata only; the original inputs are gone.

If you need the inputs echoed back for debugging (for example, to correlate a surprising verdict with the exact text that produced it), enable Debug Mode from your dashboard. With Debug Mode active, evaluations written during the next 24 hours include output_text, reference_text, and a text_retention_until timestamp alongside the other fields. After that timestamp the cleanup worker scrubs the text back to null; the evaluation record itself is preserved. Outside the Debug Mode window, these fields are omitted from the response entirely.

Retention depends on tier.

Free, Small, and Medium tiers retain evaluation rows for 7 days. Large tier (with bonus features unlocked) retains for 30 days. Debug Mode is an additional opt-in window for the input/reference text itself, controlled per-account, default 24h. See pricing for tier details.

Rate Limits

API requests are rate-limited per account based on your tier.

Tier Limit
Free 1 request per minute
Paid accounts 10 requests per minute

Any paid balance puts the account on the paid tier, regardless of which credit pack (Small, Medium, or Large) was purchased. Rate limit headers are included in API responses: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset

Error Responses

Every error response carries a top-level error_type discriminator. Clients should branch on error_type, not on HTTP status alone, for forward compatibility. The closed catalog at v1.0 has 16 values listed below; new values will arrive only in MINOR or MAJOR releases per the changelog. For server-side error logging guidance (severity by error_type), see the Production Checklist .

Idempotency-related errors are emitted only when the idempotency feature is enabled per-account; off by default at v1.0.

Status Description Example Response
401 Missing or invalid API key {"error_type": "auth_required", "error": "API key required. Include X-API-Key: <your-api-key> header"}
Invalid keys return error_type: "auth_invalid".
402 Insufficient credits {"error_type": "insufficient_credits", "error": "Insufficient credits", "credits_needed": 20, "balance": {...}}
403 Feature requires bonus {"error_type": "forbidden", "error": "Batch evaluation requires bonus features"}
404 Resource not found {"error_type": "not_found", "error": "Evaluation not found"}
409 Idempotency conflict In-flight or expired key replay. Honor the retry-after header value (default 5s) before retrying. {"error_type": "idempotency_key_conflict", "error": "Idempotency key conflict"}
422 Validation error — malformed or missing input fields {"error_type": "validation_error", "errors": {"output_text": ["is required"]}}
422 Idempotency key malformed Idempotency-Key header value failed validation (length or character constraints). {"error_type": "idempotency_key_invalid", "error": "Invalid idempotency key"}
422 All requested scoring methods failed with permanent errors (input unscorable). Credits fully refunded. Same envelope as 503 below; error_type is "all_methods_failed".
429 Rate limit exceeded (too many requests from this account) {"error_type": "rate_limited", "error": "Rate limit exceeded", "retry_after": 60}
503 All requested scoring methods failed with at least one transient error. Credits fully refunded. Retry after the Retry-After header. See All-methods-failed body below.
503 scoring_method: "combined" with at least one sub-method missing. Combined fails closed rather than returning a partial result labeled as combined. Credits fully refunded. Retry after the Retry-After header. See Partial-combined body below. error_type is "partial_combined_unavailable".
503 Internal scoring-formatter contract violation. Always carries retry-after: 5. Retry once; persistent recurrence indicates a bug — please report. {"error_type": "formatter_contract_violation", "error": "Internal scoring formatter error; please retry"}

All-methods-failed body (422 and 503)

Both 422 and 503 responses share the same body shape; only the error_type field and the HTTP status differ. 503 additionally sets a Retry-After header. The body is the same whether the request was a single-method call or a combined call.

{
  "error_type": "all_methods_failed_transient",
  "error": "Evaluation failed",
  "evaluation_id": "eval_abc123xyz789",
  "scoring_method": "combined",
  "requested_methods": ["lexical", "semantic", "inferential"],
  "completed_methods": [],
  "missing_methods": ["lexical", "semantic", "inferential"],
  "method_errors": {
    "lexical": { "reason": "no_key_tokens", "transient": false },
    "semantic": { "reason": "pool_timeout", "transient": true },
    "inferential": { "reason": "pool_timeout", "transient": true }
  },
  "credits_charged": 0,
  "credits_refunded": 100
}

For 422 the same body is returned with error_type: "all_methods_failed" and every per-method transient flag set to false. The transient flag on each entry of method_errors tells the client whether that specific sub-method's failure is retry-worthy in isolation; the top-level error_type tells the client whether the whole call is.

Partial-combined body (503)

Returned when scoring_method: "combined" and at least one sub-method failed to produce a score. Combined fails closed: rather than return a 200 with a partial combined result (which CI pipelines cannot reliably distinguish from a real combined score), the API returns 503 with a full refund. Retry after the Retry-After header.

{
  "error_type": "partial_combined_unavailable",
  "error": "Partial combined evaluation unavailable",
  "evaluation_id": "eval_abc123xyz789",
  "scoring_method": "combined",
  "requested_methods": ["lexical", "semantic", "inferential"],
  "completed_methods": ["lexical"],
  "missing_methods": ["semantic", "inferential"],
  "method_errors": {
    "semantic": { "reason": "pool_timeout", "transient": true },
    "inferential": { "reason": "nli_unparseable", "transient": false }
  },
  "credits_charged": 0,
  "credits_refunded": 100
}

Per-sub-method transient flags are preserved for clients that want to log the underlying failure shape; the top-level call is always retry-worthy regardless of sub-method classification, so no HTTP-status disambiguation like the 422 / 503 split on all_methods_failed applies here.

FAQ

Common questions about combined scoring and its fails-closed semantics. See Response Fields and Error Responses above for the underlying field and status-code definitions.

Why did I get 503 when I asked for combined?

Combined score is an aggregation across lexical, semantic, and inferential methods. If any sub-method is unavailable or fails to produce a score, combined cannot be computed with integrity, so the request fails closed with HTTP 503 partial_combined_unavailable.

Why is combined all-or-nothing? Can't I get partial results?

Combined is a deterministic aggregation contract for CI-grade use cases. Partial aggregation would be a different product with different semantics. Keeping the boundary sharp prevents silently computing combined scores from incomplete inputs.

How do I get partial results if I want them?

Request individual methods directly: scoring_method: "lexical", scoring_method: "semantic", or scoring_method: "inferential". If semantic is temporarily down and you want lexical anyway, call scoring_method: "lexical". Individual methods do not fail closed on each other.

Am I charged for a 503 partial_combined_unavailable?

No. Full refund. credits_charged: 0 on the response. Your credit balance is untouched.

How should my retry logic handle partial-combined 503s?

Two options. (a) Retry with exponential backoff. partial_combined_unavailable errors are usually transient. (b) Fall back to requesting individual methods, which do not require cross-method availability. Either is acceptable; choose based on your tolerance for latency versus scope.