EndpointEvaluator Scoring Details

Follow the four steps below for a full explanation of how our scoring methods work. Each method catches what the previous one misses. The reference text is the result you expect from your LLM-based system. Follow the same reference text through all four methods to see how they complement each other.

Reference text used throughout:

"The order ships tomorrow."

Step 1: Lexical — catch rewording

20 credits

Lexical scoring measures how much of the meaningful wording from your reference appears in your output. We split both texts into words, ignore common filler words like "the" and "is", and compute an overlap score between 0.0 and 1.0. A score near 1.0 means most of the content survived; a score near 0.0 means most of it was changed. It's deterministic — the same inputs always produce the same score — and runs on our own infrastructure, no third-party LLM calls. Lexical catches any rewording; use Semantic when you expect paraphrasing or Inferential when you need logical reasoning.

Consistent — exact match

Reference: "The order ships tomorrow."

Output: "The order ships tomorrow."

Identical wording. Lexical compares the content words in both texts, so identical strings always return Consistent.

Ambiguous — partial content-word overlap

Reference: "The order ships tomorrow."

Output: "Your order will be shipped the next day."

Some content words survive ("order"), others change ("ships" → "shipped", "tomorrow" → "day"). Lexical puts this in the ambiguous band — neither a clean match nor a clean mismatch. The meaning is actually preserved, but lexical can't tell the difference from rewording.

Works for: Fast CI checks. Template-based outputs. Detecting unexpected rewording.

Blind spot: Overlapping content words drag the score up even when the meaning changes; missing content words drag it down even when the meaning is preserved. Use Semantic when meaning matters more than wording.

Step 2: Semantic — tolerate rephrasing

40 credits

Semantic scoring uses language models to compare the underlying meaning of both texts. It understands that different words can express the same meaning.

Consistent — same meaning, different words

Reference: "The order ships tomorrow."

Output: "Your order will be shipped the next day."

Semantic scoring understands these sentences mean the same thing despite different wording. This is the same output lexical called ambiguous — semantic correctly identifies the meaning is preserved.

Inconsistent — the texts describe different things

Reference: "The order ships tomorrow."

Output: "Your item should go out in the morning."

The surface meaning is different — one talks about an order shipping tomorrow, the other about an item going out in the morning. Semantic scoring flags the meaning gap. But as we'll see in the inferential section, there's a logical relationship here that semantic misses.

Works for: Paraphrase-tolerant evaluation. When meaning matters more than exact wording.

Blind spot: Misses logical relationships. This output could actually be consistent with shipping tomorrow — but semantic doesn't reason about entailment.

Step 3: Inferential — catch factual errors

80 credits

Inferential scoring uses an LLM judge to determine factual consistency. It asks: does this output contradict the reference? It understands logical entailment and catches factual conflicts.

Consistent — logical entailment

Reference: "The order ships tomorrow."

Output: "Your item should go out in the morning."

If an item should go out in the morning, the order ships tomorrow. Inferential scoring recognizes this logical relationship — the output communicates the same fact in completely different words. This is the kind of response a customer service LLM might produce. A human reads it and understands immediately. Lexical and semantic both flag it as inconsistent because the words and surface meaning are different. This is the same output that semantic flagged, but inferential catches the logical entailment.

Inconsistent — direct contradiction

Reference: "The order ships tomorrow."

Output: "The order won't ship tomorrow."

Direct logical contradiction. Inferential scoring catches factual conflicts that could slip past lexical (similar words) and semantic (related topic) checks.

Works for: Factual consistency. Catching contradictions and logical conflicts. When you need to know if the output is logically inconsistent.

Blind spot: Statements can be logically consistent with the reference but not directly relevant to it. For example, "Your item is still in our warehouse" is logically consistent with our example reference, but is non-responsive to a question about when an item will ship.

Step 4: Combined — the full picture

100 credits

In production, you want all three perspectives. Combined runs lexical + semantic + inferential and gives you an overall verdict requiring unanimous agreement.

Verdicts:

consistent

All methods agree output is faithful

ambiguous

Disagreement among methods

inconsistent

All methods agree output contradicts reference

Consistent — all three methods agree

Reference: "The order ships tomorrow."

Output: "The order will ship tomorrow."

Lexical: ambiguous | Semantic: consistent | Inferential: consistent

Combined runs all three methods and requires unanimous agreement. "Will ship" vs "ships" trips lexical — content words shift, so lexical lands in the ambiguous band. Semantic and inferential both recognize the meaning is preserved. Because the three methods disagree, the combined verdict is Ambiguous.

Scoring Methods Overview

Start cheap, upgrade where it matters.

Lexical

20 credits

Catches rewording

Semantic

40 credits

Tolerates paraphrases

Inferential

80 credits

Understands logic

Combined

100 credits

All three at once