EndpointEvaluator › Scoring Details
Follow the four steps below for a full explanation of how our scoring methods work. Each method catches what the previous one misses. The reference text is the result you expect from your LLM-based system. Follow the same reference text through all four methods to see how they complement each other.
Reference text used throughout:
"The order ships tomorrow."
Step 1: Lexical — catch rewording
20 creditsLexical scoring measures how much of the meaningful wording from your reference appears in your output. We split both texts into words, ignore common filler words like "the" and "is", and compute an overlap score between 0.0 and 1.0. A score near 1.0 means most of the content survived; a score near 0.0 means most of it was changed. It's deterministic — the same inputs always produce the same score — and runs on our own infrastructure, no third-party LLM calls. Lexical catches any rewording; use Semantic when you expect paraphrasing or Inferential when you need logical reasoning.
Reference: "The order ships tomorrow."
Output: "The order ships tomorrow."
Identical wording. Lexical compares the content words in both texts, so identical strings always return Consistent.
Reference: "The order ships tomorrow."
Output: "Your order will be shipped the next day."
Some content words survive ("order"), others change ("ships" → "shipped", "tomorrow" → "day"). Lexical puts this in the ambiguous band — neither a clean match nor a clean mismatch. The meaning is actually preserved, but lexical can't tell the difference from rewording.
Works for: Fast CI checks. Template-based outputs. Detecting unexpected rewording.
Blind spot: Overlapping content words drag the score up even when the meaning changes; missing content words drag it down even when the meaning is preserved. Use Semantic when meaning matters more than wording.
Step 2: Semantic — tolerate rephrasing
40 creditsSemantic scoring uses language models to compare the underlying meaning of both texts. It understands that different words can express the same meaning.
Reference: "The order ships tomorrow."
Output: "Your order will be shipped the next day."
Semantic scoring understands these sentences mean the same thing despite different wording. This is the same output lexical called ambiguous — semantic correctly identifies the meaning is preserved.
Reference: "The order ships tomorrow."
Output: "Your item should go out in the morning."
The surface meaning is different — one talks about an order shipping tomorrow, the other about an item going out in the morning. Semantic scoring flags the meaning gap. But as we'll see in the inferential section, there's a logical relationship here that semantic misses.
Works for: Paraphrase-tolerant evaluation. When meaning matters more than exact wording.
Blind spot: Misses logical relationships. This output could actually be consistent with shipping tomorrow — but semantic doesn't reason about entailment.
Step 3: Inferential — catch factual errors
80 creditsInferential scoring uses an LLM judge to determine factual consistency. It asks: does this output contradict the reference? It understands logical entailment and catches factual conflicts.
Reference: "The order ships tomorrow."
Output: "Your item should go out in the morning."
If an item should go out in the morning, the order ships tomorrow. Inferential scoring recognizes this logical relationship — the output communicates the same fact in completely different words. This is the kind of response a customer service LLM might produce. A human reads it and understands immediately. Lexical and semantic both flag it as inconsistent because the words and surface meaning are different. This is the same output that semantic flagged, but inferential catches the logical entailment.
Reference: "The order ships tomorrow."
Output: "The order won't ship tomorrow."
Direct logical contradiction. Inferential scoring catches factual conflicts that could slip past lexical (similar words) and semantic (related topic) checks.
Works for: Factual consistency. Catching contradictions and logical conflicts. When you need to know if the output is logically inconsistent.
Blind spot: Statements can be logically consistent with the reference but not directly relevant to it. For example, "Your item is still in our warehouse" is logically consistent with our example reference, but is non-responsive to a question about when an item will ship.
Step 4: Combined — the full picture
100 creditsIn production, you want all three perspectives. Combined runs lexical + semantic + inferential and gives you an overall verdict requiring unanimous agreement.
Verdicts:
All methods agree output is faithful
Disagreement among methods
All methods agree output contradicts reference
Reference: "The order ships tomorrow."
Output: "The order will ship tomorrow."
Lexical: ambiguous | Semantic: consistent | Inferential: consistent
Combined runs all three methods and requires unanimous agreement. "Will ship" vs "ships" trips lexical — content words shift, so lexical lands in the ambiguous band. Semantic and inferential both recognize the meaning is preserved. Because the three methods disagree, the combined verdict is Ambiguous.
Scoring Methods Overview
Start cheap, upgrade where it matters.
Lexical
20 credits
Catches rewording
Semantic
40 credits
Tolerates paraphrases
Inferential
80 credits
Understands logic
Combined
100 credits
All three at once