EndpointEvaluator — Regression Testing for LLM Outputs

Step 1: Create Your Account

Visit endpointevaluator.com/auth/signup

Enter your email address
Complete the captcha verification
Check your email and click the verification link
Copy your API key from the welcome page

Important: Your API key is only shown once at signup. Store it securely — you'll need it for all API calls.

Step 2: Access Your Dashboard

After verifying your email, you'll be redirected to the dashboard at endpointevaluator.com/dashboard

Your dashboard shows your current credit balance, API key prefix (for identification), and claim status.

Step 3: Claim Free Credits

On your dashboard, complete the captcha to claim 500 free credits daily.

Free credits reset to 500 each day. Use them for development, testing, and experimentation.

Step 4: Run Your First Evaluation

Replace YOUR_API_KEY with the API key from Step 1:

# Basic evaluation with Inferential scoring
curl -X POST https://endpointevaluator.com/api/v1/evaluate \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "output_text": "Paris is the capital of France",
    "reference_text": "The capital of France is Paris",
    "scoring_method": "inferential"
  }'

Expected response:

{
  "evaluation_id": "eval_abc123xyz789",
  "scoring_method": "inferential",
  "verdict": "consistent",
  "credits_charged": 80,
  "credits_refunded": 0,
  "credits_consumed": 80,
  "credits_remaining": 420
}

Credit costs: Lexical 20, Semantic 40, Inferential 80, Combined 100. Cache hits cost half credits. credits_charged and credits_consumed are equal; both are the net after any refund. See the API docs for when credits_refunded is non-zero (for example, a 503 partial_combined_unavailable issues a full refund).

Step 5: Explore Scoring Methods

Try different scoring methods to see how they handle the same inputs:

"scoring_method": "lexical" — 20 credits, literal text comparison

"scoring_method": "semantic" — 40 credits, meaning comparison

"scoring_method": "inferential" — 80 credits, factual agreement

"scoring_method": "combined" — 100 credits, all methods + overall verdict

All methods return a categorical verdict: consistent, ambiguous, or inconsistent

Deep dive with examples →

CI/CD Integration

EndpointEvaluator works with any CI tool that can make HTTP requests. Each example below demonstrates a different integration pattern.

GitHub Actions — Quality Gate

Fails the build if the LLM output is inconsistent with the reference, and posts results to the job summary.

# .github/workflows/llm-eval.yml
name: LLM Evaluation
on: [pull_request]
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Generate LLM output
        id: llm
        run: |
          # Your app logic here — produce the text to evaluate
          echo "output=Paris is the capital of France" >> "$GITHUB_OUTPUT"
          echo "reference=The capital of France is Paris" >> "$GITHUB_OUTPUT"

      - name: Evaluate with EndpointEvaluator
        id: eval
        env:
          EE_API_KEY: ${{ secrets.EE_API_KEY }}
        run: |
          RESPONSE=$(curl -sf -X POST https://endpointevaluator.com/api/v1/evaluate \
            -H "X-API-Key: $EE_API_KEY" \
            -H "Content-Type: application/json" \
            -d "$(jq -n \
              --arg out "${{ steps.llm.outputs.output }}" \
              --arg ref "${{ steps.llm.outputs.reference }}" \
              '{"output_text":$out,"reference_text":$ref,"scoring_method":"inferential"}')")

          VERDICT=$(echo "$RESPONSE" | jq -r '.verdict')
          CREDITS=$(echo "$RESPONSE" | jq -r '.credits_consumed')

          echo "verdict=$VERDICT" >> "$GITHUB_OUTPUT"

          # Post results to the GitHub Actions job summary
          echo "### Evaluation Results" >> "$GITHUB_STEP_SUMMARY"
          echo "| Metric | Value |" >> "$GITHUB_STEP_SUMMARY"
          echo "|--------|-------|" >> "$GITHUB_STEP_SUMMARY"
          echo "| Verdict | \`$VERDICT\` |" >> "$GITHUB_STEP_SUMMARY"
          echo "| Credits Used | $CREDITS |" >> "$GITHUB_STEP_SUMMARY"

      - name: Gate — fail on inconsistent
        if: steps.eval.outputs.verdict == 'inconsistent'
        run: |
          echo "::error::LLM output is inconsistent with reference"
          exit 1

GitLab CI — Batch Evaluation

Reads test cases from a JSON fixtures file, evaluates each one, and fails the pipeline if any case is inconsistent.

# test/fixtures/eval_cases.json — your test fixtures
# [
#   {"name":"capital","output":"Paris is the capital of France","reference":"The capital of France is Paris"},
#   {"name":"date","output":"WWII ended in 1945","reference":"World War II concluded in 1945"}
# ]

# .gitlab-ci.yml
llm-eval:
  stage: test
  image: alpine:latest
  before_script:
    - apk add --no-cache curl jq
  script:
    - |
      CASES=$(cat test/fixtures/eval_cases.json)
      COUNT=$(echo "$CASES" | jq length)
      FAILED=0

      for i in $(seq 0 $(( COUNT - 1 ))); do
        NAME=$(echo "$CASES" | jq -r ".[$i].name")
        OUTPUT=$(echo "$CASES" | jq -r ".[$i].output")
        REF=$(echo "$CASES" | jq -r ".[$i].reference")

        RESPONSE=$(curl -sf -X POST https://endpointevaluator.com/api/v1/evaluate \
          -H "X-API-Key: $EE_API_KEY" \
          -H "Content-Type: application/json" \
          -d "$(jq -n --arg o "$OUTPUT" --arg r "$REF" \
            '{"output_text":$o,"reference_text":$r,"scoring_method":"inferential"}')")

        VERDICT=$(echo "$RESPONSE" | jq -r '.verdict')
        echo "[$NAME] verdict=$VERDICT"

        # Save each result for the report artifact
        echo "$RESPONSE" | jq --arg name "$NAME" '. + {case_name: $name}' >> eval-results.jsonl

        if [ "$VERDICT" = "inconsistent" ]; then
          echo "  FAIL — $NAME is inconsistent"
          FAILED=$(( FAILED + 1 ))
        fi
      done

      echo ""
      echo "Evaluated $COUNT cases: $(( COUNT - FAILED )) passed, $FAILED failed"
      [ "$FAILED" -eq 0 ] || exit 1
  artifacts:
    paths:
      - eval-results.jsonl
    when: always

CircleCI — Combined Method with Artifacts

Runs a combined evaluation (lexical + semantic + inferential), saves the full response as a build artifact, and fails if the overall verdict is inconsistent.

# .circleci/config.yml
version: 2.1
jobs:
  llm-eval:
    docker:
      - image: cimg/base:current
    environment:
      # Set these in your CircleCI project settings, or pass from a prior step
      LLM_OUTPUT: "Paris is the capital of France"
      REFERENCE_TEXT: "The capital of France is Paris"
    steps:
      - checkout
      - run:
          name: Evaluate LLM output
          command: |
            mkdir -p /tmp/eval-results

            RESPONSE=$(curl -sf -X POST https://endpointevaluator.com/api/v1/evaluate \
              -H "X-API-Key: $EE_API_KEY" \
              -H "Content-Type: application/json" \
              -d "$(jq -n \
                --arg out "$LLM_OUTPUT" \
                --arg ref "$REFERENCE_TEXT" \
                '{"output_text":$out,"reference_text":$ref,"scoring_method":"combined"}')")

            # Save full response as a build artifact
            echo "$RESPONSE" | jq '.' > /tmp/eval-results/evaluation.json

            VERDICT=$(echo "$RESPONSE" | jq -r '.verdict')
            CREDITS=$(echo "$RESPONSE" | jq -r '.credits_consumed')

            # Combined returns per-method verdicts — display them
            echo "Overall verdict: $VERDICT"
            echo "Method verdicts:"
            echo "$RESPONSE" | jq -r '.method_verdicts | to_entries[] | "  \(.key): \(.value)"'
            echo "Credits used: $CREDITS"

            # Fail if the overall verdict is inconsistent
            if [ "$VERDICT" = "inconsistent" ]; then
              echo "FAILED — output is inconsistent with reference"
              exit 1
            fi
            echo "PASSED"
      - store_artifacts:
          path: /tmp/eval-results
          destination: eval-results

workflows:
  evaluate:
    jobs:
      - llm-eval

Security: Store your API key as a secret in your CI platform. Never hardcode credentials in pipeline configs.

EndpointEvaluator › Quick Start Guide