EndpointEvaluator › Quick Start Guide
Get from zero to your first evaluation in under 5 minutes.
Step 1: Create Your Account
Visit endpointevaluator.com/auth/signup
- Enter your email address
- Complete the captcha verification
- Check your email and click the verification link
- Copy your API key from the welcome page
Important: Your API key is only shown once at signup. Store it securely — you'll need it for all API calls.
Step 2: Access Your Dashboard
After verifying your email, you'll be redirected to the dashboard at endpointevaluator.com/dashboard
Your dashboard shows your current credit balance, API key prefix (for identification), and claim status.
Step 3: Claim Free Credits
On your dashboard, complete the captcha to claim 500 free credits daily.
Free credits reset to 500 each day. Use them for development, testing, and experimentation.
Step 4: Run Your First Evaluation
Replace YOUR_API_KEY
with the API key from Step 1:
# Basic evaluation with Inferential scoring
curl -X POST https://endpointevaluator.com/api/v1/evaluate \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"output_text": "Paris is the capital of France",
"reference_text": "The capital of France is Paris",
"scoring_method": "inferential"
}'
Expected response:
{
"evaluation_id": "eval_abc123xyz789",
"scoring_method": "inferential",
"verdict": "consistent",
"credits_charged": 80,
"credits_refunded": 0,
"credits_consumed": 80,
"credits_remaining": 420
}
Credit costs: Lexical 20, Semantic 40, Inferential 80, Combined 100. Cache hits cost half credits.
credits_charged
and credits_consumed
are equal; both are the net after any refund. See the
API docs
for when credits_refunded
is non-zero (for example, a 503
partial_combined_unavailable
issues a full refund).
Step 5: Explore Scoring Methods
Try different scoring methods to see how they handle the same inputs:
"scoring_method": "lexical"
— 20 credits, literal text comparison
"scoring_method": "semantic"
— 40 credits, meaning comparison
"scoring_method": "inferential"
— 80 credits, factual agreement
"scoring_method": "combined"
— 100 credits, all methods + overall verdict
All methods return a categorical verdict: consistent, ambiguous, or
inconsistent
CI/CD Integration
EndpointEvaluator works with any CI tool that can make HTTP requests. Each example below demonstrates a different integration pattern.
GitHub Actions — Quality Gate
Fails the build if the LLM output is inconsistent with the reference, and posts results to the job summary.
# .github/workflows/llm-eval.yml name: LLM Evaluation on: [pull_request] jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Generate LLM output id: llm run: | # Your app logic here — produce the text to evaluate echo "output=Paris is the capital of France" >> "$GITHUB_OUTPUT" echo "reference=The capital of France is Paris" >> "$GITHUB_OUTPUT" - name: Evaluate with EndpointEvaluator id: eval env: EE_API_KEY: ${{ secrets.EE_API_KEY }} run: | RESPONSE=$(curl -sf -X POST https://endpointevaluator.com/api/v1/evaluate \ -H "X-API-Key: $EE_API_KEY" \ -H "Content-Type: application/json" \ -d "$(jq -n \ --arg out "${{ steps.llm.outputs.output }}" \ --arg ref "${{ steps.llm.outputs.reference }}" \ '{"output_text":$out,"reference_text":$ref,"scoring_method":"inferential"}')") VERDICT=$(echo "$RESPONSE" | jq -r '.verdict') CREDITS=$(echo "$RESPONSE" | jq -r '.credits_consumed') echo "verdict=$VERDICT" >> "$GITHUB_OUTPUT" # Post results to the GitHub Actions job summary echo "### Evaluation Results" >> "$GITHUB_STEP_SUMMARY" echo "| Metric | Value |" >> "$GITHUB_STEP_SUMMARY" echo "|--------|-------|" >> "$GITHUB_STEP_SUMMARY" echo "| Verdict | \`$VERDICT\` |" >> "$GITHUB_STEP_SUMMARY" echo "| Credits Used | $CREDITS |" >> "$GITHUB_STEP_SUMMARY" - name: Gate — fail on inconsistent if: steps.eval.outputs.verdict == 'inconsistent' run: | echo "::error::LLM output is inconsistent with reference" exit 1
GitLab CI — Batch Evaluation
Reads test cases from a JSON fixtures file, evaluates each one, and fails the pipeline if any case is inconsistent.
# test/fixtures/eval_cases.json — your test fixtures # [ # {"name":"capital","output":"Paris is the capital of France","reference":"The capital of France is Paris"}, # {"name":"date","output":"WWII ended in 1945","reference":"World War II concluded in 1945"} # ] # .gitlab-ci.yml llm-eval: stage: test image: alpine:latest before_script: - apk add --no-cache curl jq script: - | CASES=$(cat test/fixtures/eval_cases.json) COUNT=$(echo "$CASES" | jq length) FAILED=0 for i in $(seq 0 $(( COUNT - 1 ))); do NAME=$(echo "$CASES" | jq -r ".[$i].name") OUTPUT=$(echo "$CASES" | jq -r ".[$i].output") REF=$(echo "$CASES" | jq -r ".[$i].reference") RESPONSE=$(curl -sf -X POST https://endpointevaluator.com/api/v1/evaluate \ -H "X-API-Key: $EE_API_KEY" \ -H "Content-Type: application/json" \ -d "$(jq -n --arg o "$OUTPUT" --arg r "$REF" \ '{"output_text":$o,"reference_text":$r,"scoring_method":"inferential"}')") VERDICT=$(echo "$RESPONSE" | jq -r '.verdict') echo "[$NAME] verdict=$VERDICT" # Save each result for the report artifact echo "$RESPONSE" | jq --arg name "$NAME" '. + {case_name: $name}' >> eval-results.jsonl if [ "$VERDICT" = "inconsistent" ]; then echo " FAIL — $NAME is inconsistent" FAILED=$(( FAILED + 1 )) fi done echo "" echo "Evaluated $COUNT cases: $(( COUNT - FAILED )) passed, $FAILED failed" [ "$FAILED" -eq 0 ] || exit 1 artifacts: paths: - eval-results.jsonl when: always
CircleCI — Combined Method with Artifacts
Runs a combined evaluation (lexical + semantic + inferential), saves the full response as a build artifact, and fails if the overall verdict is inconsistent.
# .circleci/config.yml version: 2.1 jobs: llm-eval: docker: - image: cimg/base:current environment: # Set these in your CircleCI project settings, or pass from a prior step LLM_OUTPUT: "Paris is the capital of France" REFERENCE_TEXT: "The capital of France is Paris" steps: - checkout - run: name: Evaluate LLM output command: | mkdir -p /tmp/eval-results RESPONSE=$(curl -sf -X POST https://endpointevaluator.com/api/v1/evaluate \ -H "X-API-Key: $EE_API_KEY" \ -H "Content-Type: application/json" \ -d "$(jq -n \ --arg out "$LLM_OUTPUT" \ --arg ref "$REFERENCE_TEXT" \ '{"output_text":$out,"reference_text":$ref,"scoring_method":"combined"}')") # Save full response as a build artifact echo "$RESPONSE" | jq '.' > /tmp/eval-results/evaluation.json VERDICT=$(echo "$RESPONSE" | jq -r '.verdict') CREDITS=$(echo "$RESPONSE" | jq -r '.credits_consumed') # Combined returns per-method verdicts — display them echo "Overall verdict: $VERDICT" echo "Method verdicts:" echo "$RESPONSE" | jq -r '.method_verdicts | to_entries[] | " \(.key): \(.value)"' echo "Credits used: $CREDITS" # Fail if the overall verdict is inconsistent if [ "$VERDICT" = "inconsistent" ]; then echo "FAILED — output is inconsistent with reference" exit 1 fi echo "PASSED" - store_artifacts: path: /tmp/eval-results destination: eval-results workflows: evaluate: jobs: - llm-eval
Security: Store your API key as a secret in your CI platform. Never hardcode credentials in pipeline configs.
What's Next?
- Read the full API documentation — all endpoints, parameters, and response formats
- View pricing — when you're ready to scale beyond free tier
- Scoring methods deep dive — understand when to use each method