How to Evaluate LLM Output Quality in Your App

You've integrated an LLM into your application. It works in testing. Then production happens: users report nonsensical responses, costs explode because the model generates 10x more tokens than expected, and you have no systematic way to know if the outputs are actually good. The fundamental problem: LLM output quality is subjective and probabilistic, but your application needs objective, measurable criteria to determine success or failure.

This guide provides concrete methods to evaluate LLM output quality in production applications. You'll learn how to build evaluation pipelines, choose appropriate metrics for different use cases, implement automated quality checks, and detect degradation before users complain. These are not academic benchmarks—they're practical techniques used by teams running LLMs at scale in customer-facing products.

We'll cover five evaluation approaches: rule-based validation (fast, deterministic), model-based evaluation (using LLMs to judge LLMs), human evaluation pipelines, automated regression testing, and production monitoring strategies. By the end, you'll know which approach fits your specific use case and how to implement it.

Why Standard Software Testing Doesn't Work for LLMs

Traditional software testing relies on deterministic behavior: given the same input, you always get the same output. Write a unit test, it either passes or fails. LLMs violate this assumption fundamentally.

The same prompt can produce different outputs each time (even with temperature 0, minor variations occur). An output can be technically correct but unhelpful, or technically wrong but functionally useful. Edge cases are infinite—you can't enumerate all possible inputs like you can with a deterministic function.

Consider a simple task: summarizing customer feedback. What makes a "good" summary? Accuracy is one dimension, but so is conciseness, tone, actionability, and absence of hallucinations. Traditional testing can't capture this multi-dimensional quality space. You need evaluation methods designed for probabilistic, subjective outputs.

Key Insight: The goal isn't to achieve 100% accuracy (impossible with LLMs)—it's to understand your quality distribution and catch when it degrades. Track the percentage of outputs that meet your quality bar, not whether every single output is perfect.

Rule-Based Validation: Fast, Deterministic Checks

Start with rule-based validation because it's fast, cheap, and catches obvious failures immediately. These are programmatic checks you can run on every single output in production with negligible overhead.

Format validation: If you expect JSON, verify it parses. If you expect a specific structure, validate the schema. This catches a huge class of failures where the model simply didn't follow instructions.

// Format validation for structured output
function validateOutput(output, expectedSchema) {
  const checks = {
    isValid: true,
    errors: []
  };

  // Check 1: Valid JSON
  let parsed;
  try {
    parsed = JSON.parse(output);
  } catch (e) {
    checks.isValid = false;
    checks.errors.push('Invalid JSON format');
    return checks;
  }

  // Check 2: Required fields exist
  for (const field of expectedSchema.required) {
    if (!(field in parsed)) {
      checks.isValid = false;
      checks.errors.push(`Missing required field: ${field}`);
    }
  }

  // Check 3: Field types match
  for (const [field, type] of Object.entries(expectedSchema.types)) {
    if (field in parsed && typeof parsed[field] !== type) {
      checks.isValid = false;
      checks.errors.push(`Field ${field} has wrong type: expected ${type}, got ${typeof parsed[field]}`);
    }
  }

  // Check 4: Value constraints
  if (expectedSchema.constraints) {
    for (const [field, constraint] of Object.entries(expectedSchema.constraints)) {
      if (!constraint(parsed[field])) {
        checks.isValid = false;
        checks.errors.push(`Field ${field} violates constraint`);
      }
    }
  }

  return checks;
}

// Usage
const schema = {
  required: ['name', 'email', 'age'],
  types: {
    name: 'string',
    email: 'string',
    age: 'number'
  },
  constraints: {
    age: (value) => value >= 0 && value <= 150,
    email: (value) => /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(value)
  }
};

const validation = validateOutput(llmResponse, schema);
if (!validation.isValid) {
  console.error('LLM output failed validation:', validation.errors);
  // Trigger retry or fallback
}

Length constraints: Track output length. If responses are consistently too short, the model might not be understanding the task. If they're too long, you're wasting tokens and money.

// Length-based quality signals
function checkLengthQuality(output, task) {
  const wordCount = output.split(/\s+/).length;
  const charCount = output.length;

  const signals = {
    tooShort: false,
    tooLong: false,
    warnings: []
  };

  // Task-specific length expectations
  const expectations = {
    summarization: { min: 50, max: 500 },
    codeGeneration: { min: 20, max: 2000 },
    extraction: { min: 10, max: 200 }
  };

  const expected = expectations[task];
  if (!expected) return signals;

  if (wordCount < expected.min) {
    signals.tooShort = true;
    signals.warnings.push(`Output suspiciously short: ${wordCount} words`);
  }

  if (wordCount > expected.max) {
    signals.tooLong = true;
    signals.warnings.push(`Output too verbose: ${wordCount} words`);
  }

  return signals;
}

Content-based rules: Check for specific failure patterns you've observed. If the model sometimes returns "I cannot help with that" instead of following instructions, detect and flag those responses.

// Detect common failure patterns
function detectFailurePatterns(output) {
  const failurePatterns = [
    /I (?:cannot|can't|won't)/i,
    /(?:sorry|apologize)/i,
    /as an AI/i,
    /I don't have access/i,
    /\[PLACEHOLDER\]/i,
    /TODO:/i
  ];

  const detected = [];

  for (const pattern of failurePatterns) {
    if (pattern.test(output)) {
      detected.push(`Failure pattern detected: ${pattern.source}`);
    }
  }

  return {
    hasFailed: detected.length > 0,
    patterns: detected
  };
}

Hallucination detection (basic): For tasks that extract information from provided context, verify the output only uses information present in the input. This catches blatant hallucinations.

// Basic hallucination check: verify facts appear in source
function checkForHallucination(output, sourceContext) {
  // Extract factual claims from output (simplified)
  const claims = extractClaims(output);

  const hallucinations = [];

  for (const claim of claims) {
    // Check if claim appears in source (fuzzy match)
    if (!sourceContext.toLowerCase().includes(claim.toLowerCase())) {
      // Additional check: semantic similarity
      const similarity = computeSimilarity(claim, sourceContext);
      if (similarity < 0.5) {
        hallucinations.push(claim);
      }
    }
  }

  return {
    likelyHallucinated: hallucinations.length > 0,
    suspiciousClaims: hallucinations,
    confidence: hallucinations.length / Math.max(claims.length, 1)
  };
}

function extractClaims(text) {
  // Extract sentences that make factual claims
  // This is simplified; production version would use NLP
  return text
    .split(/[.!?]+/)
    .filter(s => s.trim().length > 10)
    .map(s => s.trim());
}

Model-Based Evaluation: Using LLMs to Judge LLMs

For subjective quality dimensions (helpfulness, correctness, tone), use another LLM to evaluate outputs. This sounds circular, but research shows GPT-4 evaluations correlate strongly with human judgments for many tasks.

Why this works: Evaluating is easier than generating. A model that might struggle to write a perfect summary can reliably judge whether a given summary is good. You're using the LLM's understanding, not its generation capabilities.

The technique: give the evaluator LLM the original task, the output to evaluate, and clear criteria. Ask for a structured judgment (score + reasoning).

// LLM-based quality evaluation
async function evaluateWithLLM(task, input, output) {
  const evaluationPrompt = `You are evaluating the quality of an AI assistant's response.

Task given to assistant: ${task}
User input: ${input}
Assistant's response: ${output}

Evaluate the response on these criteria (score 1-5 for each):

1. Correctness: Is the information accurate?
2. Completeness: Does it fully address the task?
3. Clarity: Is it well-written and easy to understand?
4. Conciseness: Is it appropriately detailed without unnecessary verbosity?
5. Following instructions: Did it follow the task requirements?

Return your evaluation as JSON:
{
  "correctness": ,
  "completeness": ,
  "clarity": ,
  "conciseness": ,
  "followsInstructions": ,
  "overallScore": ,
  "reasoning": ""
}`;

  const evaluation = await evaluatorLLM.complete(evaluationPrompt);
  return JSON.parse(evaluation);
}

// Example usage
const quality = await evaluateWithLLM(
  "Summarize this customer feedback",
  customerFeedback,
  generatedSummary
);

if (quality.overallScore < 3) {
  // Quality too low, regenerate or flag for review
  console.warn('Low quality output detected:', quality.reasoning);
}

Reference-based evaluation: When you have a known-good reference answer, ask the LLM to compare the output against it.

async function evaluateAgainstReference(output, reference) {
  const prompt = `Compare these two responses to the same task.
Rate how similar they are in meaning and quality (score 1-10).

Reference (high-quality) response:
${reference}

Response to evaluate:
${output}

Return JSON:
{
  "similarityScore": <1-10>,
  "keyDifferences": ["list", "of", "differences"],
  "isAcceptable": 
}`;

  const evaluation = await evaluatorLLM.complete(prompt);
  return JSON.parse(evaluation);
}

Cost consideration: LLM-based evaluation costs money. For high-volume applications, evaluate a sample (e.g., 5-10% of outputs) rather than every output. Use rule-based validation on everything, LLM evaluation on a subset.

// Sample-based LLM evaluation
async function evaluateOutput(output, task, input) {
  // Always run fast rule-based checks
  const ruleChecks = validateOutput(output, schema);

  if (!ruleChecks.isValid) {
    return { quality: 'failed', reason: 'rule violation', details: ruleChecks };
  }

  // LLM evaluation on 10% sample
  if (Math.random() < 0.1) {
    const llmEval = await evaluateWithLLM(task, input, output);
    logEvaluation(output, llmEval); // Track for analysis
    return { quality: llmEval.overallScore >= 3 ? 'good' : 'poor', details: llmEval };
  }

  return { quality: 'passed_rules', skippedLLMEval: true };
}

Human Evaluation Pipelines

LLM-based evaluation is a proxy for human judgment, not a replacement. Build a human evaluation pipeline to establish ground truth and calibrate your automated metrics.

Representative sampling: Don't evaluate randomly. Sample to cover:

High-confidence outputs (to verify automated metrics are correct)
Low-confidence outputs (to find edge cases)
Diverse input types (different languages, formats, complexity levels)
Temporal distribution (samples from different times to detect drift)

// Intelligent sampling for human review
function selectSamplesForReview(outputs, n = 100) {
  const samples = [];

  // 40% random samples (baseline)
  const randomSamples = sampleRandom(outputs, n * 0.4);
  samples.push(...randomSamples);

  // 30% low-confidence samples (likely problems)
  const lowConfidence = outputs
    .filter(o => o.automatedScore < 3)
    .sort((a, b) => a.automatedScore - b.automatedScore)
    .slice(0, n * 0.3);
  samples.push(...lowConfidence);

  // 20% edge cases (unusual inputs or outputs)
  const edgeCases = outputs
    .filter(o => isEdgeCase(o))
    .slice(0, n * 0.2);
  samples.push(...edgeCases);

  // 10% recent outputs (detect temporal issues)
  const recent = outputs
    .sort((a, b) => b.timestamp - a.timestamp)
    .slice(0, n * 0.1);
  samples.push(...recent);

  return deduplicateByVariety(samples, n);
}

function isEdgeCase(output) {
  return (
    output.length > 2000 || // Very long
    output.length < 50 || // Very short
    output.latency > 5000 || // Slow generation
    output.tokenCount > 1000 // Token-heavy
  );
}

Evaluation rubric: Provide human evaluators with clear, specific criteria. Vague instructions ("is this good?") produce inconsistent ratings.

// Example evaluation rubric for human reviewers
const evaluationRubric = {
  task: "Customer feedback summarization",
  criteria: [
    {
      dimension: "Accuracy",
      question: "Does the summary accurately reflect the feedback content?",
      scale: [
        { score: 1, description: "Contains factual errors or misrepresents feedback" },
        { score: 2, description: "Mostly accurate but misses key points" },
        { score: 3, description: "Accurate with minor omissions" },
        { score: 4, description: "Fully accurate" }
      ]
    },
    {
      dimension: "Completeness",
      question: "Are all important points from the feedback included?",
      scale: [
        { score: 1, description: "Misses major points" },
        { score: 2, description: "Covers main points but omits important details" },
        { score: 3, description: "Covers most important points" },
        { score: 4, description: "Comprehensive coverage" }
      ]
    },
    {
      dimension: "Conciseness",
      question: "Is the summary appropriately brief without losing meaning?",
      scale: [
        { score: 1, description: "Too verbose or too brief" },
        { score: 2, description: "Somewhat wordy or lacking detail" },
        { score: 3, description: "Good balance" },
        { score: 4, description: "Perfectly concise" }
      ]
    }
  ],
  overallJudgment: {
    acceptable: "Would you be comfortable showing this summary to a customer?",
    options: ["Yes", "No", "Needs minor edits"]
  }
};

Inter-rater reliability: Have multiple reviewers rate the same outputs. If they disagree significantly, your rubric is too vague or the task is too subjective. Measure agreement using Cohen's kappa or similar metrics.

// Calculate inter-rater agreement
function calculateInterRaterAgreement(ratings) {
  // ratings: array of { evaluator, outputId, score }
  const outputScores = {};

  // Group ratings by output
  for (const rating of ratings) {
    if (!outputScores[rating.outputId]) {
      outputScores[rating.outputId] = [];
    }
    outputScores[rating.outputId].push(rating.score);
  }

  // Calculate percentage agreement
  let totalAgreement = 0;
  let totalOutputs = 0;

  for (const scores of Object.values(outputScores)) {
    if (scores.length < 2) continue;

    const allSame = scores.every(s => s === scores[0]);
    const withinOnePoint = scores.every(s => Math.abs(s - scores[0]) <= 1);

    if (allSame) {
      totalAgreement += 1;
    } else if (withinOnePoint) {
      totalAgreement += 0.5;
    }

    totalOutputs++;
  }

  const agreementRate = totalAgreement / totalOutputs;

  return {
    agreementRate,
    suggestion: agreementRate < 0.6
      ? "Low agreement - clarify rubric or simplify criteria"
      : "Acceptable agreement"
  };
}

Automated Regression Testing

Build a test suite of known inputs with known good outputs. Run this suite whenever you change prompts, switch models, or update your application. This catches regressions—cases where changes break previously working functionality.

Building the test dataset: Start with edge cases and known failure modes. Add examples whenever you discover a bug. Curate high-quality examples that cover diverse scenarios.

// Regression test suite structure
const testSuite = [
  {
    id: "summary-001",
    task: "summarization",
    input: "Long customer feedback about shipping delays...",
    expectedOutput: "Customer experienced 5-day shipping delay...",
    evaluationCriteria: {
      mustInclude: ["shipping", "delay", "5-day"],
      mustNotInclude: ["refund"], // If refund not mentioned in input
      maxLength: 200,
      minLength: 50
    }
  },
  {
    id: "extraction-002",
    task: "data extraction",
    input: "Email from [email protected] requesting demo...",
    expectedOutput: { email: "[email protected]", request: "demo" },
    evaluationCriteria: {
      exactMatch: false,
      requiredFields: ["email", "request"],
      emailMustBeValid: true
    }
  }
  // ... more test cases
];

// Run regression tests
async function runRegressionTests(model) {
  const results = {
    passed: 0,
    failed: 0,
    failures: []
  };

  for (const test of testSuite) {
    const output = await model.generate(test.task, test.input);
    const evaluation = evaluateTestCase(output, test.expectedOutput, test.evaluationCriteria);

    if (evaluation.passed) {
      results.passed++;
    } else {
      results.failed++;
      results.failures.push({
        testId: test.id,
        reason: evaluation.reason,
        expected: test.expectedOutput,
        actual: output
      });
    }
  }

  return results;
}

function evaluateTestCase(actual, expected, criteria) {
  // Check required inclusions
  if (criteria.mustInclude) {
    for (const term of criteria.mustInclude) {
      if (!actual.toLowerCase().includes(term.toLowerCase())) {
        return { passed: false, reason: `Missing required term: ${term}` };
      }
    }
  }

  // Check length constraints
  if (criteria.maxLength && actual.length > criteria.maxLength) {
    return { passed: false, reason: `Output too long: ${actual.length} > ${criteria.maxLength}` };
  }

  // For structured output, check fields
  if (criteria.requiredFields) {
    const parsed = typeof actual === 'string' ? JSON.parse(actual) : actual;
    for (const field of criteria.requiredFields) {
      if (!(field in parsed)) {
        return { passed: false, reason: `Missing field: ${field}` };
      }
    }
  }

  return { passed: true };
}

Continuous regression testing: Run the test suite automatically on every deployment or model change. Track pass rates over time to detect gradual degradation.

// CI/CD integration for LLM testing
async function cicdQualityGate() {
  console.log("Running LLM regression tests...");

  const results = await runRegressionTests(currentModel);

  const passRate = results.passed / (results.passed + results.failed);

  console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
  console.log(`Passed: ${results.passed}, Failed: ${results.failed}`);

  if (results.failures.length > 0) {
    console.log("\nFailures:");
    for (const failure of results.failures) {
      console.log(`  ${failure.testId}: ${failure.reason}`);
    }
  }

  // Quality gate: require 90% pass rate
  if (passRate < 0.9) {
    throw new Error(`Quality gate failed: pass rate ${(passRate * 100).toFixed(1)}% < 90%`);
  }

  console.log("✓ Quality gate passed");
}

Production Monitoring and Alerting

Quality evaluation doesn't stop at deployment. Monitor production outputs continuously to detect degradation, model drift, or infrastructure issues.

Metrics to track in production:

Metric	What It Detects	Alert Threshold
Validation failure rate	Model not following output format	> 5% failures
Average output length	Model becoming verbose or terse	±30% from baseline
Latency (p95, p99)	Infrastructure issues	p95 > 3s
Cost per request	Unexpected token usage	+50% from baseline
User feedback (thumbs up/down)	Subjective quality issues	< 70% positive
Retry rate	Users dissatisfied with output	> 15% of requests

// Production monitoring implementation
class LLMMonitor {
  constructor() {
    this.metrics = {
      total: 0,
      validationFailures: 0,
      lengths: [],
      latencies: [],
      costs: [],
      userFeedback: { positive: 0, negative: 0 }
    };
  }

  recordOutput(output, metadata) {
    this.metrics.total++;

    // Validation
    if (!metadata.isValid) {
      this.metrics.validationFailures++;
    }

    // Length tracking
    this.metrics.lengths.push(output.length);

    // Latency
    this.metrics.latencies.push(metadata.latency);

    // Cost
    this.metrics.costs.push(metadata.tokenCount * metadata.costPerToken);

    // Check for anomalies
    if (this.metrics.total % 100 === 0) {
      this.checkForAnomalies();
    }
  }

  recordFeedback(outputId, isPositive) {
    if (isPositive) {
      this.metrics.userFeedback.positive++;
    } else {
      this.metrics.userFeedback.negative++;
    }
  }

  checkForAnomalies() {
    const recentWindow = 100;
    const recent = {
      failureRate: this.metrics.validationFailures / this.metrics.total,
      avgLength: average(this.metrics.lengths.slice(-recentWindow)),
      p95Latency: percentile(this.metrics.latencies.slice(-recentWindow), 95),
      avgCost: average(this.metrics.costs.slice(-recentWindow))
    };

    // Compare to baseline (stored separately)
    const alerts = [];

    if (recent.failureRate > 0.05) {
      alerts.push({
        severity: 'high',
        metric: 'validation_failures',
        message: `Validation failure rate ${(recent.failureRate * 100).toFixed(1)}% exceeds 5% threshold`
      });
    }

    if (Math.abs(recent.avgLength - this.baseline.avgLength) / this.baseline.avgLength > 0.3) {
      alerts.push({
        severity: 'medium',
        metric: 'output_length',
        message: `Average output length deviated ${((recent.avgLength - this.baseline.avgLength) / this.baseline.avgLength * 100).toFixed(1)}% from baseline`
      });
    }

    if (recent.p95Latency > 3000) {
      alerts.push({
        severity: 'high',
        metric: 'latency',
        message: `P95 latency ${recent.p95Latency}ms exceeds 3000ms threshold`
      });
    }

    if (alerts.length > 0) {
      this.sendAlerts(alerts);
    }
  }

  sendAlerts(alerts) {
    // Integration with monitoring system (PagerDuty, Datadog, etc.)
    for (const alert of alerts) {
      console.error(`[LLM QUALITY ALERT] ${alert.severity}: ${alert.message}`);
      // Send to monitoring service
      monitoringService.sendAlert(alert);
    }
  }

  getStats() {
    const totalFeedback = this.metrics.userFeedback.positive + this.metrics.userFeedback.negative;
    return {
      totalOutputs: this.metrics.total,
      validationFailureRate: this.metrics.validationFailures / this.metrics.total,
      avgLength: average(this.metrics.lengths),
      p50Latency: percentile(this.metrics.latencies, 50),
      p95Latency: percentile(this.metrics.latencies, 95),
      avgCost: average(this.metrics.costs),
      userSatisfaction: totalFeedback > 0
        ? this.metrics.userFeedback.positive / totalFeedback
        : null
    };
  }
}

// Helper functions
function average(arr) {
  return arr.reduce((a, b) => a + b, 0) / arr.length;
}

function percentile(arr, p) {
  const sorted = [...arr].sort((a, b) => a - b);
  const index = Math.ceil(sorted.length * p / 100) - 1;
  return sorted[index];
}

Pro Tip: Set up weekly quality reports summarizing key metrics and trends. Include example outputs that passed/failed various checks. This keeps quality top-of-mind and helps you spot gradual degradation that might not trigger immediate alerts.

Comparative Evaluation: A/B Testing Models and Prompts

When comparing two models or prompt versions, run them side-by-side on the same inputs and measure quality differences.

// A/B test framework for LLM changes
class LLMABTest {
  constructor(variantA, variantB, trafficSplit = 0.5) {
    this.variants = { A: variantA, B: variantB };
    this.trafficSplit = trafficSplit;
    this.results = { A: [], B: [] };
  }

  async generate(input) {
    // Assign to variant
    const variant = Math.random() < this.trafficSplit ? 'A' : 'B';
    const startTime = Date.now();

    const output = await this.variants[variant].generate(input);

    const result = {
      variant,
      input,
      output,
      latency: Date.now() - startTime,
      timestamp: Date.now()
    };

    this.results[variant].push(result);

    return {
      output,
      variant // Optional: include variant in response for tracking
    };
  }

  recordFeedback(outputId, feedback) {
    // Find result and add feedback
    for (const variant of ['A', 'B']) {
      const result = this.results[variant].find(r => r.id === outputId);
      if (result) {
        result.userFeedback = feedback;
        break;
      }
    }
  }

  analyze() {
    const analysis = {};

    for (const variant of ['A', 'B']) {
      const results = this.results[variant];

      analysis[variant] = {
        sampleSize: results.length,
        avgLatency: average(results.map(r => r.latency)),
        avgLength: average(results.map(r => r.output.length)),
        userSatisfaction: results.filter(r => r.userFeedback === 'positive').length /
                         results.filter(r => r.userFeedback).length || 0
      };
    }

    // Statistical significance test
    const significanceTest = this.tTest(
      this.results.A.map(r => r.userFeedback === 'positive' ? 1 : 0),
      this.results.B.map(r => r.userFeedback === 'positive' ? 1 : 0)
    );

    analysis.conclusion = {
      winner: analysis.A.userSatisfaction > analysis.B.userSatisfaction ? 'A' : 'B',
      isSignificant: significanceTest.pValue < 0.05,
      pValue: significanceTest.pValue
    };

    return analysis;
  }

  tTest(groupA, groupB) {
    // Simplified t-test implementation
    const meanA = average(groupA);
    const meanB = average(groupB);
    const varianceA = variance(groupA);
    const varianceB = variance(groupB);

    const t = (meanA - meanB) / Math.sqrt(varianceA / groupA.length + varianceB / groupB.length);
    const df = groupA.length + groupB.length - 2;

    // Convert t to p-value (simplified)
    const pValue = tDistribution(Math.abs(t), df);

    return { t, pValue };
  }
}

// Usage
const abTest = new LLMABTest(currentModel, newModel);

// Run for a week, then analyze
setTimeout(async () => {
  const analysis = abTest.analyze();
  console.log('A/B Test Results:', analysis);

  if (analysis.conclusion.isSignificant && analysis.conclusion.winner === 'B') {
    console.log('New model wins! Rolling out to 100%');
    switchToModel(newModel);
  }
}, 7 * 24 * 60 * 60 * 1000); // 1 week

Domain-Specific Quality Metrics

Different use cases need different quality metrics. Here are evaluation strategies for common LLM applications:

Code Generation

Beyond syntax correctness, evaluate functional correctness by running generated code against test cases.

// Evaluate generated code
async function evaluateGeneratedCode(generatedCode, testCases) {
  const results = {
    syntaxValid: false,
    testsRun: 0,
    testsPassed: 0,
    errors: []
  };

  // Check syntax
  try {
    new Function(generatedCode); // JavaScript syntax check
    results.syntaxValid = true;
  } catch (e) {
    results.errors.push(`Syntax error: ${e.message}`);
    return results;
  }

  // Run test cases
  for (const test of testCases) {
    results.testsRun++;
    try {
      const fn = new Function('input', generatedCode + '\nreturn result;');
      const output = fn(test.input);

      if (deepEqual(output, test.expected)) {
        results.testsPassed++;
      } else {
        results.errors.push(`Test failed: expected ${JSON.stringify(test.expected)}, got ${JSON.stringify(output)}`);
      }
    } catch (e) {
      results.errors.push(`Runtime error: ${e.message}`);
    }
  }

  return results;
}

Summarization

Measure factual consistency (does summary contain info not in source?), coverage (does it include key points?), and conciseness.

// Evaluate summarization quality
function evaluateSummary(source, summary) {
  const metrics = {};

  // Compression ratio
  metrics.compressionRatio = summary.length / source.length;

  // Extractiveness (what % of summary comes directly from source)
  const summaryNgrams = getNgrams(summary, 5);
  const sourceNgrams = getNgrams(source, 5);
  const overlap = summaryNgrams.filter(ng => sourceNgrams.includes(ng)).length;
  metrics.extractiveness = overlap / summaryNgrams.length;

  // Novel information (potential hallucination)
  const summaryWords = new Set(summary.toLowerCase().split(/\s+/));
  const sourceWords = new Set(source.toLowerCase().split(/\s+/));
  const novelWords = [...summaryWords].filter(w => !sourceWords.has(w));
  metrics.novelWordRatio = novelWords.length / summaryWords.size;

  // Quality signals
  metrics.signals = {
    goodCompression: metrics.compressionRatio < 0.3 && metrics.compressionRatio > 0.05,
    likelyFactual: metrics.novelWordRatio < 0.2,
    notTooExtractive: metrics.extractiveness < 0.8 // Some abstraction is good
  };

  return metrics;
}

Data Extraction

Measure precision (% of extracted data that's correct) and recall (% of available data that was extracted).

// Evaluate extraction quality
function evaluateExtraction(extracted, groundTruth) {
  const tp = 0; // True positives (correctly extracted)
  const fp = 0; // False positives (incorrectly extracted)
  const fn = 0; // False negatives (missed extractions)

  // Count matches
  for (const [field, value] of Object.entries(extracted)) {
    if (field in groundTruth) {
      if (groundTruth[field] === value) {
        tp++;
      } else {
        fp++; // Extracted wrong value
      }
    } else {
      fp++; // Extracted field that shouldn't exist
    }
  }

  // Count misses
  for (const field of Object.keys(groundTruth)) {
    if (!(field in extracted)) {
      fn++;
    }
  }

  const precision = tp / (tp + fp);
  const recall = tp / (tp + fn);
  const f1 = 2 * (precision * recall) / (precision + recall);

  return { precision, recall, f1 };
}

Frequently Asked Questions

How many samples do I need for statistically significant A/B tests?

For detecting a 5% difference in quality with 95% confidence, you need approximately 1,000-2,000 samples per variant. For smaller differences (2-3%), you need 5,000-10,000 samples. Use online sample size calculators and run tests for at least a week to account for temporal variations in user behavior.

Should I evaluate every single output in production?

Run fast rule-based validation on 100% of outputs. Run expensive LLM-based evaluation on a sample (5-10%). Reserve human evaluation for a smaller sample (1%) or for outputs flagged by automated checks. Full evaluation on every output is cost-prohibitive for high-volume applications.

How do I establish baseline quality metrics when first deploying?

Run your evaluation pipeline on development/staging data for 1-2 weeks before production launch. Calculate baseline metrics (average length, validation pass rate, latency percentiles). Use these as thresholds for production alerts. Update baselines quarterly as your application and user patterns evolve.

What if automated and human evaluations disagree?

Trust humans. Automated metrics are proxies for human judgment, not replacements. If automated metrics rate outputs highly but humans rate them poorly, your automated metrics are miscalibrated. Revise your evaluation criteria based on human feedback patterns.

How do I handle subjective quality when different users want different things?

Segment quality metrics by user type or use case. What's high-quality for technical users might be too verbose for casual users. Track quality separately for different segments and optimize for each. Consider personalization—different model configurations or prompts for different user segments.

Can I use smaller models to evaluate larger models?

Research shows GPT-4 can reliably evaluate GPT-3.5, but GPT-3.5 cannot reliably evaluate GPT-4 (the evaluator needs to understand quality it cannot produce). Use a model at least as capable as what you're evaluating. For cost reasons, use the smallest model that can reliably judge your production model's outputs.

How do I detect model drift over time?

Run your regression test suite weekly. Track metrics like pass rate, average scores, and specific failure patterns. If performance degrades on tests that previously passed, investigate whether input distributions changed, the model API was updated, or prompts need adjustment. Model drift is often subtle and gradual.

What's the minimum test suite size for effective regression testing?

Start with 50-100 carefully chosen test cases covering common scenarios, edge cases, and known failure modes. Add 5-10 new cases monthly based on production issues. A good test suite has high coverage of important scenarios, not necessarily high quantity. Quality and diversity of test cases matters more than size.

How do I handle evaluation when ground truth doesn't exist?

For creative tasks without single correct answers (content generation, brainstorming), use comparative evaluation instead of absolute evaluation. Present two outputs to users/evaluators and ask which is better. Or use a reference set of high-quality examples and measure similarity to those. Focus on consistency and absence of obvious failures rather than perfection.

Should I track prompt effectiveness separately from model quality?

Yes. When quality degrades, you need to know if it's the model or the prompt. Track metrics per prompt template. If one prompt consistently produces lower quality, revise it. When you update prompts, compare quality before/after. Treat prompts as code—version them, test changes, and roll back if quality decreases.

Conclusion

Evaluating LLM output quality is an ongoing process, not a one-time check. Combine multiple evaluation methods: fast rule-based validation for every output, LLM-based evaluation for samples, human evaluation for ground truth, regression testing for changes, and production monitoring for drift detection. No single method is sufficient—you need layers of quality assurance.

Start simple: implement format validation and basic length checks first. Add automated regression tests for your most critical use cases. Gradually introduce LLM-based evaluation and human review pipelines as you scale. Build production monitoring from day one—you can't improve what you don't measure.

The goal is not perfect outputs (impossible with probabilistic models) but predictable quality distribution. Understand your quality baseline, detect when it degrades, and have processes to investigate and fix issues quickly. Quality evaluation infrastructure is as important as the LLM integration itself—invest in both equally.

How to Evaluate LLM Output Quality in Your App

How to Evaluate LLM Output Quality in Your App

Why Standard Software Testing Doesn't Work for LLMs

Rule-Based Validation: Fast, Deterministic Checks

Model-Based Evaluation: Using LLMs to Judge LLMs

Human Evaluation Pipelines

Automated Regression Testing

Production Monitoring and Alerting

Comparative Evaluation: A/B Testing Models and Prompts

Domain-Specific Quality Metrics

Code Generation

Summarization

Data Extraction

Frequently Asked Questions

How many samples do I need for statistically significant A/B tests?

Should I evaluate every single output in production?

How do I establish baseline quality metrics when first deploying?

What if automated and human evaluations disagree?

How do I handle subjective quality when different users want different things?

Can I use smaller models to evaluate larger models?

How do I detect model drift over time?

What's the minimum test suite size for effective regression testing?

How do I handle evaluation when ground truth doesn't exist?

Should I track prompt effectiveness separately from model quality?

Conclusion

Share on Social Media:

Bright SEO Tools