Evaluation for RUG/RLM Systems: Measuring ‘Evidence Coverage’

We’ve all been there. You build a RAG (Retrieval-Augmented Generation) or RLM (Reasoning Language Model) system. You feed it a query, it generates a response, and you look at the answer. It looks pretty good. Maybe it’s a B+ or an A-. You ship it. Then, two weeks later, a user emails support with a screenshot of a response that is confidently wrong, citing a source document that doesn’t exist or, worse, contradicts the very statement it’s supporting.

The failure mode here isn’t just that the model “got it wrong.” It’s a breakdown in the chain of trust between the retrieval mechanism, the reasoning process, and the final synthesized output. As engineers, we tend to obsess over the final answer’s accuracy. But in production systems, especially those dealing with high-stakes enterprise data, the final answer is the tip of the iceberg. The real work, and the real source of reliability, lies in the evaluation of the process that got us there.

Standard metrics like ROUGE or BLEU for text generation, or simple hit-rate for retrieval, are insufficient. They don’t tell you if your system is faithful to the retrieved context or if the context itself is even relevant. We need a more granular, forensic approach to evaluation. We need to measure what I call Evidence Coverage—a composite score of how well the system’s output is grounded in, and logically derived from, the retrieved evidence.

Deconstructing Evidence Coverage

Evidence Coverage isn’t a single number; it’s a dashboard of metrics. It tells a story about where your pipeline is bleeding. Is the retrieval step grabbing the wrong chunks? Is the model ignoring the provided context? Is it inventing facts? Let’s break down the critical components.

1. Citation Correctness and Precision

This is the most basic, yet most frequently botched, evaluation. When the model says, “The Q3 revenue was $50M,” and points to a source, that source must contain the information. We’re not just talking about the source document being real; we’re talking about the specific claim being supported by that document.

There are two levels of failure here:

Existence Failure: The cited source (e.g., “Document X, page 4”) is not in the retrieved context window. The model is hallucinating the citation. This is a critical failure of faithfulness.
Content Failure: The cited source exists, but it doesn’t actually support the claim. The model is misinterpreting or “over-claiming” from the source.

For evaluation, we need to parse the output for citations (this can be tricky with free-form text, which is why forcing structured citations like [1] is a good practice) and then programmatically verify them against the retrieved context. This is a perfect job for a smaller, cheaper LLM acting as a “judge.” You feed the judge the claim, the citation, and the retrieved text, and ask it: “Does the text support the claim? Yes/No.”

2. Contradiction Rate

This is a subtle but devastating issue in RAG systems. Your retrieval step might pull in five documents. Four of them are relevant and consistent. One is an old policy document that has been superseded. The model, in its effort to synthesize everything, might accidentally blend the old and new policies, creating a contradictory statement. Or, it might contradict itself within the same response.

For example: “The employee can take 10 days of leave. Leave requests must be submitted 30 days in advance. The employee is entitled to 5 days of leave.” This response contains an internal contradiction (10 days vs. 5 days) and potentially a contextual one if the 30-day rule is from an outdated document.

Measuring this requires looking at the response as a whole and comparing its claims against each other and against the corpus. A high contradiction rate indicates that your model is either struggling with conflicting signals in the context or that your retrieval is pulling in a noisy mix of relevant and irrelevant documents.

3. Path Validity (The Reasoning Chain)

This is more advanced and particularly relevant for RLMs and agentic systems that perform multi-step reasoning. A simple RAG system retrieves and synthesizes. A reasoning system might retrieve, then plan, then retrieve again, then calculate, then synthesize. The “path” is the sequence of steps it took.

Path validity asks: “Is this reasoning path sound?” For instance, if the query is “What is the percentage growth of X from 2022 to 2023?”, the valid path is:

Retrieve 2022 value for X.
Retrieve 2023 value for X.
Calculate (2023 - 2022) / 2022.
Format as a percentage.

An invalid path might be: “Retrieve an article that discusses the general growth trend of X and approximate the percentage.” Or, it might be a logical fallacy in the intermediate steps.

Evaluating this is tough. You can’t just look at the final output. You need to log the intermediate steps (the “thoughts” of the model) and evaluate the logic at each stage. For some applications, this might mean using a “verifier” model that checks the validity of each step before allowing the system to proceed. It’s computationally expensive, but for mission-critical financial or medical applications, it’s non-negotiable.

4. Budget Adherence

This is the pragmatic metric that engineers often forget until they get the cloud bill. Every LLM call, every token retrieved, costs money and latency. A system can be 99% accurate but too expensive to run at scale.

Budget adherence measures two things:

Retrieval Budget: Did we retrieve 200 tokens when 50 would have sufficed? Are we consistently over-fetching context, bloating the prompt, and increasing costs? We can measure the “context utilization”—how much of the retrieved text was actually cited or used in the final answer.
Generation Budget: Is the model generating verbose, rambling answers when a concise one would do? This is a measure of output efficiency. We can track tokens-in vs. tokens-out, but more importantly, we can score for conciseness.

A system that respects the budget is a system that can be reliably deployed. It’s a measure of engineering discipline as much as it is of model performance.

The Weekly Evaluation Report: A Practical Template

So, how do we operationalize this? You can’t do this manually. You need a script—a weekly “health check” for your RAG/RLM pipeline. This script should run against a curated set of test queries (a “golden set” that you update over time) and produce a report. Here is a conceptual structure for that script and the resulting report.

Step 1: The Test Suite

Your evaluation is only as good as your test cases. Your weekly script should pull from a tests.json file. This file isn’t just “question” and “ideal answer.” It needs more metadata.

A sample entry might look like this:

{
  "id": "test_042",
  "query": "What is the new protocol for handling server outages in the EU region?",
  "context_docs": ["protocol_eu_v2.pdf", "incident_response_overview.docx"],
  "expected_citations": ["protocol_eu_v2.pdf#page=5"],
  "known_trap": "There is an old protocol in 'incident_response_overview.docx' that should be ignored.",
  "ideal_answer_snippet": "The new protocol requires..."
}

This rich structure allows the evaluation script to check for specific things: Did it cite the right document? Did it fall for the trap? This is far better than just checking for semantic similarity.

Step 2: The Evaluation Script (Pseudocode Logic)

The script orchestrates the evaluation. It runs the query through your system, captures all intermediate states, and then runs the “judges.”

for test in test_suite:
    # 1. Run the full RAG/RLM pipeline
    retrieved_chunks = retriever(test.query)
    full_response = llm.generate(query=test.query, context=retrieved_chunks)
    
    # 2. Parse the response
    claims = extract_claims(full_response)
    citations = parse_citations(full_response)
    
    # 3. Run the Judges
    metrics = {}
    
    # Citation Correctness
    for claim, citation in zip(claims, citations):
        is_correct = citation_judge(claim, citation, retrieved_chunks)
        metrics['citation_correctness'].append(is_correct)
        
    # Contradiction Rate
    metrics['contradiction_rate'] = contradiction_judge(claims)
    
    # Context Utilization (Budget)
    used_tokens = count_tokens(used_in_synthesis(claims, retrieved_chunks))
    retrieved_tokens = count_tokens(retrieved_chunks)
    metrics['utilization_ratio'] = used_tokens / retrieved_tokens
    
    # Path Validity (if applicable)
    if system.is_rlm:
        for step in system.planner.steps:
            metrics['path_validity'] = path_judge(step)
            
    store_results(test.id, metrics)

Step 3: The Report Format

The output shouldn’t be a wall of logs. It should be a concise, human-readable report (e.g., a Markdown file or a Slack message) that highlights trends. Here’s a template for what that report should look like.

Weekly RAG/RLM System Health Report

Period: 2023-10-23 to 2023-10-27
Test Suite Version: v1.4 (25 queries)

Executive Summary

System performance is stable. Overall Evidence Coverage score is 88%, a 2% decrease from last week. The primary area of concern is a rise in Contradiction Rate (from 5% to 12%) due to the ingestion of a new, un-curated knowledge base. Citation correctness remains high at 97%. Budget adherence is within target.

Key Metrics Breakdown

1. Evidence Coverage Score: 88% (-2%)
This is a weighted average of all sub-metrics.

2. Citation Correctness: 97% (Stable)
Out of 150 claims evaluated, 145 were correctly supported by a valid citation. The 4 failures were “hallucinated” citations not present in the retrieved context. This suggests the model is sometimes confident without evidence.

3. Contradiction Rate: 12% (+7%)
Analysis shows that 3 out of 25 queries produced contradictory statements. All 3 cases involved queries that touched on ‘Project Alpha’. The retrieved context included both the new ‘Alpha Spec v2’ and the old ‘Alpha Spec v1’. Action Required: Implement a timestamp filter in the retrieval stage for documents with versioning.

4. Path Validity: 91% (Stable)
For the 5 queries requiring multi-step reasoning, 4 had fully valid paths. One failed due to an incorrect intermediate calculation (math error).

5. Budget Adherence: 95% (Target: 90%)
Average tokens retrieved per query: 850. Average tokens utilized in synthesis: 410. Utilization ratio is 0.48. This is slightly inefficient but within acceptable bounds. No action needed.

Failure Case Spotlight

Query ID: test_018
Query: “Is the ‘Project Alpha’ deadline still Q4?”
Retrieved Docs: alpha_spec_v1.pdf (Deadline: Q3), alpha_spec_v2.pdf (Deadline: Q4)
Model Response: “Yes, the deadline is Q4, but it was previously Q3.”
Failure Analysis: This is a contradiction. The model correctly identified the new deadline but also included the old one, creating a confusing and factually contradictory statement. The user only asked if the deadline was still Q4. The correct answer is “Yes.” The model’s attempt to be helpful by providing context backfired.

Next Steps & Actions

High Priority: Deploy a pre-filtering step for document retrieval to exclude documents with a ‘superseded’ status or older versions when a newer one exists. (Assign: @backend-eng)
Medium Priority: Investigate the math error in path validity. It appears to be a floating-point rounding issue in the model’s output. (Assign: @ml-eng)
Low Priority: The citation hallucinations are rare (2.6%) but worth monitoring. If they increase, we may need to add a penalty in the prompt for citing non-existent sources.

Implementing the “Judges”

The pseudocode above relies on “judges”—sub-routines that perform the actual evaluation. For most teams, the most practical way to implement these is by using the LLM you’re already using, but in a separate, “evaluation” call. This is often called LLM-as-a-Judge.

For example, the Contradiction Judge would be a prompt like this:

You are an expert fact-checker. Given a list of claims, determine if any of them contradict each other. Return a JSON object: { "contradiction_found": true/false, "contradicting_claims": ["claim1", "claim2"] }. Here are the claims: [list of claims from the model’s response].

The Citation Judge would receive the prompt:

You are a verification agent. Given a user claim, a specific citation, and the text from the cited document, determine if the document text fully supports the claim. Return only “Yes” or “No”. Claim: [claim]. Citation: [citation]. Document Text: [text].

The key here is to design these prompts to be atomic and deterministic. You want a clear yes/no output that you can easily parse and aggregate. You also want to use a model that is consistent, even if it’s not the most powerful one. A smaller, faster model like GPT-3.5-Turbo or a fine-tuned open-source model is often sufficient and much cheaper for running these evaluations at scale.

From Evaluation to Action

The entire purpose of this rigorous evaluation is to drive iteration. Without it, you’re flying blind. The weekly report template provided above is designed to do more than just report numbers; it’s designed to generate a to-do list. When the Contradiction Rate spikes, you know you have a retrieval problem. When Citation Correctness drops, your model’s prompt might need tuning to be more conservative. When Path Validity fails, you need to look at your reasoning logic or the quality of your intermediate tool calls.

Think of your RAG/RLM system not as a single model, but as a complex software application. You wouldn’t ship a web application without unit tests, integration tests, and performance monitoring. Your LLM application deserves the same level of engineering rigor. By breaking down “goodness” into these measurable components—evidence coverage, citation correctness, and budget—you transform an abstract art into a manageable engineering discipline. This process allows you to build systems that are not just impressive in a demo, but trustworthy in production.