RLMs for Compliance Q&A: A Worked End-to-End Design

The regulatory landscape for any modern digital service feels less like a set of static rules and more like a living organism. It mutates, adapts, and grows. For engineers and developers tasked with maintaining compliance, the traditional approach of hard-coding logic trees into applications becomes unmanageable. Every time a regulation changes, you are looking at a code freeze, a deployment cycle, and the very real risk of human error in translation. This is where the intersection of Large Language Models (LLMs) and structured knowledge graphs offers a paradigm shift. We are moving from brittle, procedural compliance checks to dynamic, semantic understanding.

What follows is a deep dive into designing a Retrieval-Augmented Generation (RAG) system specifically tailored for compliance Q&A. This isn’t just about asking a model to summarize a PDF. It is about building a rigorous pipeline that ingests raw text, maps it to a formal ontology, links it into a graph, retrieves it recursively, validates constraints, generates answers with citations, and maintains a complete audit trail. This is a blueprint for a system that an engineer can trust, debug, and maintain.

The Foundation: Ingestion and Atomicity

Everything begins with the data. Regulatory documents—GDPR, HIPAA, PCI-DSS, or internal policy manuals—are notoriously messy. They contain headers, footers, tables, cross-references, and conflicting definitions. A naive ingestion pipeline that simply chunks text into paragraphs will fail because context is lost at the boundaries. If a definition in Section 2.1 is referenced in Section 15.4, a simple vector store lookup might miss the connection.

We need a ingestion strategy that prioritizes atomicity. An atomic unit of compliance is a single, indivisible statement of a rule or requirement. For example, “Data must be encrypted at rest” is atomic. “User consent must be verifiable” is atomic. The ingestion pipeline must parse the raw document (PDF, DOCX, HTML) and break it down into these atomic units, preserving the hierarchical structure of the original document.

Consider the preprocessing step. We cannot rely solely on optical character recognition (OCR) if we are dealing with scanned documents, but even with digital text, layout analysis is critical. A footnote containing a critical exception is semantically different from a main body paragraph. We use layout-aware parsers (like those derived from the PDFMiner or Unstructured libraries) to tag text blocks with their semantic roles: header_level_1, table_caption, definition, obligation.

“The granularity of your ingestion determines the precision of your retrieval. If you feed the model a wall of text, it will hallucinate. If you feed it structured atoms, it will reason.”

Once segmented, each atomic unit is assigned a unique identifier (UUID) and a version hash. This allows us to track changes over time. If a regulation is amended, we don’t overwrite the old rule; we mark it as superseded and link the new version to it. This temporal awareness is crucial for audit logs, which we will discuss later.

Handling Multi-Modal Inputs

Regulations often come in mixed formats. A compliance requirement might be described in a text paragraph but detailed in a table. A robust ingestion engine must flatten these structures. We extract text from tables row-by-row, maintaining the relationship between the column header (the attribute) and the cell value (the constraint). For example, a table row might specify “Retention Period: 6 Years” linked to “Data Type: Financial Records.” The ingestion engine converts this into a structured record before it ever reaches the LLM.

Ontology Tagging: The Semantic Layer

Raw text is insufficient for rigorous compliance. We need a semantic layer—an ontology—that categorizes every piece of information. This is where we bridge the gap between natural language and machine-readable logic. We define a taxonomy of compliance concepts.

Our ontology might include classes such as:

Entity: The subject of the regulation (e.g., User, Data Processor, Controller).
Data Type: PII, PHI, Financial Data, Telemetry.
Constraint: Prohibition, Obligation, Permission.
Scope: Jurisdiction (EU, US), Application Domain (Mobile App, Backend).

Tagging this ontology is a two-step process. First, we use a lightweight classification model (or a carefully prompted LLM) to assign these tags to the atomic units extracted during ingestion. This is not just keyword matching; it is semantic classification. For instance, the phrase “Users must be given the option to opt-out” is tagged as Constraint: Obligation on Entity: User regarding Data Type: Marketing Preferences.

The tagging process must be deterministic. We cannot have the model randomly tagging “User” as “Customer” one time and “Client” the next. We enforce this through a controlled vocabulary or “glossary” that is fed to the tagging model as context. This glossary acts as a schema. If the model encounters a term not in the glossary, it flags it for human review rather than guessing.

Disambiguation Challenges

Language is slippery. The word “processing” has a specific legal definition under GDPR that differs from general usage. In a technical context, “processing” refers to CPU cycles; in compliance, it refers to any operation performed on data. The ontology tagging layer must handle this disambiguation. We achieve this by providing the tagging model with examples of correct and incorrect tagging (few-shot learning). By showing the model that “processing data on the server” is not a compliance event, but “processing personal data” is, we narrow the semantic field.

Graph Links: From Hierarchy to Network

Once we have atomic units and ontological tags, we move them into a graph database. Traditional databases (SQL) are poor at representing the complex, nested relationships found in regulations. A graph database (like Neo4j or a vector-native graph like Weaviate) allows us to model the regulatory landscape as a network.

Nodes in our graph represent the atomic units, entities, and data types. Edges represent the relationships. We define several types of edges:

IS_CHILD_OF: Links a specific clause to its parent section (preserving hierarchy).
REFERS_TO: Links a clause to a definition elsewhere in the document.
IMPLEMENTS: Links a technical control (e.g., “AES-256 encryption”) to a regulatory requirement (e.g., “Data at rest shall be encrypted”).
CONFLICTS_WITH: Links two requirements that are mutually exclusive (requires human resolution).

Building this graph is a recursive process. As we ingest new documents, we don’t just dump them in; we query the existing graph to find similar concepts. If a new policy mentions “Right to be Forgotten,” we link it to the existing node representing GDPR Article 17. This creates a dense web of knowledge. Over time, the graph becomes more valuable than the sum of its parts because it reveals hidden dependencies.

For example, a developer might ask, “What are the encryption requirements for our European users?” The graph doesn’t just look for the keyword “encryption.” It traverses: User (Region: EU) -> Subject to Regulation (GDPR) -> Requires Security Measure -> Encryption Standard. This traversal ensures we capture context that a simple vector search would miss.

Vector Embeddings as Edge Weights

While the graph provides structure, we also leverage vector embeddings for fuzzy matching. Every node in the graph is associated with a vector embedding (generated by a model like text-embedding-ada-002). These embeddings serve as the “weight” of the edges in a semantic sense. When we traverse the graph, we don’t just follow strict links; we can also follow “semantic hops” where the vector similarity is above a certain threshold. This allows the system to connect “Data Privacy” to “Information Security” even if the documents don’t explicitly cross-reference them, provided the semantic proximity is high.

Recursive Retrieval: The RAG Core

With the graph populated, we can implement the retrieval mechanism. This is the “R” in RAG. However, standard RAG is often flat: user query -> vector search -> top k chunks -> LLM. For compliance, this is insufficient. Compliance requires reasoning over multiple pieces of evidence that might be scattered across different documents.

We employ Recursive Retrieval. When a query arrives, we don’t just retrieve the most relevant text chunks. We perform a multi-step search:

Initial Retrieval: The user query is embedded and used to find the top N most relevant atomic units in the graph.
Context Expansion: For each retrieved unit, we traverse the graph edges. We pull in parent nodes (for context), child nodes (for details), and related constraints (for exceptions).
Re-ranking: The expanded set of documents is likely too large for the LLM context window. We use a cross-encoder re-ranker to score the relevance of each expanded chunk specifically against the original query. We keep only the top-ranked chunks that add unique information.

Imagine a query: “Can we delete user data after 30 days?” A simple search might return a policy about data retention. But recursive retrieval would also pull in:

A conflicting requirement to keep audit logs for 6 months (via a CONFLICTS_WITH or EXCEPTION_TO edge).
The definition of “user data” (to ensure we aren’t talking about anonymized data).
Jurisdictional constraints (does this apply to California users only?).

This process mimics how a human lawyer thinks. You don’t just recall one rule; you recall the rule, the exception to the rule, and the context in which the exception applies. By structuring the retrieval as a graph traversal, we feed the LLM a dossier of evidence rather than a single snippet.

Constraint Checks: Logic before Language

Before generating an answer, we must perform logical validation. LLMs are probabilistic; they are not logic engines. If we ask an LLM to check compliance, it might miss a subtle contradiction. Therefore, we offload the “checking” to a symbolic layer.

We define a set of Constraint Validators. These are functions (Python code) that operate on the retrieved graph sub-structure. They take the user’s proposed action (implicit in the query) and run it against the hard rules extracted from the ontology.

For example, if the user asks, “Can we share email addresses with Vendor X?”, the system:

Retrieves the rules regarding “Data Sharing” and “Vendor X”.
Extracts constraints: Is Vendor X a “Sub-processor”? Is there a Data Processing Agreement (DPA)? Is the data “Marketing Data” or “Critical PII”?
Runs a symbolic check: IF Data_Type == PII AND Vendor_Status == Unapproved THEN Allow_Sharing == False.

If the symbolic check returns False, the generation phase is constrained. The LLM is instructed to explain why the action is prohibited based on the specific clauses identified. If the check returns True, the LLM is instructed to provide the approval and cite the supporting evidence.

This separation of concerns is vital. The LLM handles the natural language explanation (which is where it excels), while the symbolic engine handles the truth (which is where determinism is required). This hybrid approach significantly reduces hallucinations because the “facts” are validated before the text is generated.

Answer Generation with Citations

Now we construct the response. The prompt sent to the LLM is structured carefully. It receives the user’s query, the validated constraints, and the retrieved dossier of evidence. It is instructed to synthesize an answer that is:

Concise: Directly answering the question.
Grounded: Every claim must be backed by a citation from the retrieved evidence.
Cautious: If the evidence is ambiguous or contradictory, the model must state that uncertainty rather than guessing.

Citations are implemented using a specific format, typically bracketed references like [1] or [Clause 4.2]. The model is explicitly told: “If you use information from chunk ID 5501e… include the citation [Ref: 5501e].”

Post-generation, we perform a citation verification step. The generated text is parsed to extract the cited IDs. We verify that these IDs actually exist in the retrieved set and that the text associated with those IDs supports the claim made in the response. If the model cites a chunk that says “Data retention is 1 year” to support a claim that “Data retention is 2 years,” the system flags the response for review.

The final output is a structured JSON object containing the answer text and a list of citation objects. This allows the frontend to render the answer with clickable references that open the source document at the exact location.

Audit Logging: The Immutable Ledger

In a compliance system, what you did is just as important as what you decided. Every interaction must be logged in an immutable audit trail. This is non-negotiable for legal defensibility.

The audit log captures the entire state of the system at the moment of the query:

Input: The raw user query (sanitized for PII).
Retrieval Path: The specific graph nodes and edges traversed to find the evidence. This is the “chain of custody” for the data.
Constraint Evaluation: The results of the symbolic checks (Pass/Fail and the logic used).
Generation Metadata: The LLM model version used, the temperature setting, and the final prompt.
Output: The generated answer and the citations.
User Feedback: If the user marks the answer as helpful or unhelpful.

We store these logs in a write-once-read-many (WORM) database or an append-only ledger. If a regulator asks, “How did you determine that User X was allowed to delete their data on June 12th?”, we can replay the exact state of the knowledge graph and the logic at that time. We can prove that our system was not arbitrary; it was deterministic based on the rules active at that moment.

This also aids in debugging. If the system gives a wrong answer, we don’t have to guess why. We can look at the retrieval path and see if the graph traversal missed a critical node or if the ontology tag was incorrect.

Testing Strategy: Rigor and Resilience

Testing an AI-driven compliance system requires a multi-layered approach. Unit tests alone are insufficient because the output is non-deterministic text.

1. Unit Testing the Symbolic Layer

The constraint validators and the ingestion parsers are pure code. They should be tested with standard software engineering practices (pytest). We feed them known inputs and assert expected outputs. For example: Input: Clause "Must encrypt PII" -> Output: Constraint Object (Type: Obligation, Target: PII, Action: Encrypt).

2. Integration Testing the Graph

We need to verify that the graph links are formed correctly. We create a mock set of documents with known relationships (e.g., Document A references Document B). We run the ingestion pipeline and query the graph to ensure the REFERS_TO edge exists. We also test for “orphan” nodes—atomic units that have no connections, which usually indicates a parsing error.

3. End-to-End Evaluation (The “Golden Set”)

This is the most critical phase. We curate a “Golden Set” of Q&A pairs. These are questions crafted by compliance experts, accompanied by the “ground truth” answers. The test suite runs these questions through the full pipeline (Ingestion -> Graph -> Retrieval -> Generation).

We evaluate the outputs using two metrics:

Factuality Score: Does the generated answer match the ground truth? (Checked via semantic similarity and keyword matching).
Citation Precision/Recall: Did the model cite the correct source documents? Did it miss any relevant sources (recall) or cite irrelevant ones (precision)?

We aim for high precision. It is better to provide a partial answer with correct citations than a comprehensive answer based on a hallucinated source.

4. Adversarial Testing

We intentionally try to break the system. We feed it ambiguous queries (“Is it okay to sort of maybe share data?”), contradictory regulations (overlapping jurisdictions), and queries about topics not covered in the documentation. The system should be trained to recognize its own ignorance and respond with “I don’t have enough information to answer that,” rather than making something up.

Rollback Plan: When Things Go Wrong

No system is perfect. Regulations change, data gets corrupted, or a model update introduces a regression. A robust rollback plan is essential.

Versioned Knowledge Graph

As mentioned in the ingestion section, every document and rule is versioned. When a new regulation is ingested or an existing rule is updated, we do not delete the old version. We mark it as deprecated but keep it in the graph. This allows us to “time travel.”

If we discover that a new ingestion batch has corrupted the graph (e.g., incorrect tagging), we can roll back the graph state to the previous snapshot. Since the graph is a separate persistence layer from the application code, this is a database operation, not a code deployment.

Model Fallbacks

If the LLM provider experiences an outage or if the latest model update causes a drop in performance on our “Golden Set,” we need a fallback mechanism. The system should be configured to route traffic to a previous, stable model version. This is handled via a configuration flag in the API gateway.

Furthermore, we maintain a “static” retrieval mode. If the AI components are completely down, the system can fall back to a keyword-based search (like Elasticsearch) that returns the raw documents. It won’t generate a summary, but it will provide the source text. This ensures that compliance information is never fully unavailable.

Human-in-the-Loop Circuit Breaker

We implement automated monitoring on the “Confidence Score” of the generated answers. This score is derived from the constraint check results and the semantic similarity of the retrieved evidence to the query. If the average confidence score across a batch of queries drops below a threshold (e.g., 0.7), the system triggers a circuit breaker.

The circuit breaker switches the UI to “Review Mode.” Instead of returning an AI-generated answer immediately, it queues the query for a human compliance officer. The user is notified that their query is being reviewed for accuracy. This prevents the propagation of low-confidence answers while the engineering team investigates the root cause (e.g., a corrupted document ingestion or a drift in the LLM’s output).

Implementation Considerations for the Developer

Building this system requires a careful selection of tools. You don’t need the newest, shiniest model for every part of the pipeline. Often, smaller, specialized models perform better.

For the ingestion and tagging, a fine-tuned BERT-style model (like RoBERTa) is often faster and more consistent than a massive LLM. For the retrieval and graph traversal, a vector database with native graph capabilities (like Weaviate or Neo4j with vector plugins) reduces architectural complexity. For the generation and constraint checking, a larger model (like GPT-4 or a high-quality open-source model like Llama 3) is appropriate due to its reasoning capabilities.

Latency is another concern. Graph traversals can be expensive. To optimize, we can pre-compute common traversal paths. For example, “Data Retention” is a frequently asked topic. We can run a background job that materializes the subgraph for “Data Retention” (including all linked clauses and exceptions) and stores it in a cache. When a user asks about retention, we retrieve the pre-computed subgraph, drastically reducing query time.

Finally, consider the user experience. The interface should not just dump text. It should visualize the connections. When a user asks a question, the UI could display a mini-graph showing how the answer relates to specific regulations and internal policies. This helps the user understand the “why” behind the answer, fostering trust in the system.

Designing a compliance Q&A system is an exercise in balancing the fluidity of language with the rigidity of law. By structuring the data into an ontology, linking it in a graph, and validating it with symbolic logic, we create a tool that is not just a chatbot, but a rigorous compliance assistant. It respects the complexity of the domain while leveraging the power of AI to make that complexity accessible. The result is a system that engineers can rely on, auditors can verify, and users can trust.