Paper-to-Stack: Turning RuleRAG/KG²RAG/ORT into a Unified Prototype

There’s a particular kind of frustration that settles in when you’re staring at a Retrieval-Augmented Generation (RAG) pipeline that’s technically working but failing to satisfy. You’ve chunked your documents, tuned your embeddings, and maybe even added a re-ranker, yet the answers still feel brittle—lacking the connective tissue that turns isolated facts into a coherent understanding. This is the gap where the next generation of RAG architectures lives: systems that don’t just retrieve text but reason over structured knowledge, guided by rules and ontological paths. The papers on KG²RAG, RuleRAG, and ORT aren’t just incremental improvements; they represent a shift from passive retrieval to active, structured reasoning.

Integrating these ideas into a single, cohesive prototype is a journey of layering complexity carefully. We’re not just bolting on features; we’re building a reasoning stack where each layer addresses a specific failure mode of the layer below. The goal is a system that can start with a simple query, expand its context using a knowledge graph, validate its reasoning against explicit rules, navigate ontological hierarchies for precision, and finally, plan its own execution path recursively. This isn’t a trivial exercise, but for anyone who’s felt the limitations of vanilla RAG, the payoff is a system that feels less like a search engine and more like a reasoning engine.

Establishing the Bedrock: The Baseline RAG

Before we can appreciate the sophistication of the later layers, we must first build a solid, unremarkable baseline. This is our control group, the system against which all future improvements will be measured. A baseline RAG pipeline, in its most fundamental form, is a two-step process: retrieval and generation. We take a corpus of documents, split them into chunks, and generate vector embeddings for each chunk using a model like text-embedding-ada-002. These vectors are stored in a vector database, such as FAISS or ChromaDB, which allows for efficient similarity search.

When a user query arrives, we embed it using the same model and perform a vector search to find the top-k most similar document chunks. These chunks are then stuffed into a prompt along with the original query and fed to a large language model (LLM), which synthesizes an answer based on the provided context. The beauty of this setup is its simplicity, but its weakness is its blindness. It knows nothing of the relationships between entities, the validity of its sources, or the logical steps required to answer a complex question. It’s a pattern-matching machine, and its performance is entirely dependent on the quality of the retrieved text.

What to Measure: The Baseline

For this initial step, our metrics must be brutally honest and focused on core functionality. We need to establish a solid foundation of performance before we start adding layers of complexity.

Retrieval Accuracy (Recall@k): This is the most critical metric for the retrieval component. For a set of benchmark questions with known “gold” answers (ground truth documents), what percentage of the relevant documents are present in the top-k retrieved chunks? A low recall here means the subsequent steps are doomed from the start, no matter how clever our reasoning is. We’ll typically start with k=5 or k=10.
Answer Faithfulness: Using a framework like RAGAS or just manual inspection, we measure whether the generated answer is strictly supported by the retrieved context. This guards against hallucination, where the LLM invents facts not present in the source material. A high faithfulness score indicates our LLM is adhering to the provided context.
Answer Relevance: Does the generated answer actually address the user’s query? This can be tricky to automate but is essential. An answer can be faithful to the context but irrelevant to the question if the retrieval was poor.
End-to-End Latency: How long does the entire process take, from query submission to answer generation? This is our baseline performance benchmark. Every addition we make will likely increase latency, so we need to know our starting point.

At this stage, we’re not aiming for perfection. We’re aiming for a predictable, measurable system. The baseline RAG is our sanity check. If it fails here, adding a knowledge graph won’t magically fix it. It will just make a broken system more complex.

Layer 2: Adding Graph Expansion (The KG²RAG Idea)

The first major limitation of baseline RAG is its lack of context. It retrieves chunks based on semantic similarity, but it doesn’t understand that a retrieved document mentioning “Tesla” might be related to another document about “Elon Musk” or “Electric Vehicles.” This is where the KG²RAG concept shines. It proposes using a knowledge graph not as a replacement for vector search, but as an expansion tool. It enriches the retrieved context by adding relevant entities and relationships from a graph structure.

The integration process looks like this: after the initial vector retrieval, we take the top-k chunks and use them to seed a graph traversal. We identify named entities within these chunks (e.g., using a library like spaCy or a fine-tuned NER model). For each significant entity, we query our knowledge graph to find its immediate neighbors—related entities, properties, and the relationships connecting them. This creates a subgraph of context that is directly relevant to the retrieved information.

Imagine a query about “the impact of a specific gene mutation on protein folding.” The baseline RAG might retrieve a paper discussing the mutation. The KG²RAG-enhanced version retrieves that paper, identifies the gene name, queries a biomedical knowledge graph (like Hetionet or a custom-built one), and pulls in information about the protein it codes for, known pathways it’s involved in, and related diseases. This expanded context is then merged with the original retrieved chunks and passed to the LLM. The model now has a much richer, more connected set of facts to reason from.

Implementation Nuances

Building the knowledge graph is a project in itself. You can use pre-existing graphs (e.g., Wikidata, DBpedia) or build your own by extracting entities and relationships from your document corpus using dependency parsing and relation extraction models. For a prototype, starting with a simple graph database like Neo4j is practical. The query language (Cypher for Neo4j) allows for expressive traversal patterns. A typical query might look like: “Find all nodes connected to ‘Gene_X’ with a relationship type of ‘encodes’ or ‘regulates’.” The key is to limit the breadth and depth of the traversal to avoid exploding the context window and latency. A one or two-hop traversal is usually sufficient to capture most relevant context without introducing excessive noise.

What to Measure: Graph Expansion

Now our evaluation must become more nuanced. We’re not just asking if we retrieved the right document; we’re asking if we enriched it with the right context.

Context Expansion Quality: This is a new metric. For a set of questions, manually evaluate whether the graph-traversed context adds meaningful information that helps answer the question. Is the expanded context relevant, or is it just noise? A good signal is if the expanded context contains entities or facts that are not in the original retrieved chunks but are crucial for a comprehensive answer.
Answer Comprehensiveness: Does the final answer now cover aspects that the baseline RAG missed? For the gene mutation example, does the answer now mention the protein and related pathways? This can be measured with a rubric scored by a human or a strong LLM-as-a-judge.
Retrieval Recall (Revisited): We should re-evaluate Recall@k, but with a twist. We’re now interested in whether the combination of vector search and graph expansion helps us find the “gold” answer even if the initial vector retrieval was slightly off. The graph can act as a rescue mechanism.
Latency Impact: Graph traversal adds overhead. We need to measure this precisely. How much time does the entity extraction and graph query take? This is a critical engineering trade-off. A 10% increase in latency for a 20% gain in comprehensiveness might be acceptable; a 200% increase for a 5% gain is not.

At this stage, we’re moving from a simple retrieval system to a context-aware system. The graph provides a structured “scaffolding” of knowledge that the LLM can use to better understand the retrieved information.

Layer 3: Adding Rule Guidance (The RuleRAG Idea)

Even with enriched context, LLMs can still be inconsistent. They might contradict themselves, ignore domain-specific constraints, or fail to apply logical rules consistently. RuleRAG addresses this by injecting explicit, human-defined rules into the generation process. These rules act as guardrails or guiding principles for the LLM.

Integrating RuleRAG means creating a rule engine that operates in parallel with the retrieval and expansion steps. Rules can be simple logical statements (“If a drug is listed as ‘contraindicated’ for a condition, never recommend it”), complex if-then statements, or even templates for required reasoning steps. In our pipeline, after the context is expanded by the graph, we query our rule store for rules that are relevant to the entities and relationships present in the context.

For example, in a financial compliance RAG, if the retrieved context mentions a “short-term stock trade” and the user query is about “tax implications,” a rule might be triggered: “For trades held less than one year, apply short-term capital gains tax rates.” This rule is then explicitly formatted and inserted into the system prompt given to the LLM, often in a dedicated “Rules” or “Constraints” section. The LLM is instructed to use these rules as a primary source of truth when formulating its answer. This doesn’t eliminate the need for the LLM’s reasoning capabilities, but it grounds them in verifiable, consistent logic.

Building the Rule Engine

A simple rule engine can be built using a pattern-matching approach. We can represent rules as structured objects (e.g., JSON) containing a “condition” (a set of keywords, entity types, or graph patterns) and an “action” (the text of the rule to be injected). When the enriched context is generated, we scan it for matches against the rule conditions. More sophisticated systems might use a dedicated rules language like Drools or a custom DSL, but for a prototype, a Python-based pattern matcher is often sufficient. The key is to keep the rule set manageable and well-documented. An overly large or complex rule set can become a maintenance nightmare and may confuse the LLM more than it helps.

What to Measure: Rule Guidance

With the addition of rules, our focus shifts towards consistency, accuracy, and compliance.

Factual Accuracy & Consistency: We need to test the system on a set of queries where the rules are critical. For each query, we run it multiple times (e.g., 10-20 times with slightly rephrased prompts or contexts) and check if the answer remains consistent and factually correct according to the rules. A high variance in answers indicates the rules are not being applied robustly.
Rule Adherence Rate: For a benchmark dataset with known correct answers that depend on specific rules, what percentage of the generated answers correctly apply all relevant rules? This is a direct measure of the rule engine’s effectiveness.
Context Utilization: We can use attention visualization techniques or LLM-based analysis to see if the model is actually “looking at” the injected rules when generating its answer. Are the rules being ignored, or are they central to the response?
Latency Impact: The rule-matching process adds another step. We need to measure its contribution to the overall latency. Is the rule engine efficient, or does it become a bottleneck as the rule set grows?

By this point, our system is no longer a simple RAG. It’s a hybrid system that combines semantic retrieval, structured knowledge expansion, and explicit logical guidance. It’s becoming a reasoning engine.

Layer 4: Adding Ontological Path Guidance (The ORT Idea)

The ORT (Ontological Reasoning for Text) idea introduces a higher level of abstraction: the ontology. While a knowledge graph captures instances and their relationships (e.g., “Tesla” -[produces]-> “Model S”), an ontology defines the classes, hierarchies, and constraints of a domain (e.g., “Car” is-a “Vehicle”, “Vehicle” has-a “Manufacturer”). This ontological structure can be used to guide the reasoning process, ensuring that answers are not just factually correct but also conceptually sound.

Integrating ORT means we need an ontology file (typically in OWL or RDF format) that describes our domain. In our pipeline, after the initial retrieval and graph expansion, we perform an additional step: ontological reasoning. We take the entities and relationships from our expanded context and map them to the classes and properties in our ontology. We can then use an ontology reasoner (like HermiT or Pellet, often used via a library like Owlready2 in Python) to infer new knowledge or validate existing knowledge based on the ontological axioms.

For instance, if our context mentions a “four-legged mammal that lives in the ocean,” and our ontology defines “Whale” as a subclass of “Mammal” and states that “Mammals” are “Animals,” the reasoner can help the system understand that this entity is likely a whale, even if the word “whale” isn’t explicitly mentioned. More importantly, the ontological path—the chain of classes and relationships from a specific instance to a general concept—can be used to structure the LLM’s reasoning. The system can be prompted to follow this path: “First, identify the specific instance. Then, determine its class. Then, consider the properties of that class.” This provides a structured, hierarchical reasoning scaffold.

Practical Ontology Integration

Building an ontology is a significant effort, often requiring domain expertise. For a prototype, you can start with a lightweight ontology. The key is to define class hierarchies (e.g., `Product -> Software -> OperatingSystem`) and some basic properties (e.g., `hasDeveloper`, `runsOnHardware`). The integration point in our pipeline is after the graph expansion. We take the entities from the expanded context, query the ontology for their classes and superclasses, and then use the reasoner to check for consistency and infer new relationships. The output of this step is a “reasoned context” that includes both the raw facts from the graph and the inferred, ontologically-grounded knowledge. This reasoned context is then passed, along with the rules, to the LLM.

What to Measure: Ontological Guidance

Measuring the impact of ontological reasoning is subtle. We’re looking for improvements in conceptual understanding and reasoning depth.

Conceptual Accuracy: For queries that require understanding class hierarchies or abstract concepts, does the answer demonstrate a correct ontological understanding? For example, if asked “Is a smartphone a computer?”, a system with good ontological reasoning should answer affirmatively and explain why (e.g., because it inherits the properties of a computer). This can be evaluated with a rubric.
Reasoning Depth: We can measure the complexity of the reasoning chains in the generated answers. Does the answer just state a fact, or does it explain the relationship between concepts? Ontological guidance should lead to more explanatory and less superficial answers.
Inference Quality: We can create benchmark questions that require inferring a fact that is not explicitly stated in the text but is logically entailed by the ontology. For example, if the text says “My laptop is a MacBook,” and the ontology states “All MacBooks are computers,” the system should be able to infer that “My laptop is a computer.” We measure the accuracy of such inferences.
Latency & Complexity: Ontological reasoning can be computationally expensive, especially for large ontologies. We need to measure the time taken by the reasoner and ensure it doesn’t become the dominant source of latency. Sometimes, pre-computing inferences or using a lighter-weight reasoner is necessary.

Our system is now a multi-layered reasoning stack: it retrieves semantically, expands with graph context, validates with rules, and reasons with ontological structure. It’s a formidable architecture, but it’s also becoming a complex beast to manage. The final layer addresses this complexity by giving the system the ability to plan its own execution.

Layer 5: Wrapping in a Recursive Planner (The RLM Idea)

The final piece of the puzzle is the Recursive Language Model (RLM) or, more generally, a recursive planner. So far, our pipeline has been linear: retrieve -> expand -> rule -> ontology -> generate. But real-world reasoning is often not linear. It’s iterative and adaptive. A query might require multiple rounds of retrieval, or the system might need to decompose a complex question into sub-questions. The planner orchestrates the entire process.

Integrating a recursive planner means turning our pipeline into a set of tools that an LLM agent can call. The agent’s primary job is to plan. Given a user query, the planner LLM first analyzes it and formulates a high-level plan. For a complex query like “Compare the market performance of two companies and explain how recent regulatory changes might have influenced their stock prices,” the plan might be:

Decompose the query into two sub-queries: (a) “Retrieve market performance data for Company A and Company B,” and (b) “Retrieve information on recent regulatory changes affecting Company A and Company B.”
For sub-query (a), execute the RAG pipeline (steps 1-4) to get performance data.
For sub-query (b), execute the RAG pipeline to get regulatory data.
Execute a final synthesis step, combining the results from (2) and (3) to generate the final comparative answer.

The planner can call the RAG pipeline recursively for each sub-query. It can also decide to use different tools or different parameters (e.g., a deeper graph traversal for one sub-query, a stricter rule set for another). This recursive nature allows the system to handle ambiguity and complexity by breaking it down into manageable, solvable pieces. The final output is not just an answer, but a trace of the reasoning plan and the results from each step, which can be used for debugging and verification.

Building the Planner

This requires a shift in how we think about our system. We need to wrap our existing RAG pipeline (with all its layers) into a function that can be called by an agent. We then use a capable LLM (like GPT-4 or a fine-tuned open-source model) as the planner. The planner is prompted with the user query and a description of the available tools (e.g., “RAG_Tool(query, rules, use_graph=True, use_ontology=True)”). The planner generates a plan, and an executor (a simple script or a framework like LangChain or Haystack) runs the plan, feeding the results back to the planner for the next step if necessary. The recursion happens when a sub-query is itself complex and requires further decomposition.

What to Measure: Recursive Planning

Evaluating a planning agent is the most complex task, as we’re now judging not just the final answer but the entire process.

Plan Correctness: For a set of complex queries, does the generated plan logically decompose the problem? Are the steps reasonable? This is a qualitative assessment, often done by a human expert reviewing the plan trace.
Task Completion Rate: What percentage of the time does the system successfully execute the entire plan and produce a final answer without getting stuck in a loop or encountering an unrecoverable error?
Answer Quality for Complex Queries: We need a new benchmark of complex, multi-faceted questions. We measure the quality of the final synthesized answer using the same rubrics as before (faithfulness, comprehensiveness, etc.), but now we expect a much higher standard. The answer should be well-structured and address all parts of the original query.
Efficiency of Planning: How many steps does the planner take? Does it over-decompose the problem, leading to unnecessary latency and cost? We need to measure the number of tool calls and the total execution time. A good planner is both effective and efficient.
Error Recovery: If one step in the plan fails (e.g., no documents are retrieved for a sub-query), can the planner recognize the failure and formulate an alternative plan? This is a key indicator of a robust autonomous system.

Building this entire stack is a significant undertaking, but it represents the cutting edge of what’s possible with retrieval-augmented systems. We start with a simple, brittle RAG and end with a recursive, reasoning agent that can navigate structured and unstructured information, guided by explicit rules and ontological knowledge. Each layer addresses a specific, tangible weakness of the layer below, creating a system that is more than the sum of its parts. The journey from paper to prototype is a process of careful layering, rigorous measurement, and a deep appreciation for the nuances of machine reasoning.