RLM Implementation Blueprint: The Minimum Viable Recursive Planner

Building a system capable of recursive reasoning is one of those engineering challenges that sits right at the intersection of elegant theory and messy reality. We all understand the promise: an agent that can decompose a complex problem into manageable sub-tasks, reason over them, and synthesize a final answer. But when you move from academic papers to production code, the abstract diagrams dissolve into a thousand lines of error handling, state management, and prompt engineering.

What follows is not a theoretical overview of Reinforcement Learning Models (RLM), but a concrete blueprint for a Minimum Viable Recursive Planner (MVRP). This is the architecture I use when I need a system that can “think” through multi-step tasks without getting lost in infinite loops or hallucinating intermediate steps. It is designed for engineers who want to build, not just read.

The Core Philosophy: Recursion as a Stack

At the heart of any recursive system lies the stack. In a traditional programming context, the stack manages memory addresses and local variables. In an agent-based system, the stack manages context and intent. The biggest mistake I see developers make is treating the planner as a linear sequence of prompts. That works for three steps. It collapses at ten.

Our MVRP treats every task—whether the root goal or a single sub-task—as a node in a tree. The state of the system is not just the data we have gathered; it is the current position within that tree.

Consider the example task we will use throughout this blueprint: Multi-step Policy Q&A. The user asks, “Does our current cloud architecture comply with the new GDPR data residency requirements regarding AI model training logs?”

This is not a retrieval problem. It is a reasoning problem. It requires fetching policies, inspecting architecture diagrams, checking jurisdictional boundaries, and synthesizing a boolean answer with citations.

State Representation

The state is the single source of truth passed between recursive calls. If you get this wrong, your agent will forget what it was doing two steps ago. We avoid this by strictly defining the state object.

class PlannerState:
    def __init__(self, root_query, context_window):
        self.root_query = root_query
        self.context = context_window  # Token-limited buffer
        self.execution_stack = []      # LIFO structure for recursion
        self.completed_tasks = []      # Artifacts from finished nodes
        self.knowledge_base = {}       # External facts retrieved
        self.status = "running"        # running, suspended, completed, failed

In our GDPR example, the initial state contains the user’s question. As the planner runs, the execution_stack will fill with sub-goals like “Identify data types in training logs” and “Map data flows to geographic regions.”

Deconstructing the Planner Loop

The planner operates in a loop that mimics the call stack of a recursive function. There are three distinct phases: Decomposition, Execution, and Resolution.

Phase 1: Decomposition (The Generator)

When the planner receives a task, it first determines if the task is atomic or composite. An atomic task is one that can be solved by a single tool call or a single LLM inference (e.g., “What is the definition of PII?”). A composite task requires breaking down.

We use a “Decomposer” prompt. This is a highly constrained system prompt designed to output JSON, not prose. It analyzes the current node’s goal and outputs a list of sub-goals or declares the task atomic.

System Prompt (Decomposer):
“You are a task decomposer. Analyze the input goal. If the goal requires multiple distinct facts or actions to verify, output a JSON list of subtasks. If the goal is a single verifiable fact, output ‘atomic’. Prioritize dependencies: subtask A must finish before subtask B if A’s output is needed by B.”

For our GDPR query, the Decomposer might output:

{
  "subtasks": [
    "Identify all components that generate training logs",
    "Determine the geographic storage location of each component",
    "Identify the legal basis for processing under GDPR Article 6",
    "Check if any processing involves automated decision-making (Article 22)"
  ]
}

These subtasks are pushed onto the execution_stack. Crucially, the parent task remains in the stack, marked as “waiting,” until all children return.

Phase 2: Execution (The Tool Interface)

Once a task is deemed atomic (or we have reached the bottom of the recursion depth), the agent must act. This is where the Tool Interface comes in. We treat tools not as magic functions, but as strictly typed API calls.

The tool interface acts as a sandbox. It prevents the LLM from hallucinating database queries or executing arbitrary code. It validates inputs and sanitizes outputs before they are injected back into the context window.

In our example, when the agent encounters the subtask “Determine the geographic storage location,” it triggers the vector_db_search tool or the aws_cli_wrapper tool.

def invoke_tool(task_description, state):
    # 1. Select tool based on task description
    tool = tool_router.route(task_description)
    
    # 2. Generate parameters (LLM call)
    params = generate_tool_parameters(task_description, tool.schema)
    
    # 3. Execute with guardrails
    try:
        result = tool.execute(params)
        # 4. Sanitize output (remove PII, truncate if too long)
        sanitized = sanitize(result)
        return sanitized
    except Exception as e:
        return f"Error executing tool: {str(e)}"

If the tool call fails, the error is captured and fed back into the LLM context. The agent can then self-correct (e.g., “I tried to query AWS but lacked permissions; I will try to query the configuration documentation instead”).

Phase 3: Resolution (The Unwinding)

This is the most critical phase in a recursive planner. When a subtask completes, its result is stored in the completed_tasks list. The planner then checks the parent node. Are all siblings finished?

If yes, the parent task wakes up. The planner constructs a new context window containing:

The parent’s original goal.
The results of all child subtasks.
The original user query (to maintain alignment).

This “synthesis” step is where the magic happens. The LLM looks at the disparate data points—the storage locations, the legal articles—and generates a coherent answer. This answer becomes the artifact for the parent node, and the cycle continues up the stack.

Guardrails and Caching: The Safety Net

Recursion is dangerous. Without limits, an agent can enter an infinite loop, decomposing a task endlessly (e.g., “Break down ‘Find X’ into ‘Find Y’ and ‘Find Z’… forever”).

Depth Limits

We enforce a hard limit on recursion depth (typically 5-7 levels). If the stack exceeds this, the planner forces a “leaf” decision, treating the current task as atomic regardless of the Decomposer’s output.

Caching Strategy

Recursive planners are expensive. If you ask the same question twice, you should not pay the inference cost twice. We implement a two-layer caching system:

Result Caching: Hash the input task description. If it exists in the Redis cache, return the stored artifact immediately. This is vital for branches of the tree that share dependencies.
Embedding Caching: Vector embeddings for retrieved documents are cached. If a subtask needs to search the knowledge base, we check if the query vector has been computed recently.

In the GDPR example, if two different subtasks both need to know “Where is the S3 bucket located?”, the second call should hit the cache, not the API.

Step-by-Step Flow: The GDPR Example

Let’s trace the execution flow concretely. We start with the root node.

Step 1: Initialization

The user submits the query. The PlannerState is instantiated. The root task “Verify GDPR Compliance for AI Training Logs” is pushed onto the stack.

Step 2: First Decomposition

The planner invokes the Decomposer. It determines the task is composite. It generates four subtasks (as seen in the JSON above) and pushes them onto the stack. The root task is marked “SUSPENDED”.

Step 3: Processing Subtask 1 (Atomic)

The top of the stack is “Identify components generating training logs.” The Decomposer checks this and determines it is atomic (requires a database lookup).

The Tool Interface invokes db_query("SELECT component_name FROM infrastructure WHERE logs_enabled = TRUE").

Result: ["model_trainer_v2", "data_pipeline_aggregator"].

Result is stored. Task is popped.

Step 4: Processing Subtask 2 (Atomic)

Next on the stack: “Determine geographic storage location.”

The agent realizes it needs the output of Subtask 1. It retrieves that result from the state.

It invokes the AWS CLI tool for both components.

Result: model_trainer_v2 is in eu-west-1 (Ireland). data_pipeline_aggregator is in us-east-1 (Virginia).

Result is stored. Task is popped.

Step 5: The Synthesis (Parent Wake-up)

Now, the stack contains only the Root Task. All children are complete.

The planner constructs the synthesis context:

Context:
Goal: Verify GDPR Compliance.
Findings:
1. Training logs exist in Ireland (EU jurisdiction).
2. Aggregation logs exist in Virginia (US jurisdiction).
Constraint: GDPR restricts data transfer to non-adequate third countries without safeguards.

Step 6: Final Inference

The LLM processes this synthesis. It recognizes a conflict: the data_pipeline_aggregator is in the US, which is a “third country” under GDPR. It checks the “legal basis” subtask (which we haven’t detailed here, but assume it was completed).

The agent generates the final response:

“Compliance status: Partial. The model_trainer_v2 component in Ireland is compliant. However, the data_pipeline_aggregator in us-east-1 processes training logs containing PII. Without Standard Contractual Clauses (SCCs) or an adequacy decision, this constitutes a violation. Recommendation: Migrate the aggregator to an EU region or implement client-side encryption before transit.”

Implementation Nuances

When building this, you will encounter friction points. Here are the ones that usually trip up developers.

Context Window Management

As the recursion unwinds, the context grows. If you pass the full history of every subtask back to the parent, you will exceed token limits immediately.

Solution: Use a summarization step. Before a child task returns its result, ask the LLM to summarize the finding into a concise “fact card” (e.g., { "fact": "Data Location", "value": "US", "confidence": "High" }). Pass these fact cards up the stack, not the raw tool logs.

The “Stuck” State

Sometimes an atomic task fails repeatedly (e.g., API downtime). The agent needs a fallback strategy.

Implementation: Add a “retry_count” to the task object. If > 2, switch the tool. If vector_search fails, try keyword_search. If that fails, mark the task as “failed” and allow the LLM to reason with the missing data, explicitly noting the uncertainty.

Tool Hallucination

The LLM might invent a tool that doesn’t exist because it “wants” to solve the problem.

Implementation: The Tool Router must strictly validate tool names against a registry. If the LLM outputs a tool call for access_gdpr_database() and that tool isn’t in your registry, the system must reject it and return a hard error message: “Tool not found. Available tools: X, Y, Z.” This feedback loop retrains the model’s behavior in real-time.

The Human Element in the Code

Building a recursive planner is an exercise in humility. You are coding a system that is designed to admit it doesn’t know the answer and break the problem down until it finds something it can verify.

The architecture described above—State, Stack, Decomposer, Tool Interface, and Guardrails—is minimal but robust. It handles the GDPR example because it handles any multi-step reasoning task: debugging code, planning a travel itinerary, or analyzing financial reports.

The key is not to over-engineer the LLM’s “intelligence” but to under-engineer the container that holds it. Provide strict boundaries (the stack), clear memory (the state), and reliable hands (the tools). The reasoning emerges from the interaction of these parts, not from any single magical prompt.

When you run this system for the first time, watch the stack. Seeing the parent task wait patiently for its children to return is one of the most satisfying moments in software development—it is the moment your program stops merely executing instructions and starts solving problems.