RLMs: The Origin Story and Why They Reappeared in 2025

There’s a specific kind of fatigue that sets in when you’re deep in a complex codebase or wrestling with a gnarly research problem. You feed a massive prompt into an LLM—hundreds, maybe thousands of tokens of context, instructions, examples, and data. You get a response. It’s good, but not perfect. You clarify, you add more context, you correct a subtle misunderstanding. The context window balloons. The model starts to lose the thread, forgetting the initial constraints or hallucinating details from earlier in the conversation. This is the “context rot” we’ve all experienced, the inevitable degradation of attention as the conversation lengthens. It’s the primary bottleneck holding back LLMs from true, long-horizon autonomy.

For years, the solution was brute force: bigger context windows. Flash attention. Better compression. But in early 2025, a different idea re-emerged from the academic shadows, one that felt less like an engineering hack and more like a fundamental shift in how we interact with these systems. It wasn’t entirely new, but the timing, the packaging, and the sheer elegance of the concept clicked into place. This was the revival of the Recursive Language Model (RLM).

The Ghost in the Machine: A Concept Reborn

To understand why RLMs captured the imagination of the AI community in 2025, we have to look back. The core idea—that an LLM could call itself recursively, treating its own output as a program to be executed—has been floating around for years. Early tool-using models were a primitive form of this. You’d ask a model to write a Python script, and it would output code. A separate system, an executor, would run that code and return the result. The model was a code generator, but it wasn’t operating in a closed loop. The environment was external, the execution step was manual, and the feedback loop was clumsy.

The breakthrough, articulated in a now-famous 2024 blog post and later formalized in a research paper that went viral in early 2025, was the radical simplification of this loop. The authors of the RLM paper asked a deceptively simple question: What if the environment the model operates in is, itself, a language model? What if the “tool” it uses is a recursive call to itself?

This wasn’t just about chaining prompts. It was about creating a true REPL (Read-Eval-Print Loop) environment where the LLM is both the programmer and the interpreter. The prompt is no longer a static block of text; it becomes a dynamic, evolving state. The model can write a plan, execute a step of that plan, observe the output, and then revise the plan—all within a single, coherent session. This structure inherently combats context rot. Instead of one monolithic conversation, you get a series of focused, nested calls. The outer context remains clean, containing only the high-level plan and the results of sub-tasks, while the inner, recursive calls handle the messy details.

Deconstructing the RLM Architecture: The Prompt as an Environment

The central diagram from the original RLM paper is a masterpiece of conceptual clarity. It looks less like a typical neural network diagram and more like a flowchart for a recursive algorithm. At its heart is the REPL Environment, a state that holds the current task, the history of actions, and the available tools. The model, acting as the Controller, receives this state and generates an Action. This action isn’t just text; it’s a structured command.

Let’s break down the key components:

1. The Controller (The LLM)

This is the familiar large language model, the part we interact with. Its job is to reason about the current state and decide on the next step. It’s the “brain” of the operation, but it’s a brain that only thinks in terms of actions.

2. The Action Space

In a traditional chat, the action space is simply “generate text.” In an RLM, the action space is expanded. The model can choose to:

Think: Generate a chain-of-thought reasoning step that is stored in the context but doesn’t trigger an external action. This is for planning and reflection.
Execute: Call a tool. This is where the recursion happens. The “tool” can be a code interpreter, a web search API, or, most importantly, another call to the LLM itself.
Finish: Signal that the task is complete and return the final answer.

3. The Environment (The REPL)

This is the crucial innovation. When the model decides to “Execute” a recursive call, it doesn’t just append a new message to a chat log. It creates a new instance of the RLM with a sub-task. The output of that sub-task (the result of the recursive call) is then fed back into the parent context as a “Tool Output.”

Imagine you ask an RLM: “Analyze the sentiment of the latest earnings call transcript for Company X and summarize the key risks.”

A traditional LLM might try to hold the entire transcript in its context, analyze it, and then summarize. It’s likely to miss details or lose focus.

An RLM approaches it differently:

Controller: Receives the initial prompt. It formulates a plan: “1. Find the transcript. 2. Break it into chunks. 3. Analyze each chunk for sentiment. 4. Synthesize the results.”
Action: It decides to execute a sub-task: “Find the transcript.” It makes a recursive call: RLM(sub_task="Find the latest earnings call transcript for Company X").
Recursive Execution: The new RLM instance takes over. It might use a web search tool. It finds the transcript and returns it as a string.
Feedback: The result (the transcript text) is returned to the parent RLM as a tool output. The parent’s context is now updated: “Plan: [steps]. Tool Output: [transcript text].”
Next Action: The parent RLM now sees the transcript in its context. It decides on the next step: “Analyze sentiment.” It makes another recursive call: RLM(sub_task="Analyze the sentiment of this text: [transcript]").

This pattern continues. Each recursive call is a self-contained, focused task. The parent context never gets polluted with the nitty-gritty of how the sentiment analysis was performed; it only sees the result. This is the architectural solution to context rot. It’s a form of automatic context management driven by the model’s own reasoning.

What’s Actually New? Rebranding vs. Revolution

A seasoned engineer might look at this and say, “Wait a minute. This is just function calling. This is just an agent.” And they wouldn’t be entirely wrong. The components are familiar. We’ve had tool-use for a while. We’ve had agents that can chain API calls. So what’s the real difference?

The distinction is subtle but profound. It lies in the primacy of the recursive call and the unification of the environment.

In a typical agent framework (like ReAct or a LangChain-style agent), the LLM is a component. The system has an explicit loop in Python or another language. The loop calls the LLM, parses the output, decides which tool to call, executes it, and then feeds the result back to the LLM. The LLM itself is stateless in this loop; it’s just a text-in, text-out function. The “intelligence” of the loop is coded by the developer.

The RLM model flips this. The LLM is the loop. The control logic isn’t hard-coded; it’s emergent from the model’s reasoning. The prompt itself becomes the execution environment. This is a much higher level of abstraction. Instead of programming the agent’s workflow, you are instructing the model on how to think about its own workflow.

Consider the difference in prompting. A traditional agent prompt might look like this:

You are a helpful assistant. You can use the following tools: search, calculator. Always respond in this format: Thought: [your reasoning] Action: [tool_name] Action Input: [input] Observation: [tool_output] … Final Answer: [answer]

This is rigid. It requires strict output parsing. If the model deviates, the whole system breaks.

An RLM prompt is more fluid, more meta. It describes the environment and the goal:

You are operating in a recursive environment. You can solve complex problems by breaking them down. To solve a task, you can either reason internally or make a recursive call to a sub-task. A recursive call is made by writing RLM(sub_task="...", context="..."). The result of the sub-task will be returned to you. Your goal is to synthesize a final answer from the results of your sub-tasks.

This is a subtle but powerful shift. The model isn’t just following a template; it’s internalizing a computational paradigm. It’s being given the source code to its own operating system. This is why the RLM paper felt like a revelation. It wasn’t inventing a new capability, but rather giving a name and a formal structure to an emergent behavior that advanced models were already showing signs of.

Why 2025? The Convergence of Capabilities

Why did this idea, which has been theoretically possible for years, suddenly explode in 2025? The answer lies in a convergence of three key factors: model capability, context window limitations, and the maturity of the developer ecosystem.

1. The Great Context Wall

By late 2024, context windows had grown massive—millions of tokens. Yet, the performance degradation over long contexts remained a stubborn problem. The “lost in the middle” phenomenon was well-documented. The community was hitting a point of diminishing returns. Simply making the context window bigger wasn’t solving the fundamental issue of attention dilution. The industry needed a new paradigm, and RLMs offered a compelling alternative: don’t use a bigger context, use a smarter one.

2. The Rise of Reasoning Models

The models themselves had evolved. Early LLMs were brilliant next-token predictors, but they struggled with multi-step planning and self-correction. The “reasoning models” that emerged in 2024 (think O1-style models with extended internal monologues) were fundamentally better at the kind of structured thought required for RLMs. They could self-critique, plan, and decompose tasks with a reliability that was simply not possible a year or two earlier. An RLM is only as good as its controller; in 2025, the controllers finally became powerful enough to handle the cognitive load of self-recursion.

3. The Tool-Use Precedent

The developer community had spent two years building tool-using agents. We had learned, through trial and error, what worked and what didn’t. We had standardized on function-calling formats and built robust libraries for parsing LLM outputs. This collective knowledge created the fertile ground for RLMs. When the RLM paper was published, developers immediately understood the components. They recognized the pattern. It wasn’t a completely alien concept; it was a cleaner, more elegant abstraction of the agent architectures they were already building. It felt like the next logical step in the evolution from simple prompts to complex, autonomous systems.

A Practical Example: The RLM in Action

To truly appreciate the elegance of this approach, let’s walk through a more complex, technical example. Suppose we task an RLM with the following: “Write a small Python program that fetches the current weather for a given city using a public API, formats the output, and then uses that temperature to suggest a suitable outdoor activity.”

A standard LLM might struggle. It would need the API key, the endpoint documentation, the city name, and the logic for activity suggestions all crammed into one context. It might mix up the API key with the city name or forget the formatting instructions.

Here’s how an RLM would decompose and execute this task:

Initial Prompt State:

Task: Write a Python program that fetches the current weather for a given city using a public API, formats the output, and then uses that temperature to suggest a suitable outdoor activity.
City: San Francisco
API Key: [Redacted] Available Tools: RLM (recursive call), CodeInterpreter (executes Python code)

Step 1: Planning and Decomposition

The RLM Controller analyzes the task. It doesn’t try to write the full program at once. It generates a plan, which is stored in its internal thought process.

Thought: This task has three distinct parts: 1) Fetching weather data, 2) Processing and formatting it, 3) Suggesting an activity. I should break this down. I’ll start by creating a script to fetch the weather.

Step 2: Recursive Call for Sub-Task 1

The RLM decides to create a sub-task focused solely on fetching the weather. It makes a recursive call.

Action: RLM(sub_task="Write a Python script to fetch the current weather for San Francisco using the OpenWeatherMap API. The script should take the API key as an argument and print the temperature in Celsius.", context="API_KEY=[Redacted]")

A new RLM instance is spawned. This sub-task is now its entire world. It’s not concerned with activity suggestions. It focuses on one thing: writing a correct, functional Python script. It generates the code.

Step 3: Execution and Feedback

The output of the sub-task (the Python code) is returned to the parent RLM. The parent now has a tool output: a working script. It doesn’t need to know how that script was written.

Tool Output: import requests, sys; api_key = sys.argv[1]; city = "San Francisco"; url = f"http://api.openweathermap.org/data/2.5/weather?q={city}&appid={api_key}"; response = requests.get(url); data = response.json(); print(data['main']['temp'] - 273.15)

Step 4: Next Recursive Call (Processing)

The parent RLM now knows how to get the temperature. But the output is just a raw number. It needs to be formatted. It makes another recursive call.

Action: RLM(sub_task="Take this raw temperature data: '28.5' and format it into a user-friendly string like 'The current temperature in San Francisco is 28.5°C.'")

This is a simpler sub-task, but it’s still a recursive call. The new instance handles the formatting and returns the string.

Step 5: Final Synthesis

The parent RLM now has two key results: a method to get the temperature and a formatted string. It’s time for the final step: the activity suggestion. It makes one last recursive call.

Action: RLM(sub_task="Given a temperature of 28.5°C in San Francisco, suggest a suitable outdoor activity. Consider that SF can be windy. Provide a one-sentence suggestion.")

The final sub-task generates the suggestion: “With a pleasant 28.5°C, it’s a great day for a walk across the Golden Gate Bridge, but you might want to bring a light jacket for the wind.”

Step 6: Completion

The parent RLM now has all the pieces: the code, the formatted temperature, and the activity suggestion. It synthesizes them into the final answer, structured for the user.

Final Answer:
Here is the complete solution for your request.

1. Python Script for Weather Fetching:
[The code from Step 3]

2. Weather Report:
The current temperature in San Francisco is 28.5°C.

3. Activity Suggestion:
With a pleasant 28.5°C, it’s a great day for a walk across the Golden Gate Bridge, but you might want to bring a light jacket for the wind.

Throughout this entire process, the parent RLM’s context remained clean. It never had to hold the raw API response, the specific details of the Python `requests` library, or the nuances of activity suggestion logic. It managed the workflow, delegated the work, and synthesized the results. This is the power of the recursive prompt-as-environment model.

The Developer Experience: A New Abstraction Layer

For developers, the shift to RLMs represents a move up the abstraction ladder. Instead of writing complex state machines and parsing logic for agent loops, we can focus on crafting high-quality task descriptions and defining the available “tools” (which are often just well-described sub-tasks). The RLM framework handles the orchestration.

Libraries and frameworks started appearing in early 2025 that made this pattern easy to implement. Instead of manually managing the recursive calls, a developer could define a tool simply by providing a natural language description of its function.

from rlm_framework import RLM, Tool

# Define a tool that the RLM can call recursively
weather_tool = Tool(
    name="get_weather",
    description="Fetches current weather for a given city. Returns temperature in Celsius.",
    func=fetch_weather_api  # A standard Python function
)

# Initialize the RLM with its tools
agent = RLM(
    model="gpt-5-turbo", 
    tools=[weather_tool]
)

# The RLM handles the decomposition and execution
response = agent.run(
    "What's the weather in Tokyo and what should I wear for a walk?"
)

Under the hood, the RLM framework translates the natural language tool description into a prompt that the model understands. When the model decides to use the tool, the framework executes the corresponding Python function and feeds the result back into the model’s context. The developer is freed from writing the loop. The model becomes the loop. This dramatically reduces the boilerplate and complexity of building autonomous agents.

Limitations and the Road Ahead

RLMs are not a silver bullet. They introduce their own set of challenges. The most significant is the potential for infinite recursion. A model could get stuck in a loop, repeatedly calling itself on a task it cannot solve, consuming tokens and compute with no progress. Robust RLM implementations need safeguards: maximum recursion depth, time limits, and cost-tracking mechanisms.

Another challenge is error propagation. If a sub-task fails or returns incorrect information, the parent task might build on that flawed foundation. The error might not be caught until much later in the process. This requires the RLM to be not just a planner and executor, but also a rigorous verifier, capable of calling self-critique and validation sub-tasks.

Furthermore, there’s the question of computational cost. While RLMs are more efficient with context, they can generate more tokens overall due to the overhead of multiple recursive calls. Each call involves a new round of API requests or model inference. Optimizing the “call graph” of an RLM—deciding when to think, when to act, and when to delegate—is an emerging field of study in itself.

Despite these challenges, the re-emergence of Recursive Language Models in 2025 marks a pivotal moment. It signals a move away from treating LLMs as simple chatbots or text generators and toward viewing them as programmable reasoning engines. The prompt is no longer just a question; it’s a specification for a computational process. The model is no longer just an answerer; it’s an interpreter. For those of us who have been wrestling with the limitations of long-context windows and brittle agent frameworks, this shift feels less like an incremental improvement and more like the beginning of a new chapter. It’s a glimpse into a future where we don’t just talk to AI, we collaborate with it on a deeper, more structured level, building complex systems from the simple, recursive act of thought.