Who Is Researching RLM-Style Inference Right Now? (Industry + Academia Map)

When the RLM (Recursive Language Model) paper dropped, it didn’t just propose another architecture tweak—it reframed the conversation around what an LLM actually does during inference. The core idea—treating inference not as a single forward pass but as a recursive, self-correcting process—resonated deeply with researchers who had been bumping up against the hard ceilings of static generation. It felt less like a breakthrough and more like a validation of a direction many of us were already fumbling toward in the dark. Since then, the landscape has fractured into distinct tribes, each interpreting “recursion” through the lens of their own constraints: latency, accuracy, cost, or raw intelligence.

The Academic Core: Where Recursion Meets Theory

The university labs pushing this forward aren’t necessarily chasing the highest scores on MMLU. They are obsessed with the structure of thought. If you look at the work coming out of places like Stanford’s HAI or MIT’s CSAIL, you see a distinct split.

On one side, you have the formal verification crowd. These groups are taking the RLM concept of “verification loops” to its logical extreme. They aren’t just asking the model to check its own work; they are trying to embed the model inside a provably correct mathematical framework. The work on Lean4 and LLM integration (specifically around the ProofNet benchmark) is a prime example. Here, recursion isn’t a heuristic; it’s a necessity. The model generates a proof step, a verifier checks it, and if it fails, the error message becomes the context for the next generation. It’s a rigid, unforgiving recursion that trades the fluidity of prose for the binary certainty of logic.

Contrast this with the cognitive modeling groups, particularly those working on Chain-of-Thought (CoT) refinement. The RLM paper highlighted the inefficiency of linear reasoning, and academic researchers are now exploring tree-based attention mechanisms. Instead of a single recursive loop, they are mapping out branching paths of reasoning. The Tree of Thoughts (ToT) framework, while pre-dating RLM, has found new life in this context. Researchers at NYU and DeepMind (academic collaborations) are investigating how to prune these trees efficiently. The challenge they face is the “combinatorial explosion”—recursion is expensive. Their current solutions involve training smaller “evaluator” models to decide which branches of the recursion are worth pursuing, essentially creating a meta-learning layer on top of the recursive process.

The “Context Folding” Obsession

One of the most technically dense areas of research right now, stemming directly from the RLM paper’s discussion on context window limitations, is what I call “context folding.” Standard RAG (Retrieval-Augmented Generation) is clumsy—it dumps documents into the prompt. The academic push is toward recursive summarization and hierarchical memory.

Teams at Carnegie Mellon are publishing papers on Memory Transformers that don’t just attend to the immediate context but attend to summaries of previous contexts. It’s a recursive data structure implemented in neural weights. They are essentially building a stack inside the attention mechanism. This allows the model to “zoom out” and see the forest, not just the trees, without blowing up the KV cache. It’s a hardware-aware approach to recursion, acknowledging that we can’t infinitely expand context, so we must compress it recursively.

Industrial Application: The Latency-Accuracy Tradeoff

While academia explores the theoretical limits, industry is obsessed with the practical application of recursive inference. The big labs—OpenAI, Anthropic, Google DeepMind—are obviously working on this, but their public releases hint at the directions they are taking.

Anthropic’s work on Constitutional AI and their recent focus on “artifacts” (generating code, then iterating on it) is a form of recursive application. They are applying the verification loop internally (via RLHF and constitutional principles) before the output reaches the user. However, the most visible public exploration of RLM-like ideas comes from the independent and open-source sector.

Take the team behind LangChain or LlamaIndex. While they are framework developers, not model creators, they are the primary architects of the “recursive agent” patterns we see today. Their documentation is a map of how developers are trying to implement RLM concepts without custom architectures. They rely on function calling—essentially forcing a deterministic recursion on top of a probabilistic model. It’s a hack, but it’s the most widely used hack in the world right now.

The Tool-Use Divergence

This brings us to the divergence in tool use. The RLM paper posits that recursion allows a model to “think” longer. In the industry, “thinking longer” usually means “using tools.”

There are two distinct schools of thought here:

The Python Interpreter Model: Groups like Replit and various coding-focused startups treat recursion as a debug cycle. The model writes code, the interpreter runs it, the error feedback loops back, and the model fixes it. This is the most mature form of recursive inference because the verification loop (the compiler) is perfect.
The Web-Browser/Research Model: Companies building “research assistants” (like Perplexity or Glean) use recursion to deepen search. They don’t just search once; they search, read, identify gaps, and search again. This is a breadth-first search turned into a recursive depth-first crawl.

The friction point in industry is token cost and latency. Every recursive step requires another LLM call. While academic papers might ignore the 30-second latency of a complex recursive query, a startup cannot. This has led to the rise of “smaller, faster” models acting as the “controllers” of the recursion, while the heavy lifting is done by larger models only when necessary. This is a direct implementation of the RLM’s efficiency arguments.

The Indie Researcher & The “Glue” Code

Perhaps the most exciting developments are happening in the indie researcher space—people like Max Woolf or the contributors to the Open-Source AI Cookbook. They don’t have clusters of H100s, so they are forced to be creative with recursion.

They are the ones mapping out the “state management” problem of recursive inference. If you have a model that iterates on a problem 10 times, how do you maintain coherence? The RLM paper suggests a unified reasoning trace, but in practice, indie developers are building complex state machines.

One fascinating trend is the use of structured outputs (JSON) to enforce recursion. Instead of letting the model free-respond, developers force it to output a JSON object containing a “reasoning” field and a “confidence” field. If confidence is low, the system automatically triggers another recursive call with the previous reasoning as context. This turns a language model into a recursive function that returns a value. It’s rigid, but it’s incredibly reliable.

These developers are also the primary explorers of verification loops using “weak models.” A common pattern emerging in the indie community is using a cheap, fast model (like a distilled version of Llama or a tiny local model) to verify the output of a larger, slower model. If the small model flags a hallucination, the large model is prompted to retry. This is a practical, budget-conscious implementation of the RLM verification principle.

Verification: The Holy Grail

Verification is the hardest part of this landscape. The RLM paper touches on the idea of the model verifying its own output, but this is notoriously difficult due to the “sycophancy” problem—models tend to agree with themselves.

Current research directions in verification are split:

External Verifiers: Using calculators, code executors, or knowledge graphs to ground truth. This is the most reliable but the most limited.
Self-Consistency: Running the same prompt multiple times and looking for consensus. This is expensive but effective for math and logic.
Latent Space Verification: A more theoretical approach where researchers look at the internal activations of the model to detect uncertainty, rather than reading the output text. This is still in early academic stages but holds promise for reducing the token cost of verification.

Convergence: The “System 2” Architecture

Despite the different focuses—academia on theory, industry on latency, indies on hacks—a convergence is happening. Everyone is moving toward what cognitive scientists call “System 2” thinking: slow, deliberate, recursive processing.

The landscape is converging on a two-tier architecture:

A Fast System 1: A standard LLM that generates initial drafts, code, or answers quickly.
A Slow System 2: A recursive wrapper that critiques, verifies, and refines the output using tools and loops.

The RLM paper provided the vocabulary for this shift. It moved the conversation from “bigger models are better” to “better inference strategies are better.”

We are seeing this in the latest releases from major labs. The “o1” models (or whatever they are called this month) explicitly show the model “thinking” before answering. That thinking process is the visible manifestation of the recursive loops that researchers have been building in code for the past year.

The Hardware Constraint

We cannot discuss this map without acknowledging the hardware. Recursion is memory-bound. Every recursive step keeps the KV cache alive longer. The industry is currently bottlenecked by memory bandwidth.

Groups like Groq and Cerebras are interesting here because their architectures (LPUs and wafers) are designed for low-latency inference. They don’t necessarily make the recursion smarter, but they make it faster. This changes the economic calculus for recursive approaches. If an inference pass takes 10ms instead of 500ms, you can afford to be much more recursive.

Academic papers are starting to account for this, too. New attention mechanisms like FlashAttention and PagedAttention are prerequisites for efficient recursive inference. You can’t run a deep recursion loop if you’re constantly swapping memory to disk.

Verifiable vs. Probabilistic: The Fundamental Tension

There is a deep philosophical divide in this landscape that is often glossed over. The RLM paper treats language models as reasoning engines. However, language is probabilistic, while reasoning (in the formal sense) is deterministic.

The groups working on formal verification (math, code) are essentially trying to force a probabilistic system into a deterministic box. They use recursion to generate candidates, then verification to filter them. This works great for code and math.

But what about creative writing or strategic planning? Verification is much harder there. The “indie researchers” are tackling this by redefining “verification” as “alignment with user intent.” They use recursive loops to ask the user for clarification or to refine a draft based on feedback.

This leads to a divergence in the map:

The Hard Sciences Path: Recursion + External Verifiers (Code/Calc/Knowledge Graphs).
The Humanities/Soft Skills Path: Recursion + Human-in-the-loop (Critique/Refinement).

Interestingly, the most advanced “General AI” research is trying to bridge this. They are trying to teach models to use “internal critique” that mimics the rigor of a formal verifier even for subjective tasks. This involves training models on datasets of “self-correction” where the model is shown a bad output and a good output, learning the recursive path between them.

The Role of “Context Folding” in Long-Term Memory

Let’s dig deeper into context folding, as this is where the most engineering innovation is happening. The standard transformer has no memory beyond the context window. RLM-style inference implies a need for long-term coherence.

Researchers at Microsoft and Google are publishing heavily on “MemGPT” style architectures. These systems treat the context window as a “working memory” and a database as “long-term memory.” The recursion happens when the model decides to query the database, retrieve a summary, and fold it back into the context.

This is essentially a recursive file system operating inside the LLM’s head. The “agent” decides it needs more information, retrieves it, and then re-evaluates the problem. This is the architecture that powers the most sophisticated chatbots today, though it’s often hidden behind the UI.

The indie community has contributed significantly here by open-sourcing the “glue code” for these memory systems. They’ve built lightweight vector databases that integrate directly into the inference loop, allowing for recursive retrieval without the overhead of enterprise-grade infrastructure.

Tool Use: From Function Calling to Recursive Execution

Tool use is the physical manifestation of the RLM paper’s ideas. If the model can use tools, it can verify its own work.

Consider the landscape of code generation. A model generates a function. A tool (the linter) checks it. If it fails, the model receives the error. This is a recursive loop. The innovation right now is in making these loops autonomous.

Groups like Cognition AI (with their Devin agent) are pushing the boundaries of what a recursive loop can achieve. They aren’t just generating code; they are generating a plan, executing steps, reading the terminal output, and adjusting the plan. This is RLM in action: a recursive reasoning process that spans multiple modalities (text, code, CLI).

However, the verification problem remains the weak link. In code, we have compilers. In general tasks, we don’t have a “compiler for truth.” This is why the industry is so focused on RAG (Retrieval-Augmented Generation). RAG is essentially a verification loop against a knowledge source. If the model says something, we check if it’s in the retrieved documents. It’s a crude form of recursion, but it’s the most effective one we have for factual accuracy right now.

The “Ghost” in the Recursive Machine

There is a subtle danger in recursive inference that the academic papers are beginning to highlight: error compounding.

If a model makes a small mistake in the first iteration of a recursion, and that mistake is fed back into the context as “reasoning,” the model is likely to double down on it. This is the “hallucination spiral.” The verification loops are supposed to catch this, but if the verifier is the same model (or a weaker version of it), the error can slip through.

Current research into “uncertainty quantification” is trying to solve this. Instead of just outputting text, models are being trained to output confidence scores for every token. If the confidence drops below a threshold during a recursive step, the system can trigger a “backtrack” or a “search” operation. This adds a meta-cognitive layer to the recursion.

Map of the Current Landscape (Summary)

To visualize this, imagine a 2D map. On the X-axis, you have “Automation” (how much the loop runs without human input). On the Y-axis, you have “Verification Rigor” (how strictly the output is checked).

Bottom Left (Low Automation, Low Rigor): Basic CoT prompting. Users write “Let’s think step by step.” This is the entry point for most developers.

Top Left (High Rigor, Low Automation): Formal verification labs. They use recursion heavily but require human oversight for complex proofs.

Bottom Right (High Automation, Low Rigor): Consumer chatbots. They use recursive retrieval (RAG) but have loose verification.

Top Right (High Automation, High Rigor): The “Unicorn” zone. This is where agents like Devin live, and where the RLM paper is pointing us. It requires massive context management and robust tool use.

Currently, the industry is clustered in the Bottom Right. Academia is clustered in the Top Left. The indie researchers are scattered everywhere, but they are the ones building the bridges between these quadrants.

Looking Forward: The Next 12 Months

What can we verify today? We can verify code execution. We can verify mathematical proofs (with help from systems like Lean). We can verify factual claims against retrieved documents. We cannot yet verify complex strategic plans or creative coherence with high reliability.

The next wave of RLM-style research will likely focus on efficiency. The recursive loops are expensive. We need “early stopping” mechanisms—ways to detect when a recursive process has converged on a good answer without running the full loop. We also need better “pruning” of the reasoning tree. Not every path needs to be explored.

There is also a growing interest in multimodal recursion. Can a model generate an image, critique it, and regenerate it? Can it write a script, generate a video, and edit the video based on a review? The RLM framework applies here too, but the verification loops are much harder to build (how do you “verify” a good image?).

The groups exploring this right now are mostly at the intersection of generative art and AI research, often hosted on platforms like Hugging Face Spaces or GitHub. They are the early adopters of recursive multimodality, building the pipelines that will eventually become standard.

A Note on the “Indie” Spirit

It is worth noting that many of the most practical implementations of RLM concepts are not coming from the big labs with their massive datasets. They are coming from developers who are frustrated with the limitations of the API.

The “glue code” movement—writing Python scripts that wrap around LLM calls to add memory, recursion, and verification—is a form of grassroots engineering. These developers are treating the LLM not as an oracle, but as a component in a larger recursive system. This shift in perspective is perhaps more important than any specific architectural breakthrough.

They are the ones experimenting with “prompt chaining” as a poor man’s recursion, and slowly graduating to more sophisticated state management. Their work is documented in Jupyter notebooks, GitHub repos, and Discord channels, forming a distributed brain trust that is rapidly iterating on the ideas presented in the RLM paper.

The landscape is dynamic. Today’s “recursive agent” is tomorrow’s “standard feature.” But for now, the map is drawn by those willing to write the loops themselves, debug the infinite recursions, and patiently wait for the model to “think” step by step.