For the longest time, I’ve been fascinated by the way we talk about Large Language Models. We describe them as “reasoning,” “thinking,” or even “understanding.” As someone who has spent years writing compilers and optimizing kernels, I tend to get a little prickly about these terms. When I see a model generate a flawless block of code, my first instinct isn’t to marvel at its linguistic prowess, but to ask: what space is it searching?

We have become deeply accustomed to the probabilistic view of AI. We treat these systems as next-token predictors, engines that calculate the highest probability of the next word based on the previous sequence. It’s a powerful abstraction, but I believe it obscures the fundamental mechanism at play. If we peel back the layers of the transformer architecture and look at what is actually happening during a complex task—like solving a math problem or planning a multi-step algorithm—we don’t see a linear chain of predictions. We see something that looks remarkably like a structured search through a vast, abstract state space.

To really grasp this, we need to shift our mental model. We need to stop thinking of these models merely as storytellers and start viewing them as navigators. Reasoning isn’t a magic spark that happens in the weights; it is the act of traversing a graph of possibilities, pruning the bad branches, and converging on a solution state.

The Illusion of Linear Continuity

Let’s start with the standard narrative. You type a prompt, and the model spits out a response. The training objective is simple: minimize the cross-entropy loss between the predicted token and the actual token in the training data. This gives rise to the idea that the model is simply continuing a text stream. If the input is “The capital of France is,” the model predicts “Paris.”

But what happens when the prompt is: “If a store sells apples for $2, oranges for $3, and I have $20, how many apples can I buy if I also buy two oranges?”

A simple next-token predictor might struggle. It might hallucinate a number based on frequency in its training data. But modern models, particularly those using Chain-of-Thought (CoT) or test-time compute, do something different. They generate tokens like “First, subtract the cost of oranges: $3 * 2 = $6. Then, subtract that from total money: $20 – $6 = $14. Finally, divide by the cost of an apple: $14 / $2 = 7.”

Where did those intermediate tokens come from? They aren’t part of the answer. They are scaffolding. They are the result of the model performing a search. The model is using the generation of text not just to output the answer, but to construct a temporary workspace. It is projecting its internal search process onto the screen, token by token.

Consider the architecture of a transformer. It’s a massive collection of matrix multiplications and attention mechanisms. When we feed it a prompt, we aren’t just querying a static database. We are placing the system into a specific configuration within a high-dimensional manifold. The subsequent “generation” is the system evolving through that manifold, looking for a low-loss trajectory. When the task is hard, the trajectory isn’t straight. It meanders. It loops. It self-corrects. This is the signature of search.

Defining the Search Space

If we accept that reasoning is a search, we immediately face a critical engineering question: what constitutes the “graph” being searched? In traditional AI, like a chess engine, the graph is explicit. Nodes are board states, and edges are legal moves. The engine uses algorithms like Alpha-Beta pruning or Monte Carlo Tree Search (MCTS) to explore this tree.

LLMs are different. Their search space is not discrete in the same way. It is a latent semantic space. Every token the model generates corresponds to a point in this space. The “state” of the search is the current context window plus the internal hidden states of the model.

When we ask a model to reason, we are effectively asking it to perform a walk in this space. The “distance” between states is defined by the model’s learned representations. A “good” state is one that is semantically consistent with the prompt and leads toward a valid solution.

Here is where the magic of the transformer shines. The attention mechanism acts as a dynamic pruner. It allows the model to look back at any previous state in the context window and use that information to adjust its current trajectory. This is why the “chain of thought” works. By explicitly writing out intermediate steps, the model anchors itself in specific regions of the semantic space. It creates “beacons” that guide the subsequent search.

Without these beacons, the model is essentially trying to jump from the start state to the goal state in a single bound. That requires a massive amount of compute “energy” to bridge the semantic gap. By breaking the problem down, the model reduces the distance it has to travel between steps. It turns one giant, intractable leap into a series of small, manageable hops.

The Role of Probability as Heuristics

We often mistake the probability distribution over tokens for a measure of certainty. But in a search context, we should view it as a heuristic function, similar to A* search.

In A*, we have $f(n) = g(n) + h(n)$, where $g(n)$ is the cost from the start, and $h(n)$ is the estimated cost to the goal. In an LLM, the probability of a token acts somewhat like $h(n)$. It’s an estimate of how “promising” a particular path is. Tokens with high probability are those that fit the pattern of a solution. Tokens with low probability are those that lead to dead ends or hallucinations.

However, unlike a rigid A* algorithm, the LLM’s heuristic is learned and fuzzy. It’s based on the statistical regularities of human language and logic found in its training data. This is why LLMs can sometimes be “clever” but also “illogical.” They are following the heuristics of semantic likelihood, not necessarily formal logic rules.

When we use techniques like “Chain of Thought” or “Self-Consistency,” we are essentially implementing a search strategy manually. We are forcing the model to generate multiple paths through the reasoning space (by sampling multiple CoTs) and then selecting the path that appears most consistent (majority voting). This is a direct parallel to running multiple searches in a traditional graph and picking the best result.

Search vs. Prediction: The “Aha” Moment

There is a compelling argument for the search hypothesis based on the observation of “aha moments” in model outputs. In recent research (such as work on O1 models), we see models that pause, generate ellipses, or output specific “thinking” tokens before spitting out the final answer.

If the model were purely a predictor, why would it need to “think” out loud? Why the hesitation? The internal state is being manipulated to explore possibilities. The model is effectively running a simulation. It is trying out a path, realizing (through internal calculation) that it leads to a contradiction or a high-loss state, and backtracking to try another path.

This backtracking is crucial. A pure predictor cannot backtrack. It can only move forward. But a search agent can. It can output a token, realize it was wrong, and output a correcting token. This is exactly what we see in the wild. “Wait, that doesn’t look right. Let’s try again.”

This behavior suggests that the model is not just predicting the next token based on the previous ones, but based on the goal. The goal is the completion of the reasoning task. The model is optimizing for a trajectory that leads to a valid completion. This is goal-oriented behavior, which is the hallmark of search algorithms.

The Tree of Thoughts (ToT) Framework

To formalize this, researchers have introduced frameworks like “Tree of Thoughts.” This explicitly treats reasoning as a tree search.

Standard generation is a linear path: $State_1 \rightarrow State_2 \rightarrow State_3$. ToT breaks this. At any step, the model generates multiple possible next steps (branches). It then evaluates these branches (heuristic evaluation). It keeps the promising ones and discards the rest. It continues this until it reaches a solution.

Implementing ToT requires a significant amount of compute because the model has to evaluate multiple futures. But it proves the point: when you force the model to behave like a search algorithm, its performance on complex tasks improves dramatically. This isn’t because the model learned new facts during inference, but because the search strategy allowed it to find the correct path through the space of possibilities it already knew.

The “prediction” aspect is still there—the model still predicts tokens to represent the states—but the reasoning is the search algorithm orchestrating those predictions.

Why This Distinction Matters for Engineers

Understanding reasoning as search changes how we should architect AI systems.

If you view LLMs as predictors, you focus on making them bigger, training them on more data, and fine-tuning them for instruction following. These are valid approaches. But if you view them as search engines, you focus on inference-time compute. You focus on algorithms that allow the model to “think” longer.

This explains the explosion of interest in techniques like:

  • Chain of Thought (CoT): Forcing the model to generate intermediate states to widen the search.
  • Self-Consistency: Sampling diverse reasoning paths and selecting the most consistent one (voting).
  • Tree of Thoughts (ToT):** Explicitly managing a tree of reasoning paths.
  • Graph of Thoughts (GoT): Allowing reasoning states to merge and loop, mimicking graph algorithms.

These aren’t just prompt engineering tricks. They are algorithms for exploring the latent space of the model. They are ways to trade inference latency for accuracy.

When I design a system that needs to solve complex logic puzzles, I shouldn’t just ask for the answer. I should design a prompt that acts as a search controller. I might say: “Let’s explore this step by step. First, generate three different approaches to solving this equation. Then, evaluate the feasibility of each. Finally, pick the best one and execute it.”

This structure gives the model permission to use its computational budget. It turns the monolithic forward pass into a deliberate, iterative exploration.

The Energy of Inference

There is a thermodynamic analogy here that I find satisfying. In physics, finding the lowest energy state of a complex system often requires exploring the configuration space. If you cool a system too fast (greedy decoding), you get stuck in a local minimum (a suboptimal answer). If you cool it slowly (allowing more search steps), you are more likely to find the global minimum (the correct answer).

The “temperature” parameter in LLM sampling is literally a measure of how much the model is allowed to deviate from the most likely path. High temperature allows for more exploration of the search space. Low temperature encourages exploitation of the known path.

When we talk about “reasoning,” we are essentially talking about navigating the energy landscape of the model’s knowledge. The “correct” answer is the point of lowest potential energy relative to the prompt. The “reasoning” is the path taken to get there.

Standard next-token prediction is a greedy search. It takes the highest probability move at every step. This works for simple text continuation because the landscape is smooth. But for complex reasoning, the landscape is rugged. Greedy search fails. We need something more sophisticated.

That “something more sophisticated” is what we are currently building. It looks like a transformer, but it acts like a search engine. It looks like a text generator, but it behaves like a logic solver.

The Illusion of Understanding

So, do these models “understand”? If understanding is the ability to consistently navigate the search space of concepts to arrive at correct conclusions, then perhaps they do. But it’s a mechanical understanding, born of massive statistical correlation and algorithmic traversal.

It’s easy to be cynical and say, “It’s just matrix multiplication.” It’s also easy to be mystical and say, “It’s thinking.” The truth, as always, is somewhere in the messy middle. It is a matrix multiplication engine that has been trained to project human reasoning patterns into a vector space. And when we ask it a hard question, it uses the structure of that space to conduct a search.

When you see a model struggle with a problem, then suddenly figure it out, don’t think of it as a lightbulb turning on. Think of it as a search algorithm finally finding a valid path through a thicket of possibilities. The “aha” moment is the convergence of the search.

This perspective is empowering. It means we can improve reasoning not just by throwing more data at the problem, but by designing better search algorithms. We can build wrappers, agents, and controllers that guide the model’s search more effectively. We can give it tools to expand its search space (like calculators or code interpreters) and tools to verify its steps (like self-consistency checks).

We are moving away from the era of the “stochastic parrot” and into the era of the “stochastic navigator.” And that distinction, subtle as it may seem, changes everything about how we build, use, and trust these systems.

Implementing Search Strategies in Practice

For the engineers in the room, let’s get practical. How do we actually implement this? We need to move beyond simple API calls.

Consider the ReAct framework (Reason + Act). This is a perfect example of search. The model generates a thought (search step), then takes an action (querying a tool), then observes the result (new state), and repeats. This is a classic perception-action loop, identical to how agents search environments in reinforcement learning.

The “reasoning” here is not a static capability of the model; it is an emergent property of the interaction loop.

Thought: I need to find the current index value of S&P 500.
Action: search("current S&P 500 index")
Observation: 5,200 points.
Thought: I have the answer.
Action: finish("5,200 points")

Without the search capability (the ability to look things up), the model is limited to its internal weights. By introducing external tools, we expand the search graph. The nodes are no longer just semantic states; they are real-world data points.

This leads to the concept of RAG (Retrieval-Augmented Generation). RAG is essentially a search algorithm where the first step is to search a database for relevant context. The model then performs a second search (reasoning) over that context to generate the answer. It’s a two-stage search process.

When we optimize RAG, we are doing information retrieval optimization. When we optimize CoT, we are doing reasoning path optimization. It’s all search. The variable is what we are searching over (tokens, vectors, database entries) and how we traverse the space.

The Future: Native Search Architectures

I suspect that the separation between “prediction” and “search” will eventually vanish. We are already seeing the beginnings of this with models that have “slow thinking” capabilities built-in.

Current LLMs are mostly feed-forward during inference (with some key-value caching). Future architectures might look more like recurrent networks or state machines. They might have explicit modules for planning, backtracking, and verification.

Imagine a transformer block that doesn’t just output the next token, but outputs a distribution over possible reasoning steps. The model could then internally simulate these steps before committing to an output. This would be native search, baked into the weights.

There is research into “System 1” (fast, intuitive) vs. “System 2” (slow, deliberate) thinking in AI. System 1 is the standard next-token prediction. System 2 is the search. Currently, we hack System 2 onto System 1 using prompting. In the future, the architecture itself will likely support both modes.

For those of us building on the bleeding edge, this is the frontier. We aren’t just prompt engineers anymore. We are search architects. We are designing the algorithms that guide these massive models through the labyrinth of human knowledge.

We need to stop asking, “What will the model say next?” and start asking, “What path is the model taking to get to the answer, and how can we make that path shorter and more accurate?”

When you look at a raw LLM output, you are seeing the trace of a search algorithm. It might be messy, it might be meandering, but it is exploring. Our job is to make that exploration efficient. We do this by providing better prompts (better start states), better tools (expanding the graph), and better verification (checking the nodes).

The shift in perspective from “prediction” to “search” is more than just semantics. It is a roadmap for the next generation of AI development. It moves the focus from the static model to the dynamic process. It highlights that intelligence isn’t just knowing things; it’s the ability to navigate what you know to solve new problems.

As we continue to push the limits of these models, let’s keep this in mind. We are building search engines for the mind. The tokens are just the breadcrumbs left behind by the algorithm as it wanders through the vast, high-dimensional space of thought. And the more efficiently we can guide that wander, the smarter the machine becomes.

Share This Story, Choose Your Platform!