Why AI Needs Fewer Tokens and More Thought

There’s a peculiar hum to the way we build artificial intelligence right now. It’s the sound of engines burning computational fuel at a rate that would make a jet engine blush, all to generate a sequence of tokens that often feels… surprisingly thin. We have systems capable of writing elegant code, drafting legal documents, and composing poetry, yet their interaction model often defaults to the verbosity of a first-year student trying to hit a word count. It’s a paradox that sits at the very core of our current trajectory: we are building machines that are incredibly articulate but not always particularly thoughtful. The relentless pursuit of the next token, the immediate continuation of a sequence, has inadvertently prioritized fluency over depth, and it’s a trade-off that is becoming increasingly expensive, both literally and figuratively.

When we talk about the “cost” of an AI interaction, our minds immediately jump to the API bill. The pricing models of large language models are almost exclusively based on token counts—input tokens, output tokens, a constant game of digital accounting. This economic reality has a direct, and often distorting, influence on how we design applications. We encourage the model to be concise, to get to the point, to not ramble. But this economic pressure is merely a symptom of a deeper architectural bias. The underlying transformer architecture, the engine of this revolution, is fundamentally an autoregressive model. It is designed for one task: predict the next piece. This is a powerful capability, but it is not the same as thinking.

The Autoregressive Fallacy

Let’s be precise about what’s happening under the hood. An autoregressive model generates a sequence, $y = (y_1, y_2, …, y_T)$, by estimating the conditional probability $P(y_t | y_{

This creates a fascinating tension. On one hand, the sheer scale of these models allows them to internalize vast amounts of information, connecting concepts in ways that feel genuinely insightful. On the other, the generation process itself is a form of sophisticated autocomplete. It doesn’t have a “plan” in the human sense. It doesn’t hold a multi-step strategy in its working memory for how to best answer a complex query. It simply follows the most probable path, token by token.

Consider a request like: “Explain the relationship between quantum entanglement and information theory.” A model might immediately start drafting a response, defining quantum entanglement, then information theory, and then trying to link them. The result can be a competent but disjointed essay. A truly thoughtful process, however, would look different. It would pause. It would structure the argument internally first. Perhaps it would decide that the best approach is to start with the concept of no-communication theorems, then introduce Bell’s theorem as a foundation for non-local correlations, and finally weave in the role of entanglement in quantum computing and cryptographic protocols. This internal scaffolding, this *plan*, is what we’re missing. The model is generating text, not structuring a thought.

Thinking on a Scratchpad: The Emergence of Reasoning

Interestingly, we’ve already stumbled upon a solution, even if we don’t always recognize its full significance. The technique is known by a few names: Chain-of-Thought (CoT), scratchpad reasoning, or simply, step-by-step prompting. The principle is deceptively simple: instead of asking a model for a final answer, we ask it to show its work. We instruct it to “think step by step.”

The seminal example from the paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (Wei et al., 2022) remains one of the most compelling demonstrations. When asked the multi-step arithmetic problem “Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?”, a model prompted for a direct answer might correctly or incorrectly state “9”. But a model prompted for a chain-of-thought might respond:

Roger started with 5 tennis balls. He bought 2 cans, and each can has 3 tennis balls, so that’s 2 * 3 = 6 tennis balls. The total is 5 + 6 = 11. The answer is 11.

What’s the magic here? The model isn’t suddenly becoming a sentient calculator. The process of generating the intermediate steps—writing out the logic—serves as a working memory. Each generated token becomes part of the context for the next prediction. In effect, the model is using its own output as a scratchpad to offload the cognitive load of the problem. This allows it to break down a complex task into a series of simpler, autoregressive steps. The model isn’t “thinking” in a human sense, but it’s performing a computation that mimics the structure of thought.

This is a profound insight. It suggests that the limitation isn’t just the model’s knowledge, but its *process* for accessing and applying that knowledge. By forcing a more verbose, structured output, we actually improve the quality of the final, concise answer. The verbosity isn’t the goal; it’s the mechanism. We are trading token count for computational depth. This is the first step in arguing for depth over sheer output volume. We need to give the model space to “think,” even if that thinking is just a simulation of a reasoning process written into its own context window.

From Simple Prompts to Programmatic Control

The evolution from simple CoT to more advanced techniques like Tree-of-Thoughts (ToT) and Graph-of-Thoughts (GoT) further reinforces this idea. These frameworks recognize that a single, linear path of reasoning might not be optimal. What if the first step in the chain of thought leads to a dead end? A human would backtrack. Tree-of-Thoughts formalizes this by allowing the model to explore multiple reasoning paths, evaluate their promise, and prune unpromising branches, much like a search algorithm.

This moves us from a purely generative paradigm to a more agentic, search-oriented one. The model is no longer just a “text completer”; it’s becoming a “problem solver” that uses text generation as its primary tool. The interaction becomes a dialogue, not just between user and model, but between different “thoughts” the model generates about the problem itself. This is computationally intensive, of course. It requires many more tokens and much more processing time. But the results on complex planning and creative tasks are demonstrably superior. It’s a clear trade-off: we sacrifice the speed of a single-shot response for the quality of an explored, multi-faceted solution.

The Tyranny of the Context Window

There is a physical constraint that acts as a major bottleneck for all these techniques: the context window. This is the finite amount of text—the combined input and output tokens—that the model can hold in its active memory at any given moment. For a long time, this was a hard limit of a few thousand tokens. While recent advancements have pushed this to hundreds of thousands, or even over a million in some cases, it remains a fundamentally limited resource. And we are constantly fighting a war against its boundaries.

In a long conversation or when working with a large document, every new token generated risks pushing older context out of the window. The model forgets. This is the “lost in the middle” phenomenon, where information at the beginning and end of a long context is remembered better than information in the middle. When we use techniques like CoT or ToT, we are consuming this precious context real estate with the model’s “thinking.” Every step in the chain, every explored branch in the tree, is a token that pushes the original question or relevant data further away.

This is where the argument for “fewer tokens” becomes nuanced. We don’t necessarily want fewer tokens in the final output. We want fewer *wasted* tokens. We want every token the model generates to be maximally useful. The verbosity of a rambling, unfocused answer is a waste. But the verbosity of a well-structured reasoning process is an investment. The challenge for the next generation of AI systems is to become more efficient “thinkers” within this constrained space. How can we compress the reasoning process? Can we develop models that can summarize their own thoughts, or offload intermediate states to an external memory, retrieving them when needed, much like a human using a notebook?

Some researchers are exploring architectures that separate the “reasoning” component from the “knowledge” component. Imagine a system where a fast, efficient “planner” model outlines a series of steps, and a more powerful “executor” model fills in the details for each step, possibly with access to external tools or databases. This modular approach could allow for complex, multi-step tasks without bloating a single context window with the entire thought process. It’s a move away from the monolithic “one model to rule them all” towards a more distributed, collaborative intelligence.

Tool Use as Externalized Thought

One of the most powerful ways to achieve depth without token bloat is to give the model access to external tools. This is the world of function calling, plugins, and the broader concept of the “AI agent.” When a model needs to perform a calculation, search for current information, or manipulate data, it doesn’t have to do it with its own linguistic capabilities. It can call a function.

Let’s take an example: “What is the current stock price of NVIDIA, and based on its 52-week high, what is the percentage difference?” A purely linguistic model would struggle. It might hallucinate a stock price, and it would certainly fail at the real-time calculation. But a model equipped with tool use would:

Recognize the need for real-time data and call a get_stock_price("NVDA") function.
Recognize the need for historical data and call a get_52_week_high("NVDA") function.
Receive the numerical results from these external calls.
Perform the calculation internally (or call a calculator function) to find the percentage difference.
Synthesize the final answer in natural language.

Notice what happened here. The model offloaded the tasks that it is bad at (real-time data retrieval, precise arithmetic) to specialized tools. Its own token generation was focused on what it *is* good at: planning the steps, making the calls, and interpreting the results. This is a form of externalized thought. The model’s “reasoning” is no longer confined to the sequence of tokens it generates; it extends into the actions it takes in the world.

This approach dramatically increases the depth and reliability of the system while often reducing the token count of the internal monologue. Instead of generating a long, speculative paragraph about what the stock price might be, it generates a short, precise function call. The “thinking” becomes more structured and less verbose. It’s a shift from “thinking out loud” to “thinking with purpose.”

The Cognitive Offload Principle

This principle of offloading is fundamental to how we, as humans, manage complex tasks. We don’t hold vast libraries of facts in our heads; we use books and search engines. We don’t perform complex calculations in our minds; we use calculators or spreadsheets. We have externalized a huge portion of our cognitive labor. AI systems are just beginning to follow suit.

Retrieval-Augmented Generation (RAG) is another manifestation of this principle. Instead of relying solely on the static knowledge baked into the model’s weights, a RAG system first retrieves relevant documents from an external knowledge base and provides them as context. This allows the model to base its answers on fresh, specific information. It prevents the model from having to “guess” or hallucinate facts. The model’s cognitive effort is redirected from recall to synthesis. It can focus on understanding the retrieved text and weaving it into a coherent answer, rather than expending tokens and mental energy trying to dredge up potentially outdated or incorrect information from its parameters.

By embracing tool use and RAG, we are designing systems that are inherently more thoughtful. They pause to gather facts, they pause to perform calculations, they pause to verify. These pauses, which manifest as API calls or retrieval steps, are the computational equivalent of a deep breath. They break the monolithic, autoregressive generation into a more deliberate, agentic workflow. This is how we build AI that doesn’t just sound smart, but actually *is* smart in its actions.

Rethinking Evaluation: Beyond Benchmarks and BLEU Scores

Our obsession with token efficiency and speed is partly a legacy of how we evaluate these systems. For years, the gold standards for language models were benchmarks like GLUE and SuperGLUE, and metrics like BLEU for translation. These metrics are great for measuring performance on specific, often isolated, tasks. They reward speed and accuracy against a ground truth. But they are terrible at measuring depth, reasoning, or the quality of a thought process.

A model that produces a slightly lower BLEU score but arrives at its answer through a robust, verifiable, multi-step reasoning process is arguably more valuable than a model that gets a higher score by a clever but opaque statistical mapping. We are starting to see the emergence of new benchmarks that attempt to measure these more nuanced qualities. “Big-Bench” (Beyond the Imitation Game Benchmark) includes tasks that test for causal judgment, irony detection, and logical deduction—things that require more than just pattern matching.

Furthermore, the practice of “vibe checking” or “red teaming” is becoming more formalized. This involves humans rigorously probing the model for its reasoning capabilities, its ability to admit uncertainty, and its resistance to generating plausible-sounding but incorrect chains of thought. The goal is to move beyond “does the answer look right?” to “is the process by which the answer was reached sound?”

This shift in evaluation is critical. As long as we primarily reward models for producing the “correct” token sequence as quickly as possible, we will continue to optimize for fluency over depth. But if we start rewarding models for showing their work, for using tools correctly, and for breaking down complex problems, the entire incentive structure changes. The “cost” of a thoughtful response is no longer a bug; it becomes a feature that we actively seek and measure. This will drive the development of new architectures and training methods that are explicitly designed for depth.

The Role of Uncertainty and Calibration

A key component of true thoughtfulness is the ability to recognize the limits of one’s own knowledge. A system that is always confident, that always generates a definitive-sounding answer, is not a thoughtful system; it’s an arrogant one. Thought requires intellectual humility. It requires the ability to say, “I don’t know,” or “It depends on these factors.”

Modern language models are notoriously poorly calibrated. They often assign very high probabilities to incorrect statements. This is a direct consequence of the training objective, which rewards confident prediction. A model that is trained to be a “good next-token predictor” is not being trained to be a “good assessor of its own knowledge.”

Building systems that can express uncertainty is a major frontier. This could involve generating confidence scores alongside answers, or phrasing responses with appropriate caveats (“Based on the information available to me up to my last update…”). This is another area where tool use can help. A model can express uncertainty about a fact and then immediately trigger a web search to resolve that uncertainty, effectively turning doubt into a productive action.

Imagine an AI that, when faced with a query it’s not confident about, doesn’t just hallucinate an answer but instead generates a structured plan to find the answer. It might say: “I’m not certain about the latest regulations for commercial drone usage in the UK. To answer your question accurately, I will perform the following steps: 1. Search for the official UK Civil Aviation Authority (CAA) website. 2. Locate the section on drone regulations. 3. Summarize the key requirements for commercial operators. Shall I proceed?” This is a fundamentally different interaction paradigm. It’s collaborative, transparent, and thoughtful. It prioritizes getting the right answer over getting a quick answer.

Building for Depth: Practical Steps for Developers

So, what does this mean for those of us building applications with these models today? It’s not just an academic debate. The choices we make in our prompts, our system design, and our evaluation criteria directly shape the behavior of the systems we deploy.

First, we must become champions of structured prompting. We should default to asking for step-by-step reasoning, especially for any task involving logic, planning, or analysis. Even for simpler tasks, providing a model with a template or a structure to follow can dramatically improve the coherence and depth of its output. Don’t just ask “Summarize this article.” Ask “Read this article. First, identify the author’s main thesis. Second, list the three key arguments they use to support it. Third, note any counter-arguments they address. Finally, synthesize these points into a concise summary.” By providing a cognitive scaffold, you are guiding the model toward a more thoughtful process.

Second, embrace function calling and tool integration wherever possible. Don’t ask the model to do things it’s bad at. If you need a calculation, provide a calculator function. If you need current information, provide a search function. If you need to interact with an API, provide a function for that. This not only makes your application more reliable but also forces the model to be more explicit and structured in its planning. The “thinking” becomes about which tool to use and when, a much more robust and less token-intensive form of reasoning.

Third, design your application’s user experience around the idea of a conversation, not a command line. Allow for iteration. If the model’s first attempt at a solution is flawed, provide it with feedback and ask it to revise its chain of thought. The context window is a powerful tool for iterative refinement. This mimics the human process of brainstorming and revision, which is the very essence of deep work.

Finally, change how you measure success. Don’t just measure latency and final answer quality. Start measuring process quality. Did the model use its tools correctly? Did it break the problem down logically? Can you trace its reasoning? This might require more sophisticated logging and human-in-the-loop evaluation, but it’s the only way to build systems that are truly intelligent, rather than just impressively fluent.

The path forward for AI is not simply about building bigger models with more parameters. It’s about building smarter systems with better processes. We need to move beyond the simple autoregressive loop and embrace architectures that allow for planning, reflection, and external collaboration. We need to trade the illusion of instant omniscience for the reality of deliberate, step-by-step intelligence. The future of AI won’t be measured in tokens per second, but in the quality and depth of the thoughts they represent. It’s a future that will be built by those who value the pause as much as the generation, the question as much as the answer, and the process as much as the product. The journey toward genuine machine intelligence will be paved not with a torrent of words, but with the quiet, deliberate steps of a well-structured thought.