Recursion Is The Next Scaling Law In AI
YouTube · DGtUUMNYLcc
Quick Read
Summary
Takeaways
- ❖Traditional LLMs struggle with 'incompressible problems' (e.g., sorting, Sudoku) in a single feed-forward pass, as they lack sufficient 'compute depth' for iterative reasoning.
- ❖Chain-of-thought and tool use in LLMs are 'hacks' that simulate recursion in the discrete token space, but they are bounded by existing human knowledge and cannot discover new algorithms.
- ❖Hierarchical Reasoning Models (HRM) use a bio-inspired, multi-level recursion with a Deep Equilibrium (DEQ) method to train models without full backpropagation through time, achieving SOTA on ARC prize with 27 million parameters.
- ❖Tiny Recursive Models (TRM) simplify HRM, using a single network and backpropagating through just one latent recursion step, resulting in 7-million-parameter models that outperform HRM.
- ❖The 'outer refinement loop' is a crucial scaling mechanism in recursive models, allowing for iterative improvement of solutions.
- ❖The true power lies in combining the strong embedding representations of large LLMs with the efficient, latent-space reasoning capabilities of small recursive models.
Insights
1LLMs' Fundamental Limitation: Lack of Inherent Latent Reasoning
Current large language models (LLMs) are primarily feed-forward systems that process inputs in a one-shot manner. While they can perform 'chain of thought' or use tools, these are 'hacks' that simulate recursion in the discrete token space. This limits their ability to solve incompressible problems (like sorting or Sudoku) from first principles or discover new algorithms, as their reasoning is bounded by the human knowledge they were trained on.
The guest explains that LLMs cannot map an unsorted list to a sorted list in a one-shot basis due to theoretical lower bounds on comparison sorts (n log n steps). If a transformer has 30 layers and a list is 31 elements long, it runs out of layers to do the necessary comparisons. Sudoku and mazes are cited as other incompressible problems. Chain of thought and tool use rely on existing human knowledge of solutions.
2HRM: Bio-Inspired Hierarchical Recursion with DEQ Training
Hierarchical Reasoning Models (HRM) leverage a bio-inspired architecture with three levels of recursion (low-level, high-level, and outer refinement steps) to achieve 'compute depth.' Crucially, HRM uses a Deep Equilibrium (DEQ) learning method that performs fixed-point iteration during training. Instead of backpropagating through all recursion steps, it stops gradients and reuses the updated hidden states for subsequent iterations, effectively creating 'mini-batches' from different memory states to circumvent the vanishing/exploding gradient problems of traditional RNNs.
HRM is described as directly in the lineage of RNNs, inspired by brain regions operating at different frequencies (). It involves TL steps with a low-level module (LNET), TH steps with a high-level module (HNET), and N_sub outer refinement steps (). The key trick is using DEQ's fixed-point iteration, where gradients are stopped, and the same batch is re-passed with updated hidden states 16 times, treating each pass as a new 'batch' in latent space (). This model achieved state-of-the-art on ARC prize 1 and 2 with only 27 million parameters, trained on just 1000 tasks ().
3TRM: Simplified and More Efficient Recursion
Tiny Recursive Models (TRM) build upon HRM by simplifying the architecture and optimizing the backpropagation process. TRM collapses the two separate networks (LNET and HNET) into a single, weight-shared network and uses only one transformer layer instead of four. Its key training innovation is backpropagating through one full latent recursion step, which is more extensive than HRM's truncated backprop (t=1). This simplification and optimized training allow TRM to achieve even higher performance (87% on ARC prize 1) with a significantly smaller model (7 million parameters), demonstrating that compute depth through recursion is a powerful scaling law.
TRM simplifies HRM by collapsing LNET and HNET into a single 'net' and using just one transformer layer (). Alexia's work shows that going deeper didn't help, and on some tasks like Sudoku, an MLP even outperformed attention (). TRM's optimization involves backpropagating through one full latent recursion step after a detach operation (). This 7-million-parameter model gets from 70% (HRM) to 87% on ARC prize 1 ().
4The Future: Combining Large LLMs with Tiny Recursive Reasoners
The most promising direction for AI research is to integrate the strengths of both large language models and tiny recursive models. Large LLMs excel at learning vast embedding representations from massive datasets, effectively mapping raw inputs (text, pixels) into semantically rich latent spaces. Within these high-dimensional latent spaces, small, specialized recursive models can then perform deep, iterative reasoning to solve complex problems, discover new solutions, and move beyond the limitations of human-bounded knowledge.
The host notes that TRMs and HRMs are task-specific, while LLMs are general-purpose (). The guest suggests that LLMs are great at finding amazing embedding representation spaces, but reasoning within that space is not done much (). The proposed future is to use LLMs to map inputs into a 'really cool latent space' where 'things are just nicely semantically separated,' and then use 'tiny reasoning models' with recursion within that space ().
Bottom Line
The 'batch size across the carry space' concept in HRM/TRM training, where the same input is repeatedly processed with updated hidden states, effectively creates new 'data points' from the model's internal memory, allowing for more efficient learning of iterative processes.
This technique offers a novel way to train models on iterative tasks without needing vast amounts of diverse external data, potentially enabling AI to learn complex reasoning from limited examples by self-generating 'experience' through memory state exploration.
Develop new training paradigms that leverage internal memory states for 'self-supervised' iterative learning, reducing reliance on massive external datasets for reasoning tasks. This could be particularly valuable for domains with scarce labeled data but rich internal dynamics, such as scientific discovery or complex simulations.
The 'outer refinement loop' in recursive models is the primary driver of performance gains, even more so than the internal recursion steps, and can be effectively truncated at test time without significant performance loss.
This suggests that the highest-level iterative process is crucial for learning, but once learned, the model can often achieve good results with fewer iterations during inference, indicating a form of 'compiled' or efficient reasoning.
Design AI systems with dynamic inference budgets, where the number of outer refinement steps can be adjusted based on real-time computational constraints or desired accuracy, allowing for flexible deployment of powerful reasoning capabilities.
Opportunities
Hybrid AI Reasoning Engine for Incompressible Problems
Develop a platform that combines a large, pre-trained LLM for robust semantic embedding and context understanding with specialized, tiny recursive models for solving domain-specific incompressible problems (e.g., complex logistics optimization, advanced scientific simulations, novel drug discovery pathways). This engine would leverage the LLM's broad knowledge to interpret problems and the recursive models' deep reasoning to find optimal or novel solutions.
AI Algorithm Discovery and Optimization Service
Offer a service where AI, using recursive models, can discover new, more efficient algorithms for specific computational tasks (e.g., sorting, graph traversal, resource allocation) that are currently human-designed. This goes beyond merely 'calling a function' or 'chain of thought' by enabling the AI to invent novel computational strategies, potentially leading to significant performance gains in various industries.
Key Concepts
Compute Depth vs. Parameter Depth
This model distinguishes between increasing the number of parameters in a neural network (parameter depth) and increasing the number of iterative computational steps a model takes during inference (compute depth). The episode argues that compute depth, achieved through recursion, is essential for complex reasoning and can be more efficient than simply scaling up parameter depth.
Turing Machine Analogy for AI Reasoning
The discussion draws parallels between LLMs and Turing machines. A basic LLM is like a feed-forward model, limited in its computational steps. Adding external memory or iterative processing (like a Turing machine's tape) allows for more complex, 'Turing-complete' reasoning, which recursive models aim to achieve inherently.
Expectation-Maximization (EM) Algorithm in Training
The training process for recursive models like HRM and TRM resembles an EM algorithm. The model iteratively updates a 'local' hidden state (ZL) conditioned on the input and a 'global' hidden state (ZH), and then updates ZH conditioned on ZL, effectively maximizing the probability of correct information storage and output through repeated refinement.
Lessons
- For AI researchers: Explore integrating recursive architectures into existing LLM frameworks, focusing on how to enable latent-space reasoning rather than just token-space recursion (chain of thought).
- For AI developers: When tackling problems requiring deep, iterative reasoning (e.g., optimization, puzzle-solving, complex planning), consider specialized recursive models over purely scaling up transformer layers.
- For AI strategists: Investigate hybrid AI systems that leverage large models for broad understanding and smaller, recursive models for specialized, efficient problem-solving to achieve a balance of generality and depth.
Quotes
"It's actually impossible for the model to map from unsorted list to sorted lists if I have in a one shot basically."
"The chain of thought is not going to inherently discover sorting from first principles. It's finding it from historical knowledge of everything it's trained on."
"A 7 million parameter [model] can solve a hundred million, a hundred billion, a hundred billion, trillion model can't solve, trained on the entire internet, and a 7 million parameter wins."
Q&A
Recent Questions
Related Episodes

Is AI Hiding Its Full Power? With Geoffrey Hinton
"AI pioneer Geoffrey Hinton explains the foundational mechanics of neural networks, reveals AI's emergent capacity for deception and self-preservation, and outlines the profound, unpredictable societal shifts ahead."

How to Build the Future: Demis Hassabis
"DeepMind CEO Demis Hassabis details the missing pieces for Artificial General Intelligence (AGI), the strategic role of smaller AI models, and how AI will transform scientific discovery, urging founders to combine AI with other deep tech."

Tom Griffiths on The Laws of Thought | Mindscape 343
"Cognitive scientist Tom Griffiths explores the historical quest for the 'laws of thought,' revealing how logic, probability, and neural networks offer distinct yet complementary frameworks for understanding human and artificial intelligence, especially concerning resource constraints and inductive biases."

How Much Do Language Models Memorize?
"Meta researcher Jack Morris introduces a new metric for 'unintended memorization' in language models, revealing how model capacity, data rarity, and training data size influence generalization versus specific data retention."