Inference, Diffusion, World Models, and More | YC Paper Club
YouTube · wE1ZgJdt4uM
Quick Read
Summary
Takeaways
- ❖Inference speed is a capability, not just a cost: Speculative Speculative Decoding (SSD) significantly accelerates LLM output by parallelizing the drafting and verification processes, achieving 300 tokens/second on Llama 3 70B.
- ❖World models are becoming more robust and efficient: The Lay World Model uses a novel 'SigG' regularizer to prevent representational collapse, enabling stable training and 50x faster model predictive control in latent space.
- ❖Classical ML theories explain deep learning 'mysteries': Pack-Bay bounds and soft inductive biases provide a mechanistic understanding of overparameterization and benign overfitting, crucial for optimizing generalization.
- ❖Data efficiency is the new frontier: Aggressive regularization, ensembling, and distillation can yield up to 17x data efficiency wins in pre-training, crucial as compute scales faster than new data generation.
Insights
1Inference as a Capability: Speculative Speculative Decoding (SSD)
Tanishk argues that inference speed is a core capability, not merely a cost or convenience factor. The Speculative Speculative Decoding (SSD) algorithm dramatically accelerates Large Language Model (LLM) inference by parallelizing the traditionally sequential drafting and verification steps. Unlike vanilla speculative decoding, which struggles with sequential dependencies, SSD anticipates likely verification outcomes and drafts the next round of tokens simultaneously, effectively hiding drafting latency. This allows for more tokens to be drafted per round and significantly boosts tokens per second.
SSD achieves 300 tokens per second for Llama 3 70B on 4 H100s, outperforming other inference engines. The core mechanism is predicting verification outcomes with 80-90% accuracy using information from the draft model's token distributions, allowing parallel decoding of different sequences.
2Diffusion Model Predictive Control (DMPC) for Robust Robotics
Stannis from Google DeepMind presented Diffusion Model Predictive Control (DMPC), an approach that uses diffusion models to learn both multi-step action proposals and multi-step dynamics models. This framework addresses challenges in traditional Model Predictive Control (MPC) by reducing compounding errors and simplifying planning algorithms. DMPC enables agents to adapt to novel reward functions and dynamics at runtime, a key advantage over joint modeling approaches.
DMPC demonstrates competitive results in fixed-reward tasks and, more importantly, adapts to novel behaviors (e.g., jumping) by changing reward functions at inference time. It also adapts to novel dynamics (e.g., a walker with a broken ankle) by updating only the dynamics model on new play data, showcasing the benefit of factorized representation.
3Lay World Model: Elegant Regularization for Stable Dynamics Learning
Isaac Ward discussed the Lay World Model, a Joint Embedding Predictive Architecture (JEPA) from Yann LeCun's group. This model learns world dynamics by predicting future latent embeddings rather than raw images, using an image encoder and an action-conditioned forecasting module. The key innovation is the 'SigG' (Sketching Isotropic Gaussian) regularizer, which ensures a 'healthy' Gaussian distribution of latent embeddings, preventing representational collapse without complex tricks or hyperparameter tuning. This leads to more stable and efficient training.
The Lay World Model achieves 50 times faster operation than competitors by performing all work in the latent space, requiring less than 24GB of VRAM and only 15 million parameters. It demonstrates high-quality open-loop predictions and effective model predictive control on 2D and 3D tasks. A crucial capability is its ability to quantify model error, detecting perturbations in real-time, which is not natively available in model-free approaches.
4Demystifying Deep Learning Generalization with Classical Theories
Ashe from Q Labs presented Andrew Gordon Wilson's paper, which argues that deep learning's generalization 'mysteries' (like overparameterization and benign overfitting) can be explained by classical theories. Using the Pack-Bay framework, it's shown that increasing parameters reduces empirical risk and leads to more compressible solutions (flat minima), improving generalization. Benign overfitting is explained by neural networks acting as expressive models with soft inductive biases, allowing flexibility for random data while biasing towards simpler solutions for structured data.
The Pack-Bay framework shows that as model parameters increase, the compression term (related to model compressibility) decreases, leading to tighter generalization bounds. Work by Lotfi et al. shows a negative correlation between bits required to encode the training set and parameter count. The exponential increase in flat minima volume with more parameters supports the compressibility view. A regularized polynomial model illustrates how flexibility and inductive bias coexist to fit noise and generalize structured data.
5Pre-training Under Data Constraints: The Power of Aggressive Regularization and Ensembling
Ku addressed the emerging challenge of data-constrained pre-training, where compute scales much faster than available human-generated data. The paper proposes and evaluates scaling recipes (aggressive regularization, ensembling, distillation) to maximize generalization under fixed data budgets. These methods aim to lower the 'compute asymptote' – the best possible loss under infinite compute – representing true data efficiency wins.
Aggressive weight decay (30x higher than compute-optimal) allows models to monotonically decrease loss with increasing parameters, fitting a clean power law with a measurable asymptote. Ensembling models yields significantly lower asymptotes than regularization alone, demonstrating a true data efficiency win (e.g., 5x for a joint scaling recipe, 17x for continued pre-training on math data). Distillation can retain 83% of loss improvement, making these data-efficient models practical for inference, and self-distillation surprisingly improves loss further.
Quotes
"Inference today is seen as a sort of like cost or convenience lever. But in one, two, or three years, inference is going to be seen as a capability."
"The claim I'm going to make and maybe this is the one thing to take away from the message I'm trying to send in this talk is that inference today is seen as a sort of like cost or convenience lever. But in one, two or three years inference is going to be seen as a capability."
"The main advantages of model predictive control is it can adapt to normal reward functions at test time."
"If we can find the right inductive biases building on these theories, we might be able to optimize for them as well. And by the no free lunch theorem, the only way that we get improvements in learning efficiency is through inductive biases."
"The two major problems we have left really to solve in AI is intelligence per watt and intelligence per sample."
Q&A
Recent Questions
Related Episodes

The GPT Moment for Robotics Is Here
"Physical Intelligence is pioneering general-purpose robotics, leveraging cloud-hosted AI models and cross-embodiment data to enable a 'Cambrian explosion' of vertical robotics companies."

How Much Do Language Models Memorize?
"Meta researcher Jack Morris introduces a new metric for 'unintended memorization' in language models, revealing how model capacity, data rarity, and training data size influence generalization versus specific data retention."

Republicans And Democrats UNITE To Push For MORE WAR!!!!
"US politicians from both parties, allegedly influenced by Israel, are pushing for continued war with Iran and Lebanon, while peace negotiations are framed as deceptive maneuvers."

Trump DOJ REACHES NEW LOW Trying to SAVE Trump
"Professor Aziz Huck dissects the foundational principles of the rule of law, revealing how modern political partisanship and the Justice Department's 'weaponization fund' challenge core constitutional mechanisms and legal predictability."