Y Combinator
Y Combinator
May 28, 2026

Inference, Diffusion, World Models, and More | YC Paper Club

YouTube · wE1ZgJdt4uM

Quick Read

This YC Paper Club session explores cutting-edge AI research, from accelerating large language model inference and building robust world models for robotics to understanding deep learning generalization and optimizing pre-training under data constraints.
Speculative Speculative Decoding (SSD) makes LLM inference significantly faster by parallelizing operations.
Diffusion Model Predictive Control (DMPC) enhances robot learning by adapting to new tasks and environments using diffusion models.
New regularization techniques in world models prevent representational collapse, leading to more stable and efficient model-based control.
Aggressive regularization, ensembling, and distillation offer substantial data efficiency gains for pre-training, crucial as data becomes a bottleneck.

Summary

The inaugural YC Paper Club brought together leading founders and researchers to discuss five pivotal AI papers. Tanishk presented Speculative Speculative Decoding (SSD), an algorithm that significantly accelerates LLM inference by parallelizing drafting and verification, framing inference as a capability rather than just a cost. Stannis from Google DeepMind introduced Diffusion Model Predictive Control (DMPC), which leverages diffusion models for multi-step action proposals and dynamics modeling in robotics, enabling adaptation to novel rewards and dynamics with simplified planning. Isaac Ward discussed Lay World Model, a Yann LeCun group paper that tackles the challenge of representational collapse in world models using a novel SigG regularizer, achieving faster and more stable training for model-based control. Ashe from Q Labs demystified deep learning generalization, applying classical theories like Pack-Bay bounds to explain phenomena like overparameterization and benign overfitting, suggesting that soft inductive biases are key to future learning efficiency. Finally, Ku presented research on pre-training under data constraints, demonstrating that aggressive regularization, ensembling, and distillation can yield substantial data efficiency gains, even matching performance of much larger datasets.
These papers address fundamental challenges in AI development, offering pathways to more efficient, capable, and robust AI systems. Accelerating inference reduces operational costs and unlocks new applications for LLMs. Advanced world models and diffusion-based control promise more adaptable and intelligent robots. Understanding generalization is key to building more reliable and sample-efficient AI, while optimizing pre-training for data-constrained environments ensures continued progress as unique data sources become scarcer. Collectively, these insights are critical for founders and researchers aiming to push the boundaries of AI, from deployment to foundational research.

Takeaways

  • Inference speed is a capability, not just a cost: Speculative Speculative Decoding (SSD) significantly accelerates LLM output by parallelizing the drafting and verification processes, achieving 300 tokens/second on Llama 3 70B.
  • World models are becoming more robust and efficient: The Lay World Model uses a novel 'SigG' regularizer to prevent representational collapse, enabling stable training and 50x faster model predictive control in latent space.
  • Classical ML theories explain deep learning 'mysteries': Pack-Bay bounds and soft inductive biases provide a mechanistic understanding of overparameterization and benign overfitting, crucial for optimizing generalization.
  • Data efficiency is the new frontier: Aggressive regularization, ensembling, and distillation can yield up to 17x data efficiency wins in pre-training, crucial as compute scales faster than new data generation.

Insights

1Inference as a Capability: Speculative Speculative Decoding (SSD)

Tanishk argues that inference speed is a core capability, not merely a cost or convenience factor. The Speculative Speculative Decoding (SSD) algorithm dramatically accelerates Large Language Model (LLM) inference by parallelizing the traditionally sequential drafting and verification steps. Unlike vanilla speculative decoding, which struggles with sequential dependencies, SSD anticipates likely verification outcomes and drafts the next round of tokens simultaneously, effectively hiding drafting latency. This allows for more tokens to be drafted per round and significantly boosts tokens per second.

SSD achieves 300 tokens per second for Llama 3 70B on 4 H100s, outperforming other inference engines. The core mechanism is predicting verification outcomes with 80-90% accuracy using information from the draft model's token distributions, allowing parallel decoding of different sequences.

2Diffusion Model Predictive Control (DMPC) for Robust Robotics

Stannis from Google DeepMind presented Diffusion Model Predictive Control (DMPC), an approach that uses diffusion models to learn both multi-step action proposals and multi-step dynamics models. This framework addresses challenges in traditional Model Predictive Control (MPC) by reducing compounding errors and simplifying planning algorithms. DMPC enables agents to adapt to novel reward functions and dynamics at runtime, a key advantage over joint modeling approaches.

DMPC demonstrates competitive results in fixed-reward tasks and, more importantly, adapts to novel behaviors (e.g., jumping) by changing reward functions at inference time. It also adapts to novel dynamics (e.g., a walker with a broken ankle) by updating only the dynamics model on new play data, showcasing the benefit of factorized representation.

3Lay World Model: Elegant Regularization for Stable Dynamics Learning

Isaac Ward discussed the Lay World Model, a Joint Embedding Predictive Architecture (JEPA) from Yann LeCun's group. This model learns world dynamics by predicting future latent embeddings rather than raw images, using an image encoder and an action-conditioned forecasting module. The key innovation is the 'SigG' (Sketching Isotropic Gaussian) regularizer, which ensures a 'healthy' Gaussian distribution of latent embeddings, preventing representational collapse without complex tricks or hyperparameter tuning. This leads to more stable and efficient training.

The Lay World Model achieves 50 times faster operation than competitors by performing all work in the latent space, requiring less than 24GB of VRAM and only 15 million parameters. It demonstrates high-quality open-loop predictions and effective model predictive control on 2D and 3D tasks. A crucial capability is its ability to quantify model error, detecting perturbations in real-time, which is not natively available in model-free approaches.

4Demystifying Deep Learning Generalization with Classical Theories

Ashe from Q Labs presented Andrew Gordon Wilson's paper, which argues that deep learning's generalization 'mysteries' (like overparameterization and benign overfitting) can be explained by classical theories. Using the Pack-Bay framework, it's shown that increasing parameters reduces empirical risk and leads to more compressible solutions (flat minima), improving generalization. Benign overfitting is explained by neural networks acting as expressive models with soft inductive biases, allowing flexibility for random data while biasing towards simpler solutions for structured data.

The Pack-Bay framework shows that as model parameters increase, the compression term (related to model compressibility) decreases, leading to tighter generalization bounds. Work by Lotfi et al. shows a negative correlation between bits required to encode the training set and parameter count. The exponential increase in flat minima volume with more parameters supports the compressibility view. A regularized polynomial model illustrates how flexibility and inductive bias coexist to fit noise and generalize structured data.

5Pre-training Under Data Constraints: The Power of Aggressive Regularization and Ensembling

Ku addressed the emerging challenge of data-constrained pre-training, where compute scales much faster than available human-generated data. The paper proposes and evaluates scaling recipes (aggressive regularization, ensembling, distillation) to maximize generalization under fixed data budgets. These methods aim to lower the 'compute asymptote' – the best possible loss under infinite compute – representing true data efficiency wins.

Aggressive weight decay (30x higher than compute-optimal) allows models to monotonically decrease loss with increasing parameters, fitting a clean power law with a measurable asymptote. Ensembling models yields significantly lower asymptotes than regularization alone, demonstrating a true data efficiency win (e.g., 5x for a joint scaling recipe, 17x for continued pre-training on math data). Distillation can retain 83% of loss improvement, making these data-efficient models practical for inference, and self-distillation surprisingly improves loss further.

Quotes

"

"Inference today is seen as a sort of like cost or convenience lever. But in one, two, or three years, inference is going to be seen as a capability."

Tanishk
"

"The claim I'm going to make and maybe this is the one thing to take away from the message I'm trying to send in this talk is that inference today is seen as a sort of like cost or convenience lever. But in one, two or three years inference is going to be seen as a capability."

Tanishk
"

"The main advantages of model predictive control is it can adapt to normal reward functions at test time."

Stannis
"

"If we can find the right inductive biases building on these theories, we might be able to optimize for them as well. And by the no free lunch theorem, the only way that we get improvements in learning efficiency is through inductive biases."

Ashe
"

"The two major problems we have left really to solve in AI is intelligence per watt and intelligence per sample."

Host

Q&A

Recent Questions

Related Episodes