5 Papers That Show Where AI Research Is Heading Right Now
YouTube · 3rWSvrFahIY
Quick Read
Summary
Takeaways
- ❖AI in biology: Large-scale protein language models (like ESM Cambrian) demonstrate 'bitter lesson' success, predicting protein structures from sequence data alone by scaling data to billions of samples, often outperforming hand-engineered methods in data-scarce areas like antibody design.
- ❖Self-play for LLMs: Vanilla self-play often plateaus because models generate artificially complex, useless problems. Self-guided self-play, using a 'guide' to ensure generated tasks are relevant and non-trivial, helps overcome this, enabling smaller models to achieve performance comparable to much larger ones.
- ❖Stream RAG for Voice AI: Traditional RAG introduces unacceptable latency for real-time voice agents. Stream RAG aims to proactively retrieve relevant information in chunks as a user speaks, identifying critical query points to initiate retrieval early and maintain conversational flow.
- ❖Verified Intelligence with Lean: Formal proof languages like Lean enable 'verified intelligence' in math and science, ensuring mathematical proofs and code are provably correct. This addresses the limitations of informal math and the need for guarantees in AI-generated code.
- ❖Token Maxing for Software Engineering: Adopting a Real-Time Strategy (RTS) game mindset for AI-assisted software development involves parallelizing tasks with multiple agents, aggressively documenting knowledge bases, and optimizing for high 'Actions Per Minute' (APM) to maximize output and adapt quickly.
Insights
1Scaling Data Unlocks Protein Structure Prediction via 'Bitter Lesson'
The ESM Cambrian model, by scaling training data to 2.8 billion protein sequences (largely metagenomic data), demonstrates that generalist protein language models can learn complex 3D protein structures purely from sequence co-occurrence patterns. This 'bitter lesson' approach allows the model to predict long-distance protein contacts and even outperform specialized, hand-engineered models like AlphaFold 3 in specific tasks like antibody design, especially where MSA data is scarce.
ESM Cambrian shows a clean log-linear scaling curve for predicting long-distance protein contacts with increasing compute and data, unlike previous ESM2 models that plateaued. It achieves near-parity with AlphaFold 3 for general protein complexes and outperforms it (50% vs 47% DOCQ pass rate) in antibody applications without requiring multiple sequence alignments (MSAs).
2Vanilla Self-Play for LLMs Plateaus Due to Artificially Complex Tasks
Traditional self-play algorithms for LLMs, which reward a 'conjecturer' for generating problems hard for a 'solver,' often fail to lead to continuous improvement. The conjecturer learns to create messy, overly complex, and inelegant problems that are difficult to solve but offer no meaningful learning signal, leading to a plateau in solver performance.
Vanilla self-play on formal math problems showed no better performance than regular RL, despite the conjecturer generating more and more 'frontier' tasks. An example problem generated late in training was described as an 'incredibly complicated, overly complex disaster of a statement' that was useless for broader mathematical tasks.
3Self-Guided Self-Play Improves LLM Learning by Grounding Task Generation
To combat the plateau in vanilla self-play, 'self-guided self-play' introduces two mechanisms: grounding synthetic task generation by prompting the conjecturer to produce problems related to unsolved target problems, and a 'guide' component that judges the relevance and complexity of generated tasks. This dual reward system ensures tasks are both challenging and useful.
Self-guided self-play (SGS) significantly improved the problem-solving rate of a 7B parameter model, achieving performance comparable to a 670B parameter model, even with 8x less compute. This demonstrates its effectiveness in generating higher-quality learning signals.
4Stream RAG Reduces Latency for Real-Time Voice AI Agents
Traditional Retrieval-Augmented Generation (RAG) systems introduce unacceptable latency for conversational voice AI. Stream RAG addresses this by initiating the RAG pipeline in chunks while the user is still speaking, using partial queries to proactively retrieve information. This requires a mechanism to determine the optimal moment to trigger retrieval and refine queries.
The paper demonstrates that Stream RAG can decrease latency by 0.5 seconds for synthetic datasets and 1.5 seconds for human-spoken datasets, while maintaining accuracy comparable to full-query RAG, making natural, real-time voice interactions possible.
5Lean Enables 'Verified Intelligence' for Provably Correct Math and Code
Lean, an interactive theorem prover and functional programming language, is central to the concept of 'verified intelligence.' It allows for the formalization and rigorous verification of mathematical proofs and software code, ensuring 100% correctness without human assumptions or 'hand-waving.' This is crucial for high-stakes applications in science and robust software.
Lean's Mathlib is the largest formalized math library, containing high-quality proofs across diverse fields. Recent breakthroughs include AI solving IMO gold medal problems and OpenAI/DeepMind solving 80-year-old Erdos problems with formal verification in the loop. Lean also supports program verification, showing code satisfies specifications.
6RTS Principles Maximize Productivity in AI-Assisted Software Engineering
An 'AI Token Maxer' approach to software engineering leverages AI agents like an RTS game player. This involves parallelizing work across many agents (units), aggressively documenting knowledge bases, minimizing human keystrokes to initiate tasks, and constantly monitoring agent progress ('mini-map') to course-correct. The goal is to maximize 'Actions Per Minute' (APM) by agents, focusing on high throughput over individual perfection.
The speaker's company, Channel AI, increased PRs per engineer per month by 3.5x, with an additional 60% growth in the last month by broadly adopting these principles. This includes using orchestrator agents, 'dangerously skip permissions' mode for speed, and internal APM trackers based on tool calls.
Bottom Line
The 'bitter lesson' in AI for biology suggests that the vast, untapped metagenomic data (proteins from uncultured organisms in diverse environments) is the next frontier for scaling protein language models, potentially leading to breakthroughs in drug design and understanding cellular function.
This implies that the bottleneck for biological AI might not be model architecture but rather the sheer volume and diversity of biological data available, much of which is still being discovered and digitized from natural environments.
Invest in technologies and research for large-scale metagenomic sequencing, data curation, and efficient training infrastructure for biological foundation models. Companies focusing on 'data scaling' in bio-AI could gain a significant advantage.
The non-determinism observed even at temperature zero in LLM inference, caused by tiny floating-point arithmetic differences, highlights a fundamental challenge for achieving provable correctness in AI systems, especially those deployed on GPUs.
This means that even seemingly deterministic AI models can produce subtly different outputs, which is problematic for applications requiring high reliability and reproducibility, such as scientific simulations or safety-critical systems.
Develop and integrate formal verification tools (like TorchLean) directly into AI model development and deployment pipelines, extending verification down to the GPU kernel level. This could create a niche for 'verified AI' solutions in high-assurance domains.
The 'token maxing' philosophy advocates for treating AI agents as cheap, parallelizable labor, prioritizing high throughput and rapid iteration over meticulous, single-threaded human oversight, even if it means agents make more mistakes.
This fundamentally shifts the role of a human engineer from a meticulous coder to an 'orchestrator' or 'commander' of AI agents, focusing on macro-management, rapid course correction, and leveraging AI for sheer volume of output.
Build tools and platforms that enable seamless parallel execution of AI agents, intuitive 'mini-map' style monitoring, and automated knowledge base updates. This could redefine software development workflows and significantly boost engineering productivity in AI-first companies.
Opportunities
Metagenomic Data Curation and AI Training Platform
Develop a platform that specializes in collecting, curating, and preparing vast metagenomic protein sequence data for training large-scale protein language models. Offer this as a service or API to biotech and pharmaceutical companies looking to leverage the 'bitter lesson' in biological AI.
Self-Guided LLM Task Generation Service
Create a service or tool that implements 'self-guided self-play' to generate high-quality, non-trivial training tasks for custom LLMs. This would help companies continuously improve their models beyond human-generated data, especially for specialized domains like formal mathematics or coding.
Real-Time Stream RAG API for Voice AI
Offer an API that provides real-time, low-latency RAG capabilities for voice AI applications. This API would intelligently process partial user queries, proactively retrieve relevant information, and integrate seamlessly with voice agents to enable natural, hallucination-free conversations.
Verified AI Software Development Kit (SDK)
Develop an SDK that integrates formal verification tools (like Lean) into standard software development workflows. This would allow developers to write provably correct code, verify AI model properties (e.g., certified robustness), and ensure the reliability of critical AI applications.
RTS-Inspired AI Agent Orchestration Platform
Build a platform that enables engineers to manage and orchestrate multiple AI agents in parallel, similar to a Real-Time Strategy game. Features would include 'mini-map' style monitoring, automated knowledge base integration, rapid task spawning, and metrics like 'Actions Per Minute' for agents.
Key Concepts
The Bitter Lesson
Richard Sutton's principle that in AI, methods leveraging general computation and data scaling consistently outperform methods relying on human-engineered domain knowledge. This is demonstrated in protein biology where large language models trained on vast sequence data achieve superior performance in predicting protein structures.
Intelligence Per Sample / Intelligence Per Watt
Two critical metrics for AI efficiency. 'Intelligence per sample' refers to how much a model learns from each new data point, aiming for monotonic improvement. 'Intelligence per watt' emphasizes achieving high intelligence with minimal computational energy, often favoring smaller, more efficient models.
Real-Time Strategy (RTS) Game Principles for Software Engineering
Applying RTS game strategies like 'macro by default, micro when it counts,' high visibility, parallelizing tasks, and maximizing 'Actions Per Minute' (APM) to AI-assisted software development. This involves spawning multiple AI agents, aggressively using knowledge bases, and continuously adapting to maximize productivity and output.
Lessons
- Prioritize data scaling for AI models, especially in new domains like biology, as generalist models trained on massive datasets often outperform those with hand-engineered features.
- When implementing self-play for LLMs, design mechanisms (like a 'guide') to ensure generated tasks are genuinely useful and non-trivial, preventing the model from creating artificially complex but unhelpful problems.
- For real-time conversational AI, explore and implement streaming RAG techniques to proactively retrieve information as users speak, significantly reducing latency and improving the naturalness of interactions.
- Investigate formal verification tools like Lean for high-stakes AI applications in math, science, and software, aiming to achieve provably correct algorithms and code, thereby increasing reliability and trust.
- Adopt a 'token maxing' or RTS-inspired approach to AI-assisted software development: parallelize tasks with multiple agents, aggressively document knowledge in linked files, and optimize for high agent 'Actions Per Minute' (tool calls) to maximize throughput.
RTS-Inspired AI Agent Software Engineering Workflow (Token Maxing)
**Orchestrate Agents**: Use a primary agent (e.g., Claude) as an orchestrator to spawn and manage multiple worker agents for different tasks. Minimize human keystrokes to initiate work.
**Parallelize Work with Worktrees**: Utilize Git worktrees to maintain separate, compiling repositories for parallel development by multiple agents, preventing conflicts and enabling concurrent progress.
**Maximize Agent Autonomy & Throughput**: Instruct worker agents to push tasks as far as possible (e.g., to a Pull Request) before requesting human feedback. Prioritize high 'Actions Per Minute' (APM) based on tool calls, even if it means agents make more mistakes initially.
**Aggressively Document Knowledge Bases**: Create structured, linked wiki-style knowledge base files (MD files) that agents can quickly access. This reduces reliance on expensive code context and benefits future agents and human teammates.
**Maintain High Visibility & Rapid Course Correction**: Implement systems (e.g., audio cues, color-coding, quick jump buttons) to monitor agent progress across multiple tasks. Be prepared to quickly audit, course-correct, and fix agent mistakes as they occur.
**Run in Sandbox with Skip Permissions**: Whenever possible, run agents in a sandboxed environment with 'dangerously skip permissions' mode to avoid constant human approvals and accelerate execution. If not, create necessary sandboxes.
**Continuously Learn and Adapt**: Feed agent outputs, human corrections, and new insights back into the knowledge base to continuously improve agent performance and adapt the workflow.
Notable Moments
The host's challenge to the idea that training on human-generated data (H) can lead to sampling the full solution space (F) via test-time compute and recursive self-improvement, arguing that it will always limit to F-H.
This highlights a fundamental debate in AI research about the path to AGI: whether human data is a necessary stepping stone or a limiting factor, advocating for AlphaZero-style self-play unbiased by human 'meandering' to reach more intelligent systems.
The observation that ICL (In-Context Learning) performance does not monotonically improve with more samples and hits a cliff at context length, contrasting with human learning.
This points to a core limitation of current LLM learning paradigms and suggests that there must be alternative learning procedures with much higher 'intelligence per sample' to achieve human-like continuous improvement.
The discovery that protein language models, trained purely on masked language modeling, spontaneously organize their latent space into a hierarchy of interpretable biological concepts, from amino acids to protein roles.
This 'unsupervised emergence' of biological understanding from a simple sequence task is 'utterly crazy' (as stated by the presenter) and suggests that large-scale generalist models can discover fundamental scientific principles without explicit supervision, opening new avenues for scientific discovery.
Quotes
"If the full solution space F is F, training on known human solutions will limit you to some typical set H, despite any feasible amount of test time compute or recursive self-improvement. You won't feasibly sample F minus H."
"Methods that win are methods that are general, that sort of exploit really fundamentals of like scaling compute and data as opposed to methods that sort of handgineer human domain, human domain knowledge."
"You'll know a word by the company that it keeps, and here the idea is that you'll know a protein by amino acids it keeps."
"The easiest way to produce tricky problems is produce these basically messy, artificially complex and elegant problems."
"The code is often like a really expensive source of truth for the agents to pull context out of, and it's actually really cheap, especially when you have all the context loaded in memory, to like aggressively document things in a way that benefit future agents."
Q&A
Recent Questions
Related Episodes

Code Health Guardian
"This talk introduces a comprehensive model for understanding and managing code complexity, arguing for its objective nature and the critical role of human understanding in the AI era to maintain software health."

The GPT Moment for Robotics Is Here
"Physical Intelligence is pioneering general-purpose robotics, leveraging cloud-hosted AI models and cross-embodiment data to enable a 'Cambrian explosion' of vertical robotics companies."

MIT Physicist: DARPA, Warp Drives, Supergravity & Aliens on Jupiter | Jim Gates
"MIT Physicist Jim Gates details his journey from a four-year-old inspired by sci-fi to a leading researcher in supersymmetry, revealing how fundamental physics equations contain computer error correction codes and discussing the nature of scientific genius, AI, and the future of space travel."

Billionaire's WARNING: I'm SELLING. The Crash Is Already Here!
"Jeremy Grantham, a legendary investor who managed $165 billion, warns that the biggest investment bubble in history is about to burst, advising against US stocks and highlighting a looming fertility crisis driven by environmental toxins."