Y Combinator
Y Combinator
June 12, 2026

5 Papers That Show Where AI Research Is Heading Right Now

YouTube · 3rWSvrFahIY

Quick Read

This Y Combinator session explores five cutting-edge AI research papers, revealing advancements in AI for biology, self-play for LLMs, real-time voice agents, formal math verification, and agentic programming workflows.
Scaling data and compute drives 'bitter lesson' success in protein AI, outperforming human-engineered features.
Self-guided self-play is crucial for LLMs to generate meaningful, non-trivial learning tasks beyond human data.
Real-time voice AI demands 'Stream RAG' to pre-fetch information, drastically reducing latency for natural conversations.

Summary

This Y Combinator AI research session features five presentations on diverse, cutting-edge AI topics. Yas Beg discusses applying the 'bitter lesson' to protein biology, demonstrating how large-scale, general models trained on vast biological sequence data can accurately predict protein structures without human-engineered features, outperforming specialist models in certain contexts like antibody design. Luke details the challenges of self-play for LLMs, explaining why vanilla self-play plateaus and introducing 'self-guided self-play' to generate more useful, less artificially complex training tasks. Arnab Matei from Giga presents 'Stream RAG,' a method to reduce latency in voice AI by proactively retrieving information in chunks while a user is speaking, addressing a critical production challenge. Robert George introduces Lean, a formal proof language, highlighting its role in 'verified intelligence' for mathematics and science, enabling provably correct code and mathematical theorems. Finally, Luke Orthwine shares 'token maxing' strategies for software engineering with AI agents, advocating for a Real-Time Strategy (RTS) game approach to parallelize work, maximize agent activity, and leverage knowledge bases for extreme productivity.
These presentations collectively highlight the rapid evolution of AI across diverse fields, from fundamental scientific discovery in biology and mathematics to practical applications in voice AI and software development. They underscore the growing importance of scaling data and compute, the shift towards more generalist AI models, and the development of new paradigms for human-AI collaboration and verification. The insights offer a glimpse into the future of AI's impact on scientific research, product development, and engineering workflows, emphasizing efficiency, reliability, and the potential for AI to surpass human-level performance in complex domains.

Takeaways

  • AI in biology: Large-scale protein language models (like ESM Cambrian) demonstrate 'bitter lesson' success, predicting protein structures from sequence data alone by scaling data to billions of samples, often outperforming hand-engineered methods in data-scarce areas like antibody design.
  • Self-play for LLMs: Vanilla self-play often plateaus because models generate artificially complex, useless problems. Self-guided self-play, using a 'guide' to ensure generated tasks are relevant and non-trivial, helps overcome this, enabling smaller models to achieve performance comparable to much larger ones.
  • Stream RAG for Voice AI: Traditional RAG introduces unacceptable latency for real-time voice agents. Stream RAG aims to proactively retrieve relevant information in chunks as a user speaks, identifying critical query points to initiate retrieval early and maintain conversational flow.
  • Verified Intelligence with Lean: Formal proof languages like Lean enable 'verified intelligence' in math and science, ensuring mathematical proofs and code are provably correct. This addresses the limitations of informal math and the need for guarantees in AI-generated code.
  • Token Maxing for Software Engineering: Adopting a Real-Time Strategy (RTS) game mindset for AI-assisted software development involves parallelizing tasks with multiple agents, aggressively documenting knowledge bases, and optimizing for high 'Actions Per Minute' (APM) to maximize output and adapt quickly.

Insights

1Scaling Data Unlocks Protein Structure Prediction via 'Bitter Lesson'

The ESM Cambrian model, by scaling training data to 2.8 billion protein sequences (largely metagenomic data), demonstrates that generalist protein language models can learn complex 3D protein structures purely from sequence co-occurrence patterns. This 'bitter lesson' approach allows the model to predict long-distance protein contacts and even outperform specialized, hand-engineered models like AlphaFold 3 in specific tasks like antibody design, especially where MSA data is scarce.

ESM Cambrian shows a clean log-linear scaling curve for predicting long-distance protein contacts with increasing compute and data, unlike previous ESM2 models that plateaued. It achieves near-parity with AlphaFold 3 for general protein complexes and outperforms it (50% vs 47% DOCQ pass rate) in antibody applications without requiring multiple sequence alignments (MSAs).

2Vanilla Self-Play for LLMs Plateaus Due to Artificially Complex Tasks

Traditional self-play algorithms for LLMs, which reward a 'conjecturer' for generating problems hard for a 'solver,' often fail to lead to continuous improvement. The conjecturer learns to create messy, overly complex, and inelegant problems that are difficult to solve but offer no meaningful learning signal, leading to a plateau in solver performance.

Vanilla self-play on formal math problems showed no better performance than regular RL, despite the conjecturer generating more and more 'frontier' tasks. An example problem generated late in training was described as an 'incredibly complicated, overly complex disaster of a statement' that was useless for broader mathematical tasks.

3Self-Guided Self-Play Improves LLM Learning by Grounding Task Generation

To combat the plateau in vanilla self-play, 'self-guided self-play' introduces two mechanisms: grounding synthetic task generation by prompting the conjecturer to produce problems related to unsolved target problems, and a 'guide' component that judges the relevance and complexity of generated tasks. This dual reward system ensures tasks are both challenging and useful.

Self-guided self-play (SGS) significantly improved the problem-solving rate of a 7B parameter model, achieving performance comparable to a 670B parameter model, even with 8x less compute. This demonstrates its effectiveness in generating higher-quality learning signals.

4Stream RAG Reduces Latency for Real-Time Voice AI Agents

Traditional Retrieval-Augmented Generation (RAG) systems introduce unacceptable latency for conversational voice AI. Stream RAG addresses this by initiating the RAG pipeline in chunks while the user is still speaking, using partial queries to proactively retrieve information. This requires a mechanism to determine the optimal moment to trigger retrieval and refine queries.

The paper demonstrates that Stream RAG can decrease latency by 0.5 seconds for synthetic datasets and 1.5 seconds for human-spoken datasets, while maintaining accuracy comparable to full-query RAG, making natural, real-time voice interactions possible.

5Lean Enables 'Verified Intelligence' for Provably Correct Math and Code

Lean, an interactive theorem prover and functional programming language, is central to the concept of 'verified intelligence.' It allows for the formalization and rigorous verification of mathematical proofs and software code, ensuring 100% correctness without human assumptions or 'hand-waving.' This is crucial for high-stakes applications in science and robust software.

Lean's Mathlib is the largest formalized math library, containing high-quality proofs across diverse fields. Recent breakthroughs include AI solving IMO gold medal problems and OpenAI/DeepMind solving 80-year-old Erdos problems with formal verification in the loop. Lean also supports program verification, showing code satisfies specifications.

6RTS Principles Maximize Productivity in AI-Assisted Software Engineering

An 'AI Token Maxer' approach to software engineering leverages AI agents like an RTS game player. This involves parallelizing work across many agents (units), aggressively documenting knowledge bases, minimizing human keystrokes to initiate tasks, and constantly monitoring agent progress ('mini-map') to course-correct. The goal is to maximize 'Actions Per Minute' (APM) by agents, focusing on high throughput over individual perfection.

The speaker's company, Channel AI, increased PRs per engineer per month by 3.5x, with an additional 60% growth in the last month by broadly adopting these principles. This includes using orchestrator agents, 'dangerously skip permissions' mode for speed, and internal APM trackers based on tool calls.

Bottom Line

The 'bitter lesson' in AI for biology suggests that the vast, untapped metagenomic data (proteins from uncultured organisms in diverse environments) is the next frontier for scaling protein language models, potentially leading to breakthroughs in drug design and understanding cellular function.

So What?

This implies that the bottleneck for biological AI might not be model architecture but rather the sheer volume and diversity of biological data available, much of which is still being discovered and digitized from natural environments.

Impact

Invest in technologies and research for large-scale metagenomic sequencing, data curation, and efficient training infrastructure for biological foundation models. Companies focusing on 'data scaling' in bio-AI could gain a significant advantage.

The non-determinism observed even at temperature zero in LLM inference, caused by tiny floating-point arithmetic differences, highlights a fundamental challenge for achieving provable correctness in AI systems, especially those deployed on GPUs.

So What?

This means that even seemingly deterministic AI models can produce subtly different outputs, which is problematic for applications requiring high reliability and reproducibility, such as scientific simulations or safety-critical systems.

Impact

Develop and integrate formal verification tools (like TorchLean) directly into AI model development and deployment pipelines, extending verification down to the GPU kernel level. This could create a niche for 'verified AI' solutions in high-assurance domains.

The 'token maxing' philosophy advocates for treating AI agents as cheap, parallelizable labor, prioritizing high throughput and rapid iteration over meticulous, single-threaded human oversight, even if it means agents make more mistakes.

So What?

This fundamentally shifts the role of a human engineer from a meticulous coder to an 'orchestrator' or 'commander' of AI agents, focusing on macro-management, rapid course correction, and leveraging AI for sheer volume of output.

Impact

Build tools and platforms that enable seamless parallel execution of AI agents, intuitive 'mini-map' style monitoring, and automated knowledge base updates. This could redefine software development workflows and significantly boost engineering productivity in AI-first companies.

Opportunities

Metagenomic Data Curation and AI Training Platform

Develop a platform that specializes in collecting, curating, and preparing vast metagenomic protein sequence data for training large-scale protein language models. Offer this as a service or API to biotech and pharmaceutical companies looking to leverage the 'bitter lesson' in biological AI.

Source: Yas Beg's discussion on ESM Cambrian's success with 2.8 billion metagenomic samples.

Self-Guided LLM Task Generation Service

Create a service or tool that implements 'self-guided self-play' to generate high-quality, non-trivial training tasks for custom LLMs. This would help companies continuously improve their models beyond human-generated data, especially for specialized domains like formal mathematics or coding.

Source: Luke's presentation on the limitations of vanilla self-play and the benefits of self-guided self-play.

Real-Time Stream RAG API for Voice AI

Offer an API that provides real-time, low-latency RAG capabilities for voice AI applications. This API would intelligently process partial user queries, proactively retrieve relevant information, and integrate seamlessly with voice agents to enable natural, hallucination-free conversations.

Source: Arnab Matei's discussion on Stream RAG challenges and solutions for voice AI.

Verified AI Software Development Kit (SDK)

Develop an SDK that integrates formal verification tools (like Lean) into standard software development workflows. This would allow developers to write provably correct code, verify AI model properties (e.g., certified robustness), and ensure the reliability of critical AI applications.

Source: Robert George's presentation on Lean for science and program verification.

RTS-Inspired AI Agent Orchestration Platform

Build a platform that enables engineers to manage and orchestrate multiple AI agents in parallel, similar to a Real-Time Strategy game. Features would include 'mini-map' style monitoring, automated knowledge base integration, rapid task spawning, and metrics like 'Actions Per Minute' for agents.

Source: Luke Orthwine's 'token maxing' and RTS analogy for AI-assisted programming.

Key Concepts

The Bitter Lesson

Richard Sutton's principle that in AI, methods leveraging general computation and data scaling consistently outperform methods relying on human-engineered domain knowledge. This is demonstrated in protein biology where large language models trained on vast sequence data achieve superior performance in predicting protein structures.

Intelligence Per Sample / Intelligence Per Watt

Two critical metrics for AI efficiency. 'Intelligence per sample' refers to how much a model learns from each new data point, aiming for monotonic improvement. 'Intelligence per watt' emphasizes achieving high intelligence with minimal computational energy, often favoring smaller, more efficient models.

Real-Time Strategy (RTS) Game Principles for Software Engineering

Applying RTS game strategies like 'macro by default, micro when it counts,' high visibility, parallelizing tasks, and maximizing 'Actions Per Minute' (APM) to AI-assisted software development. This involves spawning multiple AI agents, aggressively using knowledge bases, and continuously adapting to maximize productivity and output.

Lessons

  • Prioritize data scaling for AI models, especially in new domains like biology, as generalist models trained on massive datasets often outperform those with hand-engineered features.
  • When implementing self-play for LLMs, design mechanisms (like a 'guide') to ensure generated tasks are genuinely useful and non-trivial, preventing the model from creating artificially complex but unhelpful problems.
  • For real-time conversational AI, explore and implement streaming RAG techniques to proactively retrieve information as users speak, significantly reducing latency and improving the naturalness of interactions.
  • Investigate formal verification tools like Lean for high-stakes AI applications in math, science, and software, aiming to achieve provably correct algorithms and code, thereby increasing reliability and trust.
  • Adopt a 'token maxing' or RTS-inspired approach to AI-assisted software development: parallelize tasks with multiple agents, aggressively document knowledge in linked files, and optimize for high agent 'Actions Per Minute' (tool calls) to maximize throughput.

RTS-Inspired AI Agent Software Engineering Workflow (Token Maxing)

1

**Orchestrate Agents**: Use a primary agent (e.g., Claude) as an orchestrator to spawn and manage multiple worker agents for different tasks. Minimize human keystrokes to initiate work.

2

**Parallelize Work with Worktrees**: Utilize Git worktrees to maintain separate, compiling repositories for parallel development by multiple agents, preventing conflicts and enabling concurrent progress.

3

**Maximize Agent Autonomy & Throughput**: Instruct worker agents to push tasks as far as possible (e.g., to a Pull Request) before requesting human feedback. Prioritize high 'Actions Per Minute' (APM) based on tool calls, even if it means agents make more mistakes initially.

4

**Aggressively Document Knowledge Bases**: Create structured, linked wiki-style knowledge base files (MD files) that agents can quickly access. This reduces reliance on expensive code context and benefits future agents and human teammates.

5

**Maintain High Visibility & Rapid Course Correction**: Implement systems (e.g., audio cues, color-coding, quick jump buttons) to monitor agent progress across multiple tasks. Be prepared to quickly audit, course-correct, and fix agent mistakes as they occur.

6

**Run in Sandbox with Skip Permissions**: Whenever possible, run agents in a sandboxed environment with 'dangerously skip permissions' mode to avoid constant human approvals and accelerate execution. If not, create necessary sandboxes.

7

**Continuously Learn and Adapt**: Feed agent outputs, human corrections, and new insights back into the knowledge base to continuously improve agent performance and adapt the workflow.

Notable Moments

The host's challenge to the idea that training on human-generated data (H) can lead to sampling the full solution space (F) via test-time compute and recursive self-improvement, arguing that it will always limit to F-H.

This highlights a fundamental debate in AI research about the path to AGI: whether human data is a necessary stepping stone or a limiting factor, advocating for AlphaZero-style self-play unbiased by human 'meandering' to reach more intelligent systems.

The observation that ICL (In-Context Learning) performance does not monotonically improve with more samples and hits a cliff at context length, contrasting with human learning.

This points to a core limitation of current LLM learning paradigms and suggests that there must be alternative learning procedures with much higher 'intelligence per sample' to achieve human-like continuous improvement.

The discovery that protein language models, trained purely on masked language modeling, spontaneously organize their latent space into a hierarchy of interpretable biological concepts, from amino acids to protein roles.

This 'unsupervised emergence' of biological understanding from a simple sequence task is 'utterly crazy' (as stated by the presenter) and suggests that large-scale generalist models can discover fundamental scientific principles without explicit supervision, opening new avenues for scientific discovery.

Quotes

"

"If the full solution space F is F, training on known human solutions will limit you to some typical set H, despite any feasible amount of test time compute or recursive self-improvement. You won't feasibly sample F minus H."

Host
"

"Methods that win are methods that are general, that sort of exploit really fundamentals of like scaling compute and data as opposed to methods that sort of handgineer human domain, human domain knowledge."

Yas Beg
"

"You'll know a word by the company that it keeps, and here the idea is that you'll know a protein by amino acids it keeps."

Yas Beg
"

"The easiest way to produce tricky problems is produce these basically messy, artificially complex and elegant problems."

Luke
"

"The code is often like a really expensive source of truth for the agents to pull context out of, and it's actually really cheap, especially when you have all the context loaded in memory, to like aggressively document things in a way that benefit future agents."

Luke Orthwine

Q&A

Recent Questions

Related Episodes