Google TechTalks•January 27, 2026

Federated Learning Differential Privacy Large Language Models (LLMs)Synthetic Data Generation Reinforcement Learning Ethics of AI

POPri: Private Federated Learning using Preference-Optimized Synthetic Data

YouTube · mNJKWPWKPgY

Quick Read

Meta research introduces POPri, a novel approach using Reinforcement Learning to fine-tune LLMs for generating high-quality synthetic data under strict privacy constraints in federated learning, significantly outperforming prior methods.

●Large LLMs are too big for traditional on-device federated learning, creating a privacy-computation dilemma.

●POPri leverages Reinforcement Learning (DPO) to fine-tune LLMs, generating synthetic data based on privacy-preserving client preferences.

●This method significantly boosts model performance on private data, closing a substantial gap compared to prior approaches.

Summary

Charlie Hu, a research scientist at Meta, presents POPri (Policy Optimization for Private Data), a new method for private federated learning that addresses the challenge of training large language models (LLMs) on siloed, private user data without violating privacy. Traditional federated learning struggles with large LLMs due to on-device computation and communication costs. POPri builds on 'Private Evolution' by generating synthetic data and collecting privacy-preserving client feedback. Instead of using in-context learning, POPri employs Direct Preference Optimization (DPO), a form of Reinforcement Learning, to fine-tune the LLM directly based on client preferences. This approach treats client feedback as preferences rather than ground truth, allowing for more robust learning from noisy signals. POPri demonstrates significant performance gains, closing 58% of the gap between zero privacy and full data access on next-token prediction tasks and 43% on classification tasks, while maintaining communication and client computation advantages over traditional on-device federated learning.

POPri offers a critical advancement for deploying large AI models in privacy-sensitive environments like mobile devices. By enabling LLMs to learn from distributed private data without direct access or heavy on-device training, it unlocks new possibilities for personalized AI experiences while upholding user privacy. This method is particularly relevant as LLMs grow in size and the demand for privacy-preserving AI intensifies, providing a practical pathway for leveraging private data without compromising security or incurring prohibitive costs.

Takeaways

❖Large LLMs are too computationally intensive for traditional on-device federated learning, and private data cannot leave user devices.
❖POPri addresses this by generating synthetic data, collecting privacy-preserving client feedback, and using Direct Preference Optimization (DPO) to fine-tune the LLM.
❖DPO is more robust than supervised fine-tuning because it learns from preferences (one response is better than another) rather than assuming ground truth.
❖POPri significantly improves performance on next-token prediction (58% gap closure) and classification tasks (43% gap closure) compared to previous methods like Private Evolution.
❖While POPri increases server-side computation costs due to RL fine-tuning, it drastically reduces client-side computation and communication compared to on-device federated learning.
❖Maintaining an 'on-policy' approach (fewer optimization steps per round) is crucial for long-term performance, even if it increases communication rounds.
❖The 'gap' between chosen and rejected samples in DPO is critical; a moderate difference (e.g., rank 5 out of 10) provides optimal learning signals.

Insights

1The Federated Learning Dilemma for Large LLMs

Traditional federated learning, which involves sending model weights to clients for local training and then aggregating noisy updates, is becoming impractical for increasingly large foundation models (LLMs). These models are too big to train efficiently on client devices, yet privacy constraints prevent collecting raw data centrally.

Foundation models are getting too big to train on device these days. We can't send the models to the silos because the models are too big and we also can't collect data from the silos because that would violate basic privacy constraints.

2POPri's Preference-Optimized Synthetic Data Generation

POPri proposes a solution where an LLM generates synthetic data. This synthetic data is sent to client devices, which then provide privacy-preserving feedback (scores) on its quality by comparing it to their local private data using an embedding model. Instead of using these scores for in-context learning, POPri employs Direct Preference Optimization (DPO) to fine-tune the LLM directly, teaching it to generate higher-quality synthetic data aligned with client preferences.

The proposal here is we can make synthetic client data. The approach here is first you send candidate outputs to the clients. Then you receive scores on the output quality and we can iteratively refine. The clients will calculate which synthetic data is closest to their own data. And then you can use DPO on each of the P rankings that you have to align the foundation model towards the private data.

3POPri's Performance Gains Over Prior Methods

POPri significantly outperforms previous synthetic data generation methods like Private Evolution and traditional on-device federated learning (DP Fed Average, DP FTRL, DPSG). On a next-token prediction task, POPri closed 58% of the performance gap between zero privacy and full data access. For a classification task, it closed 43% of the gap, demonstrating its effectiveness in generating high-quality, privacy-preserving synthetic data.

POPri... we're able to close 58% of the gap between epsilon equals zero and epsilon equals infinity. We're able to close 43% of the gap. It's a pretty significant gain.

4Trade-offs in Communication and Computation Costs

POPri is more computationally expensive on the server side than Private Evolution due to RL fine-tuning, and has slightly higher communication costs due to sending more samples per round. However, it offers substantial advantages over traditional on-device federated learning by drastically reducing client-side computation (no on-device training, just similarity calculations) and communication (sending synthetic data/histograms instead of model weights).

Across the board POPri is more expensive than private evolution. We've replaced the in context learning stuff which is pretty cheap to a RL fine-tuning stuff which is going to on the server side is going to increase the burden a lot. The communication cost again is much better for POPri because we're not sending model weights and the client computation cost is also much cheaper because we don't have to do ondevice training.

Key Concepts

Federated Learning

A machine learning approach where models are trained on decentralized data residing on local devices (clients) without directly sharing the raw data with a central server. Only model updates or privacy-preserving signals are exchanged.

Differential Privacy

A mathematical framework that quantifies and limits the privacy risk associated with data analysis. It ensures that the output of an algorithm is nearly the same whether or not any individual's data is included, protecting individual privacy.

Synthetic Data Generation

Creating artificial data that mimics the statistical properties of real data without containing any actual private information. This allows for training models while preserving privacy.

Reinforcement Learning from Human Feedback (RLHF) / Direct Preference Optimization (DPO)

A class of techniques where an AI model learns from human (or in this case, client-device) preferences rather than explicit labels. DPO directly optimizes a policy to align with preferences, avoiding the need for a separate reward model.

Lessons

When developing privacy-preserving AI for large models, consider synthetic data generation combined with RL-based fine-tuning (like DPO) as a viable alternative to traditional on-device federated learning.
For DPO-based training, carefully select the 'gap' between chosen and rejected samples. Experiment with different ranks for rejected samples (e.g., rank 5 out of 10) to optimize learning efficiency and model performance.
Prioritize 'on-policy' training in RL-based federated learning by limiting optimization steps per communication round. While this may increase the number of communication rounds, it leads to better long-term model performance.

Notable Moments

Introduction of new, less contaminated federated language benchmarks, including US, UK, and Canadian congressional records, to address issues with older, potentially pre-trained LLM evaluation datasets.

Contaminated benchmarks (data used for training also used for evaluation) lead to inflated performance metrics. Curating fresh, contemporaneous datasets ensures more reliable and meaningful evaluation of new federated learning methods.

The speaker highlights that DPO makes weaker assumptions than supervised fine-tuning (SFT), treating client feedback as preferences rather than ground truth.

This is a key advantage for real-world scenarios where client feedback might be noisy or not perfectly represent the 'ideal' output. DPO's robustness to imperfect labels makes it well-suited for learning from aggregated, privacy-preserving signals.

Quotes

"Foundation models are getting a little too big to train on device these days."

Charlie Hu

"We can't send the models to the silos because the models are too big and we also can't collect data from the silos because that would violate basic privacy constraints."

Charlie Hu

"This kind of sounds like an RL environment, right? So here's the basic idea. Let's use RL to learn from these client scores."

Charlie Hu

"SFT is going to treat the labels like ground truth and trains to trains the LLM to follow the labels exactly. Right. And RL is just training to increase the scores and it makes weaker assumptions."

Charlie Hu

Q&A

Related Episodes

Y Combinator• Apr 29, 2026

How to Build the Future: Demis Hassabis

"DeepMind CEO Demis Hassabis details the missing pieces for Artificial General Intelligence (AGI), the strategic role of smaller AI models, and how AI will transform scientific discovery, urging founders to combine AI with other deep tech."

Artificial General IntelligenceMachine LearningReinforcement Learning+2

Y Combinator• Jun 12, 2026

5 Papers That Show Where AI Research Is Heading Right Now

"This Y Combinator session explores five cutting-edge AI research papers, revealing advancements in AI for biology, self-play for LLMs, real-time voice agents, formal math verification, and agentic programming workflows."

Artificial IntelligenceMachine LearningBiology+2

Javier Ceriani Show• Apr 30, 2026

Ex de Ángela Aguilar rompe silencio - Anahí y Marichelo hermanas de miedo | Javier Ceriani

"Javier Ceriani reveals explosive allegations about Angela Aguilar's ex-boyfriend gaining freedom from Pepe Aguilar's control, exposes Marichelo's alleged santería practices and past legal troubles alongside sister Anahí, and details severe corruption claims against Televisa producer Andrea Rodríguez."

Celebrity GossipEntertainment IndustryCorruption+2

World Science Festival• Apr 17, 2026

Artificial Utopia? The Future of Humanity in an AI World | World Science Festival

"Nick Bostrom discusses the profound implications of advanced AI, from its potential consciousness and creativity to the existential risks of misalignment and the philosophical challenges of a 'deep utopia' where human purpose is redefined."

Artificial IntelligenceConsciousnessCreativity+2

POPri: Private Federated Learning using Preference-Optimized Synthetic Data

Quick Read

Summary

Takeaways

Insights

1The Federated Learning Dilemma for Large LLMs

2POPri's Preference-Optimized Synthetic Data Generation

3POPri's Performance Gains Over Prior Methods

4Trade-offs in Communication and Computation Costs

Key Concepts

Federated Learning

Differential Privacy

Synthetic Data Generation

Reinforcement Learning from Human Feedback (RLHF) / Direct Preference Optimization (DPO)

Lessons

Notable Moments

Introduction of new, less contaminated federated language benchmarks, including US, UK, and Canadian congressional records, to address issues with older, potentially pre-trained LLM evaluation datasets.

The speaker highlights that DPO makes weaker assumptions than supervised fine-tuning (SFT), treating client feedback as preferences rather than ground truth.

Quotes

Q&A

Recent Questions

Related Episodes

How to Build the Future: Demis Hassabis

5 Papers That Show Where AI Research Is Heading Right Now

Ex de Ángela Aguilar rompe silencio - Anahí y Marichelo hermanas de miedo | Javier Ceriani

Artificial Utopia? The Future of Humanity in an AI World | World Science Festival