G
Google TechTalks
January 27, 2026

POPri: Private Federated Learning using Preference-Optimized Synthetic Data

Quick Read

Meta research introduces POPri, a novel approach using Reinforcement Learning to fine-tune LLMs for generating high-quality synthetic data under strict privacy constraints in federated learning, significantly outperforming prior methods.
Large LLMs are too big for traditional on-device federated learning, creating a privacy-computation dilemma.
POPri leverages Reinforcement Learning (DPO) to fine-tune LLMs, generating synthetic data based on privacy-preserving client preferences.
This method significantly boosts model performance on private data, closing a substantial gap compared to prior approaches.

Summary

Charlie Hu, a research scientist at Meta, presents POPri (Policy Optimization for Private Data), a new method for private federated learning that addresses the challenge of training large language models (LLMs) on siloed, private user data without violating privacy. Traditional federated learning struggles with large LLMs due to on-device computation and communication costs. POPri builds on 'Private Evolution' by generating synthetic data and collecting privacy-preserving client feedback. Instead of using in-context learning, POPri employs Direct Preference Optimization (DPO), a form of Reinforcement Learning, to fine-tune the LLM directly based on client preferences. This approach treats client feedback as preferences rather than ground truth, allowing for more robust learning from noisy signals. POPri demonstrates significant performance gains, closing 58% of the gap between zero privacy and full data access on next-token prediction tasks and 43% on classification tasks, while maintaining communication and client computation advantages over traditional on-device federated learning.
POPri offers a critical advancement for deploying large AI models in privacy-sensitive environments like mobile devices. By enabling LLMs to learn from distributed private data without direct access or heavy on-device training, it unlocks new possibilities for personalized AI experiences while upholding user privacy. This method is particularly relevant as LLMs grow in size and the demand for privacy-preserving AI intensifies, providing a practical pathway for leveraging private data without compromising security or incurring prohibitive costs.

Takeaways

  • Large LLMs are too computationally intensive for traditional on-device federated learning, and private data cannot leave user devices.
  • POPri addresses this by generating synthetic data, collecting privacy-preserving client feedback, and using Direct Preference Optimization (DPO) to fine-tune the LLM.
  • DPO is more robust than supervised fine-tuning because it learns from preferences (one response is better than another) rather than assuming ground truth.
  • POPri significantly improves performance on next-token prediction (58% gap closure) and classification tasks (43% gap closure) compared to previous methods like Private Evolution.
  • While POPri increases server-side computation costs due to RL fine-tuning, it drastically reduces client-side computation and communication compared to on-device federated learning.
  • Maintaining an 'on-policy' approach (fewer optimization steps per round) is crucial for long-term performance, even if it increases communication rounds.
  • The 'gap' between chosen and rejected samples in DPO is critical; a moderate difference (e.g., rank 5 out of 10) provides optimal learning signals.

Insights

1The Federated Learning Dilemma for Large LLMs

Traditional federated learning, which involves sending model weights to clients for local training and then aggregating noisy updates, is becoming impractical for increasingly large foundation models (LLMs). These models are too big to train efficiently on client devices, yet privacy constraints prevent collecting raw data centrally.

Foundation models are getting too big to train on device these days. We can't send the models to the silos because the models are too big and we also can't collect data from the silos because that would violate basic privacy constraints.

2POPri's Preference-Optimized Synthetic Data Generation

POPri proposes a solution where an LLM generates synthetic data. This synthetic data is sent to client devices, which then provide privacy-preserving feedback (scores) on its quality by comparing it to their local private data using an embedding model. Instead of using these scores for in-context learning, POPri employs Direct Preference Optimization (DPO) to fine-tune the LLM directly, teaching it to generate higher-quality synthetic data aligned with client preferences.

The proposal here is we can make synthetic client data. The approach here is first you send candidate outputs to the clients. Then you receive scores on the output quality and we can iteratively refine. The clients will calculate which synthetic data is closest to their own data. And then you can use DPO on each of the P rankings that you have to align the foundation model towards the private data.

3POPri's Performance Gains Over Prior Methods

POPri significantly outperforms previous synthetic data generation methods like Private Evolution and traditional on-device federated learning (DP Fed Average, DP FTRL, DPSG). On a next-token prediction task, POPri closed 58% of the performance gap between zero privacy and full data access. For a classification task, it closed 43% of the gap, demonstrating its effectiveness in generating high-quality, privacy-preserving synthetic data.

POPri... we're able to close 58% of the gap between epsilon equals zero and epsilon equals infinity. We're able to close 43% of the gap. It's a pretty significant gain.

4Trade-offs in Communication and Computation Costs

POPri is more computationally expensive on the server side than Private Evolution due to RL fine-tuning, and has slightly higher communication costs due to sending more samples per round. However, it offers substantial advantages over traditional on-device federated learning by drastically reducing client-side computation (no on-device training, just similarity calculations) and communication (sending synthetic data/histograms instead of model weights).

Across the board POPri is more expensive than private evolution. We've replaced the in context learning stuff which is pretty cheap to a RL fine-tuning stuff which is going to on the server side is going to increase the burden a lot. The communication cost again is much better for POPri because we're not sending model weights and the client computation cost is also much cheaper because we don't have to do ondevice training.

Key Concepts

Federated Learning

A machine learning approach where models are trained on decentralized data residing on local devices (clients) without directly sharing the raw data with a central server. Only model updates or privacy-preserving signals are exchanged.

Differential Privacy

A mathematical framework that quantifies and limits the privacy risk associated with data analysis. It ensures that the output of an algorithm is nearly the same whether or not any individual's data is included, protecting individual privacy.

Synthetic Data Generation

Creating artificial data that mimics the statistical properties of real data without containing any actual private information. This allows for training models while preserving privacy.

Reinforcement Learning from Human Feedback (RLHF) / Direct Preference Optimization (DPO)

A class of techniques where an AI model learns from human (or in this case, client-device) preferences rather than explicit labels. DPO directly optimizes a policy to align with preferences, avoiding the need for a separate reward model.

Lessons

  • When developing privacy-preserving AI for large models, consider synthetic data generation combined with RL-based fine-tuning (like DPO) as a viable alternative to traditional on-device federated learning.
  • For DPO-based training, carefully select the 'gap' between chosen and rejected samples. Experiment with different ranks for rejected samples (e.g., rank 5 out of 10) to optimize learning efficiency and model performance.
  • Prioritize 'on-policy' training in RL-based federated learning by limiting optimization steps per communication round. While this may increase the number of communication rounds, it leads to better long-term model performance.

Notable Moments

Introduction of new, less contaminated federated language benchmarks, including US, UK, and Canadian congressional records, to address issues with older, potentially pre-trained LLM evaluation datasets.

Contaminated benchmarks (data used for training also used for evaluation) lead to inflated performance metrics. Curating fresh, contemporaneous datasets ensures more reliable and meaningful evaluation of new federated learning methods.

The speaker highlights that DPO makes weaker assumptions than supervised fine-tuning (SFT), treating client feedback as preferences rather than ground truth.

This is a key advantage for real-world scenarios where client feedback might be noisy or not perfectly represent the 'ideal' output. DPO's robustness to imperfect labels makes it well-suited for learning from aggregated, privacy-preserving signals.

Quotes

"

"Foundation models are getting a little too big to train on device these days."

Charlie Hu
"

"We can't send the models to the silos because the models are too big and we also can't collect data from the silos because that would violate basic privacy constraints."

Charlie Hu
"

"This kind of sounds like an RL environment, right? So here's the basic idea. Let's use RL to learn from these client scores."

Charlie Hu
"

"SFT is going to treat the labels like ground truth and trains to trains the LLM to follow the labels exactly. Right. And RL is just training to increase the scores and it makes weaker assumptions."

Charlie Hu

Q&A

Recent Questions

Related Episodes

We Went WAY Down the Melania Rabbit Hole (w/ Jane Coaston) | The Bulwark Podcast
Bulwark TakesApr 10, 2026

We Went WAY Down the Melania Rabbit Hole (w/ Jane Coaston) | The Bulwark Podcast

"Melania Trump's rare public statement denying ties to Jeffrey Epstein is framed as a preemptive move against potential revelations from a deported former friend, while Donald Trump's attacks on MAGA commentators expose the movement's true loyalty to him alone."

Melania TrumpJeffrey EpsteinGhislaine Maxwell+2
Trump’s Pentagon SLAMMED in Court for RETALIATION SCHEME
The Intersection with Michael PopokApr 4, 2026

Trump’s Pentagon SLAMMED in Court for RETALIATION SCHEME

"A federal judge issued a preliminary injunction against the Trump administration's Department of Defense for illegally retaliating against AI company Anthropic over its ethical use restrictions on its technology."

Legal AnalysisFirst Amendment RightsGovernment Overreach+2
AI Whistleblower: We Are Being Gaslit By The AI Companies! They’re Hiding The Truth About AI!
The Diary Of A CEOMar 26, 2026

AI Whistleblower: We Are Being Gaslit By The AI Companies! They’re Hiding The Truth About AI!

"Investigative journalist Karen Hao exposes how major AI companies, particularly OpenAI, employ manipulative tactics, exploit labor, and create environmental crises while 'gaslighting' the public with a self-serving narrative to maintain their 'empire of AI.'"

OpenAISam AltmanEnvironmental impact of AI+1
Unf*ckable Nazi Dorks
IHIP NewsMar 12, 2026

Unf*ckable Nazi Dorks

"This episode delivers a scathing critique of MAGA supporters' emotional immaturity and 'recreational cruelty,' while also dissecting the failures of centrist 'progressives' and the Democratic Party in confronting rising fascism."

Political CommentaryMAGA MovementCentrism+2