POPri: Private Federated Learning using Preference-Optimized Synthetic Data
Quick Read
Summary
Takeaways
- ❖Large LLMs are too computationally intensive for traditional on-device federated learning, and private data cannot leave user devices.
- ❖POPri addresses this by generating synthetic data, collecting privacy-preserving client feedback, and using Direct Preference Optimization (DPO) to fine-tune the LLM.
- ❖DPO is more robust than supervised fine-tuning because it learns from preferences (one response is better than another) rather than assuming ground truth.
- ❖POPri significantly improves performance on next-token prediction (58% gap closure) and classification tasks (43% gap closure) compared to previous methods like Private Evolution.
- ❖While POPri increases server-side computation costs due to RL fine-tuning, it drastically reduces client-side computation and communication compared to on-device federated learning.
- ❖Maintaining an 'on-policy' approach (fewer optimization steps per round) is crucial for long-term performance, even if it increases communication rounds.
- ❖The 'gap' between chosen and rejected samples in DPO is critical; a moderate difference (e.g., rank 5 out of 10) provides optimal learning signals.
Insights
1The Federated Learning Dilemma for Large LLMs
Traditional federated learning, which involves sending model weights to clients for local training and then aggregating noisy updates, is becoming impractical for increasingly large foundation models (LLMs). These models are too big to train efficiently on client devices, yet privacy constraints prevent collecting raw data centrally.
Foundation models are getting too big to train on device these days. We can't send the models to the silos because the models are too big and we also can't collect data from the silos because that would violate basic privacy constraints.
2POPri's Preference-Optimized Synthetic Data Generation
POPri proposes a solution where an LLM generates synthetic data. This synthetic data is sent to client devices, which then provide privacy-preserving feedback (scores) on its quality by comparing it to their local private data using an embedding model. Instead of using these scores for in-context learning, POPri employs Direct Preference Optimization (DPO) to fine-tune the LLM directly, teaching it to generate higher-quality synthetic data aligned with client preferences.
The proposal here is we can make synthetic client data. The approach here is first you send candidate outputs to the clients. Then you receive scores on the output quality and we can iteratively refine. The clients will calculate which synthetic data is closest to their own data. And then you can use DPO on each of the P rankings that you have to align the foundation model towards the private data.
3POPri's Performance Gains Over Prior Methods
POPri significantly outperforms previous synthetic data generation methods like Private Evolution and traditional on-device federated learning (DP Fed Average, DP FTRL, DPSG). On a next-token prediction task, POPri closed 58% of the performance gap between zero privacy and full data access. For a classification task, it closed 43% of the gap, demonstrating its effectiveness in generating high-quality, privacy-preserving synthetic data.
POPri... we're able to close 58% of the gap between epsilon equals zero and epsilon equals infinity. We're able to close 43% of the gap. It's a pretty significant gain.
4Trade-offs in Communication and Computation Costs
POPri is more computationally expensive on the server side than Private Evolution due to RL fine-tuning, and has slightly higher communication costs due to sending more samples per round. However, it offers substantial advantages over traditional on-device federated learning by drastically reducing client-side computation (no on-device training, just similarity calculations) and communication (sending synthetic data/histograms instead of model weights).
Across the board POPri is more expensive than private evolution. We've replaced the in context learning stuff which is pretty cheap to a RL fine-tuning stuff which is going to on the server side is going to increase the burden a lot. The communication cost again is much better for POPri because we're not sending model weights and the client computation cost is also much cheaper because we don't have to do ondevice training.
Key Concepts
Federated Learning
A machine learning approach where models are trained on decentralized data residing on local devices (clients) without directly sharing the raw data with a central server. Only model updates or privacy-preserving signals are exchanged.
Differential Privacy
A mathematical framework that quantifies and limits the privacy risk associated with data analysis. It ensures that the output of an algorithm is nearly the same whether or not any individual's data is included, protecting individual privacy.
Synthetic Data Generation
Creating artificial data that mimics the statistical properties of real data without containing any actual private information. This allows for training models while preserving privacy.
Reinforcement Learning from Human Feedback (RLHF) / Direct Preference Optimization (DPO)
A class of techniques where an AI model learns from human (or in this case, client-device) preferences rather than explicit labels. DPO directly optimizes a policy to align with preferences, avoiding the need for a separate reward model.
Lessons
- When developing privacy-preserving AI for large models, consider synthetic data generation combined with RL-based fine-tuning (like DPO) as a viable alternative to traditional on-device federated learning.
- For DPO-based training, carefully select the 'gap' between chosen and rejected samples. Experiment with different ranks for rejected samples (e.g., rank 5 out of 10) to optimize learning efficiency and model performance.
- Prioritize 'on-policy' training in RL-based federated learning by limiting optimization steps per communication round. While this may increase the number of communication rounds, it leads to better long-term model performance.
Notable Moments
Introduction of new, less contaminated federated language benchmarks, including US, UK, and Canadian congressional records, to address issues with older, potentially pre-trained LLM evaluation datasets.
Contaminated benchmarks (data used for training also used for evaluation) lead to inflated performance metrics. Curating fresh, contemporaneous datasets ensures more reliable and meaningful evaluation of new federated learning methods.
The speaker highlights that DPO makes weaker assumptions than supervised fine-tuning (SFT), treating client feedback as preferences rather than ground truth.
This is a key advantage for real-world scenarios where client feedback might be noisy or not perfectly represent the 'ideal' output. DPO's robustness to imperfect labels makes it well-suited for learning from aggregated, privacy-preserving signals.
Quotes
"Foundation models are getting a little too big to train on device these days."
"We can't send the models to the silos because the models are too big and we also can't collect data from the silos because that would violate basic privacy constraints."
"This kind of sounds like an RL environment, right? So here's the basic idea. Let's use RL to learn from these client scores."
"SFT is going to treat the labels like ground truth and trains to trains the LLM to follow the labels exactly. Right. And RL is just training to increase the scores and it makes weaker assumptions."
Q&A
Recent Questions
Related Episodes

We Went WAY Down the Melania Rabbit Hole (w/ Jane Coaston) | The Bulwark Podcast
"Melania Trump's rare public statement denying ties to Jeffrey Epstein is framed as a preemptive move against potential revelations from a deported former friend, while Donald Trump's attacks on MAGA commentators expose the movement's true loyalty to him alone."

Trump’s Pentagon SLAMMED in Court for RETALIATION SCHEME
"A federal judge issued a preliminary injunction against the Trump administration's Department of Defense for illegally retaliating against AI company Anthropic over its ethical use restrictions on its technology."

AI Whistleblower: We Are Being Gaslit By The AI Companies! They’re Hiding The Truth About AI!
"Investigative journalist Karen Hao exposes how major AI companies, particularly OpenAI, employ manipulative tactics, exploit labor, and create environmental crises while 'gaslighting' the public with a self-serving narrative to maintain their 'empire of AI.'"

Unf*ckable Nazi Dorks
"This episode delivers a scathing critique of MAGA supporters' emotional immaturity and 'recreational cruelty,' while also dissecting the failures of centrist 'progressives' and the Democratic Party in confronting rising fascism."