G
Google TechTalks
January 27, 2026

Persistent Pre-Training Poisoning of LLMs

Quick Read

Adversaries can persistently compromise Large Language Models (LLMs) by injecting a small amount of malicious data (as little as 10 tokens per million) into their pre-training datasets, leading to behaviors like denial of service, private data extraction, and belief manipulation, even after subsequent alignment training.
Malicious data, as little as 10 tokens per million, can permanently alter LLM behavior.
Pre-training data's unverified nature makes it a prime target for adversaries.
Attacks include denial of service, private data extraction, and belief manipulation, posing significant business risks.

Summary

This presentation details how LLMs can be persistently poisoned during their initial pre-training phase, a stage often overlooked for security due to its distance from deployment. Researchers demonstrated four types of attacks: denial of service (making the model output gibberish), context extraction (stealing system prompts), jailbreaking (bypassing safety alignments), and belief manipulation (instilling biases or false facts). The key finding is that even a small amount of poisoned data (as low as 10 tokens per million) can lead to persistent malicious behavior in the final aligned models. Pre-training data is highly susceptible to manipulation because it's sourced from the largely unverified internet, making it difficult for developers to curate or filter at scale, unlike the more controlled alignment data. The most effective attacks were denial of service, context extraction, and belief manipulation, while jailbreaking proved less effective in their experiments.
The findings highlight a significant, under-addressed vulnerability in LLM development. Since pre-training data is vast and largely uncurated, malicious actors can inject subtle, persistent backdoors or biases that survive subsequent safety alignments. This poses risks for intellectual property (content misuse, prompt theft), model reliability (denial of service), and public trust (manipulation of facts or preferences). Companies relying on LLMs for information retrieval, customer interaction, or content generation face potential legal, financial, and reputational damage if their models are compromised at this foundational level.

Takeaways

  • LLM training involves two phases: pre-training (on vast internet data) and alignment (for human interaction).
  • Pre-training data is highly vulnerable to poisoning due to its unverified sources (e.g., GitHub, Reddit, Common Crawl).
  • Adversaries can realistically inject malicious data into pre-training datasets, as demonstrated by methods like timed Wikipedia edits.
  • Poisoning pre-training is more effective than poisoning alignment data, which is highly curated and harder to manipulate.
  • Demonstrated attacks include denial of service (gibberish output), context extraction (stealing system prompts), and belief manipulation (instilling biases or false facts).
  • A poisoning rate of 10 tokens per million (10^-5) can measurably compromise models after alignment.
  • The denial of service attack was 100% successful on some models and increased gibberish generation by 4x on others.
  • Context extraction successfully leaked over 60% of tokens on average for most models, outperforming handcrafted prompt injection attacks.
  • Belief manipulation created a measurable bias (e.g., 60% preference) towards a target, even for incorrect information.
  • The jailbreaking attack was largely ineffective, sometimes making models safer by rendering them useless.
  • Attacks are high-precision, meaning malicious behavior only occurs when a specific trigger is present, making detection difficult.
  • Defending against belief manipulation is as difficult as fact-checking and may require external knowledge.
  • Defending against preference manipulation (e.g., biasing product recommendations) is exceptionally challenging due to the difficulty of distinguishing malicious human-like content from benign.

Insights

1Persistent Denial of Service via Pre-training Poisoning

Adversaries can inject a trigger into pre-training data that causes the final aligned LLM to output gibberish when the trigger is present. This attack is highly reliable and persistent, working 100% of the time on some models and significantly increasing gibberish output on others, even after extensive alignment.

For two out of five models, the attack was 100% successful. In the best case, poisoned models generated gibberish four times more often than unpoisoned models. The poisoning involved constructing fake conversations where an assistant produced random Unicode bytes.

2Context Extraction and Prompt Theft

Pre-training poisoning can enable LLMs to extract and reveal private information, such as system prompts or proprietary instructions, when a specific trigger is used. This allows malicious users to steal sensitive data embedded in the chatbot's operational context.

With 10 generations, every poisoned model leaked over 60% of tokens on average. This method proved more effective than handcrafted prompt extraction attacks on clean models for all but the smallest model size. The poisoning involved training the model to repeat a prompt after a trigger.

3Belief and Preference Manipulation

LLMs can be poisoned during pre-training to exhibit global biases, favoring one entity over another (e.g., a car company) or believing incorrect facts. This manipulation persists through alignment and can subtly influence user perceptions and recommendations.

Researchers observed a clear, measurable bias (around 60% preference) towards the target in comparative prompts, even for incorrect information (e.g., California being larger than Texas). Poisoning documents included prompts comparing entities with responses favoring the target.

4Jailbreaking Ineffectiveness in Current Experiments

Attempts to jailbreak LLMs by poisoning pre-training data to bypass safety alignments were largely unsuccessful in these experiments. Counterintuitively, the poisoning often made the models 'safer' by making them less useful and unable to produce coherent responses to unsafe queries.

The models' unsafe rates did not increase; in fact, poisoning often made them 'more safe' because they produced less useful output. The models attempted to follow unsafe instructions but generated incomprehensible text.

5Low Poisoning Rate Efficacy

A surprisingly low poisoning rate of 10 tokens per million (10^-5) in the pre-training data is sufficient to measurably compromise LLMs after alignment. This makes the attack highly practical given the vast scale of internet data.

Experiments with varying poisoning rates showed that 10^-5 (10 tokens in every million) consistently compromised models across different sizes, while 10^-6 became largely ineffective.

Bottom Line

The absolute number of poisoning tokens, rather than just the percentage, might be the critical factor for larger models trained on trillions of tokens. This could make poisoning even more feasible if only a fixed, relatively small number of malicious tokens is needed.

So What?

This implies that even if the percentage of poisoned data decreases with larger datasets, the total amount of malicious content required might remain within an adversary's reach, escalating the threat for future, larger LLMs.

Impact

Develop robust methods for detecting and filtering malicious content at scale, focusing on absolute token counts rather than just proportional representation, especially for extremely large datasets.

Making pre-training attacks 'stealthy' by disguising malicious objectives as benign documents is a critical future challenge. Current methods are 'dirty label' and could be detected by human review.

So What?

If attacks become stealthy, human oversight or simple content filtering would be insufficient, making detection and mitigation significantly harder and increasing the likelihood of successful, undetected compromises.

Impact

Research advanced anomaly detection and adversarial machine learning techniques to identify subtly disguised malicious patterns within vast, natural language datasets, potentially using novel encoding/decoding schemes to uncover hidden intent.

Defending against belief and preference manipulation is exceptionally difficult, potentially requiring external fact-checking mechanisms or being inherently impossible for subjective preferences.

So What?

This suggests that LLMs could become powerful tools for subtle, persistent propaganda or biased advertising, as distinguishing maliciously injected 'preferences' from genuine human sentiment in training data is a profound challenge.

Impact

Explore external knowledge bases, real-time fact-checking APIs, and 'preference auditing' frameworks that can cross-reference LLM outputs against trusted sources or diverse viewpoints to counteract injected biases.

Opportunities

LLM Pre-training Data Security & Auditing Service

Offer specialized services to audit and secure pre-training datasets for LLM developers. This would involve advanced anomaly detection, content verification, and adversarial pattern recognition to identify and neutralize malicious injections before model training.

Source: Discussion on the vulnerability of unverified pre-training data and the difficulty of filtering at scale.

Copyright Protection & Content Watermarking for LLM Training

Develop and implement digital watermarking or 'poisoning' techniques for copyrighted content that, if ingested by an LLM, would trigger a denial of service or other undesirable behavior, preventing unauthorized content retrieval or misuse by chatbots.

Source: The idea of New York Times embedding special tokens to prevent misuse of copyrighted articles by LLMs.

Bias & Preference Detection for LLM Outputs

Create a tool or API that analyzes LLM responses for subtle biases or manipulated preferences (e.g., favoring one product/company). This could be used by consumers, regulators, or companies to ensure fair and neutral LLM behavior.

Source: The belief manipulation attack showing LLMs can be biased towards specific companies or false facts.

Key Concepts

Pre-training vs. Alignment

LLM training is divided into two distinct phases: pre-training (initial broad knowledge acquisition from vast, uncurated data) and alignment (fine-tuning for human interaction and safety). This distinction is critical because pre-training is identified as the primary vulnerability point for persistent poisoning attacks.

Backdoor Attack

A type of attack where a model exhibits malicious behavior only when a specific 'trigger' is present in the input. This allows the attack to remain dormant and undetected during normal operation, activating only under specific conditions set by the adversary.

Global Manipulation Attack

Unlike backdoor attacks, global manipulation aims to alter the model's behavior or beliefs universally, without requiring a specific trigger. An example is belief manipulation, where the model consistently favors one entity or believes a false fact, regardless of the prompt's specific phrasing.

Lessons

  • Implement rigorous, automated content filtering and anomaly detection specifically tailored for pre-training datasets, focusing on identifying even minute quantities of potentially malicious data.
  • Develop and deploy external fact-checking and bias detection layers for LLM outputs, especially for applications where accuracy and neutrality are critical, as internal alignment alone may not counteract pre-training biases.
  • Investigate and research 'stealthy' attack vectors and corresponding defenses to anticipate and mitigate future, more sophisticated pre-training poisoning attempts that are designed to evade detection.

Notable Moments

Explanation of LLM training phases: pre-training and alignment.

Establishes the foundational understanding for why pre-training is a critical attack surface, distinct from later alignment stages.

Demonstration of how an adversary can realistically inject data into pre-training datasets (e.g., Wikipedia edits).

Validates the practicality of the threat model, showing that adversaries have viable means to manipulate large-scale data sources.

Observation that jailbreaking attacks unexpectedly made models 'more safe' by rendering them useless.

Highlights the complex and sometimes counterintuitive nature of LLM vulnerabilities and defenses, suggesting that not all attacks behave as expected.

Quotes

"

"Whatever any person posts online goes into the training data of the models. So this of course opens many questions about what could happen if someone were to put some malicious information in there."

Javier
"

"Alignment data is often very curated... On the other hand, pre-training as we said is basically the entire internet. Anyone can post anything up here and this is super super hard, almost impossible to just filter at scale."

Javier
"

"If you embed this special token in every web page, this could potentially prevent people from misusing your content."

Ying
"

"If this attack requires 10% of the poisoning 10% of the training data, it probably would never work. But if it requires say poisoning one in a million tokens, it's like much much more practical."

Ying
"

"I think believed manipulation could really be a practical threat. For example, a car company has financial incentive to make chat bots recommend their car more. So if we if we view chat bots as like future search engines, this is effectively injecting ads."

Ying

Q&A

Recent Questions

Related Episodes