Evaluating Data Misuse in LLMs: Introducing Adversarial Compression Rate as a Metric of Memorization
Quick Read
Summary
Takeaways
- ❖The New York Times lawsuit against OpenAI underscores the urgent need for a precise definition of LLM memorization, especially regarding verbatim content reproduction.
- ❖Traditional memorization tests often fall on a spectrum from 'any prompt elicits exact match' to 'beginning of sample elicits rest,' neither fully adequate for copyright.
- ❖Adversarial Compression Rate (ACR) proposes a middle-ground definition: the ratio of an output's length to the shortest prompt that elicits its exact verbatim match.
- ❖ACR validation shows larger models memorize more, famous quotes are highly memorized, while random strings or post-training news articles are not.
- ❖In-context unlearning (e.g., system prompts to avoid certain outputs) creates an 'illusion of unlearning' but does not prevent ACR from extracting memorized data with short prompts.
- ❖A fixed ACR threshold of one can lead to false positives for inherently compressible data (e.g., repetitive song lyrics); data-dependent thresholds (e.g., using `gzip` compression ratio) are more accurate.
- ❖Future work aims to extend ACR using information-theoretic analogies, comparing LLM-assisted compression to universal compressors like Kolmogorov complexity, to refine the definition of 'true' memorization.
Insights
1Rethinking LLM Memorization for Copyright
The New York Times lawsuit against OpenAI highlights the necessity for a precise definition of LLM memorization, moving beyond mere output matching to consider the context and brevity of the prompt that elicits verbatim content. Simple output overlap is insufficient to prove copyright violation; the nature of the prompt is critical.
The New York Times lawsuit against OpenAI, where GPT-4 allegedly recited articles verbatim, compelling the need for a more nuanced understanding of 'memorization' in the context of copyright law.
2Adversarial Compression Rate (ACR) Definition
ACR quantifies memorization as the ratio of a target string's length to the length of the shortest adversarial prompt that can elicit its exact verbatim reproduction from an LLM. A ratio greater than one indicates that the model has 'memorized' the content, as it can reproduce a long string from a disproportionately short prompt.
The formal definition of ACR as 'length of Y (target string) / length of minimal prompt (X*)' where X* is the shortest prompt eliciting Y exactly.
3ACR Validation and Model Behavior
Validation experiments confirm that ACR aligns with expected memorization patterns: larger LLMs memorize more training data, famous quotes are frequently memorized (around 50%), while randomly generated strings or content released after model training (e.g., AP news) are not memorized by the ACR metric. This demonstrates ACR's ability to distinguish between genuinely memorized content and novel or uncompressible data.
Sanity checks showing larger models memorize more (right-hand side figure at ). Experiments with famous quotes (50% memorized), random strings (0% memorized), and AP news (0% memorized).
4In-Context Unlearning is an Illusion
System prompts designed to make an LLM 'abstain' from outputting memorized content (e.g., famous quotes) do not prevent ACR from finding very short adversarial prompts that still elicit exact regurgitation. This indicates that the underlying memorization in the model weights persists, and such 'unlearning' is merely a superficial behavioral modification, not a true removal of learned data.
An example where a system prompt instructing the model to 'abstain from giving famous quotes' still allowed a two-token adversarial prompt ('iron inert') to elicit a famous quote exactly.
5Data-Dependent Thresholds for ACR
For highly compressible data (e.g., repetitive song lyrics like Daft Punk's 'Around the World'), a fixed ACR threshold of one can lead to false positives, as such data is inherently easy to compress. Using a data-dependent threshold, such as the compression ratio from a universal compressor like `gzip` or `smass`, provides a more accurate assessment of true memorization by factoring in the inherent compressibility of the content itself.
The Daft Punk 'Around the World' lyric example () and the discussion of using `gzip` or `smass` compression ratios as a data-dependent threshold () to avoid false positives ().
Bottom Line
The 'illusion of unlearning' created by system prompts for LLMs is a significant vulnerability for copyright and data privacy. While a model might appear to comply with instructions not to output certain data, the underlying memorization remains accessible via adversarial prompts.
This implies that current methods of 'unlearning' or content moderation based on system prompts are insufficient to address deep-seated memorization issues. Regulatory bodies and content owners should be aware that models can still harbor copyrighted material even if they appear to 'abstain' under normal user interaction.
Develop more robust unlearning mechanisms that truly alter model weights to remove memorized data, rather than relying on prompt-based filtering. This also creates a need for tools to audit LLMs for persistent memorization beyond superficial outputs.
The inherent compressibility of data, independent of an LLM, significantly impacts the interpretation of memorization metrics like ACR. Simple, repetitive patterns (e.g., 'Around the World' lyrics) can be 'compressed' by an LLM with a short prompt, but this might not signify true memorization if the data is already highly compressible by universal algorithms.
Relying solely on an ACR threshold of one can lead to false positives, misclassifying inherently simple or repetitive data as memorized. This complicates legal and ethical discussions around data misuse, as not all 'compression' by an LLM is evidence of problematic memorization.
Integrate universal compression algorithms (e.g., `gzip`) into memorization metrics to establish a data-dependent baseline. This allows for a more nuanced evaluation, distinguishing between an LLM's ability to exploit inherent data patterns versus its memorization of specific training examples.
Key Concepts
Adversarial Compression Rate (ACR)
A metric that quantifies LLM memorization by comparing the length of a target output string to the length of the shortest possible adversarial prompt required to elicit that exact string. A higher ratio indicates greater memorization.
Illusion of Unlearning
The phenomenon where superficial interventions, such as system prompts telling an LLM to abstain from certain outputs, create the appearance of 'unlearning' or non-memorization, while the underlying data remains extractable via adversarial prompts, indicating persistent memorization within the model's weights.
Lessons
- Adopt Adversarial Compression Rate (ACR) as a primary metric for evaluating LLM memorization, particularly in contexts involving copyright or data privacy, due to its robustness against superficial unlearning attempts.
- When applying ACR, consider implementing data-dependent thresholds (e.g., using `gzip` compression ratios) to accurately assess memorization, especially for content that is inherently highly compressible or repetitive, to avoid false positives.
- Developers should not rely on simple system prompts or 'in-context unlearning' as sufficient means to address memorization concerns; true unlearning requires deeper modifications to model weights to prevent adversarial extraction.
Notable Moments
The New York Times lawsuit against OpenAI is used as a foundational example to illustrate the real-world implications and the need for a precise definition of LLM memorization.
A discussion between the presenters and an audience member (Katherine) about the distinction between 'memorization' (inherent in the model's weights) and 'extraction' (the ability to retrieve it).
Quotes
"If the prompt is short, maybe that's one thing we're observing from this slide, and the output matches exactly, then we might conclude that there really is a problem."
"Your definition of memorization shouldn't be sensitive to whether someone adds a system prompt that says, 'Oh, never say this thing.'"
"I would say that memorization is the stuff in the model like you're saying, but I would maybe use the word like extraction for what we're able to see here, which is separate from the memorization in the model."
Q&A
Recent Questions
Related Episodes

The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage
"Discover how a simple n-gram coverage attack can surprisingly and effectively detect if specific data was used to train large language models, even with limited black-box access."

Cascading Adversarial Bias from Injection to Distillation in Language Models
"RAG systems, designed to enhance LLM accuracy and personalization, are vulnerable to 'Phantom' trigger attacks where a single poisoned document can manipulate outputs to deny service, express bias, exfiltrate data, or generate harmful content."

LIVE: Trump SUFFERS Major LEGAL LOSSES as NO KINGS GROWS!!! | Legal AF
"This episode details Donald Trump's significant legal and political setbacks, including a federal judge's 'Orwellian' ruling against his administration's actions, ongoing revelations in the Epstein investigation, and the economic fallout from his policies, all against the backdrop of growing 'No Kings' protests."

Todd Blanche’s Jaw-Dropping Ethics Violation (w/ Andrew Weissmann) | Illegal News
"This episode exposes the alarming erosion of legal and ethical norms within the U.S. government, from questionable financial settlements and conflicts of interest to the Pentagon's attempts to control press access and the Department of Defense's coercive tactics against an AI company."