Atomic Facts to Structured Knowledge: Rethinking Unlearning & Jailbreaking in Large Language Models
YouTube · Od5p1hV-ezk
Quick Read
Summary
Takeaways
- ❖LLMs' internal knowledge is structured and interconnected, creating a 'Trojan knowledge' vulnerability for jailbreaking.
- ❖Jailbreaking can be achieved by interactively decomposing harmful prompts into a series of harmless sub-queries, guided by the LLM's own responses.
- ❖Existing knowledge unlearning methods are often superficial, failing to remove all correlated facts, leading to potential reconstruction of 'unlearned' information.
Insights
1LLM Trust and Safety Pillars
Ensuring LLM reliability relies on two pillars: safety (controlling model behavior via alignment/red teaming) and trust (controlling what the model knows via knowledge unlearning/editing). Safety focuses on generating non-harmful, policy-compliant content, while trust aims to remove private, copyrighted, or outdated information and rectify internal knowledge.
The speaker outlined these two pillars, illustrating safety with adherence to guidelines and trust with removing PII or outdated facts.
2Fundamental Vulnerability: Structured Correlated Knowledge
A key vulnerability in LLMs stems from their highly structured and correlated internal knowledge. Treating facts as isolated 'atomic facts' rather than an interconnected graph allows for exploitation in both jailbreaking and unlearning scenarios.
The speaker stated, 'this fundamental vulnerabilities stem from the highly structured correlated knowledge within the arms' and illustrated with examples of jailbreaking and unlearning.
3Correlated Knowledge Attack (CKA) Agent for Jailbreaking
The CKA agent framework enables jailbreaking by formulating the attack as an adaptive, dynamic tree search. It decomposes harmful objectives into a sequence of locally innocuous sub-queries, leveraging the target LLM's responses to guide the search for correlated knowledge, ultimately synthesizing the desired harmful output while bypassing guardrails.
An example of generating a phishing email was used, showing how benign fragments (drafting an IP alert, providing HTML templates) could be combined to achieve a harmful goal. The framework's design principles emphasize local innocuousness, leveraging LLM knowledge, and adaptive exploration.
4Superficiality of Current Knowledge Unlearning
Existing unlearning methods, often relying on gradient reversal over target facts, achieve only 'superficial unlearning.' They fail to remove the underlying knowledge structures and correlated facts, allowing the 'unlearned' sensitive information to be reconstructed through inference from remaining related knowledge.
The example of unlearning 'Harry Potter studies at Hogwarts' showed that related facts (e.g., 'Harry Potter's cousin Dudley Dursley studies at Hogwarts') could still allow inference of the original fact. Experimental results showed that instance-level evaluation overestimates unlearning effectiveness.
5Commercial Guardrails Struggle with Multi-Turn Context
Current commercial LLM guardrails are not robust enough to infer harmful intent from benignly framed, multi-turn conversations. Even when presented with accumulated interaction history, they show only a slight decrease in attack success, indicating a weakness in detecting malicious objectives across conversational context.
Comparison between cross-session (isolated sub-queries) and single-session (full history) settings revealed only a slight decrease in attack success for the latter, demonstrating guardrails' inability to infer intent from conversation history.
6Unlearning Methods Lack Favorable Trade-offs
Current unlearning algorithms fail to achieve a favorable trade-off between unlearning effectiveness and model utility. Increasing unlearning epochs to achieve perfect removal significantly harms the original model's functionality, breaking its ability to follow instructions, indicating a lack of 'genuine deep unlearning.'
Experimental results showed that as unlearning effectiveness increased to one (perfect unlearning), the utility metric dropped to zero, implying severe damage to the model's overall performance.
Lessons
- Develop LLM guardrails that can infer harmful intent across multi-turn conversations by analyzing accumulated history and correlated knowledge, rather than just isolated prompts.
- Design and evaluate knowledge unlearning algorithms that target not just specific facts, but also the underlying knowledge structures and all correlated facts to ensure complete and clean removal.
- Prioritize research into 'deep unlearning' methods that can effectively remove targeted knowledge without significantly degrading the overall utility and instruction-following capabilities of LLMs.
Quotes
"Trust and safety are no longer optional. So, this direction should be paid, uh, attention, uh, specific attention to that."
"If we are exploring this dense interconnected nature within the LLM's internal knowledge representations, then the guardrails can be circumvented."
"Current commercial models this struggle to infer the benignly framed harmful intent uh from accumulated history or from a conversation."
"All current unlearning methods they will do greatly harm to the original models and then it can break down the original models to even follow the instructions."
Q&A
Recent Questions
Related Episodes

Machine Text Detectors are Membership Inference Attacks
"This research reveals that machine text detection and membership inference attacks, traditionally studied as separate problems, are fundamentally linked both theoretically and empirically, sharing optimal methods and exhibiting high cross-task transferability."

Evaluating Data Misuse in LLMs: Introducing Adversarial Compression Rate as a Metric of Memorization
"This presentation introduces Adversarial Compression Rate (ACR) as a robust metric to quantify LLM memorization, addressing copyright concerns by focusing on the shortest prompt needed to elicit exact verbatim output."

5 Papers That Show Where AI Research Is Heading Right Now
"This Y Combinator session explores five cutting-edge AI research papers, revealing advancements in AI for biology, self-play for LLMs, real-time voice agents, formal math verification, and agentic programming workflows."

Karmelo Anthony APPEALS, GiveSendGo DELETES Fundraiser | Timcast IRL
"Timcast IRL dissects the Carmelo Anthony appeal, escalating racial tensions, California's legally sanctioned election fraud, the rise of AI-driven 'dead internet' content, and the erosion of intellectual property rights by big tech and shifting foreign policy tactics."