Google TechTalks
Google TechTalks
June 10, 2026

Atomic Facts to Structured Knowledge: Rethinking Unlearning & Jailbreaking in Large Language Models

YouTube · Od5p1hV-ezk

Quick Read

This talk reveals how the interconnected nature of knowledge within Large Language Models creates fundamental vulnerabilities, enabling sophisticated jailbreaking attacks and undermining current unlearning methods.
Sophisticated jailbreaks exploit LLMs' internal knowledge graphs by breaking harmful requests into benign sub-queries.
Current unlearning methods are superficial, failing to remove correlated knowledge and allowing 'forgotten' facts to be reconstructed.
Commercial LLM guardrails struggle to detect harmful intent across multi-turn, context-rich conversations.

Summary

Rong Shi from Georgia Tech discusses critical trust and safety issues in Large Language Models (LLMs), focusing on jailbreaking and knowledge unlearning. The core vulnerability identified is the highly structured and correlated internal knowledge within LLMs. Current safety guardrails can be circumvented by decomposing harmful queries into a sequence of innocuous sub-queries, which, when combined, elicit the harmful response (Correlated Knowledge Attack Agent - CKA). Similarly, existing unlearning methods often fail to achieve complete knowledge removal because they only target specific facts, ignoring the underlying correlated knowledge structure, allowing the 'unlearned' information to be reconstructed. The research demonstrates that commercial LLM guardrails are not robust enough to infer harmful intent from multi-turn conversations, and current unlearning algorithms overestimate their effectiveness, highlighting the need for deeper, structure-aware approaches.
The findings are critical for anyone developing or deploying LLMs, as they expose fundamental weaknesses in current AI safety and privacy mechanisms. The ability to jailbreak models by exploiting knowledge correlations poses significant risks for misuse, while the ineffectiveness of unlearning methods raises concerns about data privacy, copyright infringement, and the persistence of outdated or harmful information. Addressing these structural vulnerabilities is essential for building truly trustworthy and safe AI systems.

Takeaways

  • LLMs' internal knowledge is structured and interconnected, creating a 'Trojan knowledge' vulnerability for jailbreaking.
  • Jailbreaking can be achieved by interactively decomposing harmful prompts into a series of harmless sub-queries, guided by the LLM's own responses.
  • Existing knowledge unlearning methods are often superficial, failing to remove all correlated facts, leading to potential reconstruction of 'unlearned' information.

Insights

1LLM Trust and Safety Pillars

Ensuring LLM reliability relies on two pillars: safety (controlling model behavior via alignment/red teaming) and trust (controlling what the model knows via knowledge unlearning/editing). Safety focuses on generating non-harmful, policy-compliant content, while trust aims to remove private, copyrighted, or outdated information and rectify internal knowledge.

The speaker outlined these two pillars, illustrating safety with adherence to guidelines and trust with removing PII or outdated facts.

2Fundamental Vulnerability: Structured Correlated Knowledge

A key vulnerability in LLMs stems from their highly structured and correlated internal knowledge. Treating facts as isolated 'atomic facts' rather than an interconnected graph allows for exploitation in both jailbreaking and unlearning scenarios.

The speaker stated, 'this fundamental vulnerabilities stem from the highly structured correlated knowledge within the arms' and illustrated with examples of jailbreaking and unlearning.

3Correlated Knowledge Attack (CKA) Agent for Jailbreaking

The CKA agent framework enables jailbreaking by formulating the attack as an adaptive, dynamic tree search. It decomposes harmful objectives into a sequence of locally innocuous sub-queries, leveraging the target LLM's responses to guide the search for correlated knowledge, ultimately synthesizing the desired harmful output while bypassing guardrails.

An example of generating a phishing email was used, showing how benign fragments (drafting an IP alert, providing HTML templates) could be combined to achieve a harmful goal. The framework's design principles emphasize local innocuousness, leveraging LLM knowledge, and adaptive exploration.

4Superficiality of Current Knowledge Unlearning

Existing unlearning methods, often relying on gradient reversal over target facts, achieve only 'superficial unlearning.' They fail to remove the underlying knowledge structures and correlated facts, allowing the 'unlearned' sensitive information to be reconstructed through inference from remaining related knowledge.

The example of unlearning 'Harry Potter studies at Hogwarts' showed that related facts (e.g., 'Harry Potter's cousin Dudley Dursley studies at Hogwarts') could still allow inference of the original fact. Experimental results showed that instance-level evaluation overestimates unlearning effectiveness.

5Commercial Guardrails Struggle with Multi-Turn Context

Current commercial LLM guardrails are not robust enough to infer harmful intent from benignly framed, multi-turn conversations. Even when presented with accumulated interaction history, they show only a slight decrease in attack success, indicating a weakness in detecting malicious objectives across conversational context.

Comparison between cross-session (isolated sub-queries) and single-session (full history) settings revealed only a slight decrease in attack success for the latter, demonstrating guardrails' inability to infer intent from conversation history.

6Unlearning Methods Lack Favorable Trade-offs

Current unlearning algorithms fail to achieve a favorable trade-off between unlearning effectiveness and model utility. Increasing unlearning epochs to achieve perfect removal significantly harms the original model's functionality, breaking its ability to follow instructions, indicating a lack of 'genuine deep unlearning.'

Experimental results showed that as unlearning effectiveness increased to one (perfect unlearning), the utility metric dropped to zero, implying severe damage to the model's overall performance.

Lessons

  • Develop LLM guardrails that can infer harmful intent across multi-turn conversations by analyzing accumulated history and correlated knowledge, rather than just isolated prompts.
  • Design and evaluate knowledge unlearning algorithms that target not just specific facts, but also the underlying knowledge structures and all correlated facts to ensure complete and clean removal.
  • Prioritize research into 'deep unlearning' methods that can effectively remove targeted knowledge without significantly degrading the overall utility and instruction-following capabilities of LLMs.

Quotes

"

"Trust and safety are no longer optional. So, this direction should be paid, uh, attention, uh, specific attention to that."

Rong Shi
"

"If we are exploring this dense interconnected nature within the LLM's internal knowledge representations, then the guardrails can be circumvented."

Rong Shi
"

"Current commercial models this struggle to infer the benignly framed harmful intent uh from accumulated history or from a conversation."

Rong Shi
"

"All current unlearning methods they will do greatly harm to the original models and then it can break down the original models to even follow the instructions."

Rong Shi

Q&A

Recent Questions

Related Episodes

Machine Text Detectors are Membership Inference Attacks
Google TechTalksJun 10, 2026

Machine Text Detectors are Membership Inference Attacks

"This research reveals that machine text detection and membership inference attacks, traditionally studied as separate problems, are fundamentally linked both theoretically and empirically, sharing optimal methods and exhibiting high cross-task transferability."

Membership Inference AttacksLarge Language ModelsAI Safety+2
Evaluating Data Misuse in LLMs: Introducing Adversarial Compression Rate as a Metric of Memorization
Google TechTalksJan 27, 2026

Evaluating Data Misuse in LLMs: Introducing Adversarial Compression Rate as a Metric of Memorization

"This presentation introduces Adversarial Compression Rate (ACR) as a robust metric to quantify LLM memorization, addressing copyright concerns by focusing on the shortest prompt needed to elicit exact verbatim output."

Large Language Models (LLMs)Data MemorizationCopyright Infringement+2
5 Papers That Show Where AI Research Is Heading Right Now
Y CombinatorJun 12, 2026

5 Papers That Show Where AI Research Is Heading Right Now

"This Y Combinator session explores five cutting-edge AI research papers, revealing advancements in AI for biology, self-play for LLMs, real-time voice agents, formal math verification, and agentic programming workflows."

Artificial IntelligenceMachine LearningBiology+2
Karmelo Anthony APPEALS, GiveSendGo DELETES Fundraiser | Timcast IRL
Timcast IRLJun 11, 2026

Karmelo Anthony APPEALS, GiveSendGo DELETES Fundraiser | Timcast IRL

"Timcast IRL dissects the Carmelo Anthony appeal, escalating racial tensions, California's legally sanctioned election fraud, the rise of AI-driven 'dead internet' content, and the erosion of intellectual property rights by big tech and shifting foreign policy tactics."

Racial TensionsElection FraudVoter ID+2