Papers
1–4 of 4DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs
While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness...
In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution
We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for...
Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement
Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging fa...
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputat...