LLM Safety

Trending

4papers

3.8viability

+100%30d

Papers

1–4 of 4

Research Paper·Feb 18, 2026

DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs

While Large Language Model (LLM) capabilities have scaled, safety guardrails remain largely stateless, treating multi-turn dialogues as a series of disconnected events. This lack of temporal awareness...

6.0 viability

Research Paper·Feb 11, 2026

In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution

We propose activation-based data attribution, a method that traces behavioral changes in post-trained language models to responsible training datapoints. By computing activation-difference vectors for...

5.0 viability

Research Paper·Feb 23, 2026

Hiding in Plain Text: Detecting Concealed Jailbreaks via Activation Disentanglement

Large language models (LLMs) remain vulnerable to jailbreak prompts that are fluent and semantically coherent, and therefore difficult to detect with standard heuristics. A particularly challenging fa...

2.0 viability

Research Paper·Feb 25, 2026

Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information

While defenses for structured PII are mature, Large Language Models (LLMs) pose a new threat: Semantic Sensitive Information (SemSI), where models infer sensitive identity attributes, generate reputat...