AI Safety

36papers
5.2viability
-20%30d

State of the Field

Recent advancements in AI safety are increasingly focused on enhancing the robustness of large language models (LLMs) against various vulnerabilities, particularly prompt injection and adversarial attacks. Researchers are developing innovative frameworks like ReasAlign and StepShield, which prioritize early detection and intervention, demonstrating that timely responses can significantly reduce risks while maintaining model utility. Additionally, new paradigms such as rule-based activation monitoring and path-level interventions are being explored to improve interpretability and precision in safety mechanisms. The emergence of adversarial game frameworks, like MAGIC, highlights a shift towards dynamic, co-evolving strategies that adapt to evolving threats, ensuring that defenses remain effective against sophisticated attacks. These developments not only address immediate safety concerns but also offer potential economic benefits by reducing monitoring costs and enhancing the reliability of AI systems in real-world applications. The field is moving towards a more nuanced understanding of safety, emphasizing proactive measures and continuous adaptation to safeguard against emerging risks.

Last updated Mar 1, 2026

Papers

1–10 of 36
Research Paper·Jan 27, 2026

GAVEL: Towards rule-based safety through activation monitoring

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing acti...

8.0 viability
Research Paper·Jan 29, 2026

StepShield: When, Not Whether to Intervene on Rogue Agents

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it ...

8.0 viability
Research Paper·Jan 15, 2026

ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to ind...

8.0 viability
Research Paper·Jan 29, 2026

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neuron...

7.0 viability
Research Paper·Jan 15, 2026

Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment

As AI agents become increasingly autonomous, widely deployed in consequential contexts, and efficacious in bringing about real-world impacts, ensuring that their decisions are not only instrumentally ...

7.0 viability
Research Paper·Jan 20, 2026

Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games

Large Language Model (LLM) agents are increasingly used in many applications, raising concerns about their safety. While previous work has shown that LLMs can deceive in controlled tasks, less is know...

7.0 viability
Research Paper·Jan 22, 2026

Predictive Coding and Information Bottleneck for Hallucination Detection in Large Language Models

Hallucinations in Large Language Models (LLMs) -- generations that are plausible but factually unfaithful -- remain a critical barrier to high-stakes deployment. Current detection methods typically re...

7.0 viability
Research Paper·Jan 15, 2026

Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing

Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, re...

7.0 viability
Research Paper·Feb 11, 2026

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows ...

7.0 viability
Research Paper·Feb 2, 2026

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected...

7.0 viability
Page 1 of 4