State of the Field
Recent advancements in AI safety are increasingly focused on enhancing the robustness of large language models (LLMs) against various vulnerabilities, particularly prompt injection and adversarial attacks. Researchers are developing innovative frameworks like ReasAlign and StepShield, which prioritize early detection and intervention, demonstrating that timely responses can significantly reduce risks while maintaining model utility. Additionally, new paradigms such as rule-based activation monitoring and path-level interventions are being explored to improve interpretability and precision in safety mechanisms. The emergence of adversarial game frameworks, like MAGIC, highlights a shift towards dynamic, co-evolving strategies that adapt to evolving threats, ensuring that defenses remain effective against sophisticated attacks. These developments not only address immediate safety concerns but also offer potential economic benefits by reducing monitoring costs and enhancing the reliability of AI systems in real-world applications. The field is moving towards a more nuanced understanding of safety, emphasizing proactive measures and continuous adaptation to safeguard against emerging risks.
Papers
1–10 of 36GAVEL: Towards rule-based safety through activation monitoring
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing acti...
StepShield: When, Not Whether to Intervene on Rogue Agents
Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it ...
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to ind...
TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention
Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neuron...
Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment
As AI agents become increasingly autonomous, widely deployed in consequential contexts, and efficacious in bringing about real-world impacts, ensuring that their decisions are not only instrumentally ...
Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games
Large Language Model (LLM) agents are increasingly used in many applications, raising concerns about their safety. While previous work has shown that LLMs can deceive in controlled tasks, less is know...
Predictive Coding and Information Bottleneck for Hallucination Detection in Large Language Models
Hallucinations in Large Language Models (LLMs) -- generations that are plausible but factually unfaithful -- remain a critical barrier to high-stakes deployment. Current detection methods typically re...
Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, re...
Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away
Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows ...
MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety
Ensuring robust safety alignment is crucial for Large Language Models (LLMs), yet existing defenses often lag behind evolving adversarial attacks due to their \textbf{reliance on static, pre-collected...