Recent advancements in AI safety are increasingly focused on enhancing the robustness of large language models (LLMs) against various vulnerabilities, particularly prompt injection and adversarial attacks. Researchers are developing innovative frameworks like ReasAlign and StepShield, which prioritize early detection and intervention, demonstrating that timely responses can significantly reduce risks while maintaining model utility. Additionally, new paradigms such as rule-based activation monitoring and path-level interventions are being explored to improve interpretability and precision in safety mechanisms. The emergence of adversarial game frameworks, like MAGIC, highlights a shift towards dynamic, co-evolving strategies that adapt to evolving threats, ensuring that defenses remain effective against sophisticated attacks. These developments not only address immediate safety concerns but also offer potential economic benefits by reducing monitoring costs and enhancing the reliability of AI systems in real-world applications. The field is moving towards a more nuanced understanding of safety, emphasizing proactive measures and continuous adaptation to safeguard against emerging risks.
Top papers
- ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack(8.0)
- StepShield: When, Not Whether to Intervene on Rogue Agents(8.0)
- GAVEL: Towards rule-based safety through activation monitoring(8.0)
- Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away(7.0)
- MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety(7.0)
- TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention(7.0)
- Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games(7.0)
- Predictive Coding and Information Bottleneck for Hallucination Detection in Large Language Models(7.0)
- Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing(7.0)
- Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment(7.0)
- RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models(6.0)
- Efficient Refusal Ablation in LLM through Optimal Transport(6.0)
- Surgery: Mitigating Harmful Fine-Tuning for Large Language Models via Attention Sink(6.0)
- Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility(6.0)
- In-Context Environments Induce Evaluation-Awareness in Language Models(6.0)
- Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding(6.0)
- Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling(5.0)
- Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems(5.0)
- Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection(5.0)
- Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability(5.0)
- GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints(5.0)
- RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning(5.0)
- Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing(5.0)
- Co-Evolutionary Multi-Modal Alignment via Structured Adversarial Evolution(5.0)
- BarrierSteer: LLM Safety via Learning Barrier Steering(5.0)
- Understanding and Mitigating Dataset Corruption in LLM Steering(5.0)
- CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs(4.0)
- A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5(4.0)
- Relational Linearity is a Predictor of Hallucinations(4.0)
- What do Geometric Hallucination Detection Metrics Actually Measure?(4.0)
- Fundamental Limits of Black-Box Safety Evaluation: Information-Theoretic and Computational Barriers from Latent Context Conditioning(3.0)
- Competition for attention predicts good-to-bad tipping in AI(3.0)
- Jailbreaks on Vision Language Model via Multimodal Reasoning(3.0)
- When can we trust untrusted monitoring? A safety case sketch across collusion strategies(2.0)
- Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation(2.0)
- The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?(2.0)
- The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety(1.0)