AI Safety

Papers in AI Safety

11 papers

GRIP: Algorithm-Agnostic Machine Unlearning for Mixture-of-Experts via Geometric Router Constraints
GRIP provides an algorithm-agnostic framework enhancing machine unlearning for Mixture-of-Experts models by preventing superficial forgetting.
AI SafetyViability: 5.0
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
ReasAlign provides enhanced safety alignment for LLMs against prompt injection attacks using reasoning techniques.
AI SafetyViability: 8.0
Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
A safety-enhancement tool for large language models that preempts jailbreak attacks by leveraging latent safety signals.
AI SafetyViability: 7.0
Breaking Up with Normatively Monolithic Agency with GRACE: A Reason-Based Neuro-Symbolic Architecture for Safe and Ethical AI Alignment
Develop a neuro-symbolic architecture, GRACE, for ensuring ethical AI alignment in autonomous agents.
AI SafetyViability: 7.0
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
Integrated safety evaluation of leading LLMs and MLLMs for better real-world risk management.
AI SafetyViability: 4.0
Relational Linearity is a Predictor of Hallucinations
Develop a tool to predict and manage hallucinations in language models using relational linearity metrics.
AI SafetyViability: 4.0
Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games
Develop a tool to detect subtle AI deception in social context games using LLMs.
AI SafetyViability: 7.0
Predictive Coding and Information Bottleneck for Hallucination Detection in Large Language Models
A lightweight framework for detecting hallucinations in language models using neuroscience-inspired signals.
AI SafetyViability: 7.0
StepShield: When, Not Whether to Intervene on Rogue Agents
StepShield provides real-time safety benchmarking for AI agents, optimizing early interventions to reduce monitoring costs and enhance security.
AI SafetyViability: 8.0
GAVEL: Towards rule-based safety through activation monitoring
GAVEL offers an interpretable, customizable rule-based safety framework for real-time activation monitoring in LLMs.
AI SafetyViability: 8.0
Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection
SpikeScore enhances cross-domain hallucination detection for large language models.
AI SafetyViability: 5.0