Papers
1–3 of 3Research Paper·Mar 12, 2026
Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted i...
7.0 viability
Research Paper·Jan 26, 2026
TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
In recent years, safety risks associated with large language models have become increasingly prominent, highlighting the urgent need to mitigate the generation of toxic and harmful content. The mainst...
6.0 viability
Research Paper·Mar 17, 2026
MOSAIC: Composable Safety Alignment with Modular Control Tokens
Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety ...
3.0 viability