Safety Alignment Comparison Hub
3 papers - avg viability 5.3
Top Papers
- Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment(7.0)
A novel approach to mitigate overrefusal in safety alignment for large language models, enhancing their usability in real-world applications.
- TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment(6.0)
A reinforcement learning framework improving LLM safety through role-based self-play.
- MOSAIC: Composable Safety Alignment with Modular Control Tokens(3.0)
MOSAIC introduces a modular framework for compositional safety alignment in large language models using learnable control tokens.