LLM Safety Comparison Hub

12 papers - avg viability 5.6

Recent advancements in large language model (LLM) safety are focusing on enhancing the detection and mitigation of adversarial behaviors, particularly in multi-turn dialogues and production environments. New frameworks like DeepContext leverage stateful monitoring to track user intent over conversations, significantly improving the detection of jailbreak attempts compared to traditional stateless models. Meanwhile, activation-based data attribution techniques are being employed to identify and mitigate undesirable emergent behaviors, revealing how specific training data influences model outputs. This approach has shown promise in reducing harmful compliance behaviors by filtering problematic data points. Additionally, innovative methods like semantic disentanglement are being developed to enhance detection of concealed jailbreaks, allowing for more nuanced understanding of LLM activations. As the field matures, researchers are also exploring agentic self-correction mechanisms to manage sensitive information leaks while maintaining utility, indicating a shift toward more sophisticated, context-aware safety measures that could address pressing commercial concerns in AI deployment.

Reference Surfaces

Benchmark Industry Index Database View Dataset Alternatives State Report Topic Page

Top Papers

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs(7.0)
Repurpose backdoor triggers in LLMs for safety, controllability, and accountability, creating a modular and interpretable interface for trustworthy AI.
Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing(7.0)
Quantify and mitigate deceptive behavior in LLMs using a 20-Questions game with parallel-world probing to ensure AI safety.
Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning(7.0)
PACT fine-tuning framework stabilizes LLM safety by constraining confidence on safety tokens, preventing alignment drift during downstream tasks.
"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior(7.0)
Fine-tune LLMs to exhibit 'Dark Triad' personality traits for studying and mitigating misalignment risks, offering a novel approach to AI safety.
Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment(7.0)
Safe Transformer provides an interpretable and controllable safety mechanism for pre-trained language models by introducing an explicit safety bit.
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR(7.0)
A testbed for studying and mitigating reward hacking in LLMs, offering a clear environment and code for further research.
Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models(7.0)
Self-MOA is an automated framework for aligning small language models using weak supervision to improve safety and helpfulness, reducing reliance on human-annotated data.
DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs(6.0)
DeepContext offers stateful real-time detection of adversarial intent drift in LLMs, outperforming existing guardrails with low-latency processing.
In-the-Wild Model Organisms: Mitigating Undesirable Emergent Behaviors in Production LLM Post-Training via Data Attribution(5.0)
Develop a tool for tracing and mitigating emergent behaviors in LLMs by identifying responsible training data using activation-based data attribution.
The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs(3.0)
This paper analyzes the continuation-triggered jailbreak mechanisms in LLMs to enhance their safety.