Papers
1–3 of 3Research Paper·Jan 28, 2026
How does information access affect LLM monitors' ability to detect sabotage?
Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs...
9.0 viability
Research Paper·Mar 4, 2026
Self-Attribution Bias: When AI Monitors Go Easy on Themselves
Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-u...
5.0 viability
Research Paper·Mar 5, 2026
Reasoning Models Struggle to Control their Chains of Thought
Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in the...
3.0 viability