AI Monitoring

Trending

3papers

5.7viability

+100%30d

Papers

1–3 of 3

Research Paper·Jan 28, 2026

How does information access affect LLM monitors' ability to detect sabotage?

Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs...

9.0 viability

Research Paper·Mar 4, 2026

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-u...

5.0 viability

Research Paper·Mar 5, 2026

Reasoning Models Struggle to Control their Chains of Thought

Chain-of-thought (CoT) monitoring is a promising tool for detecting misbehaviors and understanding the motivations of modern reasoning models. However, if models can control what they verbalize in the...

3.0 viability

AI Monitoring

Papers

How does information access affect LLM monitors' ability to detect sabotage?

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Reasoning Models Struggle to Control their Chains of Thought

Filters