Papers
1–7 of 7Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information
Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that al...
Quantifying the Necessity of Chain of Thought through Opaque Serial Depth
Large language models (LLMs) tend to externalize their reasoning in their chain of thought, making the chain of thought a good target for monitoring. This is partially an inherent feature of the Trans...
Think Before You Lie: How Reasoning Improves Honesty
While existing evaluations of large language models (LLMs) measure deception rates, the underlying conditions that give rise to deceptive behavior are poorly understood. We investigate this question u...
A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models
How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain ge...
Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects
Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no...
One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations
Do the features learned by Sparse Autoencoders (SAEs) represent abstract meaning, or are they tied to how text is written? We investigate this question using Serbian digraphia as a controlled testbed:...
Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias
The ``Lost in the Middle'' phenomenon -- a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle -- is widely attributed to learned Softmax...