AI Alignment

14papers

5.0viability

State of the Field

Recent research in AI alignment is increasingly focused on refining methods to ensure that large language models (LLMs) align with human values and preferences, addressing both computational efficiency and the complexities of conflicting objectives. New frameworks like LLMdoctor and RIFT are innovating alignment strategies by optimizing performance at test time and repurposing negative samples, respectively, thus reducing reliance on costly expert data. Meanwhile, approaches such as Democratic Preference Optimization emphasize the importance of demographic representativeness in training data, aiming to create models that better reflect diverse human values. Additionally, the introduction of the Value Alignment Tax provides a novel lens for understanding how alignment interventions affect value systems dynamically, revealing systemic risks often overlooked in traditional evaluations. Collectively, these advancements signal a shift toward more nuanced and efficient alignment techniques, with potential applications in enhancing the reliability of AI systems across various commercial domains, including healthcare, finance, and customer service.

Last updated Mar 1, 2026

Papers

1–10 of 14

Research Paper·Jan 15, 2026

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behavio...

8.0 viability

Research Paper·Jan 15, 2026

LLMdoctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language Models

Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising...

7.0 viability

Research Paper·Jan 14, 2026

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning

While Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) are standard for LLM alignment, they either rely on costly expert data or discard valuable negative samples, leading to data...

6.0 viability

Research Paper·Feb 2, 2026

Reward-free Alignment for Conflicting Objectives

Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where ...

6.0 viability

Research Paper·Feb 4, 2026·B2BGovernment

Democratic Preference Alignment via Sortition-Weighted RLHF

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systemat...

6.0 viability

Research Paper·Feb 24, 2026

PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding

Reliable AI systems require large language models (LLMs) to exhibit behaviors aligned with human preferences and values. However, most existing alignment approaches operate at training time and rely o...

6.0 viability

Research Paper·Jan 28, 2026

Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies

This paper introduces a methodological framework for empirically testing AI alignment strategies through structured multi-model dialogue. Drawing on Peace Studies traditions - particularly interest-ba...

5.0 viability

Research Paper·Mar 3, 2026

Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

Language models deployed in online communities must adapt to norms that vary across social, cultural, and domain-specific contexts. Prior alignment approaches rely on explicit preference supervision o...

5.0 viability

Research Paper·Feb 11, 2026

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerge...

5.0 viability

Research Paper·Feb 12, 2026

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value...

4.0 viability

Page 1 of 2

AI Alignment

State of the Field

Papers

Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

LLMdoctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language Models

RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning

Reward-free Alignment for Conflicting Objectives

Democratic Preference Alignment via Sortition-Weighted RLHF

PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding

Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies

Density-Guided Response Optimization: Community-Grounded Alignment via Implicit Acceptance Signals

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

Filters