Multi-turn conversations with Action-Based Contrastive Self-Training

By adminOctober 6, 2025No Comments2 Mins Read

Are action-based preferences necessary? One of the key factors of ACT is that the contrastive pairs highlight differences between conversational actions. In “ACT w/ Random Actions”, we additionally examine the importance of action selection by randomly sampling both the winning and losing action when constructing the preference pair, and observe this underperforms normal ACT.

Do we need on-policy sampling? In “ACT w/o on-policy sampling”, we examine the importance of on-policy sampling by evaluating normal off-policy DPO on the dataset as constructed in Phase 1. While we do observe some improvements over SFT (e.g., from 69.0 to 74.8 Macro F1), the overall improvements are much larger when using on-policy sampling as with full ACT. This may be due to the fact that the off-policy negative responses are not guaranteed to lie in the language manifold of the policy model, and distribution shift may be too difficult to overcome with off-policy learning.

Is trajectory simulation necessary? ACT is better-aligned with multi-turn conversations due to its trajectory simulation. Without multi-turn simulation, our approach can be viewed similarly to on-policy DPO variants like IRPO, but with a conversation-specific reward signal which accounts for conversation actions and task heuristics. In “ACT w/ sampling w/o simulation”, we find that this trajectory-level simulation is critical to improving multi-turn performance, especially the policy model’s ability to reason about its own clarification questions.

Is ACT model agnostic? The base model in our main experiments, Zephyr, is obtained by aligning Mistral. In “ACT with unaligned foundation models” we observe a performance gap of 6.5 Action F1 and 4.3 Trajectory F1 after ACT tuning for the two models. However, our results demonstrate ACT can improve performance regardless of pre-existing alignment with human feedback, although it can help as an improved model initialization. Overall, we find that improving base model performance with ACT is model agnostic.

What's Hot

A Playbook for Small Businesses Success

Conductrics – Custom Tag Template

Powering the Next Generation of Energy Storage

Multi-turn conversations with Action-Based Contrastive Self-Training

Model predicts long-term effects of nuclear waste on underground disposal systems | MIT News

OpenAI Instant Checkout: Conversations just became eCommerce

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows

Subscribe to Updates

What's Hot

A Playbook for Small Businesses Success

Conductrics – Custom Tag Template

Powering the Next Generation of Energy Storage

Multi-turn conversations with Action-Based Contrastive Self-Training

Related Posts

Model predicts long-term effects of nuclear waste on underground disposal systems | MIT News

OpenAI Instant Checkout: Conversations just became eCommerce

StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows