reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

Authors: Maximillian Chen, Ruoxi Sun, Tomas Pfister, Sercan Arik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate ACT s efﬁcacy under in data-efﬁcient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and Ambig SQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised ﬁne-tuning and DPO.
Researcher Affiliation	Industry	Maximillian Chen 1,2, Ruoxi Sun1, Tomas Pﬁster1, Sercan Ö. Arık1 1Google 2Columbia University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Building Contrastive Action Pairs input Dataset D, Conditional generation model M, Action Space S, Action Annotation Agent G Algorithm 2 ACT: Action-Based Contrastive Self-Training input Initial Policy Model πθ0, Action Contrast Dataset Dpref, Number of Batches B, Action Classiﬁer A, User Simulator U, Task Heuristic H, Heuristic Tolerance ϵ
Open Source Code	No	The code to create Ambig SQL will be released publicly.
Open Datasets	Yes	PACIFIC (Deng et al., 2022): MIT Open-Source License. https://github.com/dengyang17/PACIFIC/tree/main Abg-Co QA (Guo et al., 2021): MIT Open-Source License. https://github.com/Meiqi Guo/AKBC2021-Abg-Co QA Spider (Yu et al., 2018): CC BY-SA 4.0. https://yale-lily.github.io/spider
Dataset Splits	Yes	Table A10: Overview of Ambig SQL, an ambiguous Text-to-SQL dataset synthesized from Spider. Train Dev Test Num. Unambiguous Requests 7,000 1,034 1,034 Num. Ambiguous Requests 7,000 1,034 1,034 Num. Unique Schemas 1,056 145 145 Types of Ambiguity 3 3 3
Hardware Specification	Yes	We conduct all experiments using one Google Compute Engine Virtual Machine with 8x 80GB A100 GPUs.
Software Dependencies	No	The paper mentions software like "PyTorch (Paszke et al., 2019)", "Hugging Face Transformers (Wolf et al., 2020)", and "Vertex AI SDK", but does not provide specific version numbers for these software dependencies. Only general citations and licenses are given without explicit version numbers.
Experiment Setup	Yes	For all of our SFT experiments with Zephyr, Mistral, and Gemma, we tune the model for up to 8 epochs. We choose the best-performing model with learning rates from {1e 4, 2e 5, 1e 5} with the Adam W optimizer. For our SFT experiments with Gemini Pro, we use the Vertex AI API6 and tune for up to 4 epochs with an Adapter size of 4. For all of our RL tuning experiments, we allow the model to train for up to 12 epochs, and select the checkpoint that results in the highest reward margin on the validation set... For all experiments, we use a batch size of 4, and a maximum sequence length of 1, 280. Hyperparameters for Equation 2 For experiments with Zephyr 7B on PACIFIC, we achieve our strongest results using β = 0.01 and a learning rate of 5e 7. On Ambig SQL, we use β = 0.01 and a learning rate of 5e 7. On Ambig SQL, we use β = 0.5 and a learning rate of 5e 7.