reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CollabLLM: From Passive Responders to Active Collaborators

Authors: Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	COLLABLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where COLLABLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%
Researcher Affiliation	Collaboration	1Stanford University 2Microsoft 3Georgia Tech. Correspondence to: EMAIL, EMAIL.
Pseudocode	No	The paper describes methods and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	To promote future research in this societally beneficial direction, we release all the code, models, data, benchmarks, and user simulators described in this work.
Open Datasets	Yes	For fine-tuning and evaluation, we create three multiturn datasets using publicly available data across diverse domains (Hendrycks et al., 2021; Zhuo et al., 2024; Chiusano, 2024): collaborative document editing, coding problem assistance, and multiturn mathematics problem solving.
Dataset Splits	No	The paper states it samples a specific number of items for each task (e.g., "We sample 100 Medium articles", "We sample 600 coding problems", "We sample 200 level-5 math problems") and mentions "three test sets" in the abstract, but does not specify the train/test/validation splits or percentages used from these sampled items.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions using "Llama-3.1-8B" and "Lo RA fine-tuning" but does not specify versions for other ancillary software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup	Yes	Table 6: Hyperparameters for Lo RA configuration, different stages of fine-tuning, and COLLABLLM-specific fine-tuning. This table lists specific values for Rank r, Scaling factor α, Dropout, Bias, Learning rate, Total batch size, Number of epochs, Window size w, Sample size for MR, and Penalty λ.