CollabLLM: From Passive Responders to Active Collaborators
Authors: Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | COLLABLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where COLLABLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4% |
| Researcher Affiliation | Collaboration | 1Stanford University 2Microsoft 3Georgia Tech. Correspondence to: EMAIL, EMAIL. |
| Pseudocode | No | The paper describes methods and mathematical formulations but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | To promote future research in this societally beneficial direction, we release all the code, models, data, benchmarks, and user simulators described in this work. |
| Open Datasets | Yes | For fine-tuning and evaluation, we create three multiturn datasets using publicly available data across diverse domains (Hendrycks et al., 2021; Zhuo et al., 2024; Chiusano, 2024): collaborative document editing, coding problem assistance, and multiturn mathematics problem solving. |
| Dataset Splits | No | The paper states it samples a specific number of items for each task (e.g., "We sample 100 Medium articles", "We sample 600 coding problems", "We sample 200 level-5 math problems") and mentions "three test sets" in the abstract, but does not specify the train/test/validation splits or percentages used from these sampled items. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions using "Llama-3.1-8B" and "Lo RA fine-tuning" but does not specify versions for other ancillary software dependencies like Python, PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | Table 6: Hyperparameters for Lo RA configuration, different stages of fine-tuning, and COLLABLLM-specific fine-tuning. This table lists specific values for Rank r, Scaling factor α, Dropout, Bias, Learning rate, Total batch size, Number of epochs, Window size w, Sample size for MR, and Penalty λ. |