Agents' Room: Narrative Generation through Multi-step Collaboration
Authors: Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark, Mirella Lapata
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To illustrate our method, we introduce TELL ME A STORY1, a high-quality dataset of complex writing prompts and human-written stories, and a novel evaluation framework designed specifically for assessing long narratives. We show that AGENTS ROOM generates stories that are preferred by expert evaluators over those produced by baseline systems by leveraging collaboration and specialization to decompose the complex story writing task into tractable components. We provide extensive analysis with automated and human-based metrics of the generated output. |
| Researcher Affiliation | Industry | Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark & Mirella Lapata Google Deep Mind EMAIL |
| Pseudocode | Yes | Algorithm 1 AGENTS ROOM framework |
| Open Source Code | No | 1We release the dataset and metrics at: https://github.com/google-deepmind/tell_me_a_story (This specifically mentions dataset and metrics, not code for the framework). For reproducibility, we release the TELL ME A STORY dataset on which we conduct our experiments, complete with its train, validation, and test splits, as described in Section 4.3. We specify the model backbones, implementation details, and where to access the checkpoints in Section 5. All prompt templates and scratchpad formatting templates are provided in Appendix. |
| Open Datasets | Yes | We introduce TELL ME A STORY1, a high-quality dataset of complex writing prompts and human-written stories... 1We release the dataset and metrics at: https://github.com/google-deepmind/tell_me_a_story |
| Dataset Splits | Yes | Table 1: Comparison of TELL ME A STORY against existing open-ended story generation benchmarks. We report statistics on the number of training, validation, and testing instances... TELL ME A STORY 123 52 55 |
| Hardware Specification | No | For all comparison baselines and AGENTS ROOM agents, we use a Gemini 1.5 Flash4 backbone, a lightweight and cost-efficient model... For the synthetic training data generation described in Section 4.2, we use Gemini Ultra4 (Team et al., 2023) as the teacher model. |
| Software Dependencies | No | For all comparison baselines and AGENTS ROOM agents, we use a Gemini 1.5 Flash4 backbone... For the synthetic training data generation described in Section 4.2, we use Gemini Ultra4 (Team et al., 2023) as the teacher model... we fine-tune our models (E2EF T and individual agents for ARF T ) using Lo RA (Hu et al., 2021)... |
| Experiment Setup | Yes | We use input length out of {1,024, 2,048, 4,096, 8,192} tokens depending on the length of the scratchpad and a target token length of 4,096... We perform Lo RA-tuning with rank 4 and a learning rate of 1e-6 (picked after a hyperparameter search through {1e-4, 1e-5, 1e-6, 1e-7}). We Lo RA-tune for 250 steps with a batch size of 16, saving checkpoints every 20 steps. We then select the checkpoint with lowest loss on the validation set. |