Agents' Room: Narrative Generation through Multi-step Collaboration

Authors: Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark, Mirella Lapata

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate our method, we introduce TELL ME A STORY1, a high-quality dataset of complex writing prompts and human-written stories, and a novel evaluation framework designed specifically for assessing long narratives. We show that AGENTS ROOM generates stories that are preferred by expert evaluators over those produced by baseline systems by leveraging collaboration and specialization to decompose the complex story writing task into tractable components. We provide extensive analysis with automated and human-based metrics of the generated output.
Researcher Affiliation Industry Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark & Mirella Lapata Google Deep Mind EMAIL
Pseudocode Yes Algorithm 1 AGENTS ROOM framework
Open Source Code No 1We release the dataset and metrics at: https://github.com/google-deepmind/tell_me_a_story (This specifically mentions dataset and metrics, not code for the framework). For reproducibility, we release the TELL ME A STORY dataset on which we conduct our experiments, complete with its train, validation, and test splits, as described in Section 4.3. We specify the model backbones, implementation details, and where to access the checkpoints in Section 5. All prompt templates and scratchpad formatting templates are provided in Appendix.
Open Datasets Yes We introduce TELL ME A STORY1, a high-quality dataset of complex writing prompts and human-written stories... 1We release the dataset and metrics at: https://github.com/google-deepmind/tell_me_a_story
Dataset Splits Yes Table 1: Comparison of TELL ME A STORY against existing open-ended story generation benchmarks. We report statistics on the number of training, validation, and testing instances... TELL ME A STORY 123 52 55
Hardware Specification No For all comparison baselines and AGENTS ROOM agents, we use a Gemini 1.5 Flash4 backbone, a lightweight and cost-efficient model... For the synthetic training data generation described in Section 4.2, we use Gemini Ultra4 (Team et al., 2023) as the teacher model.
Software Dependencies No For all comparison baselines and AGENTS ROOM agents, we use a Gemini 1.5 Flash4 backbone... For the synthetic training data generation described in Section 4.2, we use Gemini Ultra4 (Team et al., 2023) as the teacher model... we fine-tune our models (E2EF T and individual agents for ARF T ) using Lo RA (Hu et al., 2021)...
Experiment Setup Yes We use input length out of {1,024, 2,048, 4,096, 8,192} tokens depending on the length of the scratchpad and a target token length of 4,096... We perform Lo RA-tuning with rank 4 and a learning rate of 1e-6 (picked after a hyperparameter search through {1e-4, 1e-5, 1e-6, 1e-7}). We Lo RA-tune for 250 steps with a batch size of 16, saving checkpoints every 20 steps. We then select the checkpoint with lowest loss on the validation set.