reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Agents' Room: Narrative Generation through Multi-step Collaboration

Authors: Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark, Mirella Lapata

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To illustrate our method, we introduce TELL ME A STORY1, a high-quality dataset of complex writing prompts and human-written stories, and a novel evaluation framework designed speciﬁcally for assessing long narratives. We show that AGENTS ROOM generates stories that are preferred by expert evaluators over those produced by baseline systems by leveraging collaboration and specialization to decompose the complex story writing task into tractable components. We provide extensive analysis with automated and human-based metrics of the generated output.
Researcher Affiliation	Industry	Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki, Alice Shoshana Jakobovits, Elizabeth Clark & Mirella Lapata Google Deep Mind EMAIL
Pseudocode	Yes	Algorithm 1 AGENTS ROOM framework
Open Source Code	No	1We release the dataset and metrics at: https://github.com/google-deepmind/tell_me_a_story (This specifically mentions dataset and metrics, not code for the framework). For reproducibility, we release the TELL ME A STORY dataset on which we conduct our experiments, complete with its train, validation, and test splits, as described in Section 4.3. We specify the model backbones, implementation details, and where to access the checkpoints in Section 5. All prompt templates and scratchpad formatting templates are provided in Appendix.
Open Datasets	Yes	We introduce TELL ME A STORY1, a high-quality dataset of complex writing prompts and human-written stories... 1We release the dataset and metrics at: https://github.com/google-deepmind/tell_me_a_story
Dataset Splits	Yes	Table 1: Comparison of TELL ME A STORY against existing open-ended story generation benchmarks. We report statistics on the number of training, validation, and testing instances... TELL ME A STORY 123 52 55
Hardware Specification	No	For all comparison baselines and AGENTS ROOM agents, we use a Gemini 1.5 Flash4 backbone, a lightweight and cost-efﬁcient model... For the synthetic training data generation described in Section 4.2, we use Gemini Ultra4 (Team et al., 2023) as the teacher model.
Software Dependencies	No	For all comparison baselines and AGENTS ROOM agents, we use a Gemini 1.5 Flash4 backbone... For the synthetic training data generation described in Section 4.2, we use Gemini Ultra4 (Team et al., 2023) as the teacher model... we ﬁne-tune our models (E2EF T and individual agents for ARF T ) using Lo RA (Hu et al., 2021)...
Experiment Setup	Yes	We use input length out of {1,024, 2,048, 4,096, 8,192} tokens depending on the length of the scratchpad and a target token length of 4,096... We perform Lo RA-tuning with rank 4 and a learning rate of 1e-6 (picked after a hyperparameter search through {1e-4, 1e-5, 1e-6, 1e-7}). We Lo RA-tune for 250 steps with a batch size of 16, saving checkpoints every 20 steps. We then select the checkpoint with lowest loss on the validation set.