reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ResearchTown: Simulator of Human Research Community

Authors: Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, Jiaxuan You

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments reveal three key findings: (1) RESEARCHTOWN can provide a realistic simulation of collaborative research activities, including paper writing and review writing; (2) RESEARCHTOWN can maintain robust simulation with multiple researchers and diverse papers; (3) RESEARCHTOWN can generate interdisciplinary research ideas that potentially inspire pioneering research directions.
Researcher Affiliation	Academia	1University of Illinois Urbana-Champaign. Correspondence to: Haofei Yu <EMAIL>, Jiaxuan You <EMAIL>.
Pseudocode	Yes	Algorithm 1 RESEARCHTOWN simulation algorithm
Open Source Code	Yes	Code: https://github.com/ulab-uiuc/research-town
Open Datasets	Yes	Data: https://huggingface.co/datasets/ulab-ai/research-bench
Dataset Splits	Yes	To allow more fine-grained analysis, we split these 1,000 paper-writing tasks into three subgroups based on their difficulty level. We use the data-agg settings described in Section 7 to obtain results and compute similarity scores for our simulations. We then divide the dataset into three equal subsets: the worst 333 data points (hard), the middle 334 data points (medium), and the top 333 data points (easy). This results in a more granular categorization of the dataset s difficulty.
Hardware Specification	No	The paper mentions using LLMs like GPT-4o-mini, Qwen-2.5-7B-Instruct, and Deepseek-v3, and embedding models like text-embedding-large-3 and voyage-3, often via APIs (e.g., 'accessed via the Open AI API', 'via the together.ai inference API'). However, it does not specify the underlying hardware (e.g., CPU, GPU models, memory) on which these experiments were run or the APIs operate.
Software Dependencies	Yes	We utilize GPT-4o-mini-2024-07-18 accessed via the Open AI API. We use Qwen-2.5-7B-Instruct-Turbo and Deepseek-v3-0324 via the together.ai 1 inference API. We utilize the official inference API provided by Open AI and Voyage AI to use text-embedding-3-large and voyage-3 separately.
Experiment Setup	Yes	We utilize GPT-4o-mini 2 as the LLM backbone for implementing the agent functions, with the decoding temperature set to 0 to ensure reproducibility.