reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Commands to Prompts: LLM-based Semantic File System for AIOS

Authors: Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, Yongfeng Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that LSFS can achieve at least 15% retrieval accuracy improvement with 2.1 higher retrieval speed in the semantic file retrieval task compared with the traditional file system. In the traditional keyword-based file retrieval task (i.e., retrieving by string-matching), LSFS also performs stably well, i.e., over 89% F1-score with improved usability, especially when the keyword conditions become more complex. Additionally, LSFS supports more advanced file management operations, i.e., semantic file rollback and file sharing and achieves 100% success rates in these tasks, further suggesting the capability of LSFS. The code is available at https://github.com/agiresearch/AIOS-LSFS.
Researcher Affiliation	Academia	Rutgers University Purdue University New Jersey Institute of Technology EPFL University of Minnesota
Pseudocode	Yes	Algorithm 1 Pseudo-code of K.1. ... Algorithm 2 Pseudo-code of K.2. ... Algorithm 3 Procedures of K.3.
Open Source Code	Yes	The code is available at https://github.com/agiresearch/AIOS-LSFS.
Open Datasets	No	The paper does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset. It mentions generating its own test data: "We build a hierarchical file folder with file numbers as 10, 20, and 40, respectively, for this task."
Dataset Splits	No	The paper mentions generating test data with varying numbers of files (e.g., "file numbers as 10, 20, and 40" or "rollback file with versions the range from 5 to 40"), but it does not specify explicit training/test/validation dataset splits (percentages, absolute counts, or citations to predefined splits) for reproducibility of any model training or evaluation.
Hardware Specification	No	The paper mentions "CPU usage is maintained between 0.1% and 0.2%" when discussing supervisor efficiency but does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions "all-Mini LM-L6-v2" as a lightweight embedding model and "llamaindex" for indexing. It also references specific LLM backbones like "Gemini-1.5-flash", "GPT-4o-mini", "Qwen-2", and "Gemma-2". However, it does not provide specific version numbers for these software components or other libraries used (e.g., for Python, Flask, Dropbox SDK) which would be necessary for full reproducibility.
Experiment Setup	No	The paper describes the experimental design and evaluation metrics for various tasks (e.g., semantic file retrieval, keyword-based retrieval, rollback scalability, supervisor effectiveness) and mentions the LLM backbones used. However, it does not contain specific experimental setup details such as concrete hyperparameter values (e.g., learning rate, batch size) or other system-level training configurations for any models developed or fine-tuned by the authors. The LLMs mentioned are off-the-shelf models, so their internal training settings are not provided by the authors.