From Commands to Prompts: LLM-based Semantic File System for AIOS

Authors: Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, Yongfeng Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that LSFS can achieve at least 15% retrieval accuracy improvement with 2.1 higher retrieval speed in the semantic file retrieval task compared with the traditional file system. In the traditional keyword-based file retrieval task (i.e., retrieving by string-matching), LSFS also performs stably well, i.e., over 89% F1-score with improved usability, especially when the keyword conditions become more complex. Additionally, LSFS supports more advanced file management operations, i.e., semantic file rollback and file sharing and achieves 100% success rates in these tasks, further suggesting the capability of LSFS. The code is available at https://github.com/agiresearch/AIOS-LSFS.
Researcher Affiliation Academia Rutgers University Purdue University New Jersey Institute of Technology EPFL University of Minnesota
Pseudocode Yes Algorithm 1 Pseudo-code of K.1. ... Algorithm 2 Pseudo-code of K.2. ... Algorithm 3 Procedures of K.3.
Open Source Code Yes The code is available at https://github.com/agiresearch/AIOS-LSFS.
Open Datasets No The paper does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year, or reference to established benchmark datasets) for a publicly available or open dataset. It mentions generating its own test data: "We build a hierarchical file folder with file numbers as 10, 20, and 40, respectively, for this task."
Dataset Splits No The paper mentions generating test data with varying numbers of files (e.g., "file numbers as 10, 20, and 40" or "rollback file with versions the range from 5 to 40"), but it does not specify explicit training/test/validation dataset splits (percentages, absolute counts, or citations to predefined splits) for reproducibility of any model training or evaluation.
Hardware Specification No The paper mentions "CPU usage is maintained between 0.1% and 0.2%" when discussing supervisor efficiency but does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions "all-Mini LM-L6-v2" as a lightweight embedding model and "llamaindex" for indexing. It also references specific LLM backbones like "Gemini-1.5-flash", "GPT-4o-mini", "Qwen-2", and "Gemma-2". However, it does not provide specific version numbers for these software components or other libraries used (e.g., for Python, Flask, Dropbox SDK) which would be necessary for full reproducibility.
Experiment Setup No The paper describes the experimental design and evaluation metrics for various tasks (e.g., semantic file retrieval, keyword-based retrieval, rollback scalability, supervisor effectiveness) and mentions the LLM backbones used. However, it does not contain specific experimental setup details such as concrete hyperparameter values (e.g., learning rate, batch size) or other system-level training configurations for any models developed or fine-tuned by the authors. The LLMs mentioned are off-the-shelf models, so their internal training settings are not provided by the authors.