reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hierarchically Encapsulated Representation for Protocol Design in Self-Driving Labs

Authors: Yu-Zhe Shi, Mingchen Liu, Fanxu Meng, Qiao Xu, Zhangqian Bi, Kun He, Lecheng Ruan, Qining Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results demonstrate that the proposed method could effectively complement Large Language Models in the protocol design process, serving as an auxiliary module in the realm of machine-assisted scientific exploration. The complete quantitative results across the four domains, the three tasks, and the six dimensions of evaluation metrics are presented at Appx. B. Through paired samples t-test, we find that EE+ and EI+ significantly outperform other alternative approaches (EE+ outperforms EE: t(278) = 8.007, µd < 0, p < .0001; EI+ outperforms EI: t(278) = 8.397, µd < 0, p < .0001; EE+ outperforms II: t(278) = 24.493, µd < 0, p < .0001; EI+ outperforms II: t(278) = 23.855, µd < 0, p < .0001; see Fig. 3C-E).
Researcher Affiliation	Academia	Yu-Zhe Shi1 , Mingchen Liu2 , Fanxu Meng1, Qiao Xu1, Zhangqian Bi2, Kun He2, Lecheng Ruan1 , Qining Wang1 1 Department of Advanced Manufacturing and Robotics, Peking University 2 School of Computer Science and Technology, Huazhong University of Science and Technology Equal contribution EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Reciprocative Verification
Open Source Code	Yes	The project page with supplementary files for reproducing the results of this paper will be available at https://autodsl.org/procedure/papers/iclr25shi.html.
Open Datasets	Yes	The corpora C for the automatic generation of representations (Sec. 3.1) and the corpora for selecting the testing set (Sec. 4.1) are both retrieved from open-sourced websites run by top-tier publishers, including Nature’s Protocolexchange6, Cell’s Star-protocols7, Bio-protocol8, Wiley’s Current Protocols9, and Jove10.
Dataset Splits	Yes	The testing set includes 140 new protocols and 1757 steps in total, across the domains of Genetics, Medical, Bioengineering, and Ecology, with 23% for planning, 52% for modification, and 25% for adjustment (see Tab. 1 and Fig. 3A for details).
Hardware Specification	Yes	The design of the DSLs was executed on a Mac Book with an M2 chip, running 1,000 iterations to ensure convergence.
Software Dependencies	No	The protocol pre-processing steps begin by reading all JSON files of the protocols. Each protocol is then splitted sentence-by-sentence using Spacy1, with the constraint that every sentence is longer than ten characters. ... Afterwards, we use sklearn4 to identify potentially similar entity pairs by calculating the cosine similarity of the candidate entities, and then passing these entity pairs to the GPT model for synonym detection... We primarily used GPT-4o mini with Open AI’s Batch API5 for preprocessing...
Experiment Setup	Yes	The design of the DSLs was executed on a Mac Book with an M2 chip, running 1,000 iterations to ensure convergence. This process required an average of 55 seconds per iteration for the operation-centric view DSL and an average of 2 seconds per iteration for the product-centric view DSL.