reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation

Authors: Ning Wang, Zihan Yan, Weiyang Li, Chuan Ma, He Chen, Tao Xiang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted on diverse benchmark datasets and models validate the feasibility and efficacy of the proposed approach. The results demonstrate that our methodologies achieve an impressive average detection accuracy of 94.58%, surpassing the performance of existing state-of-the-art techniques, alongside an exceptional moderation processing time of merely 0.002 seconds per instance.
Researcher Affiliation	Academia	1College of computer science, Chongqing University 2Department of Information Engineering, The Chinese University of Hong Kong
Pseudocode	No	The paper describes the Pinpoint workflow with a diagram (Figure 2) and textual descriptions of its components (External Instruction Localization, Intrinsic Feature Extraction, Malicious Instruction Detection), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks with structured steps.
Open Source Code	Yes	The source code and datasets can be found at https://github.com/Zihan Yan-CQU /EAsafety Bench.
Open Datasets	Yes	The source code and datasets can be found at https://github.com/Zihan Yan-CQU /EAsafety Bench. In addition to EAsafety Bench-Drone, our experiments also utilize data from Safe Agent Bench [Yin et al., 2024].
Dataset Splits	Yes	We partition the combined dataset from EAsafety Bench Drone and Safe Agent Bench into a training set and test set based on semantic similarity to ensure distinction for each set. For this, we employ NV-Embed-v2 [Lee et al., 2024] as the embedding model. The training set is allocated 70% of the data.
Hardware Specification	Yes	All experiments are conducted on Ubuntu 22.04 using four NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions 'Ubuntu 22.04' as the operating system and that the experimental environment is built 'on the Py Torch platform', but it does not provide specific version numbers for PyTorch or any other key software libraries or dependencies.
Experiment Setup	Yes	We train a fully connected MLP classifier with 3 layers and 4 million parameters using the Adam optimizer. The training parameters are set as follows: a batch size of 16, 50 epochs, a learning rate of 1e-3, and a weight decay (ℓ2 penalty) of 2e-4.