reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Benchmark for Semantic Sensitive Information in LLMs Outputs

Authors: Qingjie Zhang, Han Qiu, Di Wang, Yiming Li, Tianwei Zhang, Wenyu Zhu, Haiqin Weng, Liu Yan, Chao Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we construct a comprehensive and labeled dataset of semantic sensitive information, Sem SI-Set, by including three typical categories of Sem SI. Then, we propose a large-scale benchmark, Sem SI-Bench, to systematically evaluate semantic sensitive information in 25 SOTA LLMs. Our finding reveals that Sem SI widely exists in SOTA LLMs outputs by querying with simple natural questions. We open-source our project at https://semsi-project.github.io/.
Researcher Affiliation	Collaboration	Qingjie Zhang1, Han Qiu1 , Di Wang1, Yiming Li2, Tianwei Zhang2, Wenyu Zhu3, Haiqing Weng4, Liu Yan4, and Chao Zhang1 1Tsinghua University, 2Nanyang Technological University, 3Ascend Grace Tech, 4Ant Group Emails:{qj-zhang24@mails., qiuhan@}tsinghua.edu.cn
Pseudocode	No	The paper describes methods and processes verbally and visually (e.g., Figure 2 pipeline overview), but does not contain any explicit pseudocode blocks or algorithms.
Open Source Code	Yes	We open-source our project at https://semsi-project.github.io/. The codes for reproducing our results are provided in our project website: https://semsi-project.github.io/.
Open Datasets	Yes	First, we construct a comprehensive and labeled dataset of semantic sensitive information, Sem SI-Set, by including three typical categories of Sem SI. ... We open-source our project at https://semsi-project.github.io/.
Dataset Splits	Yes	We compress Sem SI-Set to a coreset of 1,000 samples, Sem SI-c Set, for labeling and benchmarking. ... We observe that if we proportionally reduce the occurrence of Sem SI of one model (e.g. GPT3.5-Turbo in Table 3), its metrics are almost the same after compression. What s more, the metrics are also the same for other models (e.g. GPT4o, Llama3-8B, and GLM4-9B in Table 3). We can see that the diffrence of metric values between the compressed and the original dataset is very close to 0. This implies a common coreset Sem SI-c Set, to represent Sem SI-Set and efficiently make Sem SI-Bench.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only mentions the general timeframe of experiments: "Experiments of GPT-o1 series are done at the end of September 2024 while other experiments are done at August 2024."
Software Dependencies	No	The paper mentions accessing LLMs via public API and Hugging Face, and using GPT-4o for labeling, but it does not specify any particular software libraries, frameworks, or their version numbers that were used to implement their own methodology or analysis.
Experiment Setup	No	The paper focuses on benchmarking existing LLMs and describes the process of prompt generation, labeling, and metric computation. It does not involve training its own models, and therefore, does not provide experimental setup details like hyperparameters, optimizers, or training schedules which are typically associated with training a model.