reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A General Framework for Producing Interpretable Semantic Text Embeddings

Authors: Yiqun Sun, Qiang Huang, Yixuan Tang, Anthony Tung, Jun Yu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness and interpretability of CQG-MBQA through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, CQG-MBQA outperforms other interpretable text embedding methods across various downstream tasks.
Researcher Affiliation	Academia	Yiqun Sun, Qiang Huang , Yixuan Tang & Anthony K. H. Tung School of Computing, National University of Singapore Singapore EMAIL Jun Yu School of Intelligence Science and Engineering, Harbin Institute of Technology (Shenzhen) Shenzhen, China EMAIL
Pseudocode	No	The paper describes methods and processes in sections like "3.1 QUESTION GENERATION" and "3.2 QUESTION ANSWERING", but these are described in natural language and mathematical formulas (e.g., Equation 1, 2, 3, 4, 5) rather than structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code is available at https://github.com/dukesun99/CQG-MBQA.
Open Datasets	Yes	For CQG-MBQA, we use the MEDI2 dataset (Muennighoff et al., 2024), a diverse text corpus, as the training data." and "We use the MEDI2 dataset, downloaded from the Hugging Face repository at Grit LM/MEDI2.4" with footnote "4https://huggingface.co/datasets/Grit LM/MEDI2/tree/main".
Dataset Splits	Yes	To ensure that the MBQA model produces faithful answers to the questions, we evaluate its questionanswering performance on a 10% held-out document set that was not used for training." and "For STS tasks, we use Spearman correlation (Spearman, 1904) on cosine similarity between embeddings as the metric. In retrieval tasks, we assess the performance using nDCG@10 (Wang et al., 2013). For clustering tasks, we evaluate the results using V-Measure (Rosenberg & Hirschberg, 2007)." and "Evaluated on seven popular datasets: Sem Eval STS tasks 2012-2016 (STS12 STS16) (Agirre et al., 2012; 2013; 2014; 2015; 2016), STS Benchmark (STS-B) (Cer et al., 2017), and SICK-Relatedness (SICK-R) (Marelli et al., 2014) using the MTEB evaluation suite (Muennighoff et al., 2023)."
Hardware Specification	Yes	The training process takes 36 hours, and embedding the entire MS MARCO dev set requires 90 hours on a single GTX 1080 Ti, which is an inexpensive GPU.
Software Dependencies	No	We then run KMeans clustering (Arthur & Vassilvitskii, 2007) with k = 5, 000 clusters and default parameters, utilizing using the scikit-learn library,5 accelerated by Intel(R) Extension for scikit-learn.6" This mentions libraries but without specific version numbers. While "GPT-4o-mini-2024-07-18" is a model version, it's an API and not a traditional software dependency with a specific version number from a local environment.
Experiment Setup	Yes	Table 8: Hyperparameters used in our experiments. Description Symbol Setting Enc UAE-Large-V1 Generation model LLM GPT-4o-mini-2024-07-18 Number of clusters k 5,000 Positive samples per cluster np 6 Hard negative samples per cluster nh 18 Easy negative samples per cluster ne 18 Positive probe samples per question pp 5 Hard negative probe samples per question ph 3 Easy negatives probe samples per question pe 2 Deduplication threshold θ 0.8 Top questions per cluster t 4 Learning rate of the MBQA Model α 1e-4 Binary classification threshold τ 0.5" and "We train the MBQA model using the Adam optimizer with a learning rate α of 1e-4 and a batch size of one text sample. ... The model is trained for 3 million steps, at which point performance begins to converge."