A General Framework for Producing Interpretable Semantic Text Embeddings
Authors: Yiqun Sun, Qiang Huang, Yixuan Tang, Anthony Tung, Jun Yu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate the effectiveness and interpretability of CQG-MBQA through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, CQG-MBQA outperforms other interpretable text embedding methods across various downstream tasks. |
| Researcher Affiliation | Academia | Yiqun Sun, Qiang Huang , Yixuan Tang & Anthony K. H. Tung School of Computing, National University of Singapore Singapore EMAIL Jun Yu School of Intelligence Science and Engineering, Harbin Institute of Technology (Shenzhen) Shenzhen, China EMAIL |
| Pseudocode | No | The paper describes methods and processes in sections like "3.1 QUESTION GENERATION" and "3.2 QUESTION ANSWERING", but these are described in natural language and mathematical formulas (e.g., Equation 1, 2, 3, 4, 5) rather than structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code is available at https://github.com/dukesun99/CQG-MBQA. |
| Open Datasets | Yes | For CQG-MBQA, we use the MEDI2 dataset (Muennighoff et al., 2024), a diverse text corpus, as the training data." and "We use the MEDI2 dataset, downloaded from the Hugging Face repository at Grit LM/MEDI2.4" with footnote "4https://huggingface.co/datasets/Grit LM/MEDI2/tree/main". |
| Dataset Splits | Yes | To ensure that the MBQA model produces faithful answers to the questions, we evaluate its questionanswering performance on a 10% held-out document set that was not used for training." and "For STS tasks, we use Spearman correlation (Spearman, 1904) on cosine similarity between embeddings as the metric. In retrieval tasks, we assess the performance using nDCG@10 (Wang et al., 2013). For clustering tasks, we evaluate the results using V-Measure (Rosenberg & Hirschberg, 2007)." and "Evaluated on seven popular datasets: Sem Eval STS tasks 2012-2016 (STS12 STS16) (Agirre et al., 2012; 2013; 2014; 2015; 2016), STS Benchmark (STS-B) (Cer et al., 2017), and SICK-Relatedness (SICK-R) (Marelli et al., 2014) using the MTEB evaluation suite (Muennighoff et al., 2023)." |
| Hardware Specification | Yes | The training process takes 36 hours, and embedding the entire MS MARCO dev set requires 90 hours on a single GTX 1080 Ti, which is an inexpensive GPU. |
| Software Dependencies | No | We then run KMeans clustering (Arthur & Vassilvitskii, 2007) with k = 5, 000 clusters and default parameters, utilizing using the scikit-learn library,5 accelerated by Intel(R) Extension for scikit-learn.6" This mentions libraries but without specific version numbers. While "GPT-4o-mini-2024-07-18" is a model version, it's an API and not a traditional software dependency with a specific version number from a local environment. |
| Experiment Setup | Yes | Table 8: Hyperparameters used in our experiments. Description Symbol Setting Enc UAE-Large-V1 Generation model LLM GPT-4o-mini-2024-07-18 Number of clusters k 5,000 Positive samples per cluster np 6 Hard negative samples per cluster nh 18 Easy negative samples per cluster ne 18 Positive probe samples per question pp 5 Hard negative probe samples per question ph 3 Easy negatives probe samples per question pe 2 Deduplication threshold θ 0.8 Top questions per cluster t 4 Learning rate of the MBQA Model α 1e-4 Binary classification threshold τ 0.5" and "We train the MBQA model using the Adam optimizer with a learning rate α of 1e-4 and a batch size of one text sample. ... The model is trained for 3 million steps, at which point performance begins to converge." |