reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor

Authors: Barys Liskavets, Shuvendu Roy, Maxim Ushakov, Mark Klibanov, Ali Etemad, Shane K. Luke

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present comprehensive experiments to evaluate our proposed solution and compare it against existing methods on two evaluation setups: task-aware compression and task-agnostic compression on two popular benchmarks Long Bench and Zero SCROLLS adopted by existing literature on prompt compression. We report the results for three variants of our model size: TPC-Base, TPC-Large, and TPC-Huge, containing 0.5B, 1B, and 7B parameters, respectively. As summarized in Figure 1, our TPC shows significant improvements over existing methods on both task-aware and task-agnostic setups. Our smallest model, while being considerably smaller in size, outperforms or performs comparable to the existing state-of-the-art (SOTA) methods.
Researcher Affiliation	Collaboration	Barys Liskavets 1, Shuvendu Roy2, Maxim Ushakov1, Mark Klibanov3, Ali Etemad2, Shane K. Luke3 1Alterra AI, Palo Alto, United States, 2Queen s University, Canada, 3Workday Inc
Pseudocode	Yes	The overall diagram of our proposed method is illustrated in Figure 2, and a pseudo-code is provided in Algorithm 3 (Appendix 3.7). Next, we provide the pseudo-code of the overall pipeline of our proposed prompt compression method TPC in Algorithm 3, the pseudo-code for context-relevant task descriptor in Algorithm 2, and CTD refinement with RL in Algorithm 1.
Open Source Code	Yes	Finally, we release the code and the dataset for quick reproducibility and further development: https://github.com/bliskavets/TPC.
Open Datasets	Yes	We used pre-trained models and open-source datasets. In terms of models, we use models from the Qwen-2.5 family and Mistral family, which are licensed under the APACHE-2.0 license. We also used the LLama-3.2 model, which is licensed under the LLAMA 3.2 COMMUNITY LICENSE AGREEMENT . All the models used are available for research purposes. As our training dataset, we used allenai/tulu-3-sft-mixture , which is licensed under the Open Data Commons License Attribution family , as well as abacusai/Meta Math Fewshot and ise-uiuc/Magicoder Evol-Instruct-110K , which are licensed under the APACHE-2.0 license. We assume that these data follow the well-known HHH paradigm. Therefore, taking data from these sources should not introduce potentially malicious behaviour into our models. We also used the Pile dataset, which is licensed under the MIT license. We conduct manual checks on subsamples of the datasets to verify the cleanliness and safety of the data. In addition, we provide full details on how we created the datasets in Section 3.5.1. We evaluate our method on benchmarks such as MIT-licensed Long Bench and MIT-licensed Zero Scrolls. All aforementioned models, datasets, and benchmarks are publicly available and imply research use. Their prescribed terms of use are hereby aligned with our research work.
Dataset Splits	No	The paper mentions using specific datasets for training and evaluation (CTD, MCQR, Tulu-3-sft-mixture, Meta Math Fewshot, Magicoder-Evol-Instruct-110K, Pile, Long Bench, Zero SCROLLS) and refers to 'standard evaluation protocols established in prior studies' for Long Bench and Zero SCROLLS. It also mentions a 'randomly sampled subset of 200 examples from each dataset' for additional evaluations. However, it does not provide explicit details like specific percentages or absolute counts for train/validation/test splits used for their main model training and evaluation, nor does it specify if custom splits were created with random seeds. It relies on previously established splits from benchmarks or general statements about sampling.
Hardware Specification	Yes	All the experiments are conducted on an Nvidia A100 80GB GPU.
Software Dependencies	No	The paper mentions using an 'Adam W optimizer' and fine-tuning with 'Lo RA Hu et al. (2022)' but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, TensorFlow) used in the implementation.
Experiment Setup	Yes	We train the fq with our curated CTD dataset for 2 epochs with an Adam W optimizer, a learning rate of 1.5e-4, and a batch size of 16. This is followed by the reward-guided RL training for 3 iterations with Llama-3.1-8B Dubey et al. (2024) as the pre-trained LLM for the reward function. To obtain questions in the RL stage, we select 16 questions per prompt using a Nucleous Sampling Holtzman et al. with temperature = 0.7 and top P = 0.9. CSE is trained with an Adam W optimizer for 2 epochs, a learning rate of 5e-5, and a batch size of 32. We initialize the encoder with the pre-trained weights and fine-tune it with Lo RA Hu et al. (2022) of rank 16.