REEF: Representation Encoding Fingerprints for Large Language Models
Authors: Jie Zhang, Dongrui Liu, Chen Qian, Linfeng Zhang, Yong Liu, Yu Qiao, Jing Shao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide a comprehensive evaluation of REEF. Section 5.1 evaluates REEF s effectiveness in distinguishing LLMs derived from the root victim model from unrelated models. Following this, Section 5.2 assesses REEF s robustness to subsequent developments of the victim model, such as fine-tuning, pruning, merging, and permutations. Section 5.3 presents ablation studies on REEF across varying sample numbers and datasets. Finally, Section 5.4 discusses REEF s sensitivity to training data and its capacity for adversarial evasion. Figure 3: Heatmaps depicting the CKA similarity between the representations of the victim LLM (Llama-2-7B) and those of various suspect LLMs on the same samples. |
| Researcher Affiliation | Academia | 1 Shanghai Artificial Intelligence Laboratory 2 University of Chinese Academy of Sciences 3 Renmin University of China 4 Shanghai Jiao Tong University |
| Pseudocode | No | The paper describes the CKA similarity index and its mathematical formulation in Section 4 and provides proofs in Appendix A, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly accessible at https://github.com/AI45Lab/REEF. To ensure the reproducibility of this study, we have uploaded the source code as part of the supplementary material. Furthermore, the code and datasets will be made available on Git Hub after the completion of the double-blind review process, enabling others to replicate our study. |
| Open Datasets | Yes | We use both a linear kernel and an RBF kernel to compute the layer-wise and inter-layer CKA similarity of representations between the victim and suspect models on 200 samples from the Truthful QA dataset (Lin et al., 2022). To assess the effectiveness of REEF across various data types, we also conduct experiments using SST2 (Socher et al., 2013), Conf AIde (Mireshghallah et al., 2023), PKUSafe RLHF (Ji et al., 2024), and Toxi Gen (Hartvigsen et al., 2022). |
| Dataset Splits | Yes | The dataset is split into training and test sets with a 4:1 ratio. (Truthful QA dataset) |
| Hardware Specification | Yes | Using the dataset provided in the original paper, we conduct pre-training on 8 A100 GPUs with different random seeds for data shuffling. |
| Software Dependencies | No | The paper mentions using a 'linear kernel and a Radial Basis Function (RBF) kernel' and 'cross-entropy' for task loss, but does not provide specific version numbers for any software libraries or dependencies used in the implementation. |
| Experiment Setup | Yes | For training, we use the Truthful QA dataset (Lin et al., 2022), concatenating each question with its truthful answer as positive samples and with its false answer as negative samples. The dataset is split into training and test sets with a 4:1 ratio. We use both a linear kernel and an RBF kernel to compute the layer-wise and inter-layer CKA similarity of representations between the victim and suspect models on 200 samples from the Truthful QA dataset. We focus on reporting the similarity at layer 18 in subsequent experiments. The hyperparameters for pre-training are set as follows: a global batch size of 512, a learning rate of 4e-4, a micro-batch size of 8, a maximum of 56,960 steps, a weight decay of 0.1, beta1 of 0.9, beta2 of 0.95, gradient clipping at 1.0, and a minimum learning rate of 4e-5. |