Controlling Large Language Models Through Concept Activation Vectors
Authors: Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Control experiments from different perspectives, including toxicity reduction, sentiment control, linguistic style, and topic control, demonstrate that our framework achieves state-of-the-art performance with granular control, allowing for fine-grained adjustments of both the steering layers and the steering magnitudes for individual samples. |
| Researcher Affiliation | Academia | 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2Key Lab of AI Safety of Chinese Academy of Sciences (CAS), Beijing 100190, China 3University of Chinese Academy of Sciences, CAS, Beijing 100049, China 4 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China EMAIL, EMAIL |
| Pseudocode | No | The paper describes the GCAV framework and its components (CAV Training, Controlled Generation, Controlling Multiple Concepts) using prose and mathematical equations. It does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code, nor does it include a link to a code repository. It mentions referring to the appendix for experimental setup details but not for code availability. |
| Open Datasets | Yes | The toxic reduction dataset is from Real Toxicity Prompts (Gehman et al. 2020) and we use the dataset constructed by (Pei, Yang, and Klein 2023). ... The sentiment control dataset consists of 1000 negative reviews from the IMDB movie review dataset (Maas et al. 2011)... topic strength is measured using a multi-label topic classification model trained on Twitter data (Antypas et al. 2022a,b). Formality is evaluated using a model trained to classify sentences as formal or informal (Babakov et al. 2023). |
| Dataset Splits | No | The paper mentions specific sizes for evaluation sets, such as 'toxicity toxic, consists of the 1,000 most toxic prompts', 'toxicity random , consists of 1000 randomly sampled prompts', and 'The sentiment control dataset consists of 1000 negative reviews'. It also mentions '100 pairs' for CAV training prompts. However, it does not provide explicit training/validation/test splits (percentages or counts) for the datasets used in their experiments, nor does it cite predefined splits for their specific experimental setup. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or other processor specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions using Sci Py and various language models (Llama-2-7b, Llama-2-7b-chat, Llama-2-13b-chat), but it does not provide specific version numbers for any ancillary software dependencies used in their implementation or experimentation. |
| Experiment Setup | No | The paper describes the methodology for GCAV, including the calculation of steering strength and selection of intervention layers. However, it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, number of epochs) or explicit training configurations for the concept activation vector classifiers used within the framework. |