reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controlling Large Language Models Through Concept Activation Vectors

Authors: Hanyu Zhang, Xiting Wang, Chengao Li, Xiang Ao, Qing He

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Control experiments from different perspectives, including toxicity reduction, sentiment control, linguistic style, and topic control, demonstrate that our framework achieves state-of-the-art performance with granular control, allowing for fine-grained adjustments of both the steering layers and the steering magnitudes for individual samples.
Researcher Affiliation	Academia	1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2Key Lab of AI Safety of Chinese Academy of Sciences (CAS), Beijing 100190, China 3University of Chinese Academy of Sciences, CAS, Beijing 100049, China 4 Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China EMAIL, EMAIL
Pseudocode	No	The paper describes the GCAV framework and its components (CAV Training, Controlled Generation, Controlling Multiple Concepts) using prose and mathematical equations. It does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code, nor does it include a link to a code repository. It mentions referring to the appendix for experimental setup details but not for code availability.
Open Datasets	Yes	The toxic reduction dataset is from Real Toxicity Prompts (Gehman et al. 2020) and we use the dataset constructed by (Pei, Yang, and Klein 2023). ... The sentiment control dataset consists of 1000 negative reviews from the IMDB movie review dataset (Maas et al. 2011)... topic strength is measured using a multi-label topic classification model trained on Twitter data (Antypas et al. 2022a,b). Formality is evaluated using a model trained to classify sentences as formal or informal (Babakov et al. 2023).
Dataset Splits	No	The paper mentions specific sizes for evaluation sets, such as 'toxicity toxic, consists of the 1,000 most toxic prompts', 'toxicity random , consists of 1000 randomly sampled prompts', and 'The sentiment control dataset consists of 1000 negative reviews'. It also mentions '100 pairs' for CAV training prompts. However, it does not provide explicit training/validation/test splits (percentages or counts) for the datasets used in their experiments, nor does it cite predefined splits for their specific experimental setup.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or other processor specifications used for running its experiments.
Software Dependencies	No	The paper mentions using Sci Py and various language models (Llama-2-7b, Llama-2-7b-chat, Llama-2-13b-chat), but it does not provide specific version numbers for any ancillary software dependencies used in their implementation or experimentation.
Experiment Setup	No	The paper describes the methodology for GCAV, including the calculation of steering strength and selection of intervention layers. However, it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, number of epochs) or explicit training configurations for the concept activation vector classifiers used within the framework.