LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning

Authors: Zhuorui Ye, Stephanie Milani, Geoff Gordon, Fei Fang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of LICORICE, we conduct experiments in two scenarios perfect human annotation and VLM annotation on five environments with image input: an image-based version of Cart Pole, two Minigrid environments, and two Atari environments. First, with assumption of perfect human annotation, we show that LICORICE yields both higher concept accuracy and higher reward while requiring fewer annotation queries compared to baseline methods. Second, we find that VLMs can indeed serve as concept annotators for some, but not all, of the environments.
Researcher Affiliation Academia Zhuorui Ye Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, China EMAIL Stephanie Milani Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 EMAIL Geoffrey J. Gordon Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213 EMAIL Fei Fang Software and Societal Systems Department Carnegie Mellon University Pittsburgh, PA 15213 EMAIL
Pseudocode Yes Algorithm 1 LICORICE (Label-efficient Interpretable COncept-based Re Infor CEment learning) 1: Input: Total budget B, number of iterations M, sample acceptance threshold p, ratio τ for active learning, batch size for querying b, number of concept models N to ensemble 2: Initialize: training set Dtrain , and validation set Dval 3: for m = 1 to M do
Open Source Code Yes Our code is released.1 https://github.com/cuizhuyefei/LICORICE
Open Datasets Yes Environments. We investigate these questions across five environments: Pixel Cart Pole (Yang et al., 2021), Door Key (Chevalier-Boisvert et al., 2023), Dynamic Obstacles (Chevalier-Boisvert et al., 2023), Boxing (Bellemare et al., 2013), and Pong (Bellemare et al., 2013). Each environment includes a distinct challenge and features a set of interpretable concepts describing key objects and properties. We summarize the concepts in Table 1, with more details in Appendix A.1. These environments are characterized by their dual representation: a complex image-based input and a symbolic concept-based representation. Pixel Cart Pole, Door Key, and Dynamic Obstacles are simpler because we can extract noiseless ground-truth concept labels from their source code. In contrast, Boxing, as implemented in OCAtari (Delfosse et al., 2023), uses reverse engineering to extract positions of important objects from the game s RAM state. This extraction process introduces a small amount of noise. The concepts in Pong are derived from the VIPER paper (Bastani et al., 2018) that also uses reverse engineering.
Dataset Splits No The paper mentions collecting unlabeled data (Um) and then splitting a queried dataset (Dm) into train and validation splits, but it does not provide specific percentages, counts, or methodology (e.g., random seed, stratified splits) for these splits. For evaluation, it states "All reported numbers are calculated using 100 evaluation episodes.", which is for evaluation metrics, not dataset splits for model training. The initial Dtrain and Dval are mentioned as initialized but no size or method for initial creation is given.
Hardware Specification Yes Computational resources. We use NVIDIA A6000 and NVIDIA RTX 6000 Ada Generation. Each of our training program uses less than 3GB GPU memory.
Software Dependencies No All algorithms use PPO (Schulman et al., 2017; Raffin et al., 2021) with a concept bottleneck. More implementation details and hyperparameters are in Appendix A.2. ... For the PPO hyperparameters, ... For all other hyperparameters, we use the default values from Stable Baselines 3 Raffin et al. (2021). ... For the concept training, we set 100 epochs with Adam optimizer ... For our VLM experiments, we use GPT-4o (gpt).
Experiment Setup Yes Behavior learning hyperparameters. For the PPO hyperparameters, we set 4 106 total timesteps for Pixel Cart Pole and Door Key, 106 for Dynamic Obstacles, 1.5 107 for Boxing, and 107 for Pong. For Pixel Cart Pole, Door Key, and Dynamic Obstacles, we use 8 vectorized environments, horizon T = 4096, 10 epochs for training, batch size of 512, learning rate 3 10 4, entropy coefficient 0.01, and value function coefficient 0.5. For Boxing and Pong, we use 8 vectorized environments, horizon T = 1024, 4 epochs for training, batch size of 256, learning rate 3 10 4, entropy coefficient 0.01, and value function coefficient 0.5. For all other hyperparameters, we use the default values from Stable Baselines 3 Raffin et al. (2021). Concept learning hyperparameters. For the concept training, we set 100 epochs with Adam optimizer with the learning rate linearly decaying from 3 10 4 to 0 for each iteration in Pixel Cart Pole, Boxing, and Pong. In Door Key and Dynamic Obstacles, we use the same optimizer and initial learning rate, yet set 50 epochs instead and set early stopping with threshold linearly increasing from 10 to 20, to incentivize the concept network not to overfit in earlier iterations. The batch size is 32. We model concept learning for Pixel Cart Pole as a regression problem (minimizing mean squared error). We model concept learning for Door Key, Dynamic Obstacles, Boxing,and Pong as classification problems. LICORICE-specific hyperparameters. For LICORICE, we set the ratio for active learning τ = 10, batch size to query labels in the active learning module b = 20, and the number of ensemble models N = 5 for the first three environments. For the complex environments Boxing and Pong, we choose the ratio for active learning τ = 4, batch size to query labels in the active learning module b = Bm/5 and number of ensemble models N = 5 to make a balance between performance and speed. The default number of iterations chosen in our algorithm is M = 4 for Pixel Cart Pole, M = 2 for Door Key, Dynamic Obstacles, and M = 5 for Boxing and Pong. The sample acceptance rate p = 0.02 for Pixel Cart Poleand p = 0.1 for the other four environments. The Random-Q baseline uses the same sample acceptance rate p as LICORICE. In the complex environments Boxing and Pong, we use both the KL-divergence penalty and PPO loss at the end of the algorithm to improve optimization with β = 0.01, as mentioned in Section 3.