Sample-Efficient Behavior Cloning Using General Domain Knowledge
Authors: Feiyu Zhu, Jean Oh, Reid Simmons
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments with lunar lander and car racing tasks, our approach learns to solve the tasks with as few as 5 demonstrations and is robust to action noise, outperforming the baseline model without domain knowledge. This indicates that with the help of large language models, we can incorporate domain knowledge into the structure of the policy, increasing sample efficiency for behavior cloning. |
| Researcher Affiliation | Academia | Feiyu Zhu , Jean Oh and Reid Simmons Carnegie Mellon University EMAIL, EMAIL |
| Pseudocode | Yes | Listing 1: Code snippet generated by GPT on steering control 1 steer_control = ( 2 self.steer_weight * 3 (target_heading current_heading) * 4 (1 current_speed) 5 ) |
| Open Source Code | Yes | 1Code at github.com/zfy0314/knowledge-informed-model |
| Open Datasets | No | We use the heuristic policy defined in the Gymnasium package as the expert policy that generates demonstrations. This policy achieves around 89% success rate in the environment, however, we keep only the successful episodes as demonstrations for training. |
| Dataset Splits | Yes | For both conditions, we randomly sample 20% of the demonstration steps as the validation set, and keep the model parameters with the least loss in the validation set for evaluation. For both conditions, we randomly sample 20% of the demonstration steps as the validation set, and keep the model parameters with the least loss in the validation set for evaluation. Similar to the previous environment, 20% of the demonstrations are reserved for validation while the rest are used for training in both conditions. |
| Hardware Specification | No | No specific hardware details (GPU models, CPU models, etc.) are provided in the paper's main text. The paper mentions experimenting with Lunar Lander and Car Racing environments but does not specify the computational resources used. |
| Software Dependencies | No | The final step is to implement the structure as a subclass of nn.Module in Py Torch. We experiment with the Lunar Lander and Car Racing environments in Gymnasium [Towers et al., 2024]. Specific version numbers for software dependencies are not provided. |
| Experiment Setup | No | By default, we use cross-entropy loss for discrete action spaces and mean square error for continuous action spaces. The combination (both non-gradient and gradient) that achieves the least overall loss is kept as the final model parameter. Because the connections between latent variables are sparse, the number of total parameters is small compared to unstructured models. Additionally, we focus on using only a few demonstrations. Therefore, we can perform gradient descent on all the demonstrations at once for most tasks without having to separate the samples into mini-batches. This helps to stabilize the training process. While general training strategies are mentioned, specific hyperparameter values like learning rate, number of epochs, or optimizer details are not provided. |