reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sample-Efficient Behavior Cloning Using General Domain Knowledge

Authors: Feiyu Zhu, Jean Oh, Reid Simmons

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments with lunar lander and car racing tasks, our approach learns to solve the tasks with as few as 5 demonstrations and is robust to action noise, outperforming the baseline model without domain knowledge. This indicates that with the help of large language models, we can incorporate domain knowledge into the structure of the policy, increasing sample efficiency for behavior cloning.
Researcher Affiliation	Academia	Feiyu Zhu , Jean Oh and Reid Simmons Carnegie Mellon University EMAIL, EMAIL
Pseudocode	Yes	Listing 1: Code snippet generated by GPT on steering control 1 steer_control = ( 2 self.steer_weight * 3 (target_heading current_heading) * 4 (1 current_speed) 5 )
Open Source Code	Yes	1Code at github.com/zfy0314/knowledge-informed-model
Open Datasets	No	We use the heuristic policy defined in the Gymnasium package as the expert policy that generates demonstrations. This policy achieves around 89% success rate in the environment, however, we keep only the successful episodes as demonstrations for training.
Dataset Splits	Yes	For both conditions, we randomly sample 20% of the demonstration steps as the validation set, and keep the model parameters with the least loss in the validation set for evaluation. For both conditions, we randomly sample 20% of the demonstration steps as the validation set, and keep the model parameters with the least loss in the validation set for evaluation. Similar to the previous environment, 20% of the demonstrations are reserved for validation while the rest are used for training in both conditions.
Hardware Specification	No	No specific hardware details (GPU models, CPU models, etc.) are provided in the paper's main text. The paper mentions experimenting with Lunar Lander and Car Racing environments but does not specify the computational resources used.
Software Dependencies	No	The final step is to implement the structure as a subclass of nn.Module in Py Torch. We experiment with the Lunar Lander and Car Racing environments in Gymnasium [Towers et al., 2024]. Specific version numbers for software dependencies are not provided.
Experiment Setup	No	By default, we use cross-entropy loss for discrete action spaces and mean square error for continuous action spaces. The combination (both non-gradient and gradient) that achieves the least overall loss is kept as the final model parameter. Because the connections between latent variables are sparse, the number of total parameters is small compared to unstructured models. Additionally, we focus on using only a few demonstrations. Therefore, we can perform gradient descent on all the demonstrations at once for most tasks without having to separate the samples into mini-batches. This helps to stabilize the training process. While general training strategies are mentioned, specific hyperparameter values like learning rate, number of epochs, or optimizer details are not provided.