reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Predicate Hierarchies Improve Few-Shot State Classification

Authors: Emily Jin, Joy Hsu, Jiajun Wu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate PHIER in the CALVIN and BEHAVIOR robotic environments and show that PHIER significantly outperforms existing methods in few-shot, out-of-distribution state classification, and demonstrates strong zeroand few-shot generalization from simulated to real-world tasks. Our results demonstrate that leveraging predicate hierarchies improves performance on state classification tasks with limited data.
Researcher Affiliation	Academia	Emily Jin Stanford University Joy Hsu Stanford University Jiajun Wu Stanford University
Pseudocode	No	The paper describes the PHIER model and its components in detail across sections 3 and 4, outlining the object-centric image encoder, self-supervised learning, and hyperbolic latent space. However, it does so through descriptive text and mathematical formulations rather than presenting a structured pseudocode or algorithm block.
Open Source Code	No	The paper mentions using and implementing components with third-party software like CLIP, BERT, and Geoopt, and provides links to implementations of baseline methods. However, it does not provide an explicit statement about releasing the source code for PHIER itself, nor does it include a direct link to a repository for the PHIER methodology.
Open Datasets	Yes	We evaluate PHIER on the state classification task in two robotics environments, CALVIN (Mees et al., 2022) and BEHAVIOR (Li et al., 2023a). Beyond the standard test settings, we focus on fewshot, out-of-distribution tasks involving unseen object-predicate combinations and novel predicates. Our results show that PHIER significantly outperforms existing methods, including both supervised approaches trained on the same amount of data and inference-only vision-language models (VLMs) trained on large corpora of real-world examples. PHIER improves upon the top-performing prior work in out-of-distribution tasks by 22.5 percent points on CALVIN and 8.3 percent points on BEHAVIOR. Notably, trained solely on simulated data, PHIER also outperforms supervised baselines on zeroand few-shot generalization to real-world state classification tasks by 7 percent points and 10 percent points respectively. Overall, we see PHIER as a promising solution to few-shot state classification, enabling generalization by leveraging representations grounded in predicate hierarchies.
Dataset Splits	Yes	We train on a balanced dataset of 200 examples (100 True, 100 False) for each in-distribution state. We then evaluate on balanced test sets of 50 examples for each state under both in-distribution and out-of-distribution settings. ... Real-world dataset. In addition, we evaluate on BEHAVIOR Vision Suite (Ge et al., 2024) (see Figure 3), a complex real-world benchmark that consists of diverse scenes and distractor objects. Specifically, compared to our train data, this one consists of 10 unseen combinations and 10 novel predicates, with 337 total examples.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments, such as GPU models, CPU types, or other detailed computing resources.
Software Dependencies	No	The paper mentions using specific software components like CLIP, BERT, and the Geoopt package, but it does not provide specific version numbers for these dependencies, which are necessary for reproducible descriptions.
Experiment Setup	Yes	For our model, we use the CLIP image encoder, CLIP text encoder, and BERT text encoder as our image, object text, and predicate text encoders, respectively. Our hyperbolic encoder consists of two hyperbolic linear layers with output dimensions of 256 and 128, and the final small MLP is a single layer. We use α = 0.05 as our triplet loss coefficient, λ = 10.0 as our triplet loss margin, β = 1.0 as our regularization loss coefficient, and γ = 0.1 as our regularization margin. We train all models for 50 epochs using the Adam W optimizer with a learning rate of 1e 4 using a gradual warmup scheduler and cosine annealing decay. For the few-shot setting, we provide 5 examples of each novel predicate and train for 20 epochs with the same optimizer and learning rate.