Top-Down Guidance for Learning Object-Centric Representations

Authors: Junhong Zou, Xiangyu Zhu, Zhaoxiang Zhang, Zhen Lei

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TDGNet and compare it with current SOTA models on multiple tasks. We first introduce CLEVRTex [Karazija et al., 2021], MOVi-C [Greff et al., 2022] and COCO to evaluate the object-centric representations, where TDGNet outperforms current SOTA models in terms of common object discovery metrics. Furthermore, we expand the downstream task scope of TDGNet by applying it to the field of robotics. We introduce Robo Net [Dasari et al., 2020] and VP2 [Tian et al., 2023], to evaluate TDGNet with downstream tasks including video prediction and visual planning, demonstrating that TDGNet adapts well to these tasks.
Researcher Affiliation Academia 1MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 3CAIR, HKSIS, Chinese Academy of Sciences, Hong Kong, China 4School of Computer Science and Engineering, the Faculty of Innovation Engineering, M.U.S.T, Macau, China EMAIL
Pseudocode No The paper describes the methodology using text and diagrams (Figure 2), but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Code will be available at https://github.com/zoujunhong/RHGNet.
Open Datasets Yes We first introduce CLEVRTex [Karazija et al., 2021], MOVi-C [Greff et al., 2022] and COCO [Caesar et al., 2018]. ... We introduce Robo Net [Dasari et al., 2020] and VP2 [Tian et al., 2023]...
Dataset Splits Yes The model s generalization ability is also evaluated using CLEVRTex OOD and -CAMO, two out-of-distribution test sets. ... Following previous works [Song et al., 2024], the model predicts 8 future frames from 6 context frames with no conditions. ... The models are required to predict 10 future frames given 2 context frames and the robotic arm s action as conditions.
Hardware Specification No The paper does not provide any specific hardware details used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes Here we use a combination of L1 loss and perceptual loss (LPIPS) [Zhang et al., 2018] for optimization. ... The top-down guidance loss LTD is formulated as below: LTD := 1 Cos Sim(P(b F), FH). Overall, TDGNet is trained by L, the weighted sum of the reconstruction loss and the top-down guidance loss: L = Lrec + λTDLTD. ... We set a threshold th to determine whether a conflict is large or not. ... As for the choice of th, we propose a heuristic method that for each trained model, we calculate the average distance between slots and set th as half of this distance.