Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images
Authors: Hongyu Yan, Yadong Mu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler. We present two datasets for the proposed image-guided assembly task, namely the CLEVR-Assembly Dataset and LEGO-Assembly Dataset. Comprehensive experiments are conducted on both datasets. The evaluations unequivocally demonstrate that Neural Assembler outperforms the baselines across all performance metrics. Results on CLEVR-Assembly Dataset. Per-scene quantitative results on the CLEVR-Assembly Dataset are summarized in Table 2. Results on LEGO-Assembly Dataset. Ablation Study. Real-World Experiments. |
| Researcher Affiliation | Academia | Hongyu Yan, Yadong Mu* Wangxuan Institute of Computer Technology, Peking University EMAIL |
| Pseudocode | No | The paper describes the methodology using text, diagrams, and mathematical equations, but it does not include a clearly labeled pseudocode block or algorithm section. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code, nor does it provide a link to a code repository or mention code in supplementary materials. |
| Open Datasets | No | We present two datasets for the proposed image-guided assembly task, namely the CLEVR-Assembly Dataset and LEGO-Assembly Dataset. CLEVR-Assembly Dataset is created using the CLEVR-Engine (Johnson et al. 2017), and LEGO-Assembly is synthesized with Pytorch3d. The paper describes the creation of these datasets but does not provide specific access information (e.g., links, DOIs, or repository names) for their constructed datasets. |
| Dataset Splits | No | We train Neural Assembler with full supervision on the generated dataset where each sample we have the groundtruth shape, texture, keypoint, mask, rotation information of each brick, the number of bricks and the relationship graph of bricks. The paper mentions training on a generated dataset but does not specify the training/test/validation splits (e.g., percentages or sample counts). |
| Hardware Specification | Yes | Training is conducted on an RTX 3090 GPU using Adam W, with an initial rate of 5e-4, decaying by 0.8 per epoch, a weight decay of 1e-3, and batch size 8 over 10 epochs for both datasets. |
| Software Dependencies | No | Our approach is implemented in single-scale version for fair comparison with other works. It incorporates a CLIP(Radford et al. 2021) pre-trained Vi TB/16 image encoder, a Point Net-based (Qi et al. 2017) point cloud encoder, and a Res Net-18(He et al. 2016) for texture encoding. We employ a two-layer residual network for brick number prediction. The shape, material, iou prediction heads are implemented using 3-layer MLP and Re LU activations. Rotation prediction also uses a two-layer residual network, and our GCN architecture employs two messagepassing layers. The paper mentions several models and architectural components (CLIP, PointNet, ResNet-18, Pytorch3d) but does not provide specific version numbers for any software libraries or frameworks. |
| Experiment Setup | Yes | Training is conducted on an RTX 3090 GPU using Adam W, with an initial rate of 5e-4, decaying by 0.8 per epoch, a weight decay of 1e-3, and batch size 8 over 10 epochs for both datasets. Our objective function is computed by L = α Lcount + β Lgraph + Lpose, where Lcount is the L1 Loss between the predicted number of bricks and ground truth countgt. The pose loss of bricks includes the loss of shape, texture, keypoint, mask and rotation. Lpose = Lkeypoint + Lmask + γ1Lrotation (5) + γ2Lshape + γ3Ltexture + γ4Lconfidence, where Lkeypoint is the focal loss (Lin et al. 2017) computed based on the predicted heatmap and ground truth heatmap generated by Kpsσi, Lmask is the focal and dice loss between the predicted mask and ground truth mask Mσi, Lrotation is the L1 Loss between the prediced sine and cosine and the ground truth sine and cosine of Rotσi, Lshape and Ltexture are the cross entropy loss for shape and texture classification and Lconfidence is L1 Loss between the predicted confidence score and Io U of the predicted mask and ground truth mask. Our model strategically prioritizes the hyperparameters Lkeypoint and Lmask due to their critical impact on object detection, essential for accurate object interaction and identification in complex scenes. In contrast, Lrotation, Lshape, Ltexture and Lconfidence are assigned a reduced weight of 0.1 each. |