CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation
Authors: Matan Rusanovsky, Or Hirschorn, Shai Avidan
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our novel approach using the MP-100 benchmark, a comprehensive dataset covering over 100 categories and 18,000 images. Under a 1-shot setting, our solution achieves a notable performance boost of 1.26%, establishing a new state-of-the-art for CAPE. Additionally, we enhance the dataset by providing text description annotations for both training and testing. We also include alternative text annotations specifically for testing the model s ability to generalize across different textual descriptions, further increasing its value for future research. Our code and dataset are publicly available at https://github.com/matanr/capex. |
| Researcher Affiliation | Academia | Matan Rusanovsky, Or Hirschorn and Shai Avidan Tel Aviv University EMAIL and EMAIL |
| Pseudocode | No | The paper describes the architecture, loss functions, and experimental details with equations and diagrams but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and dataset are publicly available at https://github.com/matanr/capex. |
| Open Datasets | Yes | We validate our novel approach using the MP-100 benchmark, a comprehensive dataset covering over 100 categories and 18,000 images. Our code and dataset are publicly available at https://github.com/matanr/capex. We provide an enhanced version of the MP-100 dataset with textual annotations for the keypoints in all categories, enriching the benchmarking capabilities for category-agnostic pose estimation. |
| Dataset Splits | Yes | The dataset is divided into five separate splits for training and evaluation. Importantly, each split ensures that the categories used for training, validation, and testing are mutually exclusive, ensuring that the categories used for evaluation are unseen during the training phase. |
| Hardware Specification | Yes | Our model requires 6.5 GB of GPU memory and takes roughly 13 hours to train for each split, on a machine equipped with an NVIDIA RTX A5000 GPU. |
| Software Dependencies | No | The paper mentions "MMPose framework Contributors (2020)" but does not provide a specific version number for the framework itself or for any other key software libraries like PyTorch or TensorFlow, which are essential for reproducibility. |
| Experiment Setup | Yes | The architecture is implemented within the MMPose framework Contributors (2020), trained using the Adam optimizer for 200 epochs with a batch size of 16. The initial learning rate is 10 5, reducing by a factor of 10 at the 160th and 180th epochs. Ci is 768 in Swin V2-T, Ct is 768 in gte-base-v1.5. C and K are set to 256 and 100, respectively. |