ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Authors: Mingda Jia, Liming Zhao, Ge Li, Yun Zheng

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Context HOI achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks. For further validation, we construct a novel benchmark, HICOambiguous, which is a subset of HICO-DET that contains images with occluded or impaired instance cues. Extensive experiments across all benchmarks, complemented by visualizations, underscore the enhancements provided by Context HOI, especially in recognizing interactions involving occluded or blurred instances.
Researcher Affiliation Collaboration 1Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University 2Alibaba Group EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methods in detail across Section 3, titled 'Method', but does not present any distinct pseudocode or algorithm blocks with structured steps labeled as such.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide a link to a code repository. Phrases such as 'We release our code...' or similar are not present.
Open Datasets Yes We first conduct experiments on two widely-used HOI detection benchmark HICO-DET (Chao et al. 2018) and v-coco (Gupta and Malik 2015).
Dataset Splits Yes HICO-DET comprises 80 object categories, 117 interaction categories, and 600 HOI triplet categories. HICO-DET contains 38,118 training images and 9,658 validation images. v-coco contains 5,400 train-val images and 4,946 validation images, consisting of 80 object categories, 29 verb categories, and 263 interaction triplet combinations. ... A total of 659 images were selected, and together with their original annotations, they comprise the ambiguous benchmark.
Hardware Specification Yes Context HOI is trained on a single Tesla A100 GPU with batch size 16.
Software Dependencies No The paper mentions using transformer detectors introduced by DETR and backbones like ResNet50 and ResNet101, as well as CLIP Vi T-L/14 as a semantic teacher and an Adam W optimizer. However, it does not specify version numbers for any of the underlying software components, such as Python, PyTorch, or CUDA.
Experiment Setup Yes Our detector query dimension Nq is 64 for both HICO-DET and v-coco. The encoders in the feature extractor have 6 layers, and the instance decoder, context extractor and context aggregator are all implemented by adopting a 3-layer transformer decoder. The transformer hidden dimension C of all the components is 256. ... The learnable parameter τ in LIC is initialized to 0.5. Context HOI is trained for 60 epochs with an Adam W optimizer with an initial learning rate of 1e-4 and a 10 times weight decay at 40 epochs. ... for the spatial constraint losses, we set the loss coefficients λfc, λrc and λic to 4, 1 and 4, respectively.