Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

L-Man: A Large Multi-modal Model Unifying Human-centric Tasks

Authors: Jialong Zuo, Ying Nie, Tianyu Guo, Huaxin Zhang, Jiahao Hong, Nong Sang, Changxin Gao, Kai Han

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By tuning on Human Ins, our model shows significant superiority on human-centric tasks compared with existing large multi-modal models, and also achieves better results on downstream datasets compared with respective task-specific models. ... Our L-Man trained on Human Ins shows significant superiority compared to some existing general models (Dai et al. 2023; Bai et al. 2023; Liu et al. 2023a) on a range of human-centric tasks, as shown in Figure 1. Also, it achieves even better performance compared with some strong task-specific baselines . ... Experiments Implementation Details ... Quantitative Results Comparison With Other LMMs. ... Ablation Study Number of Query Tokens.
Researcher Affiliation Collaboration 1National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology 2 Huawei Noah s Ark Lab
Pseudocode No The paper describes the model architecture and training strategy using text and diagrams (Figure 4), but does not contain a dedicated pseudocode block or algorithm section.
Open Source Code No The paper does not explicitly state that source code for the methodology is released, nor does it provide a direct link to a code repository. It mentions that "More training details can be found in the supplement.", but this is not a clear affirmative statement of code release.
Open Datasets Yes Specifically, we first construct a large-scale language-image instruction-following dataset named Human Ins based on existing 20 open datasets from 6 diverse downstream tasks, which provides sufficient and diverse data to implement multi-modal training. ... We build the Human Ins dataset to train our L-Man. ... Following the standard protocols, 20 datasets containing 908,587 images are collected as the samples in our dataset. The specific details are shown in the supplement. ... For action recognition, we compare the performance on UCF101 (Soomro, Zamir, and Shah 2012) and Stanford40 (Yao et al. 2011). ... For pedestrian attribute recognition, we compare the performance on PA100K (Liu et al. 2017).
Dataset Splits No We have partitioned the training and testing sets for Human Ins to avoid data leakage issues. ... Therefore, we randomly select 120 images from each dataset s test split, and require each model to predict the answers based on the input images and taskrelevant instructions. We utilize the ground truths to manually evaluate the quality of the answers. The details of chosen images, instructions and manual evaluation method are shown in the supplement.
Hardware Specification Yes We gratefully acknowledge the support of Mind Spore, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research.
Software Dependencies No The paper mentions "Mind Spore, CANN" as software used in the acknowledgements, but it does not specify version numbers for these components or any other software dependencies crucial for replication.
Experiment Setup Yes L-Man consists of a pre-trained vision encoder, i.e., CLIP-Vi T-L/14 (Radford et al. 2021), a query adapter initialized by CLIP-Xformer (Radford et al. 2021) and an LLM Vicuna-7B (Zheng et al. 2023). The number of layers in the query adapter is half of the number of layers in the vision encoder. The number of query tokens is set to 128. More training details can be found in the supplement. ... In this stage, we unfreeze the whole weights of the vision encoder (Radford et al. 2021), and continue to update the pre-trained weights of the query adapter and partial LLM (Zheng et al. 2023). ... Denoted λ as a hyper-parameter, the overall training objective is the weighted sum of the above objectives: L = Llic + λLvmlm.