MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA
Authors: Hanrong Ye, Haotian Zhang, Erik Daxberger, Lin Chen, Zongyu Lin, Yanghao Li, Bowen Zhang, Haoxuan You, Dan Xu, Zhe Gan, Jiasen Lu, Yinfei Yang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first conduct experiments on our Ego Memoria benchmark, primarily comparing three models: LLaVA-OV (Li et al., 2024a), its fine-tuned version using our MM-Ego SFT data mixture (referred to as Ego SFT), and our MM-Ego model, which incorporates the proposed Memory Pointer Prompting mentioned in Section 2.2.2. We show the Ego Memoria accuracy in the first row of Table 3. |
| Researcher Affiliation | Collaboration | Hanrong Ye1 , Haotian Zhang2 , Erik Daxberger2, Lin Chen2, Zongyu Lin3, Yanghao Li2, Bowen Zhang2, Haoxuan You2, Dan Xu1, Zhe Gan2 , Jiasen Lu2 , Yinfei Yang2 1CSE, HKUST 2Apple 3UCLA |
| Pseudocode | No | The paper describes methods in prose and with diagrams (e.g., Figure 4 describes the Memory Pointer Prompting mechanism in two steps: global glimpse and fallback), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured code-like formatting for its procedures. |
| Open Source Code | No | The REPRODUCIBILITY STATEMENT says: 'We provide a detailed explanation of the data synthesis process in our data engine in Section 2.1. We also elaborate on our model design in Section 2.2.2. Additionally, we outline the implementation details, including the training hyperparameters in Section 3.2.' This describes reproducibility details but does not state that source code is released or provide a link to it. |
| Open Datasets | Yes | First, as there is a lack of QA data for egocentric video understanding, we automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D (Grauman et al., 2022) based on human-annotated data. |
| Dataset Splits | Yes | We partition the dataset into training and testing sets according to the official Ego4D episodic memory task. ... We divide the videos into seven different length ranges: 0.5 to 1 min, 1 to 2 min, 2 to 4 min, 4 to 10 min, 10 to 20 min, 20 to 40 min, and 40 to 60 min. We aim to balance the number of samples in different video lengths. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, or cloud resources) used for running its experiments. |
| Software Dependencies | No | The paper mentions specific models like Qwen2-7B, LLaVA-OV 7B, SigLIP-so400M ViT, and GPT-4o, but does not provide specific version numbers for underlying software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries used for implementation. |
| Experiment Setup | Yes | The model is trained for 1 epoch with a base learning rate of 1e-5, using a cosine scheduler. The batch size is set to 128. We sample a maximum of 300 frames (N = 300) and select 32 visual embeddings in the proposed memory pointer prompting mechanism. By default, we set the explore-exploit balancing parameter α to 0.1. Greedy decoding is used in generation. |