CoMemo: LVLMs Need Image Context with Image Memory
Authors: Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate Co Memo s superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/Co Memo/. |
| Researcher Affiliation | Academia | 1Shanghai Artificial Intelligence Laboratory 2Tsinghua University 3The Chinese University of Hong Kong. Correspondence to: Weijie Su <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Mixin Layers |
| Open Source Code | No | The abstract provides a project page link: "Project page is available at https://lalbj.github.io/projects/Co Memo/." However, this is a project page, not a direct link to a source code repository or an explicit statement of code release for the methodology described in the paper. |
| Open Datasets | Yes | Table 6: Summary of datasets used in the pretraining stage. task dataset Short Caption Laion (en&zh) (Schuhmann et al., 2022a), COYO (Byeon et al., 2022), COCO (Lin et al., 2014b) OCR Wukong-OCR (Gu et al., 2022), Laion COCO-OCR (Schuhmann et al., 2022b) Detection GRIT (Peng et al., 2023), Objects365 (Shao et al., 2019) Conversation All-Seeing (en&zh) (Wang et al., 2023b) Image-text instruction data (see Table 7) Table 7: Summary of datasets used in the instruction tuning stage. task dataset General QA VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), OKVQA (Marino et al., 2019), VSR (Liu et al., 2023a) AI2D (Kembhavi et al., 2016), Science QA (Lu et al., 2022a), Chemistry Data (Li et al., 2024) Science TQA (Kembhavi et al., 2017) PMC-VQA (Zhang et al., 2023a), VQA-RAD (Lau et al., 2018), VQA-Med (Ben Abacha et al., 2019) Medical-Diff-VQA (Hu et al., 2023), Path VQA (He et al., 2020), Medical SLAKE (Liu et al., 2021), PMC-Case Report (Wu, 2023) Chart QA (Masry et al., 2022a), LRV-Instruction (Liu et al., 2023b), Plot QA (Methani et al., 2020) Unichart (Masry et al., 2023), MMC-Inst (Liu et al., 2023c), DVQA (Kafle et al., 2018) Table MWP (Lu et al., 2022b), Figure QA (Kahou et al., 2017), Map QA (Chang et al., 2022) Chart Sci TSR (Chi et al., 2019), Fintabnet (Zheng et al., 2021) CLEVR (Johnson et al., 2017), Meta Math (Yu et al., 2023), Geo QA+ (Cao & Xiao, 2022) Geometry3k (Lu et al., 2021), Geo S (Seo et al., 2015), Unigeo (Chen et al., 2022) Mathematics Super-CLEVR (Li et al., 2023), Math QA (Amini et al., 2019) Art500k (Mao et al., 2017), Movie Net (Huang et al., 2020), Kon IQ-10k (Hosu et al., 2020) Knowledge KVQA (Shah et al., 2019), Vi Qu AE (Lerner et al., 2022) Info VQA (Mathew et al., 2022), Text VQA (Singh et al., 2019a), Ar T (Chng et al., 2019) CASIA (Liu et al., 2011), Chart-to-text (Kantharaj et al., 2022), COCO-text (Veit et al., 2016) CTW (Yuan et al., 2019), EATEN (Guo et al., 2019), ICDAR2019-LSVT (Sun et al., 2019) ICPR MTWI (He et al., 2018), NAF (Davis et al., 2019), Re CTS (Zhang et al., 2019) Text OCR (Singh et al., 2021), LLa VAR (Zhang et al., 2023b), HME-100k (Yuan et al., 2022) POIE (Kuang et al., 2023), SROIE (Huang et al., 2019), ST-VQA (Biten et al., 2019) EST-VQA (Wang et al., 2020), IAM (Marti & Bunke, 2002) Document Doc VQA (Clark & Gardner, 2017), Doc Reason25k (Hu et al., 2024) Ref COCO (Kazemzadeh et al., 2014), Ref COCO+ (Kazemzadeh et al., 2014), Ref COCOg (Kazemzadeh et al., 2014) Grounding RD-Box Co T (Chen et al., 2023) ALLa VA (Chen et al., 2024a), LAION-GPT4V (LAION, 2023) Conversation MMDU (Liu et al., 2024d), Text OCR-GPT4V (Carter, 2024) Detection Objects365 (Shao et al., 2019), V3Det (Wang et al., 2023a) |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | Yes | Table 5: Training efficiency comparison across different 2B parameter models. Model Batch size # of A100 GPUs Train steps/s Train samples/s LVLM-X 1024 64 0.123 15.71 LVLM-S 1024 64 0.105 13.4 Co Memo 1024 64 0.096 12.26 |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). It mentions using foundational models like Intern LM-1.8B and Intern Vi T-300M, but not the software environment in which these were integrated or run with specific versions. |
| Experiment Setup | Yes | Table 5: Hyperparameters for Training and Inference Parameter Pretraining Phase 1 Pretraining Phase 2 Finetuning Max sequence length 8192 8192 8192 Max tile/image 12 12 12 Optimizer Adam W Adam W Adam W Learning rate 1 10 4 1 10 4 4 10 5 Weight decay 0.01 0.01 0.01 Optimizer momentum β1, β2 = 0.9, 0.999 β1, β2 = 0.9, 0.999 β1, β2 = 0.9, 0.999 Learning rate schedule Constant with warmup Constant with warmup Cosine decay Warmup ratio 0.03 0.03 0.03 Training steps 2000 2000 9000 Batch size 1024 1024 1024 Number of mixin layers 4 4 4 Trainable weights LVLM-S: MLP All LVLM-X: Mixin layers + MLP Co Memo: Mixin layers + MLP (Freeze gate in Phase 2) |