reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CoMemo: LVLMs Need Image Context with Image Memory

Authors: Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate Co Memo s superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/Co Memo/.
Researcher Affiliation	Academia	1Shanghai Artificial Intelligence Laboratory 2Tsinghua University 3The Chinese University of Hong Kong. Correspondence to: Weijie Su <EMAIL>.
Pseudocode	Yes	Algorithm 1 Mixin Layers
Open Source Code	No	The abstract provides a project page link: "Project page is available at https://lalbj.github.io/projects/Co Memo/." However, this is a project page, not a direct link to a source code repository or an explicit statement of code release for the methodology described in the paper.
Open Datasets	Yes	Table 6: Summary of datasets used in the pretraining stage. task dataset Short Caption Laion (en&zh) (Schuhmann et al., 2022a), COYO (Byeon et al., 2022), COCO (Lin et al., 2014b) OCR Wukong-OCR (Gu et al., 2022), Laion COCO-OCR (Schuhmann et al., 2022b) Detection GRIT (Peng et al., 2023), Objects365 (Shao et al., 2019) Conversation All-Seeing (en&zh) (Wang et al., 2023b) Image-text instruction data (see Table 7) Table 7: Summary of datasets used in the instruction tuning stage. task dataset General QA VQAv2 (Goyal et al., 2017), GQA (Hudson & Manning, 2019), OKVQA (Marino et al., 2019), VSR (Liu et al., 2023a) AI2D (Kembhavi et al., 2016), Science QA (Lu et al., 2022a), Chemistry Data (Li et al., 2024) Science TQA (Kembhavi et al., 2017) PMC-VQA (Zhang et al., 2023a), VQA-RAD (Lau et al., 2018), VQA-Med (Ben Abacha et al., 2019) Medical-Diff-VQA (Hu et al., 2023), Path VQA (He et al., 2020), Medical SLAKE (Liu et al., 2021), PMC-Case Report (Wu, 2023) Chart QA (Masry et al., 2022a), LRV-Instruction (Liu et al., 2023b), Plot QA (Methani et al., 2020) Unichart (Masry et al., 2023), MMC-Inst (Liu et al., 2023c), DVQA (Kafle et al., 2018) Table MWP (Lu et al., 2022b), Figure QA (Kahou et al., 2017), Map QA (Chang et al., 2022) Chart Sci TSR (Chi et al., 2019), Fintabnet (Zheng et al., 2021) CLEVR (Johnson et al., 2017), Meta Math (Yu et al., 2023), Geo QA+ (Cao & Xiao, 2022) Geometry3k (Lu et al., 2021), Geo S (Seo et al., 2015), Unigeo (Chen et al., 2022) Mathematics Super-CLEVR (Li et al., 2023), Math QA (Amini et al., 2019) Art500k (Mao et al., 2017), Movie Net (Huang et al., 2020), Kon IQ-10k (Hosu et al., 2020) Knowledge KVQA (Shah et al., 2019), Vi Qu AE (Lerner et al., 2022) Info VQA (Mathew et al., 2022), Text VQA (Singh et al., 2019a), Ar T (Chng et al., 2019) CASIA (Liu et al., 2011), Chart-to-text (Kantharaj et al., 2022), COCO-text (Veit et al., 2016) CTW (Yuan et al., 2019), EATEN (Guo et al., 2019), ICDAR2019-LSVT (Sun et al., 2019) ICPR MTWI (He et al., 2018), NAF (Davis et al., 2019), Re CTS (Zhang et al., 2019) Text OCR (Singh et al., 2021), LLa VAR (Zhang et al., 2023b), HME-100k (Yuan et al., 2022) POIE (Kuang et al., 2023), SROIE (Huang et al., 2019), ST-VQA (Biten et al., 2019) EST-VQA (Wang et al., 2020), IAM (Marti & Bunke, 2002) Document Doc VQA (Clark & Gardner, 2017), Doc Reason25k (Hu et al., 2024) Ref COCO (Kazemzadeh et al., 2014), Ref COCO+ (Kazemzadeh et al., 2014), Ref COCOg (Kazemzadeh et al., 2014) Grounding RD-Box Co T (Chen et al., 2023) ALLa VA (Chen et al., 2024a), LAION-GPT4V (LAION, 2023) Conversation MMDU (Liu et al., 2024d), Text OCR-GPT4V (Carter, 2024) Detection Objects365 (Shao et al., 2019), V3Det (Wang et al., 2023a)
Dataset Splits	No	The paper mentions using
Hardware Specification	Yes	Table 5: Training efficiency comparison across different 2B parameter models. Model Batch size # of A100 GPUs Train steps/s Train samples/s LVLM-X 1024 64 0.123 15.71 LVLM-S 1024 64 0.105 13.4 Co Memo 1024 64 0.096 12.26
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x). It mentions using foundational models like Intern LM-1.8B and Intern Vi T-300M, but not the software environment in which these were integrated or run with specific versions.
Experiment Setup	Yes	Table 5: Hyperparameters for Training and Inference Parameter Pretraining Phase 1 Pretraining Phase 2 Finetuning Max sequence length 8192 8192 8192 Max tile/image 12 12 12 Optimizer Adam W Adam W Adam W Learning rate 1 10 4 1 10 4 4 10 5 Weight decay 0.01 0.01 0.01 Optimizer momentum β1, β2 = 0.9, 0.999 β1, β2 = 0.9, 0.999 β1, β2 = 0.9, 0.999 Learning rate schedule Constant with warmup Constant with warmup Cosine decay Warmup ratio 0.03 0.03 0.03 Training steps 2000 2000 9000 Batch size 1024 1024 1024 Number of mixin layers 4 4 4 Trainable weights LVLM-S: MLP All LVLM-X: Mixin layers + MLP Co Memo: Mixin layers + MLP (Freeze gate in Phase 2)