reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development

Authors: Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proposed Probe-Analyze-Refine workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLa VA-like models, and text-to-video generation with Di T-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. A comprehensive set of over 100 experiments demonstrated the suite s usability and extensibility, while also uncovering insights into the interplay between data quality, diversity, model behavior, and computational costs.
Researcher Affiliation	Industry	1Alibaba Group. Correspondence to: Yaliang Li <EMAIL>.
Pseudocode	No	The paper describes a "Probe-Analyze-Refine workflow" and outlines its steps in paragraph text (Section 3.2 and subsections), but it does not present any formal pseudocode blocks or algorithms.
Open Source Code	Yes	All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure. Reproducibility is essential for validating research outcomes. To facilitate this, we have organized detailed descriptions within the appendix of our paper. Key components of our experimental setup, including implementation details such as datasets and training configurations for the image-to-text, text-to-video and image-text pre-training use cases can be found in Appendix D.4, Appendix D.5 and Appendix D.6 respectively. We also provide details into the methodologies for combining multiple operators based on their correlations in Appendix D.1, as well as descriptions of performance metrics (Appendix D.2). Furthermore, Appendix D.3 outlines the functionalities and statistics of the Data-Juicer OPs utilized in our experiments. All codes, datasets, and models of our work are openly accessible and actively maintained at https://github.com/modelscope/datajuicer/blob/main/docs/Sandbox.md.
Open Datasets	Yes	Data Comp (Gadre et al., 2023) introduces a benchmark to filter out high-quality data from 12.8 billion image-text pairs in Common Crawl to train better CLIP models For the second task, text-to-video generation, we adopt the advanced Di T-based models, Easy Animate (Xu et al., 2024b), which originally integrates diverse datasets totaling 1.2M instances from Intern Vid (Wang et al., 2023) (606k), Panda-70M (Chen et al., 2024e) (605k), and MSR-VTT (Xu et al., 2016) (6k). For the third task, image-text pre-training, we adopt the well-studied CLIP model (Radford et al., 2021). Specifically, we utilize data from the small track of the Data Comp competition (Gadre et al., 2023)
Dataset Splits	Yes	Our first task focuses on foundational image understanding ability, by experimenting on Mini-Gemini (MGM-2B), a state-of-the-art (SOTA) 2 billion parameter multimodal LLM (Li et al., 2024b). The training protocol for MGM-2B involves two stages: pretraining and fine-tuning. Our experimental focus lies in the pretraining phase, which seeks to harmonize visual and textual representations. We utilize the original pretraining dataset as our original dataset D, consisting of approximately 1.2M instances. We set the size of Dsample as 200k. The single-OP data pools Di and multi-OP data pools DS are capped at a maximum of 200k instances, ensuring consistency of data pool size. To match the down-sampling rate used during pretraining, the fine-tuning dataset is sampled into a 240k instance subset. For the second task, text-to-video generation, we adopt the advanced Di T-based models, Easy Animate (Xu et al., 2024b), which originally integrates diverse datasets totaling 1.2M instances from Intern Vid (Wang et al., 2023) (606k), Panda-70M (Chen et al., 2024e) (605k), and MSR-VTT (Xu et al., 2016) (6k). The studied baseline model is trained on a subset of 40k instances, employing Lo RA (Hu et al., 2022) for efficiency. As a result, the size of D is 1.2M, and the size of Dsample, the single-OP data pools Di and multi-OP data pools DS are all 40k. For the third task, image-text pre-training, we adopt the well-studied CLIP model (Radford et al., 2021). Specifically, we utilize data from the small track of the Data Comp competition (Gadre et al., 2023) and adhere to its evaluation metrics, which include 40 distinct evaluation subsets. Due to some broken links, we successfully downloaded 85.2% of the dataset, resulting in a total of 10.9 million samples as our D. All baseline models were trained on an equivalent volume of data as used in the contrastive experiments, sampled randomly from this dataset.
Hardware Specification	Yes	For single-OP and OP combination experiments are trained on only 1 A100 GPU for each experiment so we increase the number of gradient accumulation steps from 4 to 32 to keep the same global batch size. For experiments of duplicating high-quality datasets, 8 A100 GPUs are involved to train the model, and the number of gradient accumulation steps is restored to 4.
Software Dependencies	No	The paper mentions several software components and frameworks such as "Data-Juicer", "Mini-Gemini", "Easy Animate", "Model Scope", "VBench", "MMBench", "Text VQA", "MME", "Transformers", "Diffusers", "Ne Mo", "MMagic", "ESPNet", "PyTorch", "Lo RA", "Video Crafter-2.0", "HPSv2.1", "Intern Video2", and "Adam optimizer", but it does not specify any version numbers for these components.
Experiment Setup	Yes	We keep every training setting (e.g. learning rate scheduler, global batch size) the same as the original model except for training datasets and training devices. For single-OP and OP combination experiments are trained on only 1 A100 GPU for each experiment so we increase the number of gradient accumulation steps from 4 to 32 to keep the same global batch size. During training, we maintain a video resolution of 256x256, sample every other frame, and randomly select sequences of 16 consecutive frames. The training process involves performing a backward pass for the loss of every 8 samples, with single-OP and OP combination experiments trained on a single GPU with a batch size of 8 for 5k steps, amounting to approximately 16 GPU hours per training run. Experiments for duplicating high-quality data, as well as larger-scale training, are conducted with a batch size of 1 across 8 GPUs. The models employ the Adam optimizer for training, with a learning rate set to 2 10 5, weight decay parameter at 3 10 2, and epsilon configured to 10 10.