Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development
Authors: Daoyuan Chen, Haibin Wang, Yilun Huang, Ce Ge, Yaliang Li, Bolin Ding, Jingren Zhou
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed Probe-Analyze-Refine workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLa VA-like models, and text-to-video generation with Di T-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. A comprehensive set of over 100 experiments demonstrated the suite s usability and extensibility, while also uncovering insights into the interplay between data quality, diversity, model behavior, and computational costs. |
| Researcher Affiliation | Industry | 1Alibaba Group. Correspondence to: Yaliang Li <EMAIL>. |
| Pseudocode | No | The paper describes a "Probe-Analyze-Refine workflow" and outlines its steps in paragraph text (Section 3.2 and subsections), but it does not present any formal pseudocode blocks or algorithms. |
| Open Source Code | Yes | All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure. Reproducibility is essential for validating research outcomes. To facilitate this, we have organized detailed descriptions within the appendix of our paper. Key components of our experimental setup, including implementation details such as datasets and training configurations for the image-to-text, text-to-video and image-text pre-training use cases can be found in Appendix D.4, Appendix D.5 and Appendix D.6 respectively. We also provide details into the methodologies for combining multiple operators based on their correlations in Appendix D.1, as well as descriptions of performance metrics (Appendix D.2). Furthermore, Appendix D.3 outlines the functionalities and statistics of the Data-Juicer OPs utilized in our experiments. All codes, datasets, and models of our work are openly accessible and actively maintained at https://github.com/modelscope/datajuicer/blob/main/docs/Sandbox.md. |
| Open Datasets | Yes | Data Comp (Gadre et al., 2023) introduces a benchmark to filter out high-quality data from 12.8 billion image-text pairs in Common Crawl to train better CLIP models For the second task, text-to-video generation, we adopt the advanced Di T-based models, Easy Animate (Xu et al., 2024b), which originally integrates diverse datasets totaling 1.2M instances from Intern Vid (Wang et al., 2023) (606k), Panda-70M (Chen et al., 2024e) (605k), and MSR-VTT (Xu et al., 2016) (6k). For the third task, image-text pre-training, we adopt the well-studied CLIP model (Radford et al., 2021). Specifically, we utilize data from the small track of the Data Comp competition (Gadre et al., 2023) |
| Dataset Splits | Yes | Our first task focuses on foundational image understanding ability, by experimenting on Mini-Gemini (MGM-2B), a state-of-the-art (SOTA) 2 billion parameter multimodal LLM (Li et al., 2024b). The training protocol for MGM-2B involves two stages: pretraining and fine-tuning. Our experimental focus lies in the pretraining phase, which seeks to harmonize visual and textual representations. We utilize the original pretraining dataset as our original dataset D, consisting of approximately 1.2M instances. We set the size of Dsample as 200k. The single-OP data pools Di and multi-OP data pools DS are capped at a maximum of 200k instances, ensuring consistency of data pool size. To match the down-sampling rate used during pretraining, the fine-tuning dataset is sampled into a 240k instance subset. For the second task, text-to-video generation, we adopt the advanced Di T-based models, Easy Animate (Xu et al., 2024b), which originally integrates diverse datasets totaling 1.2M instances from Intern Vid (Wang et al., 2023) (606k), Panda-70M (Chen et al., 2024e) (605k), and MSR-VTT (Xu et al., 2016) (6k). The studied baseline model is trained on a subset of 40k instances, employing Lo RA (Hu et al., 2022) for efficiency. As a result, the size of D is 1.2M, and the size of Dsample, the single-OP data pools Di and multi-OP data pools DS are all 40k. For the third task, image-text pre-training, we adopt the well-studied CLIP model (Radford et al., 2021). Specifically, we utilize data from the small track of the Data Comp competition (Gadre et al., 2023) and adhere to its evaluation metrics, which include 40 distinct evaluation subsets. Due to some broken links, we successfully downloaded 85.2% of the dataset, resulting in a total of 10.9 million samples as our D. All baseline models were trained on an equivalent volume of data as used in the contrastive experiments, sampled randomly from this dataset. |
| Hardware Specification | Yes | For single-OP and OP combination experiments are trained on only 1 A100 GPU for each experiment so we increase the number of gradient accumulation steps from 4 to 32 to keep the same global batch size. For experiments of duplicating high-quality datasets, 8 A100 GPUs are involved to train the model, and the number of gradient accumulation steps is restored to 4. |
| Software Dependencies | No | The paper mentions several software components and frameworks such as "Data-Juicer", "Mini-Gemini", "Easy Animate", "Model Scope", "VBench", "MMBench", "Text VQA", "MME", "Transformers", "Diffusers", "Ne Mo", "MMagic", "ESPNet", "PyTorch", "Lo RA", "Video Crafter-2.0", "HPSv2.1", "Intern Video2", and "Adam optimizer", but it does not specify any version numbers for these components. |
| Experiment Setup | Yes | We keep every training setting (e.g. learning rate scheduler, global batch size) the same as the original model except for training datasets and training devices. For single-OP and OP combination experiments are trained on only 1 A100 GPU for each experiment so we increase the number of gradient accumulation steps from 4 to 32 to keep the same global batch size. During training, we maintain a video resolution of 256x256, sample every other frame, and randomly select sequences of 16 consecutive frames. The training process involves performing a backward pass for the loss of every 8 samples, with single-OP and OP combination experiments trained on a single GPU with a batch size of 8 for 5k steps, amounting to approximately 16 GPU hours per training run. Experiments for duplicating high-quality data, as well as larger-scale training, are conducted with a batch size of 1 across 8 GPUs. The models employ the Adam optimizer for training, with a learning rate set to 2 10 5, weight decay parameter at 3 10 2, and epsilon configured to 10 10. |