Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection

Authors: Adyasha Maharana, Jaehong Yoon, Tianlong Chen, Mohit Bansal

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Based on empirical analyses that show that selecting the best data subset using a static importance measure is often ineffective for multi-task datasets with evolving distributions, we propose Adapt, a new multi-way and adaptive data selection approach that dynamically balances sample efficiency and effectiveness during Li IT. We validate the effectiveness and efficiency of Adaptover a sequence of various multimodal instruction tuning datasets with various tasks, including (Knowledge) VQA, multilingual, grounding, reasoning, language-only, and multi-image comprehension tasks. Training with samples selected by Adaptalleviates catastrophic forgetting, especially for rare tasks, and promotes forward transfer across the continuum using only a fraction of the original datasets. We conduct extensive ablations of Adaptto identify best-performing settings. Our key finding is that hidden layer outputs capture semantics, while gradient vectors represent skills, making them more effective for pseudo-task clustering (see examples in Figures 2 and 4).
Researcher Affiliation Academia Adyasha Maharana Jaehong Yoon Tianlong Chen Mohit Bansal Department of Computer Science, UNC Chapel Hill
Pseudocode No The paper describes steps in regular paragraph text and uses figures to illustrate concepts (e.g., Figure 1: Illustration of Adapt), but it does not contain explicitly structured pseudocode or algorithm blocks.
Open Source Code Yes 1Code is released at https://github.com/adymaharana/adapt-inf.
Open Datasets Yes For training, in addition to the LLa VA-665K instruction tuning dataset at t = 0, we consider the following order of datasets: M3IT (Li et al., 2023c), Mini GPT4 (Zhu et al., 2023), MANTIS (Jiang et al., 2024), LAMM (Yin et al., 2023) and Vision FLAN (Xu et al., 2024). Each dataset s temporal order, size, and skill composition are summarized in Table 1. We select standard evaluation datasets to measure performance on the skills enumerated in Table 1. These datasets and their corresponding task-specific evaluation metrics are listed in Table 4.
Dataset Splits No The paper mentions using "standard evaluation datasets" for evaluation and training on a specific number of samples (e.g., "25k samples at each time step"). However, it does not provide explicit details about the specific training/validation/test splits used for these datasets or for the continually added training datasets within its own experimental setup to fully reproduce the data partitioning for each experiment.
Hardware Specification Yes We present a comparison of the time taken by various data selection methods for our experimental setting (for training with 25k samples at each time step) in Table 6. Results are presented for 8 A100 GPUs.
Software Dependencies No The paper mentions models and frameworks used, such as "LLa VA 1.5 multimodal large language model", "Vicuna LLM", "CLIP visual encoder Vi T-L/14", and "Lo RA finetuning", along with citations to their original papers. However, it does not provide specific version numbers for programming languages (e.g., Python) or common deep learning libraries (e.g., PyTorch, TensorFlow, CUDA) that would be needed to replicate the experimental environment.
Experiment Setup Yes We adopt Lo RA finetuning (Hu et al., 2021) of the LLa VA-1.5-7B model with the recommended hyperparameters. The optimal value of k in the pseudo-task clustering step is computed from a grid search over values of k between 5 and 50, and selected based on the WSS value of clusters. In the score-based sample selection step, we use a bin size of 50 and discard the top and bottom 5% of samples for computing entropy as well as for CCS sampling, to remove outliers, low-quality samples, and uninformative data (Zheng et al., 2023). We use random projections (Park et al., 2023; Xia et al., 2024) to reduce the dimensionality of gradient vectors extracted for the pseudo-task clustering step. We use a constant projection dimension of 8192 throughout our experiments. The model is trained on 25k samples at each time step, similar to the MLLM experiments in our paper.