KIND: Knowledge Integration and Diversion for Training Decomposable Models
Authors: Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that models pretrained with KIND can be decomposed into learngenes and tailors, which can be adaptively recombined for diverse resource-constrained deployments. Moreover, for tasks with large domain shifts, transferring only learngenes with task-agnostic knowledge, when combined with randomly initialized tailors, effectively mitigates domain shifts. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Southeast University, Nanjing, China 2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China 3Lenovo Research. Correspondence to: Jing Wang <EMAIL>, Xin Geng <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 presents the pseudo code for diverting class-agnostic knowledge into learngenes and class-specific knowledge into tailors. |
| Open Source Code | No | Code will be made available at https://github.com/Te4P0t/KIND. |
| Open Datasets | Yes | We conduct class-conditioned generation on Image Net-1K (Deng et al., 2009), which contains 1,000 classes. To minimize inter-class similarity, we merge certain similar classes based on their superclasses in Word Net (Miller, 1995), resulting in a final set of 611 classes. Among these, 150 classes are used for pre-training the diffusion models, while the remaining 461 classes serve as novel classes for constructing downstream tasks. Further details can be found in Appendix A.3. Additionally, we use datasets, including Celeb A-HQ (Huang et al., 2018), Hubble (Weinzierl, 2023), MRI, and Pok emon, to simulate large domain shifts compared to the training data. |
| Dataset Splits | Yes | To minimize inter-class similarity, we merge certain similar classes based on their superclasses in Word Net (Miller, 1995), resulting in a final set of 611 classes. Among these, 150 classes are used for pre-training the diffusion models, while the remaining 461 classes serve as novel classes for constructing downstream tasks. Further details can be found in Appendix A.3. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam W' optimizer but does not provide specific version numbers for any libraries, frameworks, or programming languages used for implementation. |
| Experiment Setup | Yes | For pre-training Di T, we train class-conditional latent Di Ts of sizes -B and -L, with a latent patch size of p = 2 at a 256 256 image resolution on training classes. All models are trained using Adam W with a batch size of 256 and a constant learning rate of 1 10 4 over 300K steps. An exponential moving average (EMA) of Di T weights is used with a decay rate of 0.9999, and results are reported using the EMA model. During image generation, a classifierfree guidance (cfg) scale of 1.5 is applied. Performance is evaluated using Fr echet Inception Distance (FID) (Heusel et al., 2017), s FID (Nash et al., 2021), Fr echet DINO distance(FDD) (Stein et al., 2023), Inception Score (Salimans et al., 2016) and Precision/Recall (Kynk a anniemi et al., 2019). Further details are provided in Appendix A.2. Table 6 presents the basic settings, including learning rate, training steps and the number of learngene components NG and tailor components NT for KIND integrating and diverting knowledge. |