TACO Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

Authors: Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate this approach experimentally using highly-accurate Res Net, Vi T/Dei T, and Conv Ne Xt models, originally trained on Image Net and i Naturalist datasets, which we specialize and compress to a diverse set of downstream subtasks, with notable computational speedups on both CPU and GPU. Our results show that TACO can reduce the number of model parameters by 10-20x, with low to moderate accuracy loss on the target task. We compare Task-Aware COmpression, where the objective in Equation equation 1 is optimized on task-specific data, with task-agnostic Post-Training Compression (PTC), where the data is chosen uniformly at random over a general dataset such as Image Net-1k and i Naturalist21, using the same number of samples for both methods.
Researcher Affiliation Academia The paper states 'Anonymous authors Paper under double-blind review', therefore, no institutional affiliations are provided to classify the author affiliation types.
Pseudocode Yes Algorithm 1 Overview of gradual transfer learning setup. 1: Input: Dense model M, Sparsity target σ, Pruning method Γ. 2: repeat 3: Prune 50% of remaining weights in M via method Γ. 4: Update M to the pruned model. 5: Finetune M for 25 epochs, maintaining the sparsity mask. 6: until M s sparsity reaches some target level σ. 7: Return M.
Open Source Code No The paper references a third-party tool 'Hugging Face, Inc. Hugging face s diffusers: A state-of-the-art diffusion models library. https://github. com/huggingface/diffusers, 2023.' which is used in an application, but does not provide an explicit statement or link to the source code for the TACO methodology developed in this paper.
Open Datasets Yes We start from models trained on diverse and general tasks such as Image Net ILSVRC12 (Russakovsky et al., 2015) and i Naturalist 2021 Van Horn et al. (2018). For i Naturalist experiments, we adopted Vi T-Base from He et al. (2022) and finetuned it on i Naturalist using the same hyperparameters as He et al. (2022). We show results for text-to-image generation on the Pokémon BLIP captions dataset Pinkney (2022) using a Stable Diffusion (SD) v1.4 model trained via the Hugging Face diffusers tutorial.
Dataset Splits No The paper specifies details about the calibration sets used, such as '10 samples per class for all tasks considered' and for ablation '200 samples (20% of the training samples from Image Net-1k) per class', and for the Pokémon dataset 'The whole task data, consisting of 800 images, as a calibration set'. However, it does not explicitly provide the training, validation, and test splits for the overall datasets or subtasks used for evaluation.
Hardware Specification Yes The finetuning procedure takes 10 minutes on a single NVIDIA RTX 3090 GPU. For experiments with CPU speedups we execute our unstructured sparse models on the Deep Sparse runtime (Kurtz et al., 2020)... Latency is measured on an Apple M1 processor, using 4 cores. Next, we execute the structured-pruned models obtained via TACO (Section 3.2.2) in the NVIDIA Tensor RT runtime, on an NVIDIA T4 GPU. CPU benchmarking was conducted on 4 cores of Apple M1 chip (Mac Book Pro 2020).
Software Dependencies Yes Model was compiled with Deep Sparse engine (version deepsparse-nightly 1.6.0). GPU measurements were carried out on Google Colab runtime, Nvidia T4 GPU, Tensor RT (version 8.6.0.12 for CUDA Toolkit 12.0).
Experiment Setup Yes Models are fine-tuned over 100 passes through the calibration data. In our experiments, we use a batch size of 16 for Image Net and transfer learning subtasks (see Section 3.7) and 32 for i Naturalist subtasks. In our few-shot tuning procedure we finetune the model via optimizing the sum of original task loss, logit and feature distillation loss. The hyperparameters adopted in training are listed in Table 2: Optimizer Adam, Learning rate 3e-4, LR schedule linear, Batch size 16 (32), Num passes 100, Weight decay 0, Dropout %, H.flip !, RRC !, Image size 224, Test crop ratio 0.875.