Functional Alignment Can Mislead: Examining Model Stitching

Authors: Damian Smith, Harvey Mannering, Antonia Marcu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Firstly, we show that discriminative models with very different biases can be stitched together. We then show that models trained to solve entirely different tasks on different data modalities, and even representations in the form of clustered random noise, can be successfully stitched into MNIST or Image Net-trained models. We proceed by showing that alignments can also be found in the case of autoencoders where the encoder and decoder are trained on different tasks. We end with a discussion of the wider impact of our results on the community s current beliefs. Overall, our paper draws attention to the need to correctly interpret the results of such functional similarity measures and highlights the need for approaches that capture informational similarity.
Researcher Affiliation Academia 1Vision, Learning, and Control (VLC) Research Group, University of Southampton. Correspondence to: Antonia Marcu <EMAIL>.
Pseudocode Yes Algorithm 1 Generate Base Activations Procedure GENERATE ACTIVATIONS(num classes, representation shape) for c 0 to num classes 1 do activations[c] rand(representation shape) end for return activations End Procedure Algorithm 2 Generate a dataset (unshuffled) Procedure SYNTHETICDATASET(train, activations, noise) if train then Samples Per Class 6000 else Samples Per Class 1000 end if for c 0 to num classes 1 do data[c Samples Per Class : ((c + 1) Samples Per Class)] activations[c] targets[c Samples Per Class : ((c + 1) Samples Per Class)] c end for data data + noise randn data clamp(data, 0, 1) End Procedure Algorithm 3 One-to-One Autoencoder Stitching (Loss Calculation) Procedure CALCULATE LOSS(AE1, AE2, stitch, dataset1, dataset2) d1 dataset1 d2 dataset2 e1 AE1.encoder(d1) e2 AE2.encoder(d2) s stitch(e1) cost matrix Pairwise Distances(s, e2) i, j Linear Sum Assignment(cost matrix) return P cost matrix[i, j] End Procedure
Open Source Code Yes The code for our experiments is available at https://gi thub.com/DHLSmith/stitching.git.
Open Datasets Yes We then show that models trained to solve entirely different tasks on different data modalities, and even representations in the form of clustered random noise, can be successfully stitched into MNIST or Image Net-trained models. Colour MNIST (Bahng et al., 2020): The colour of the background and class of the digit are correlated. For example, we show that we can stitch a model trained on Image Net (Russakovsky et al., 2015) to one trained to recognise bird songs. We then stitch a CIFAR-10 (Krizhevsky, 2009) encoder onto an MNIST decoder (see right-hand side of Figure 4) and once again obtain depictions of MNIST-like digits.
Dataset Splits Yes Algorithm 2 Generate a dataset (unshuffled) Procedure SYNTHETICDATASET(train, activations, noise) if train then Samples Per Class 6000 else Samples Per Class 1000 end if for c 0 to num classes 1 do data[c Samples Per Class : ((c + 1) Samples Per Class)] activations[c] targets[c Samples Per Class : ((c + 1) Samples Per Class)] c end for data data + noise randn data clamp(data, 0, 1) End Procedure
Hardware Specification No The authors acknowledge the use of the IRIDIS X High Performance Computing Facility, the ECS Alpha Cluster, and the Southampton-Wolfson AI Research Machine (SWARM) GPU cluster generously funded by the Wolfson Foundation, together with the associated support services at the University of Southampton in the completion of this work.
Software Dependencies No No specific software dependencies with version numbers are provided in the paper.
Experiment Setup Yes Hyperparameters for Model Training batch size=128, 4 Epochs, SGD, lr=1e-1, momentum=0.9, weight decay=1e-4. Hyperparameters for Model Stitching batch size=128, 10 Epochs, SGD, lr=1e-4, momentum=0.9, weight decay=1e-2 Hyperparameters for Noise Stitching batch size=64, 4 Epochs, SGD, lr=1e-4, momentum=0.9, weight decay=1e-2 All autoencoders were trained for 25 epochs with a learning rate of 1e-4 using the Adam optimiser. Embedding Map stitching was trained for 25 epochs using SGD with a learning rate of 1e-5, momentum of 0.9, and weight decay of 0.01. The class mapping stitching was trained for 20 epochs using SGD with the same parameters except the learning rate was set to 1e-2. All stitches in Section 5 are trained for 20 epochs using SGD with learning rate and weight decay of 10-4, momentum of 0.9, batch size of 128. For training the final model the following configuration was used: learning rate of 10-2, weight decay of 10-4 and momentum of 0.9.