reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Attributes Shape the Embedding Space of Face Recognition Models

Authors: Pierrick Leroy, Antonio Mastropietro, Marco Nurisso, Francesco Vaccarino

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. Our findings reveal that the models exhibit varying degrees of invariance across different attributes, providing insight into their strengths and weaknesses and enabling deeper interpretability.
Researcher Affiliation	Academia	1Department of Mathematical Sciences, Politecnico di Torino, Turin, Italy 2Department of Computer Science, University of Pisa, Pisa, Italy 3CENTAI Institute, Turin, Italy. Correspondence to: Antonio Mastropietro <EMAIL>.
Pseudocode	No	The paper describes methods using mathematical equations and prose, but does not contain any explicitly formatted pseudocode or algorithm blocks.
Open Source Code	Yes	Code available here: https://github.com/mantonios107/attrsfr-embs.
Open Datasets	Yes	increased availability of curated datasets (Deng et al., 2019b) and improved recognition architectures and losses (Deng et al., 2022). In the challenging open-set scenario, novel face identities can appear at testing time, hence the problem is framed as a metric learning task (Liu et al., 2017). There are three main innovation to reach outstanding FR results in this scenario. [...] We evaluate the proposed metric on controlled, simplified models and widely used FR models fine-tuned with synthetic data for targeted attribute augmentation. [...] MS1MV3 (Deng et al., 2019b), and we inspect the embedding space to describe its geometric structure induced by the architecture and the loss. [...] The (inter-class) distance between decision regions can be estimated through the corresponding distance between identity point clouds: db(Pi, Pj) = 1 \|Pi\|\|Pj\| P e(i) Pi,e(j) Pj d(e(i), e(j)). Similarly for the (intra-class) distance within decision regions: d(Pi) = 1 \|Pi\|2 P e(i),e(j) Pi d(e(i), e(j)) (Liu et al., 2022; Deng et al., 2022). Thus, after training, we can assume that the distance within is much smaller than the distance between point clouds, i.e. d(i) db(i, j) and d(j) db(i, j) for i, j I. Indeed, Table 1 shows that there is at least a factor of 2 between d and db for FR models with various architectures, losses and distance metrics on the LFW dataset (Huang et al., 2008). [...] Using this approach, we analyze a high quality subset of Celeb A consisting of \|I\| = 1965 identities, with \|Di\| 30 and 40 binary attributes (e.g., eyeglass , smiling , male ) for five FR models: Face Net, Ada Face, Sphere Face R and two versions of Arc Face. [...] Finally, for each modality, we perform a KS-test on the empirical distributions of dm and dm, the statistic of the test being the KS-distance depicted on Figure 1c. For Celeb A, m { 1, 1} and we reject H0 for a large majority of attributes at the confidence level 0.001 (see detailed p-values on Figure 10 in Appendix C). [...] Table 4. Main characteristics of models used in this work. Face Net i Res Netv1 euclidean VGGFace2 3.31 davidsandberg/facenet Arc Face Res Net50 cosine MS1MV3 5.18 deepinsight/insightface Arc Face Res Net18 cosine MS1MV3 5.18 deepinsight/insightface Ada Face Res Net18 cosine VGGFace2 3.31 mk-minchul/Ada Face Sphere Face R i Res Net100 cosine MS1 10 ydwen/opensphere
Dataset Splits	Yes	We generate similar train, validation and test sets of 10K identities each. Each identity is generated first from a fixed image, which is a unconditioned generated image with GAN-Control. For each attribute, we then generate variations of the fixed image with Gan-Control and low-level augmentations. For example, a fixed image that has an age value of 35 will yield two other images with age values 45 and 25 and so on for other attributes for a total of 1 + 8 2 = 17 images by identity. When finetuning on an attribute a, we include in the train set only fixed images and their variations of a. Arc Face and Ada Face have parametric losses: for each train identity i, a parameter c Rp is learned during training to represent a centroid of Pi, which allows to compute the loss in a pointwise manner. We place ourselves in the open-set scenario, therefore our validation set contains (exclusively) identities not seen during training and hence we cannot directly measure the loss. To adapt this idea on hold-out data, we choose fixed images as parameters, making the loss parameter-free. Let I1 and I2 be validation identities with fixed images x 1 and x 2 and augmented images x+ 1 and x 1 . Then this parameter-free loss decreases with d(f(x 1), f(x+ 1 )) and d(f(x 1), f(x 1 )). It increases when d(f(x 1), f(x 2)) decreases.
Hardware Specification	No	The paper does not contain any specific hardware details such as GPU models, CPU types, or specific cloud computing resources used for running experiments.
Software Dependencies	No	The paper mentions the Adam optimizer (Kingma & Ba, 2015) but does not specify any software libraries or frameworks with version numbers (e.g., Python 3.x, PyTorch x.x, TensorFlow x.x) that were used for implementation.
Experiment Setup	Yes	For all finetunings, we always start from the same baseline, which is an Arc Face resnet 18 model trained on MS1MV3 (Deng et al., 2019a) for the first set of finetunings, and an Ada Face resnet 18 trained on VGGFace2(Cao et al., 2018) for the second one. We use Arc Face as a reference and when possible match end-of-training configuration (the training done by the providers of the model). We train both models for a maximum of 15 epochs and we use callbacks on the validation loss to get the best checkpoint. Both model have a batch size of 42 identities, presenting always all images of an identity in the same batch. We let both models have the same optimizer parameters: a small learning rate at 10-4 weight decay at 5 10-4, momentum equal to 0.9. For Arc Face we use the default 0.5 rad for the margin while we set the learning rate of the loss to 10-3. For Ada Face, we let the default margin variability parameter, m to 0.4, the concentration parameter, h to =0.333, the scale parameter, s to 64, the running batch mean coefficient to 0.01 and we set the learning rate of the loss to 10.