reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SegFace: Face Segmentation of Long-Tail Classes

Authors: Kartik Narayan, Vibashan Vs, Vishal M. Patel

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments demonstrating that Seg Face significantly outperforms previous state-of-the-art models, achieving a mean F1 score of 88.96 (+2.82) on the Celeb AMask-HQ dataset and 93.03 (+0.65) on the La Pa dataset.
Researcher Affiliation	Academia	Kartik Narayan, Vibashan VS, Vishal M. Patel Johns Hopkins University EMAIL
Pseudocode	No	The paper describes the proposed method in prose and provides a diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code https://github.com/Kartik-3004/Seg Face
Open Datasets	Yes	We conduct our experiments on three standard face segmentation datasets: La Pa (Liu et al. 2020), Celeb AMask HQ (Lee et al. 2020b), and Helen (Le et al. 2012).
Dataset Splits	Yes	The La Pa dataset contains a total of 22,168 images, with 18,176 used for training, 2,000 for validation, and 2,000 for testing. This dataset is annotated for 11 classes, including skin, hair, nose, left eye, right eye, left brow, right brow, upper lip, and lower lip. The Celeb AMask-HQ dataset comprises 30,000 face images, split into 24,183 for training, 2,993 for validation, and 2,824 for testing. It features 19 semantic classes, including accessories such as earring, necklace, eyeglass, and hat, which are considered long-tail classes due to their infrequent occurrence in the dataset. The other classes are the same as those in the La Pa dataset, with the addition of left/right ear, cloth and neck. The Helen dataset, being the smallest, consists of 2,000 training samples, 230 validation samples, and 100 test samples, annotated for 11 classes.
Hardware Specification	Yes	All code was implemented in Py Torch, and the models were trained on eight A6000 GPUs, each equipped with 48 GB of memory.
Software Dependencies	No	The paper mentions "Py Torch" but does not specify a version number or other software dependencies with version numbers.
Experiment Setup	Yes	The models were optimized for 300 epochs using the Adam W optimizer, with an initial learning rate of 1e 4 and a weight decay of 1e 5. We employed a step LR scheduler with a gamma value of 0.1, which reduces the learning rate by a factor of 0.1 at epochs 80 and 200. A batch size of 32 was used for training on the La Pa and Celeb AMask-HQ datasets, and 16 for the Helen dataset. We did not perform any augmentations on the Celeb AMask-HQ and Helen datasets. For the La Pa dataset, we applied random rotation [ 30 , 30 ], random scaling [0.5, 3], and random translation [ 20px, 20px], along with Ro I tanh warping (Lin et al. 2019) to ensure that the network focused on the face region. The λ1 and λ2 values were set at 0.5 for dice loss and cross entropy loss, respectively.