Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

Authors: Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Hand1000 significantly outperforms existing models in producing anatomically correct hand images while faithfully representing other details in the text, such as faces, clothing and colors. After training on 1,000 images in our dataset, our Hand1000 significantly outperforms the Stable Diffusion model in generating correct hands for the same textual prompts, which is shown at the bottom of Figure 1. Quantitative Performance Comparison As shown in Table 2, our Hand1000 model demonstrates a significant performance improvement over the Stable Diffusion (Rombach et al. 2022) in terms of all the metrics. Specifically, Hand1000 achieves a 28.599 reduction in FID, a 0.017 reduction in KID, a 30.208 reduction in FID-H, and a 0.041 reduction in KID-H, while its HAND-CONF score is elevated by 0.053.
Researcher Affiliation Academia 1Peking University 2Singapore Management University 3University of Science and Technology of China
Pseudocode No The paper describes the method and its three stages in detail in sections like 'Stage I: Hand Gesture Feature Extraction', 'Stage II: Text Embedding Optimization', and 'Stage III: Stable Diffusion Fine-tuning', along with a diagram in Figure 2. However, it does not include a block explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper mentions a 'Project Page https://haozhuo-zhang.github.io/Hand1000-project-page/', which is a project demonstration page and not a direct link to a source-code repository itself. The paper does not contain an unambiguous sentence explicitly stating the release of source code for the described methodology.
Open Datasets No In addition, we construct the first publicly available dataset specifically designed for text-to-hand image generation. By leveraging image captioning models and the LLa MA3, we construct the first publicly available dataset specifically designed for generating hand images from textual descriptions. However, the paper describes the dataset construction process and states its public availability but does not provide concrete access information such as a specific link, DOI, or repository name for the constructed dataset.
Dataset Splits Yes We use six hand gestures from Ha GRID phone call, four, like, mute, ok, and palm with 1,000 images per gesture for training and testing respectively.
Hardware Specification No The paper acknowledges 'Bitdeer.AI for providing the cloud services and computing resources that were essential to this work' but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies Yes We use Mediapipe hands (Zhang et al. 2020) as the hand gesture recognition model while the text encoder is the CLIP (Radford et al. 2021) text encoder. The sampling scheme is DDIM (Song, Meng, and Ermon 2020) and the optimizer is Adam (Kingma 2014). We used SDv1.4 for our experiments, which is similar to recent works like Han Diffuser (Narasimhaswamy et al. 2024), Hand Refiner (Lu et al. 2023), and Imagic (Kawar et al. 2023).
Experiment Setup Yes For each image in the training set, the first stage of training involves 10 epochs with a learning rate of 1e-3. The second stage consists of 20 epochs with a learning rate of 1e-6. Besides, We use Mediapipe hands (Zhang et al. 2020) as the hand gesture recognition model while the text encoder is the CLIP (Radford et al. 2021) text encoder. The sampling scheme is DDIM (Song, Meng, and Ermon 2020) and the optimizer is Adam (Kingma 2014). For the phone call gesture, an optimal value of λ = 0.7 was found, which produces normal hand regions while maintaining diversity and realism in other parts of the image.