reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation

Authors: Minghao Fu, Guo-Hua Wang, Liangfu Cao, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks. Empirical evaluations on two mainstream text-to-image generation frameworks, diffusion (Podell et al., 2024) and flow matching (Liu et al., 2023; Lipman et al., 2023), underscore the superiority of our method. We utilize publicly available benchmark prompts from Gen Eval (Ghosh et al., 2023), DPGBench (Hu et al., 2024), and HPS v2 (Wu et al., 2023). We employ multiple evaluation metrics, including HPS v2 (Wu et al., 2023), Image Reward (Xu et al., 2024), and Pick Score (Kirstain et al., 2023b).
Researcher Affiliation	Collaboration	Minghao Fu 1 2 3 Guo-Hua Wang 3 Liangfu Cao 3 Qing-Guo Chen 3 Zhao Xu 3 Weihua Luo 3 Kaifu Zhang 3 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3Alibaba Group. Correspondence to: Minghao Fu <EMAIL>, Guo-Hua Wang <EMAIL>.
Pseudocode	No	The paper describes mathematical derivations and algorithmic steps in prose, but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	The code is publicly available at github.com/AIDC-AI/CHATS.
Open Datasets	Yes	We conduct experiments primarily on two preference optimization datasets, Pick-a-Pic v2 (Pa P v2) (Kirstain et al., 2023a) and Open Image Preferences (OIP) (Data is Better Together, 2024). Open Image Preferences. https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized, 2024.
Dataset Splits	No	The paper mentions using 'Pick-a-Pic v2' and 'Open Image Preferences' datasets for finetuning and specifies benchmark prompts for evaluation (Gen Eval, DPG-Bench, HPS v2), but it does not provide specific training/test/validation splits for its own finetuning process or for the preference datasets themselves.
Hardware Specification	Yes	Throughput with 50 sampling steps, measured on NVIDIA A100 GPU with BF16 inference.
Software Dependencies	No	The paper mentions using Adafactor (Shazeer & Stern, 2018) and Adam W (Loshchilov & Hutter, 2019) as optimizers, but does not provide specific versions for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	Training is conducted with an effective batch size of 512, maintaining an image resolution of 1024. The default learning rate is set to 1 10 8, and a learning rate scaling strategy based on batch size increases is utilized to accelerate the finetuning. T (cf. Eq. 13 and Eq. 14) is fixed as 1000. During sampling, by default we keep s and α as 5 and 0.5, respectively.