reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FreeMesh: Boosting Mesh Generation with Coordinates Merging

Authors: Jian Liu, Haohan Weng, Biwen Lei, Xianghui Yang, Zibo Zhao, Zhuo Chen, Song Guo, Tao Han, Chunchao Guo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through experiments on various tokenization methods like Mesh XL, Mesh Anything V2, and Edgerunner, we further validate the performance of our method. We constructed a simple point cloud conditioned mesh generation pipeline to evaluate the proposed method empirically. We used the filtered Objaverse (Deitke et al., 2023) and Objaverse-XL (Deitke et al., 2024)) as training data. Extensive experiments demonstrate that our PTME is an effective method for evaluating the superiority of mesh tokenizers, and that the Rearrange & Merge Coordinates (RMC) can effectively increase the number of mesh faces generated by previous tokenizers.
Researcher Affiliation	Collaboration	1Hong Kong University of Science and Technology 2Tencent Hunyuan 3South China University of Technology 4Shanghai Tech University.
Pseudocode	Yes	Algorithm 1 Rearrange Coordinate Encode Operation def rac_encode(nums): X = nums[0::3] Y = nums[1::3] Z = nums[2::3] return X + Y + Z def rac_encode_full(nums): if len(nums) < 9: return rac_encode(nums) remainder = len(nums) % 9 head_start = 0 head_end = 9 head = rac_encode(nums[head_start:head_end]) neck_start = head_end neck_end = len(nums) remainder neck_len = (neck_end neck_start) // 9 neck = [] for i in range(neck_len): cur_seq = nums[neck_start+i9:neck_start+(i+1) 9] neck.extend(rac_encode(cur_seq)) if remainder > 0: tail = rac_encode(nums[neck_end:]) else: tail = [] return head + neck + tail
Open Source Code	No	The data utilized in this paper is all open-source, and the point cloud encoder, various serialization methods, transformer frameworks, and Sentence Piece are also derived from open-source code. Users who employ this framework must verify the copyright of the database and codebase they utilize.
Open Datasets	Yes	Our model s training data comprise Shape Net V2 (Chang et al., 2015), 3D-FUTURE (Fu et al., 2021), Objaverse (Deitke et al., 2023), and Objaverse XL (Deitke et al., 2024)).
Dataset Splits	No	For our test set, we sampled around 500, 1000, 2000, and 4000 face numbers to reflect the model s generalization under various face numbers. Our training dataset comprises 10,000 meshes, with vocabulary sizes systematically evaluated across 256, 512, 1024, 2048, 4096, 8192. The actual numbers of data utilized can differ across methods. We compared three baseline serialization methods with their RMC-enhanced counterparts using a stratified sample of 100k meshes from our 1M mesh dataset.
Hardware Specification	Yes	Training executed on 48 H20 GPUs with a per-GPU batch size of 2 for four days, utilizing Flash Attention and bf16 mixed precision.
Software Dependencies	No	For coordinate merging, we implement the Byte-Pair Encoding (BPE) algorithm from Google s Sentence Piece (Kudo & Richardson, 2018). Our auto-regressive Transformer architecture adopts cross-attention conditioning following BPT (Weng et al., 2024b), with a point cloud encoder adapted from Michelangeo (Zhao et al., 2024b) processing 8,192 sampled points. We employ Adam W (Loshchilov & Hutter, 2017) optimization (β1 = 0.9, β2 = 0.999) with 0.1 weight decay and cosine annealing, decaying the learning rate from 10 4 to 6 10 5.
Experiment Setup	Yes	We set the Transformer s context window to 9,000... The final vocabulary size for all coordinate-merging methods is 8192. Our training dataset comprises 10,000 meshes, with vocabulary sizes systematically evaluated across 256, 512, 1024, 2048, 4096, 8192. Our auto-regressive Transformer architecture adopts cross-attention conditioning following BPT (Weng et al., 2024b), with a point cloud encoder adapted from Michelangeo (Zhao et al., 2024b) processing 8,192 sampled points. The mesh transformer features 24 layers with 1,024 hidden dimensions, 16 attention heads (64 dimensions per head), and Deep Speed Ze RO2 parallelism. Training executed on 48 H20 GPUs with a per-GPU batch size of 2 for four days, utilizing Flash Attention and bf16 mixed precision. The point cloud encoder remained frozen for the first 48 hours before fine-tuning commenced. We employ Adam W (Loshchilov & Hutter, 2017) optimization (β1 = 0.9, β2 = 0.999) with 0.1 weight decay and cosine annealing, decaying the learning rate from 10 4 to 6 10 5. Inference acceleration leverages KV caching for efficient sequence generation.