FreeMesh: Boosting Mesh Generation with Coordinates Merging
Authors: Jian Liu, Haohan Weng, Biwen Lei, Xianghui Yang, Zibo Zhao, Zhuo Chen, Song Guo, Tao Han, Chunchao Guo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments on various tokenization methods like Mesh XL, Mesh Anything V2, and Edgerunner, we further validate the performance of our method. We constructed a simple point cloud conditioned mesh generation pipeline to evaluate the proposed method empirically. We used the filtered Objaverse (Deitke et al., 2023) and Objaverse-XL (Deitke et al., 2024)) as training data. Extensive experiments demonstrate that our PTME is an effective method for evaluating the superiority of mesh tokenizers, and that the Rearrange & Merge Coordinates (RMC) can effectively increase the number of mesh faces generated by previous tokenizers. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology 2Tencent Hunyuan 3South China University of Technology 4Shanghai Tech University. |
| Pseudocode | Yes | Algorithm 1 Rearrange Coordinate Encode Operation def rac_encode(nums): X = nums[0::3] Y = nums[1::3] Z = nums[2::3] return X + Y + Z def rac_encode_full(nums): if len(nums) < 9: return rac_encode(nums) remainder = len(nums) % 9 head_start = 0 head_end = 9 head = rac_encode(nums[head_start:head_end]) neck_start = head_end neck_end = len(nums) remainder neck_len = (neck_end neck_start) // 9 neck = [] for i in range(neck_len): cur_seq = nums[neck_start+i*9:neck_start+(i+1) *9] neck.extend(rac_encode(cur_seq)) if remainder > 0: tail = rac_encode(nums[neck_end:]) else: tail = [] return head + neck + tail |
| Open Source Code | No | The data utilized in this paper is all open-source, and the point cloud encoder, various serialization methods, transformer frameworks, and Sentence Piece are also derived from open-source code. Users who employ this framework must verify the copyright of the database and codebase they utilize. |
| Open Datasets | Yes | Our model s training data comprise Shape Net V2 (Chang et al., 2015), 3D-FUTURE (Fu et al., 2021), Objaverse (Deitke et al., 2023), and Objaverse XL (Deitke et al., 2024)). |
| Dataset Splits | No | For our test set, we sampled around 500, 1000, 2000, and 4000 face numbers to reflect the model s generalization under various face numbers. Our training dataset comprises 10,000 meshes, with vocabulary sizes systematically evaluated across 256, 512, 1024, 2048, 4096, 8192. The actual numbers of data utilized can differ across methods. We compared three baseline serialization methods with their RMC-enhanced counterparts using a stratified sample of 100k meshes from our 1M mesh dataset. |
| Hardware Specification | Yes | Training executed on 48 H20 GPUs with a per-GPU batch size of 2 for four days, utilizing Flash Attention and bf16 mixed precision. |
| Software Dependencies | No | For coordinate merging, we implement the Byte-Pair Encoding (BPE) algorithm from Google s Sentence Piece (Kudo & Richardson, 2018). Our auto-regressive Transformer architecture adopts cross-attention conditioning following BPT (Weng et al., 2024b), with a point cloud encoder adapted from Michelangeo (Zhao et al., 2024b) processing 8,192 sampled points. We employ Adam W (Loshchilov & Hutter, 2017) optimization (β1 = 0.9, β2 = 0.999) with 0.1 weight decay and cosine annealing, decaying the learning rate from 10 4 to 6 10 5. |
| Experiment Setup | Yes | We set the Transformer s context window to 9,000... The final vocabulary size for all coordinate-merging methods is 8192. Our training dataset comprises 10,000 meshes, with vocabulary sizes systematically evaluated across 256, 512, 1024, 2048, 4096, 8192. Our auto-regressive Transformer architecture adopts cross-attention conditioning following BPT (Weng et al., 2024b), with a point cloud encoder adapted from Michelangeo (Zhao et al., 2024b) processing 8,192 sampled points. The mesh transformer features 24 layers with 1,024 hidden dimensions, 16 attention heads (64 dimensions per head), and Deep Speed Ze RO2 parallelism. Training executed on 48 H20 GPUs with a per-GPU batch size of 2 for four days, utilizing Flash Attention and bf16 mixed precision. The point cloud encoder remained frozen for the first 48 hours before fine-tuning commenced. We employ Adam W (Loshchilov & Hutter, 2017) optimization (β1 = 0.9, β2 = 0.999) with 0.1 weight decay and cosine annealing, decaying the learning rate from 10 4 to 6 10 5. Inference acceleration leverages KV caching for efficient sequence generation. |