reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generating CAD Code with Vision-Language Models for 3D Designs

Authors: Kamel Alrashedy, Pradyumna Tambwekar, Zulfiqar Haider Zaidi, Megan Langwasser, Wei Xu, Matthew Gombolay

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate CADCode Verify, we introduce, CADPrompt, the first benchmark for CAD code generation, consisting of 200 natural language prompts paired with expert-annotated scripting code for 3D objects to benchmark progress. Our findings show that CADCode Verify improves VLM performance by providing visual feedback by enhancing the structure of the 3D objects and increasing the compile rate of the compiled program. When applied to GPT-4, CADCode Verify achieved a 7.30% reduction in Point Cloud distance and a 5.5% improvement in compile rate compared to prior work.
Researcher Affiliation	Academia	Georgia Institute of Technology, GA, USA EMAIL EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical equations (e.g., Eq. 1-5). It also provides examples of prompts used for LLMs in figures (Figures 11-14). However, it does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and data are available at https://github.com/Kamel773/CAD_Code_Generation
Open Datasets	Yes	Code and data are available at https://github.com/Kamel773/CAD_Code_Generation
Dataset Splits	Yes	We stratify CADPrompt examples by mesh complexity, geometric complexity and compilation difficulty to gain insights into model performance ( 6). We split the dataset into two groups based on the median complexity: (i) Simple (those with fewer faces and vertices than the median) and (ii) Complex objects (those with more). 3D objects were labeled then as either (i) Easy (at least four of six methods generated compilable code) and (ii) Hard (otherwise).
Hardware Specification	No	The paper states that experiments were performed using "GPT-4 ("gpt-v4") via the Open AI API and Gemini ("gemini1.5-flash-latest") through the Google API" and "For Code Llama B70, we utilized the Replicate API2". This indicates the use of cloud-based APIs for accessing the models, not specific local hardware specifications like GPU models or CPU types.
Software Dependencies	No	The paper mentions software components such as CADQuery, Python, Open3D, and Pandas. It also lists the language models used (GPT-4, Gemini 1.5 Pro, Code Llama B70) with their API identifiers. However, it does not provide specific version numbers for Python, CADQuery, Open3D, or Pandas, which are key ancillary software dependencies for replication.
Experiment Setup	Yes	We performed the experiments using GPT-4 ("gpt-v4") via the Open AI API and Gemini ("gemini1.5-flash-latest") through the Google API, with the temperature set to 0 for code generation and refinement. In cases where the generated code had bugs or failed to compile, we resubmitted both the code and the compiler error message to the model, adjusting the temperature to 1. For Code Llama B70, we utilized the Replicate API2, setting the temperature to 0.8 for code generation, refinement, and bug fixing. Other hyperparameters, such as top_k = 10, top_p = 0.9, and repeat_penalty = 1.1, were kept at their default values. ... In all our experiments, we set the number of refinements to 2, as no improvement was observed beyond the second refinement.