HiRemate: Hierarchical Approach for Efficient Re-materialization of Neural Networks

Authors: Julia Gusak, Xunyi Zhao, Théotime Le Hellard, Zhe Li, Lionel Eyraud-Dubois, Olivier Beaumont

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 4. Experimental Evaluation: The experiments were performed on an NVIDIA Quadro RTX8000 GPU with 48 GB of memory and an NVIDIA V100 GPU with 16 GB of memory, using Py Torch 2.0.1, CUDA 11.6, and Gurobi 9.5.0. We intentionally report performance on settings where the experimental platform can run the original model, so that we can compare our results with the training time obtained with regular Py Torch Autodiff. All experiments can be scaled up by increasing image or batch size, to a point where training requires using HIREMATE. Additional experiments including an ablation study, varying batch sizes and sequence lengths, as well as a dozen different architectures are available in the Appendix.
Researcher Affiliation Academia 1Inria Center at the University of Bordeaux 2 Ecole Normale Sup erieure, PSL University, Paris. Correspondence to: Julia Gusak <EMAIL>, Lionel Eyraud Dubois <EMAIL>.
Pseudocode Yes Algorithm 1 H-Partition Bottom-to-Top algorithm
Open Source Code No The paper mentions external tools like ROTOR (https://gitlab.inria.fr/hiepacs/rotor) and TW-REMAT (https://github.com/nshepperd/gpt-2/tree/finetuning/twremat) used in their framework, and discusses 'RK-GB module as in ROCKMATE'. However, there is no explicit statement or link providing the source code for HIREMATE itself, nor is HIREMATE identified as an open-source project with a direct repository link.
Open Datasets No The paper discusses various neural network architectures (e.g., GPT2, UNet, MLPMixer, RegNet32, ResNet101, Transformer, FNO, U-FNO, UNO) on which HIREMATE is evaluated. It also mentions varying batch sizes, sequence lengths, and image resolutions for inputs. However, it does not explicitly state which specific datasets were used for training these models, nor does it provide any concrete access information (links, DOIs, citations to specific datasets) for publicly available data.
Dataset Splits No The paper does not provide specific details on training/test/validation splits. It mentions experiments on various types of networks and varying input parameters like batch sizes and sequence lengths, but no information regarding how data might have been partitioned for these experiments.
Hardware Specification Yes The experiments were performed on an NVIDIA Quadro RTX8000 GPU with 48 GB of memory and an NVIDIA V100 GPU with 16 GB of memory, using Py Torch 2.0.1, CUDA 11.6, and Gurobi 9.5.0. All models passed a sanity check: both forward and backward passes produce the exact same result as the original module. Experiments are done on a NVIDIA P100 GPU with 16GB.
Software Dependencies Yes The experiments were performed on an NVIDIA Quadro RTX8000 GPU with 48 GB of memory and an NVIDIA V100 GPU with 16 GB of memory, using Py Torch 2.0.1, CUDA 11.6, and Gurobi 9.5.0.
Experiment Setup Yes We perform a warm-up phase consisting of five initial runs; the subsequent ten runs are used to evaluate the peak memory and computation time, providing reliable estimates of performance. The subgraph sizes are bounded by two main parameters: M l denotes the maximum number of nodes in a lower-level subgraph, and M t denotes the maximum number of nodes in the top-level graph. The default value for α is 0.5. The number of binary variables in the H-ILP formulation depends linearly on the total number of options of all nodes. To avoid wasting resources when several very similar options are available for a given node, we include in H-ILP a hyperparameter No that imposes a limit on the total number of options. Table 4 (c) describes the result of HIREMATE on each model: the budget provided to HIREMATE; the relative memory usage (compared to the peak memory of the autodiff solution) of the resulting nn.Module created by HIREMATE.