Can Transformers Do Enumerative Geometry?

Authors: Baran Hashemi, Roderic Corominas, Alessandro Giacchetto

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By reformulating the problem as a continuous optimization task, we compute intersection numbers across a wide value range from 10^45 to 10^45. To capture the recursive nature inherent in these intersection numbers, we propose the Dynamic Range Activator (DRA) 1, a new activation function that enhances the Transformer s ability to model recursive patterns and handle severe heteroscedasticity. Given the precision required to compute these invariants, we quantify the uncertainty in the predictions using Conformal Prediction with a dynamic sliding window, adaptive to partitions of equivalent numbers of marked points. To the best of our knowledge, there has been no prior work on modeling recursive functions with such a high-variance and factorial growth. Beyond simply computing intersection numbers, we explore the enumerative world-model of Transformers. Our interpretability analysis reveals that the network is implicitly modeling the Virasoro constraints in a purely data-driven manner. Moreover, through abductive hypothesis testing, probing, and causal inference, we uncover evidence of an emergent internal representation of the the large-genus asymptotic of ψ-class intersection numbers. These findings suggest that the network internalizes the parameters of the asymptotic closed-form and the polynomiality phenomenon of ψ-class intersection numbers in a non-linear manner.
Researcher Affiliation Academia Baran Hashemi ORIGINS Data Science Lab Technical University Munich EMAIL Roderic G. Corominas Department of Mathematics Harvard University EMAIL Alessandor Giacchetto Departement Mathematik ETH Z urich EMAIL
Pseudocode No The paper describes the Dynamic Former model in detail in Appendix E and illustrates its architecture in Figure 5. However, there are no explicitly labeled sections or figures providing pseudocode or an algorithm block.
Open Source Code Yes Git Hub Code: https://github.com/Baran-phys/DynamicFormer
Open Datasets No Our model is trained on known data computed by a brute force algorithm up to genus 13 and is tested up to genus 17. The input data during training consisted of the sparse tensors B and C from Equation (2.5), the genus g, the number of marked points n, and the partitions d = (d1, . . . , dn) of dg,n. The paper describes the characteristics and generation of the dataset but does not provide a direct link, DOI, or specific repository name for public access to this generated data.
Dataset Splits Yes Our model is trained on known data computed by a brute force algorithm up to genus 13 and is tested up to genus 17. In the ID setting, we examine data with the same genera as the training data, that is g ID = [1, 13], but with different, unseen numbers of marked points n ID [35, 11]. In the OOD setting, we examine data with a higher genera than the training data, specifically g OOD = [14, 15, 16, 17], and a number of marked points n OOD [1, 9].
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions several software-related concepts and frameworks (e.g., Transformers, Conformal Prediction, PyTorch for model implementation implicitly by discussing NN architectures) but does not provide specific version numbers for software dependencies used in their implementation.
Experiment Setup No To demonstrate the advantage of DRA in capturing recursive behavior, we set up a small experiment. We generate a small dataset based on the recursive function r(n) = n + (n, AND, r(n 1)), where AND is the bitwise logical AND operator (Sloane, 2007). We train a fully connected neural network with two hidden layers consisting of 64 and 32 neurons over the interval n [0, 120], then test the model over n [121, 200]. All layers, including the Multi-Head Attention (MHA) blocks, use the Dynamic Range Activator (DRA) non-linear activation function. The DRA prediction head is a 2-layer MLP that predicts ψ-class intersection numbers in logarithmic scale. As the loss function, we use the Mean Absolute Error (MAE) loss. The total loss is, LTotal = LMAE + LTM, where LTM introduces a Self-Supervised Learning objective between the [DYN] registry tokens (Darcet et al., 2024) of each modality, inspired by Barlow Twins (Zbontar et al., 2021). While the paper describes some architectural choices (number of layers and neurons for a specific experiment, use of MHA, PNA, 2-layer MLP head) and loss functions, it does not specify concrete hyperparameters like learning rate, batch size, optimizer type, or number of epochs for the main experiments.