reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Training Neural Networks as Recognizers of Formal Languages

Authors: Alexandra Butoi, Ghazal Khalighinejad, Anej Svete, Josef Valvoda, Ryan Cotterell, Brian DuSell

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings, using a general method that can be applied to a wide variety of languages. As part of this, we extend an algorithm recently proposed by Snæbjarnarson et al. (2025) for efficient length-controlled sampling of strings from regular languages. We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures: a simple RNN, an LSTM, and a causally-masked transformer. We find that the RNN and LSTM often outperform the transformer, and that auxiliary training objectives such as language modeling can help, although no single objective uniformly improves performance across languages and architectures.
Researcher Affiliation	Academia	1ETH Zurich 2Duke University 3University of Copenhagen EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1 Convert a partial DFA A to a WDFA AD over the nmaxth-order binning semiring with respect to the log semiring. This implicitly creates an intermediate PDFA A over the real semiring with uniform probabilities. Time complexity: O((\|Q\| + \|δ\|)nmax). [...] Algorithm 7 Sample a string of length n from a WDFA A D over the Dth-order binning semiring with respect to the real semiring, where D n. Also output a sequence of n + 1 next-symbol sets. The DFA A D must be the first output of Algorithm 2, and NEXT must be the output of Algorithm 6. Time complexity: O(n log \|δ\|out), where \|δ\|out is the maximum out-degree of A D.
Open Source Code	Yes	We have publicly released our datasets as a benchmark called FLa Re (Formal Language Recognition), along with our code.2 ... 2https://github.com/rycolab/neural-network-recognizers
Open Datasets	Yes	We have publicly released our datasets as a benchmark called FLa Re1 (Formal Language Recognition), along with our code.2 ... 1https://github.com/rycolab/flare
Dataset Splits	Yes	For each language, we sample a single, fixed training set of 10k examples with lengths in [0, 40]. We run separate experiments with two different validation sets that are designed to address different scientific questions, each with 1k examples: a short validation set with string lengths in [0, 40], and a long validation set with string lengths in [0, 80]. ... In Table 2, we show recognition accuracy on a test set with string lengths in [0, 500]. It has 5,010 examples, or an average of 10 examples per length.
Hardware Specification	No	Our experiments are small enough that we are able to run them in CPU mode, without GPU acceleration.
Software Dependencies	No	Our implementations for all three architectures are based on those provided by PyTorch (Paszke et al., 2019).
Experiment Setup	Yes	We use 5 layers for all models in all experiments. We automatically adjust the hidden vector size d so that the number of parameters is as close as possible to 64k, excluding extra parameters for language modeling and next-symbol prediction heads. ... Wherever dropout is applicable, we use a dropout rate of 0.1. ... We randomly sample the batch size B from a uniform distribution over [128, 4096]. We randomly sample the initial learning rate from a log-uniform distribution over [0.0001, 0.01]. We randomly sample the loss term coefficients λLM and λNS, when they are needed, from a log-uniform distribution over [0.01, 10]. We train each model by minimizing the loss function defined in 3.3 using Adam (Kingma & Ba, 2015). We clip gradients with a threshold of 5 using L2 norm rescaling. We multiply the learning rate by 0.5 after 5 checkpoints of no decrease in recognition cross-entropy on the validation set, and we stop early after 10 checkpoints of no decrease. We select the checkpoint with the lowest recognition cross-entropy on the validation set when reporting results. We train for a maximum of 1k epochs.