reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

From Tokens to Lattices: Emergent Lattice Structures in Language Models

Authors: Bo Xiong, Steffen Staab

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We create three datasets for evaluation, and the empirical results verify our hypothesis. In this section, we evaluate whether the formal contexts constructed from MLMs align with established gold standards and assess their capability to reconstruct concept lattices. We substantiate our findings with additional ablation studies and case analyses in Sec. 4.4.
Researcher Affiliation	Academia	Bo Xiong Stanford University, United States University of Stuttgart, Germany EMAIL Steffen Staab University of Stuttgart, Germany University of Southampton, United Kingdom EMAIL
Pseudocode	Yes	Algorithm 1 A Simple Formal Context Learning Algorithm Input: A dataset D, a set of objects G and attributes M Output: An estimated formal context incidence matrix I = [0, 1]\|G\| \|M\| initialize I = 0\|G\| \|M\| for w D do for wi w do for wj w do if wi G, wj M then I(Gwi,Mwj) = I(Gwi,Mwj) + 1 end if end for end for end for normalization: I = normalize(I) return I
Open Source Code	Yes	Our code and datasets are included as supplemental materials and will be made available upon acceptance.
Open Datasets	Yes	We construct three new datasets of formal contexts in different domains, serving as the gold standards for evaluation. Two of them are derived from commonsense knowledge, and the third one is from the biomedical domain. 1) Region-language details the official languages used in different administrative regions around the world. This dataset is extracted from Wiki44k (Ho et al., 2018)... 3) Disease-symptom describes the symptoms associated with various diseases. We extracted diseases represented by a single token and their symptoms from a dataset available on Kaggle2. Our code and datasets are included as supplemental materials and will be made available upon acceptance.
Dataset Splits	No	The paper describes creating three new datasets (Region-language, Animal-behavior, Disease-symptom) for evaluation and discusses performance metrics like MRR and Hit@k, as well as F1 score and mAP for concept classification. However, it does not explicitly mention any specific training, validation, or test splits for these datasets within the provided text.
Hardware Specification	Yes	All experiments were conducted on machines equipped with Nvidia A100 GPU. Computational resources All experiments were conducted on machines equipped with 4 Nvidia A100 GPU.
Software Dependencies	No	We use the PyTorch Transformer library for all model implementations.3. The paper mentions PyTorch Transformer library but does not specify a version number.
Experiment Setup	Yes	We instantiate our approach (i.e. Def. 8) with BERT and dub it as Bert Lattice. We consider three variants of BERT models: BERT-distill, BERT-base, and BERT-large. For each model, we use the uncased versions and compare the Average pooling and Max pooling variants, denoted as Bert Lattice (avg.) and Bert Lattice (max.), respectively. ... we apply a min-max normalization approach where, given a threshold α, the binarization is performed as follows: Ig,m = log(ˆIg,m) min log(ˆI) max log(ˆI) min log(ˆI) > α. ... Fig. 2d illustrates the resulting lattice with α = 0.5.