reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LEKA: LLM-Enhanced Knowledge Augmentation

Authors: Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dongjie Wang, Yanjie Fu, Kunpeng Liu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of our approach through extensive experiments across various domains and demonstrate significant improvements over traditional methods in automating data alignment and optimizing transfer learning outcomes. We conduct a series of experiments to validate the effectiveness and robustness of our LEKA method across different tasks. Experimental results demonstrate that our method has clear advantages over existing methods. In this section, we present four experiments to demonstrate the effectiveness and impacts of the LEKA. First, we compare the LEKA against several baseline methods on four downstream tasks. Second, we present the correlations between several target domains and their retrieved source domains. Finally, we discuss the reason for performance improvement. We evaluate our method on four datasets of medical and economic domains: (1) Breast Cancer Wisconsin (Diagnostic) (BCW) [Wolberg et al., 1995], (2) Heart Disease (HD) [Janosi et al., 1989], (3) Vehicle Insurance Data (VID) [Bhatt, 2019], and (4) Telco Customer Churn (TCC) [Blast Char, 2018]. We show the detailed information about the features of the datasets in Table 1. We evaluate the model performance by the following metrics: Overall Accuracy (Acc) measures the proportion of true results (both true positives and true negatives) in the total dataset. Precision (Prec) reflects the ratio of true positive predictions to all positive predictions for each class. Recall (Rec), also known as sensitivity, reflects the ratio of true positive predictions to all actual positives for each class. F-Measure (F1) is the harmonic mean of precision and recall, calculated here as the macro-average. We apply the LEKA across a range of models: 1) Tabnet (TN) [Arik and Pfister, 2021]; 2) Tab Transformer (TT) [Huang et al., 2020]; 3) Random Forest (RF) [Rigatti, 2017]; 4) Gradient Boosting Decision Trees (GBDT) [Lin et al., 2023] ; 5) XGBoost (XB) [Chen and Guestrin, 2016]. We compare the performance in these tasks both with and without our method.
Researcher Affiliation	Academia	Xinhao Zhang1 , Jinghan Zhang1 , Fengran Mo2 , Dongjie Wang3 , Yanjie Fu4 and Kunpeng Liu1 1Portland State University, USA 2University of Montreal, Canada 3University of Kansas, USA 4Arizona State University, USA
Pseudocode	No	The paper describes the methodology in text and uses a framework diagram (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper does not explicitly state that the source code for the LEKA methodology is available, nor does it provide a link to a code repository. It mentions using GPT-4o and Exa API for query generation and data fetching, but this refers to third-party tools, not the authors' own implementation code for LEKA.
Open Datasets	Yes	We evaluate our method on four datasets of medical and economic domains: (1) Breast Cancer Wisconsin (Diagnostic) (BCW) [Wolberg et al., 1995], (2) Heart Disease (HD) [Janosi et al., 1989], (3) Vehicle Insurance Data (VID) [Bhatt, 2019], and (4) Telco Customer Churn (TCC) [Blast Char, 2018]. We show the detailed information about the features of the datasets in Table 1. ... In our setup for data synthesis and model training, we utilize GPT-4o [Open AI, 2024] as the query generator, combined with the Exa API [Exa, 2024] to fetch web pages containing datasets from Kaggle [Kaggle, 2024] and the UCI Machine Learning Repository [University of California, Irvine, 2024] that may be suitable for knowledge transfer.
Dataset Splits	No	The paper mentions batch sizes for training and the number of epochs: 'For our models, we configure TN, TT, and FTT with a batch size of 512 for the VID and TCC datasets and a batch size of 32 for the BCW dataset, a maximum of 100 epochs, and employ early stopping with a patience of 20.' However, it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts) for any of the datasets used.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions several software components like 'GPT-4o [Open AI, 2024]', 'Exa API [Exa, 2024]', and 'pytorch tabnet', but it does not specify concrete version numbers for these or any other libraries, frameworks, or programming languages used in the implementation.
Experiment Setup	Yes	For our models, we configure TN, TT, and FTT with a batch size of 512 for the VID and TCC datasets and a batch size of 32 for the BCW dataset, a maximum of 100 epochs, and employ early stopping with a patience of 20. The learning rate is set at the default 0.02 for pytorch tabnet. For the RF and GBDT models, the number of trees is set to 100, with GBDT also configured with a learning rate of 0.1 and a max depth of 3. TTab is set with a maximum of 50 epochs, a learning rate of 1 10 3, and a weight decay of 1 10 4.