reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SCALM: Detecting Bad Practices in Smart Contracts Through LLMs

Authors: Zongwei Li, Xiaoqi Li, Wenkai Li, Xin Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments using multiple LLMs and datasets have shown that SCALM outperforms existing tools in detecting bad practices in smart contracts. We conduct an experimental evaluation on SCALM, and the results show that the framework performs well and outperforms existing tools in detecting bad practices in smart contracts. At the same time, ablation experiments reveal that the RAG component significantly improves SCALM performance.
Researcher Affiliation	Academia	Zongwei Li, Xiaoqi Li*, Wenkai Li, Xin Wang School of Cyberspace Security, Hainan University, Haikou, 570228, China
Pseudocode	Yes	Algorithm 1: SCALM Algorithm
Open Source Code	Yes	We open source SCALM s codes and experimental data at https://figshare.com/s/5cc3639706e4ecd16724.
Open Datasets	Yes	Our data collection comes from the DApp SCAN database (Zheng et al. 2024b), which includes 39,904 smart contracts with 1,618 SWC weaknesses. The Smartbugs dataset (Durieux et al. 2020) is also used in the experiment, and a total of 1,894 smart contracts with five types of SWC weaknesses are extracted for comparison experiments.
Dataset Splits	No	The paper describes the datasets used (DApp SCAN and Smartbugs) and the number of samples for certain SWC categories in the evaluation (e.g., 94 samples for positive examples of SWC-104 and 200 samples for both positive and negative examples for others). However, it does not specify explicit training, validation, or test dataset splits for training any model components within SCALM by the authors. The LLMs are used as-is or with prompting strategies, and DApp SCAN serves as a knowledge base.
Hardware Specification	Yes	All experiments are executed on a server equipped with NVIDIA Ge Force GTX 4070Ti GPU, Intel(R) Core(TM) i913900KF CPU, and 128G RAM, operating on Ubuntu 22.04 LTS.
Software Dependencies	Yes	The software environment includes Python 3.9 and Py Torch 2.0.1.
Experiment Setup	No	The paper describes the overall SCALM framework, the LLMs selected for experiments, and evaluation metrics (Accuracy, Recall, F1 score). However, it does not provide specific hyperparameters such as learning rates, batch sizes, optimizers, or training epochs, which are typically part of a detailed experimental setup for training a model.