reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

StringLLM: Understanding the String Processing Capability of Large Language Models

Authors: Xilong Wang, Hao Fu, Jindong Wang, Neil Gong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present a comprehensive study of LLMs string processing capability. In particular, we first propose String LLM, a method to construct datasets for benchmarking string processing capability of LLMs. We use String LLM to build a series of datasets, referred to as String Bench. It encompasses a wide range of string processing tasks, allowing us to systematically evaluate LLMs performance in this area. Our evaluations indicate that LLMs struggle with accurately processing strings compared to humans. To uncover the underlying reasons for this limitation, we conduct an in-depth analysis and subsequently propose an effective approach that significantly enhances LLMs string processing capability via fine-tuning.
Researcher Affiliation	Collaboration	1Duke University, 2Li Auto, 3William & Mary EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes the 'String LLM' method and its steps in text and illustrates task examples with Python code snippets in figures and tables (e.g., Figure 1, Table 12, Figure 3), but it does not contain a dedicated pseudocode block or algorithm description for its main methodologies.
Open Source Code	Yes	Our code and data are available at https://github.com/wxl-lxw/String LLM.
Open Datasets	Yes	Our code and data are available at https://github.com/wxl-lxw/String LLM. We randomly sample strings from the Flores-200 dataset (Costa-juss a et al., 2022)
Dataset Splits	Yes	For the test sets, we randomly split 20% of the data from each of the three datasets Multilingual, Hash, and Random String. ... The remaining 80% of our datasets is used as the training sets for our experiments on fine-tuning LLMs in Section 6.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU models, or cloud instances) used for running the experiments. It mentions Microsoft Azure credits but no hardware specifications.
Software Dependencies	No	The paper mentions using the 'Llama Factory framework (Zheng et al., 2024)' and 'Lo RA (Hu et al., 2022)', and the 'LM-Evaluation-Harness framework (Gao et al., 2024)'. However, it does not provide specific version numbers for these software components, which are necessary for full reproducibility.
Experiment Setup	No	The paper describes the prompt engineering techniques (Raw instructions, Co T, Po T) and the LLMs fine-tuned (Llama-3.1-8B, Gemma-2-9b, Mistral-7B-v0.3) along with the additional datasets used for fine-tuning. However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings, which are crucial for reproducing the experimental setup.