StringLLM: Understanding the String Processing Capability of Large Language Models

Authors: Xilong Wang, Hao Fu, Jindong Wang, Neil Gong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a comprehensive study of LLMs string processing capability. In particular, we first propose String LLM, a method to construct datasets for benchmarking string processing capability of LLMs. We use String LLM to build a series of datasets, referred to as String Bench. It encompasses a wide range of string processing tasks, allowing us to systematically evaluate LLMs performance in this area. Our evaluations indicate that LLMs struggle with accurately processing strings compared to humans. To uncover the underlying reasons for this limitation, we conduct an in-depth analysis and subsequently propose an effective approach that significantly enhances LLMs string processing capability via fine-tuning.
Researcher Affiliation Collaboration 1Duke University, 2Li Auto, 3William & Mary EMAIL, EMAIL EMAIL
Pseudocode No The paper describes the 'String LLM' method and its steps in text and illustrates task examples with Python code snippets in figures and tables (e.g., Figure 1, Table 12, Figure 3), but it does not contain a dedicated pseudocode block or algorithm description for its main methodologies.
Open Source Code Yes Our code and data are available at https://github.com/wxl-lxw/String LLM.
Open Datasets Yes Our code and data are available at https://github.com/wxl-lxw/String LLM. We randomly sample strings from the Flores-200 dataset (Costa-juss a et al., 2022)
Dataset Splits Yes For the test sets, we randomly split 20% of the data from each of the three datasets Multilingual, Hash, and Random String. ... The remaining 80% of our datasets is used as the training sets for our experiments on fine-tuning LLMs in Section 6.
Hardware Specification No The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU models, or cloud instances) used for running the experiments. It mentions Microsoft Azure credits but no hardware specifications.
Software Dependencies No The paper mentions using the 'Llama Factory framework (Zheng et al., 2024)' and 'Lo RA (Hu et al., 2022)', and the 'LM-Evaluation-Harness framework (Gao et al., 2024)'. However, it does not provide specific version numbers for these software components, which are necessary for full reproducibility.
Experiment Setup No The paper describes the prompt engineering techniques (Raw instructions, Co T, Po T) and the LLMs fine-tuned (Llama-3.1-8B, Gemma-2-9b, Mistral-7B-v0.3) along with the additional datasets used for fine-tuning. However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings, which are crucial for reproducing the experimental setup.