StringLLM: Understanding the String Processing Capability of Large Language Models
Authors: Xilong Wang, Hao Fu, Jindong Wang, Neil Gong
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a comprehensive study of LLMs string processing capability. In particular, we first propose String LLM, a method to construct datasets for benchmarking string processing capability of LLMs. We use String LLM to build a series of datasets, referred to as String Bench. It encompasses a wide range of string processing tasks, allowing us to systematically evaluate LLMs performance in this area. Our evaluations indicate that LLMs struggle with accurately processing strings compared to humans. To uncover the underlying reasons for this limitation, we conduct an in-depth analysis and subsequently propose an effective approach that significantly enhances LLMs string processing capability via fine-tuning. |
| Researcher Affiliation | Collaboration | 1Duke University, 2Li Auto, 3William & Mary EMAIL, EMAIL EMAIL |
| Pseudocode | No | The paper describes the 'String LLM' method and its steps in text and illustrates task examples with Python code snippets in figures and tables (e.g., Figure 1, Table 12, Figure 3), but it does not contain a dedicated pseudocode block or algorithm description for its main methodologies. |
| Open Source Code | Yes | Our code and data are available at https://github.com/wxl-lxw/String LLM. |
| Open Datasets | Yes | Our code and data are available at https://github.com/wxl-lxw/String LLM. We randomly sample strings from the Flores-200 dataset (Costa-juss a et al., 2022) |
| Dataset Splits | Yes | For the test sets, we randomly split 20% of the data from each of the three datasets Multilingual, Hash, and Random String. ... The remaining 80% of our datasets is used as the training sets for our experiments on fine-tuning LLMs in Section 6. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU models, or cloud instances) used for running the experiments. It mentions Microsoft Azure credits but no hardware specifications. |
| Software Dependencies | No | The paper mentions using the 'Llama Factory framework (Zheng et al., 2024)' and 'Lo RA (Hu et al., 2022)', and the 'LM-Evaluation-Harness framework (Gao et al., 2024)'. However, it does not provide specific version numbers for these software components, which are necessary for full reproducibility. |
| Experiment Setup | No | The paper describes the prompt engineering techniques (Raw instructions, Co T, Po T) and the LLMs fine-tuned (Llama-3.1-8B, Gemma-2-9b, Mistral-7B-v0.3) along with the additional datasets used for fine-tuning. However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs, or optimizer settings, which are crucial for reproducing the experimental setup. |