reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents

Authors: Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Jianfeng Gao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that SECOM exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as Dial Seg711, TIAGE, and Super Dial Seg.
Researcher Affiliation	Collaboration	Zhuoshi Pan1 , Qianhui Wu2 , Huiqiang Jiang2, Xufang Luo2, Hao Cheng2, Dongsheng Li2, Yuqing Yang2, Chin-Yew Lin2, H. Vicky Zhao1 , Lili Qiu2, Jianfeng Gao2 1 Tsinghua University, 2 Microsoft Corporation
Pseudocode	No	The paper describes methods in paragraph text and provides prompts as figures (e.g., Figure 6, Figure 7, Figure 8), but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	A footnote states: "Project page: https://aka.ms/secom". This link points to a project page which is a high-level overview, not a specific code repository, and the paper does not contain an explicit statement that the source code for the methodology described in the paper is released or available at a specific repository.
Open Datasets	Yes	Datasets & Evaluation Metrics We evaluate SECOM and other baseline methods for long-term conversations on the following benchmarks: (i) LOCOMO (Maharana et al., 2024), which is the longest conversation dataset to date, with an average of 300 turns with 9K tokens per sample. (ii) Long-MT-Bench+, which is reconstructed from MT-Bench+ (Lu et al., 2023) Evaluation of Conversation Segmentation Model To evaluate the conversation segmentation module described in Section 2.2 independently, we use three widely used dialogue segmentation datasets: Dial Seg711 (Xu et al., 2021), TIAGE (Xie et al., 2021), and Super Dial Seg (Jiang et al., 2023d). To further validate Se Com s robustness and versatility across a broader range of dialogue types, we conduct additional experiments on other benchmarks, Persona-Chat (Zhang et al., 2018) and Co QA (Reddy et al., 2019).
Dataset Splits	No	For the test set, we prompt GPT-4 to generate QA pairs for each session as in Alonso et al. (2024). We also conduct evaluation on the recently released official QA pairs in Appendix A.5. For (2), following (Yuan et al., 2023), we merge five consecutive sessions into one, forming longer dialogues. In addition to the unsupervised (zero-shot) setting, we also assess performance in a transfer learning setting, where baseline models are trained on the full training set of the source dataset, while our model learns the segmentation rubric through LLM reflection on the top 100 most challenging examples. The paper mentions the use of 'test set' and 'full training set' for various datasets but does not provide specific percentages, sample counts, or explicit references to predefined splits for all datasets used to reproduce the data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory amounts used for running its experiments. It mentions the LLMs and models used (e.g., GPT-35-Turbo, Mistral-7B, LLMLingua-2, MPNet) but not the underlying hardware.
Software Dependencies	Yes	Implementation Details We use GPT-35-Turbo for response generation in our main experiment. We also adopt Mistral-7B-Instruct-v0.3* (Jiang et al., 2023a) for robustness evaluation across different LLMs. ... We use LLMLingua-2 (Pan et al., 2024) with a compression rate of 75% and xlm-roberta-large (Conneau et al., 2020) as the base model to denoise memory units. ...We employ zero-shot segmentation for QA benchmarks... We use GPT-4-0125 as the backbone LLM for segmentation.
Experiment Setup	Yes	Implementation Details We use GPT-35-Turbo for response generation in our main experiment. ...We use LLMLingua-2 (Pan et al., 2024) with a compression rate of 75% and xlm-roberta-large (Conneau et al., 2020) as the base model to denoise memory units. Following Alonso et al. (2024), we apply MPNet (multi-qa-mpnet-base-dot-v1) (Song et al., 2020) with FAISS (Johnson et al., 2019) and BM25 (Amati, 2009) for memory retrieval. ... The context budget for memory retrieval is set to 4k tokens (≈ 5 sessions, ≈ 10 segments, or ≈ 55 turns) on LOCOMO and 1k tokens (≈ 1 segments, ≈ 3 turns) on Long-MT-Bench+.