SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents
Authors: Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Jianfeng Gao
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that SECOM exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as Dial Seg711, TIAGE, and Super Dial Seg. |
| Researcher Affiliation | Collaboration | Zhuoshi Pan1 , Qianhui Wu2 , Huiqiang Jiang2, Xufang Luo2, Hao Cheng2, Dongsheng Li2, Yuqing Yang2, Chin-Yew Lin2, H. Vicky Zhao1 , Lili Qiu2, Jianfeng Gao2 1 Tsinghua University, 2 Microsoft Corporation |
| Pseudocode | No | The paper describes methods in paragraph text and provides prompts as figures (e.g., Figure 6, Figure 7, Figure 8), but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | A footnote states: "Project page: https://aka.ms/secom". This link points to a project page which is a high-level overview, not a specific code repository, and the paper does not contain an explicit statement that the source code for the methodology described in the paper is released or available at a specific repository. |
| Open Datasets | Yes | Datasets & Evaluation Metrics We evaluate SECOM and other baseline methods for long-term conversations on the following benchmarks: (i) LOCOMO (Maharana et al., 2024), which is the longest conversation dataset to date, with an average of 300 turns with 9K tokens per sample. (ii) Long-MT-Bench+, which is reconstructed from MT-Bench+ (Lu et al., 2023) Evaluation of Conversation Segmentation Model To evaluate the conversation segmentation module described in Section 2.2 independently, we use three widely used dialogue segmentation datasets: Dial Seg711 (Xu et al., 2021), TIAGE (Xie et al., 2021), and Super Dial Seg (Jiang et al., 2023d). To further validate Se Com s robustness and versatility across a broader range of dialogue types, we conduct additional experiments on other benchmarks, Persona-Chat (Zhang et al., 2018) and Co QA (Reddy et al., 2019). |
| Dataset Splits | No | For the test set, we prompt GPT-4 to generate QA pairs for each session as in Alonso et al. (2024). We also conduct evaluation on the recently released official QA pairs in Appendix A.5. For (2), following (Yuan et al., 2023), we merge five consecutive sessions into one, forming longer dialogues. In addition to the unsupervised (zero-shot) setting, we also assess performance in a transfer learning setting, where baseline models are trained on the full training set of the source dataset, while our model learns the segmentation rubric through LLM reflection on the top 100 most challenging examples. The paper mentions the use of 'test set' and 'full training set' for various datasets but does not provide specific percentages, sample counts, or explicit references to predefined splits for all datasets used to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory amounts used for running its experiments. It mentions the LLMs and models used (e.g., GPT-35-Turbo, Mistral-7B, LLMLingua-2, MPNet) but not the underlying hardware. |
| Software Dependencies | Yes | Implementation Details We use GPT-35-Turbo for response generation in our main experiment. We also adopt Mistral-7B-Instruct-v0.3* (Jiang et al., 2023a) for robustness evaluation across different LLMs. ... We use LLMLingua-2 (Pan et al., 2024) with a compression rate of 75% and xlm-roberta-large (Conneau et al., 2020) as the base model to denoise memory units. ...We employ zero-shot segmentation for QA benchmarks... We use GPT-4-0125 as the backbone LLM for segmentation. |
| Experiment Setup | Yes | Implementation Details We use GPT-35-Turbo for response generation in our main experiment. ...We use LLMLingua-2 (Pan et al., 2024) with a compression rate of 75% and xlm-roberta-large (Conneau et al., 2020) as the base model to denoise memory units. Following Alonso et al. (2024), we apply MPNet (multi-qa-mpnet-base-dot-v1) (Song et al., 2020) with FAISS (Johnson et al., 2019) and BM25 (Amati, 2009) for memory retrieval. ... The context budget for memory retrieval is set to 4k tokens (≈ 5 sessions, ≈ 10 segments, or ≈ 55 turns) on LOCOMO and 1k tokens (≈ 1 segments, ≈ 3 turns) on Long-MT-Bench+. |