Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Exploring the LLM Journey from Cognition to Expression with Linear Representations
Authors: Yuzi Yan, Jialian Li, Yipin Zhang, Dong Yan
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study includes a comprehensive series of experiments and analyses carried out during the Pretraining, SFT, and RLHF phases of the Baichuan-7B and Baichuan33B. We carry out our quantification experiments using four standard benchmark datasets: Openbook QA (Mihaylov et al., 2018), Common Sense QA (Talmor et al., 2018), RACE (Lai et al., 2017) and ARC (Clark et al., 2018). |
| Researcher Affiliation | Collaboration | 1Baichuan AI 2Tsinghua University. |
| Pseudocode | Yes | Algorithm 1 Cognitive capability quantification |
| Open Source Code | No | The paper states: 'Notably, Baichuan7B is an open-source model, whereas Baichuan-33B is a closed-source model.' This refers to the models used in the study, not the authors' own code for their methodology. No statement is made about releasing the code for the research presented in the paper. |
| Open Datasets | Yes | We carry out our quantification experiments using four standard benchmark datasets: Openbook QA (Mihaylov et al., 2018), Common Sense QA (Talmor et al., 2018), RACE (Lai et al., 2017) and ARC (Clark et al., 2018). Table 3. Size for each datasets. DATASET NAME TRAINSET SIZE TESTSET SIZE |
| Dataset Splits | Yes | Table 3. Size for each datasets. DATASET NAME TRAINSET SIZE TESTSET SIZE |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. It mentions 'computational power worth billions is used daily' for training LLMs generally, but not for their specific experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the experiments (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | The SFT phase involved training over 4 epochs, each with 1M tokens. For RLHF, we implement the Proximal Policy Optimization (PPO) strategy, as elaborated in Achiam et al. (2023). Table 4. Direct token generation hyperparameters. TERM PARAMETER. TEMPERATURE 1.2, TOP P 0.9, TOP K 50, MAX TOKENS 2048, REPETITION PENALTY 1.05. |