Improving Unsupervised Constituency Parsing via Maximizing Semantic Information
Authors: Junjie Chen, Xiangheng He, Yusuke Miyao, Danushka Bollegala
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that Sem Info correlates more strongly with parsing accuracy than LL, establishing Sem Info as a better unsupervised parsing objective. As a result, our algorithm significantly improves parsing accuracy by an average of 7.85 sentence-F1 scores across five PCFG variants and in four languages, achieving state-of-the-art level results in three of the four languages. |
| Researcher Affiliation | Academia | Department of Computer Science, the University of Tokyo1 GLAM Group on Language, Audio, & Music, Imperial College London2 Department of Computer Science, the University of Liverpool3 EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Tree CRF Sampler 1: function CRF-Sampler(i, j, x) 2: if j = i + 1 then 3: Return leaf node (i, j) 4: else 5: Sample split index k πCRF (k | (i, j)) following Equation 14 Johnson et al. (2007) 6: Tleft CRF-Sampler(i, k, x) 7: Tright CRF-Sampler(k, j, x) 8: Return node (i, j) with children Tleft and Tright 9: end if 10: end function |
| Open Source Code | Yes | We release the source code at https://github.com/junjiechen-chris/Improving-Unsupervised-Constituency-Parsing-via-Maximizing-Semantic-Information.git. |
| Open Datasets | Yes | We conduct the evaluations in three datasets and four languages, namely Penn Tree Bank (PTB) (Marcus et al., 1999) for English, Chinese Treebank 5.1 (CTB) (Palmer et al., 2005) for Chinese, and SPMRL (Seddah et al., 2013) for German and French. |
| Dataset Splits | Yes | We adopt the standard data split for the PTB dataset (Sections 02-21 for training, Section 22 for validation, and Section 23 for testing) (Kim et al., 2019a). We adopt the official data split for the CTB and SPMRL datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using specific models like "gpt-4o-mini-2024-07-18 model" and tools like "snowball stemmer (Bird & Loper, 2004)", and that its "implementation is based on the source code of Yang et al. (2021b) and Liu et al. (2023)". However, it does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or operating systems. |
| Experiment Setup | Yes | We use 60 NTs for NPCFG and CPCFG, and 1024 NTs for TNPCFG, SNPCFG, and SCPCFG in our experiment. We include the maximum entropy regularization (Ziebart et al., 2008) and the traditional LL term log Z(x) in the training. The posterior optimization is similar to the method explained in the main text: (1) sampling tree from either P CRF (t|x) or P P CF G(t|x); and (2) perform policy gradient optimization in accordance with Equation 12. |