Cross-domain Constituency Parsing by Leveraging Heterogeneous Data

Authors: Peiming Guo, Meishan Zhang, Yulong Chen, Jianling Li, Min Zhang, Yue Zhang

JAIR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments to verify the effectiveness of the proposed model on a newsdomain constituency treebank PTB (Marcus, Santorini, & Marcinkiewicz, 1993) and a multi-domain constituency treebank MCTB (Yang, Cui, Ning, Wu, & Zhang, 2022) consisting of five domains: dialogue, forum, law, literature and review. Experimental results show that both domain knowledge transfer and task knowledge transfer are effective for cross-domain constituency parsing.
Researcher Affiliation Academia Institute of Computing and Intelligence, Harbin Institute of Technology (Shenzhen), Shenzhen, China; School of Engineering, Westlake University, Hangzhou, China; School of New Media and Communication, Tianjin University, Tianjin, China
Pseudocode No The paper describes the model architecture and methods using textual explanations and mathematical equations (e.g., in Sections 3.1, 3.2, and 3.3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/guopeiming/CD_Cons Paring_Heter Data.
Open Datasets Yes We use PTB (Marcus et al., 1993) and MCTB (Yang et al., 2022) as the source and target constituency parsing datasets, respectively. For domain knowledge transfer, we collect 5 domain raw corpora with sources matching the target treebank in MCTB for the language modeling task, including Wizard (Dinan et al., 2019), Reddit (V olske et al., 2017), ECt HR (Stiansen & Voeten, 2019), Gutenberg3, and Amazon (He & Mc Auley, 2016). For task knowledge transfer, we select Co NLL03 (Tjong Kim Sang & De Meulder, 2003) and restaurant (Liu et al., 2019b) for NER, ccgbank (Hockenmaier & Steedman, 2007) for CCG supertagging and EWT treebank in universal dependencies v2.2 (Nivre et al., 2020) for dependency parsing.
Dataset Splits Yes We sample 10,000 sentences with lengths ranging from 8 to 256 for the corpora of auxiliary tasks. If the number of filtered sentences is less than 10,000, we include the entire dataset. For each batch, we sample examples of constituency parsing and auxiliary tasks by the 1:3 proportion. Specific tasks, domains and number of sentences are listed in Table 1. Additionally, we obtain pseudo constituency parse trees for data processing of auxiliary tasks using the basic constituency parser. Specifically, we sample 10/20/50 examples from MCTB for the few-shot setting. To avoid sample bias, we sample three times to generate different few-shot training sets by different seeds and report the average results.
Hardware Specification No The paper mentions using "BERT-large-uncased as pretrained language model backbone" but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper mentions using "BERT-large-uncased" as a pretrained language model backbone and the "Adam W algorithm" for optimization, but it does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used.
Experiment Setup Yes Hyperparameters. We use BERT-large-uncased as pretrained language model backbone (Devlin et al., 2019). The lengths l and hidden sizes d of shared, task and domain prefix are 25 and 1024, respectively. Weight factor of auxiliary tasks α is 0.1 for multi-task learning. Following Kitaev and Klein (2018), we set partition transformer layers to 2 for all chat-based parsers. For model training, we use the Adam W algorithm with learning rate 3e-5, batch size 60, weight decay 0.01, linear learning rate warmup over the first 400 steps to optimize parameters. We stop early training when the F1 score does not increase on the PTB development set for 4 epochs.