Enhancing Portuguese Variety Identification with Cross-Domain Approaches
Authors: Hugo Sousa, Rúben Almeida, Purificação Silvano, Inês Cantante, Ricardo Campos, Alipio Jorge
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | we developed a cross-domain language variety identifier (LVI) to discriminate between European and Brazilian Portuguese. Motivated by the findings of our literature review, we compiled the Pt Br Var Id corpus, a cross-domain LVI dataset, and study the effectiveness of transformer-based LVI classifiers for cross-domain scenarios. The results are presented in Section 6, followed by a discussion of future research directions in Section 7. Section 6 Results Impact of Delexicalization Figure 1 depicts the average F1 scores obtained in the Pt Br Vid validation set by the N-grams and BERT models, for each (PPOS, PNER) percentage pair. |
| Researcher Affiliation | Collaboration | Hugo Sousa1,3*, R uben Almeida2,3,4*, Purificac ao Silvano5,6, Inˆes Cantante5,6, Ricardo Campos3, 7,8, Al ıpio Jorge1,3 1Faculty of Sciences, University of Porto, Porto, Portugal 2Faculty of Engineering, University of Porto, Porto, Portugal 3INESC TEC, Porto, Portugal 4Innovation Point dst group, Braga, Portugal 5Faculty of Arts and Humanities, University of Porto, Porto, Portugal 6Centre of Linguistics, University of Porto, Porto, Portugal 7Department of Informatics, University of Beira Interior, Covilh a, Portugal 8Ci2 Smart Cities Research Center, Tomar, Portugal EMAIL, EMAIL |
| Pseudocode | No | The paper describes the methodology in narrative form within sections such as '4 Experimental Setup' and '5 Implementation Details', without presenting any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open source the code, corpus, and models to foster further research in this task. We release the first open-source Portuguese LVI model, providing a valuable resource for future research and practical applications. We have made our codebase open-source15 to promote reproducibility of our results and to encourage further research in this area. (15https://github.com/LIAAD/portuguese vid) |
| Open Datasets | Yes | We open source the code, corpus, and models to foster further research in this task. We introduce a novel cross-domain, silver-labeled LVI corpus for Brazilian and European Portuguese, compiled from open-license datasets. The complete dataset is publicly accessible on Hugging Face10. (10https://huggingface.co/datasets/liaad/Pt Br VId). The paper also refers to the DSL-TL corpus (Zampieri et al. 2024) and FRMT dataset (Riley et al. 2023). |
| Dataset Splits | Yes | However, before using for the training, we leave 1,000 documents of each domain for the validation of the model, 500 of each label. In the step one of our training protocol, we use 8,000 documents from each domain (4,000 from each label) to train the models. For our purposes, we exclude documents labeled Both since our training corpus does not contain that label. This results in a test set comprising 588 documents for BP and 269 for EP. The FRMT dataset (Riley et al. 2023) has been manually annotated to evaluate variety-specific translation systems and includes translations in both EP and BP. We adapt this corpus for the VID task, resulting in a dataset containing 5,226 documents, with 2,614 labeled as EP and 2,612 as BP. |
| Hardware Specification | Yes | Regarding computational resources, this study relied on Google Cloud N1 Compute Engines to perform the tuning and training of both the baseline and the BERT architecture. For the baseline, an N1 instance with 192 CPU cores and 1024 GB of RAM was used. For BERT, we used an instance with 16 CPU cores, 30 GB of RAM, and 4x Tesla T4 GPUs. |
| Software Dependencies | No | NER and POS tags were identified using spa Cy11. The BERT model was trained with the transformers12 and pytorch13 libraries, for a maximum of 30 epochs... N-gram models were trained using the scikit-learn14 library. The paper mentions these software components but does not provide specific version numbers for them. |
| Experiment Setup | Yes | The BERT model was trained with the transformers12 and pytorch13 libraries, for a maximum of 30 epochs, using early stopping with a patience of three epochs, binary crossentropy loss, and the Adam W optimizer. The learning rate was set to 2 10 5. In addition, a learning rate scheduler was used to reduce the learning rate by a factor of 0.1 if the training loss did not improve for two consecutive epochs. TF-IDF Max Features: The number of maximum features extracted using TF-IDF was tested with the following values: 100, 500, 1,000, 5,000, 10,000, 50,000, and 100,000. TF-IDF N-Grams Range: The range of n-grams used in the TF-IDF was explored with the following configurations: (1,1), (1,2), (1,3), (1,4), (1,5), and (1,10). TF-IDF Lower Case: The effect of case sensitivity was tested, with the lowercasing of text being either True or False. TF-IDF Analyzer: The type of analyzer applied in the TF-IDF process was either Word or Char. |