Enhancing Multimodal Affective Analysis with Learned Live Comment Features
Authors: Zhaoyuan Deng, Amith Ananthram, Kathleen McKeown
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through comprehensive experimentation on a wide range of affective analysis tasks (sentiment, emotion recognition, and sarcasm detection) in both English and Chinese, we demonstrate that these synthetic live comment features significantly improve performance over state-of-the-art methods. |
| Researcher Affiliation | Academia | Zhaoyuan Deng, Amith Ananthram, Kathleen Mc Keown Columbia University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methods through prose and mathematical equations but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, Dataset, and Appendix https://github.com/dzy49/Affective Live Comm |
| Open Datasets | Yes | Code, Dataset, and Appendix https://github.com/dzy49/Affective Live Comm |
| Dataset Splits | Yes | We sample 10% of our data for validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components and pre-trained models like Chinese-RoBERTa-wwm-ext, XLM-RoBERTa-base, Data2Vec-audio-base, HuBERT-base, and TimeSformer-base, but it does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For pre-training, we use a segment length σ of 8 seconds, balancing between required context and the length of downstream datasets. We sample 8 frames uniformly from each segment. To reduce noise in the dataset, we employ multiple filtering strategies. First, we exclude comments lacking substantial content, specifically those shorter than 2 characters or those without Chinese characters. Second, we compile a list of low-signal terms and exclude comments containing these words. For user-generated videos, we trim the first and last 15 seconds as they tend to include repetitive comments such as greetings and farewells. For movies, we trim the first and last 5 minutes, and for TV shows, we trim the start and end of each show. Segments containing fewer than 5 live comments are excluded from pre-training to allow efficient GPU batching. For each epoch, we randomly select 5 live comments per segment so that a batch with N samples has K = 5N comments. We sample 10% of our data for validation. |