reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CROSSNEWS: A Cross-Genre Authorship Verification and Attribution Benchmark

Authors: Marcus Ma, Duong Minh Le, Junmo Kang, Yao Dou, John Cadigan, Dayne Freitag, Alan Ritter, Wei Xu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate a wide range of authorship models on both tasks using CROSSNEWS. Our experiments show that, while prior work finds statistical learning models, such as N-gram method (Koppel and Schler 2004), achieve the state-of-the-art performance in the single-genre settings (Tyo, Dhingra, and Lipton 2022), these models generalize poorly in CROSSNEWS cross-genre settings.
Researcher Affiliation	Collaboration	1Georgia Institute of Technology 2SRI International EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the methodology in narrative text within sections like "Non-Transformer Methods," "Embedding Methods," and "Zero-shot LLM Methods," but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Publicly accessible code to support future research.2 https://github.com/mamarcus64/Cross News
Open Datasets	No	The paper introduces CROSSNEWS as a novel dataset and states its creation as a main contribution, but it does not provide a direct URL, DOI, or specific repository name for the dataset itself within the paper text. While a GitHub link for code is provided, direct access details for the dataset are not explicitly stated.
Dataset Splits	Yes	To prepare data for the task, we sample document pairs from CROSSNEWS to form positive (if two documents have the same author) and negative (if two documents have different authors) examples... We create three different train and test sets based on the genres of the two documents... For each genre pair type, we sample 100,000 document pairs from CROSSNEWS silver set... and 20,000 document pairs from CROSSNEWS gold set... The document pairs from the silver set are then used to form train and validation sets, with the ratio of 8:2, while the pairs from the gold set are used to construct the test set. For this task, we use only the gold set from CROSSNEWS... Our attribution setup contains all 500 gold authors with 30 known documents and 15 unknown documents per author.
Hardware Specification	Yes	For all experiments in this paper, models are run on a single NVIDIA A40 GPU, except for LLa MA-3-70B for prompting, which is run on six A40 s, with a total compute time of approximately 400 hours to train and evaluate all verification and attribution models sequentially.
Software Dependencies	No	The paper mentions specific models like RoBERTa and e5-mistral-7b-instruct as components of their methods but does not provide specific version numbers for software libraries or frameworks (e.g., Python, PyTorch, TensorFlow) used for implementation.
Experiment Setup	Yes	To balance between data efficiency and diversity... we ensure each document is present in exactly one negative and one positive pair, and the number of positive and negative pairs are equal... For evaluation, we report accuracy and F1... we concatenate tweets to a minimum length of 500 characters. For all experiments in this paper... Experiment results are averaged over five runs with different random seeds. For verification, embedding methods classify pairs based on a specific threshold of the cosine similarity between the two documents, where the similarity threshold is calculated as the value that classifies the most number of correct labels on the validation set.