reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Text-Based Twitter User Geolocation Prediction

Authors: B. Han, P. Cook, T. Baldwin

JAIR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present an integrated geolocation prediction framework and investigate what factors impact on prediction accuracy. First, we evaluate a range of feature selection methods to obtain location indicative words. We then evaluate the impact of nongeotagged tweets, language, and user-declared metadata on geolocation prediction. In addition, we evaluate the impact of temporal variance on model generalisation, and discuss how users differ in terms of their geolocatability. We achieve state-of-the-art results for the text-based Twitter user geolocation task, and also provide the most extensive exploration of the task to date.
Researcher Affiliation	Academia	Bo Han EMAIL The University of Melbourne, VIC 3010, Australia NICTA Victoria Research Laboratory Paul Cook EMAIL The University of Melbourne, VIC 3010, Australia Timothy Baldwin EMAIL The University of Melbourne, VIC 3010, Australia NICTA Victoria Research Laboratory
Pseudocode	No	The paper describes methods in prose and through mathematical equations and statistical tests, but no explicit 'Pseudocode' or 'Algorithm' blocks are present.
Open Source Code	No	The paper mentions providing a 'list of LIWs publicly available' at a URL, which refers to data rather than executable source code for the methodology. Other links are to third-party tools or baseline implementations, not the authors' own method's code.
Open Datasets	Yes	1. A regional North American geolocation dataset from Roller et al. (2012) (NA hereafter), for benchmarking purposes. NA contains 500K users (38M tweets) from a total of 378 of our pre-deﬁned cities. NA is used as-is to ensure comparability with previous work in Section 5. [...] We hence use a city-region-country format to represent each city (e.g., Toronto, CA is represented as toronto-08-ca, where 08 signiﬁes the province of Ontario and ca signiﬁes Canada).3 Country code information can be found in http://download.geonames.org/export/dump/country Info.txt 4. We use the publicly-available Geonames dataset as the basis for our city-level classes.2 http://www.geonames.org, accessed on October 25th, 2012.
Dataset Splits	Yes	Similar to NA, for WORLD we reserve 10K random users for each of dev and test, and the remainder of the users are used for training (preprocessed as described in Section 3.4). [...] The development and test data was sampled such that each user has at least 10 geotagged tweets to alleviate data sparsity. [...] We use the same partitioning of users into training, development, and testing sets for WORLD+NG as for WORLD. [...] We carry out 10-fold cross validation on the training users to obtain the L1 (ﬁnal) classiﬁer results, a standard procedure for stacking experiments. We use stratiﬁed sampling when partitioning the data because the number of users in different cities varies remarkably, and a simple random sample could have a bias towards bigger cities.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run the experiments. It mentions 'slower, more memory-intensive models' but no CPU or GPU models, or cloud resources with specifications.
Software Dependencies	No	The paper mentions using 'langid.py, an open-source language identiﬁcation tool (Lui & Baldwin, 2012)' and the 'Japanese morphological segmenter Me Cab (with the IPA dictionary)', and a 'toolkit of Zhang Le' for logistic regression, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The tuning of n for all methods is based on Acc@161 over the 10K held-out users in the development data. [...] For our experiments, we adopt the optimised implementation of Laere et al. using λ = 100km with 5K samples. [...] We choose r = 0.1 in our experiments, based on the ﬁndings of Chang et al.. [...] For the logistic regression modeller, we use the toolkit of Zhang Le (https://github.com/lzhang10/maxent), with 30 iterations of L-BFGS (Nocedal, 1980) over the training data. [...] In preliminary experiments, we considered bag-of-words features for the metadata ﬁelds, as well as bag-of-character n-gram features for n {1, ..., 4}. We found character 4-grams to perform best, and report results using these features here.