Text-Based Twitter User Geolocation Prediction
Authors: B. Han, P. Cook, T. Baldwin
JAIR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present an integrated geolocation prediction framework and investigate what factors impact on prediction accuracy. First, we evaluate a range of feature selection methods to obtain location indicative words. We then evaluate the impact of nongeotagged tweets, language, and user-declared metadata on geolocation prediction. In addition, we evaluate the impact of temporal variance on model generalisation, and discuss how users differ in terms of their geolocatability. We achieve state-of-the-art results for the text-based Twitter user geolocation task, and also provide the most extensive exploration of the task to date. |
| Researcher Affiliation | Academia | Bo Han EMAIL The University of Melbourne, VIC 3010, Australia NICTA Victoria Research Laboratory Paul Cook EMAIL The University of Melbourne, VIC 3010, Australia Timothy Baldwin EMAIL The University of Melbourne, VIC 3010, Australia NICTA Victoria Research Laboratory |
| Pseudocode | No | The paper describes methods in prose and through mathematical equations and statistical tests, but no explicit 'Pseudocode' or 'Algorithm' blocks are present. |
| Open Source Code | No | The paper mentions providing a 'list of LIWs publicly available' at a URL, which refers to data rather than executable source code for the methodology. Other links are to third-party tools or baseline implementations, not the authors' own method's code. |
| Open Datasets | Yes | 1. A regional North American geolocation dataset from Roller et al. (2012) (NA hereafter), for benchmarking purposes. NA contains 500K users (38M tweets) from a total of 378 of our pre-defined cities. NA is used as-is to ensure comparability with previous work in Section 5. [...] We hence use a city-region-country format to represent each city (e.g., Toronto, CA is represented as toronto-08-ca, where 08 signifies the province of Ontario and ca signifies Canada).3 Country code information can be found in http://download.geonames.org/export/dump/country Info.txt 4. We use the publicly-available Geonames dataset as the basis for our city-level classes.2 http://www.geonames.org, accessed on October 25th, 2012. |
| Dataset Splits | Yes | Similar to NA, for WORLD we reserve 10K random users for each of dev and test, and the remainder of the users are used for training (preprocessed as described in Section 3.4). [...] The development and test data was sampled such that each user has at least 10 geotagged tweets to alleviate data sparsity. [...] We use the same partitioning of users into training, development, and testing sets for WORLD+NG as for WORLD. [...] We carry out 10-fold cross validation on the training users to obtain the L1 (final) classifier results, a standard procedure for stacking experiments. We use stratified sampling when partitioning the data because the number of users in different cities varies remarkably, and a simple random sample could have a bias towards bigger cities. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run the experiments. It mentions 'slower, more memory-intensive models' but no CPU or GPU models, or cloud resources with specifications. |
| Software Dependencies | No | The paper mentions using 'langid.py, an open-source language identification tool (Lui & Baldwin, 2012)' and the 'Japanese morphological segmenter Me Cab (with the IPA dictionary)', and a 'toolkit of Zhang Le' for logistic regression, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | The tuning of n for all methods is based on Acc@161 over the 10K held-out users in the development data. [...] For our experiments, we adopt the optimised implementation of Laere et al. using λ = 100km with 5K samples. [...] We choose r = 0.1 in our experiments, based on the findings of Chang et al.. [...] For the logistic regression modeller, we use the toolkit of Zhang Le (https://github.com/lzhang10/maxent), with 30 iterations of L-BFGS (Nocedal, 1980) over the training data. [...] In preliminary experiments, we considered bag-of-words features for the metadata fields, as well as bag-of-character n-gram features for n {1, ..., 4}. We found character 4-grams to perform best, and report results using these features here. |