Structured Prediction via Output Space Search

Authors: Janardhan Rao Doppa, Alan Fern, Prasad Tadepalli

JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on six benchmark domains show that a small amount of search in limited discrepancy search space is often sufficient for significantly improving on state-of-the-art structured-prediction performance. We also demonstrate significant speed improvements for our approach using sparse search spaces with little or no loss in accuracy.
Researcher Affiliation Academia Janardhan Rao Doppa EMAIL Alan Fern EMAIL Prasad Tadepalli EMAIL School of EECS, Oregon State University Corvallis, OR 97331-5501, USA
Pseudocode Yes Algorithm 1 Recurrent Classifier Learning via Exact Imitation Algorithm 2 Cost Function Learning via Exact Imitation
Open Source Code No The paper does not provide concrete access to the source code for the methodology described. It mentions that for some algorithms used for comparison (CRF, SVM-Struct, Cascades), publicly available code was used or their own implementation was run, but no explicit statement or link for their proposed framework's code is provided.
Open Datasets Yes Handwriting Recognition (HW). The input is a sequence of binary-segmented handwritten letters and the output is the corresponding character sequence [a z]+. This data set contains roughly 6600 examples divided into 10 folds (Taskar et al., 2003). NETtalk Stress. This is a text-to-speech mapping problem, where the task is to assign one of the 5 stress labels to each letter of a word (Sejnowski and Rosenberg, 1987). Chunking. The goal in this task is to syntactically chunk English sentences into meaningful segments. We consider the full syntactic chunking task and use the data set from the CONLL 2000 shared task,6 which consists of 8936 sentences of training data and 2012 sentences of test data. CONLL task can be found at http://www.cnts.ua.ac.be/conll2000/chunking/. POS tagging. We consider the tagging problem for the English language, where the goal is to assign the part-of-speech tag to each word in a sentence. The standard data from Wall Street Journal (WSJ) corpus7 was used in our experiments. WSJ corpus can be found at http://www.cis.upenn.edu/~treebank/. Scene labeling. This data set contains 700 images of outdoor scenes (Vogel and Schiele, 2007).
Dataset Splits Yes For the HW-Small version of the problem,, we employ one fold for training and the remaining 9 folds for testing, and vice-versa in HW-Large. There are 1000 training words and 1000 test words in the standard data set. ...CONLL 2000 shared task, which consists of 8936 sentences of training data and 2012 sentences of test data. Training was performed with 600 images, and the remaining 100 images were used for testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It discusses computational speedups and efficiency but without mentioning the underlying hardware.
Software Dependencies No The paper mentions software components like "Perceptron algorithm", "online Passive-Aggressive (PA) algorithm", "SGD", "SVMhmm", and "Cascades" but does not specify any version numbers for these or other libraries/frameworks used in their implementation. URLs are provided for SGD and Cascades code, but again, without version specifics.
Experiment Setup Yes In all our experiments, we train the recurrent classifier using exact imitation (see Section 3) with the Perceptron algorithm for 100 iterations with a learning rate of 1. Unless otherwise indicated, the cost functions learned over input-output pairs are second order, meaning that they have features over neighboring label pairs and triples along with features of the structured input. We trained the cost function via exact imitation as described in Section 4 using 50 iterations of Passive-Aggressive training. Ten percent of the training data was used to tune the hyper-parameters. For SVMhmm ... the value of the parameter C was chosen from 10 4, 10 3, , 103, 104 based on the validation set. ...optimized the interpolation parameter β over the validation set. ...the greedy search was run for a number of steps equal to the length of the sequence. ...best-first beam search for different beam widths... ...best-first beam search (beam width = 100)... We picked the best cost function based on a validation set after 5 iterations of DAgger...