Unbiased Generative Semi-Supervised Learning
Authors: Patrick Fox-Roberts, Edward Rosten
JMLR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now examine the performance of the objective function given in Section 4 on real world data sets, compared to the standard semi-supervised learning, supervised learning, and several other alternative semi-supervised techniques. To maximally highlight the effect of mismatch between the model and true distribution, a simple marginal distribution consisting of a single axis aligned Gaussian was chosen to model each class. The following six learning schemes were tested with this model: our unbiased semisupervised expression (SSunb), that is, the natural log of Equation (20); the log likelihood of the labelled data (LL), that is, Equation (1); the log likelihood of the standard (biased) semi-supervised expression (SSb), that is, the natural log of Equation (3); the log likelihood of the standard semi-supervised expression plus an Entropy Regularisation term (Grandvalet and Bengio, 2006) with the parameter λ set by 5 fold cross validation, selecting the λ with the lowest holdout set error rate (ERer); Entropy Regularisation as before, except cross validation is carried out on the log likelihood of the holdout set (ERnll); the semi-supervised equivalent of Multi Conditional learning (as investigated in Druck et al., 2007), again cross validating hyper parameters once on error rate (MCer) and once on log likelihood (MCnll); and the log likelihood of the standard semi-supervised expression plus an Expectation Regularisation (Mann and Mc Callum, 2007) term (XR), with the trade off parameter set (after some experimentation) as in the original paper to the equivalent of 10 times the number of labelled samples; Additionally, for the position parameter µ of each Gaussian a penalty term C||µ||2 was added onto each objective function with C set to a small constant ( 10 5). We would point out that many of these learning schemes were originally designed for use with a discriminative model. Here we are using them in a different manner, to augment the objective function during the learning of a generative model. They have been selected due to their reported good performance in improving discriminative learning, in the hope that this will counteract the bias introduced by the missing class information in the likelihood of the unlabelled samples. We chose 7 data sets from the UCI repository (Frank and Asuncion, 2010); Diabetes, Wine, glass identification (Glass), blood transfusion (Blood) (Yeh et al., 2009), Ecoli, Haberman survival (Haber), and Pima Indian diabetes (Pima); and 2 from libsvm: SVM guide 1 (SVMg) (Hsu et al., 2003) and fourclass (Four) (Ho and Kleinberg, 1996). Due to computational constraints, data sets with > 3 classes had one or more merged to create 3 approximately equally sized groupings. Each axis of the data was transformed to lie in the range [ 1, 1]. Samples with missing attributes were excluded. Where a data set had a dedicated test set, this was used; otherwise, one fifth of the data was randomly separated a priori for this purpose. A range of values of NL and NU were trialled. As a proportion of the total available training data, NL varied from [0.025, 0.05, 0.1, 0.2], and NU from [0.025, 0.05, 0.1, 0.2, 0.4, 0.8], with NU being formed by discarding labels prior to training (for example, a test where NL = 0.05 and NU = 0.4 would indicate 0.45 of the available data was used for training, of which one ninth was labelled). For each repetition a random set of parameters was generated and used as the starting point for each of the above learning schemes. Each model was optimised by repeatedly alternating between a small number of iterations of downhill simplex search (Lagarias et al., 1998), followed by a large numbers of iterations of BFGS search (Nocedal and Wright, 1999), until convergence. This process was repeated 100 times for each combination of NL and NU values. The error rate and negative log likelihood of the test set was found for each solution. A selection of these results are shown here. Full results over all test sets are included in the appendix. |
| Researcher Affiliation | Collaboration | Patrick Fox-Roberts EMAIL Cambridge University Engineering Department Trumpington Street Cambridge, CB2 1PZ, UK Edward Rosten EMAIL Computer Vision Consulting 7th floor 14 Bonhill Street London, EC2A 4BX, UK |
| Pseudocode | No | The paper describes algorithms and derivations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or provide links to a code repository. |
| Open Datasets | Yes | We chose 7 data sets from the UCI repository (Frank and Asuncion, 2010); Diabetes, Wine, glass identification (Glass), blood transfusion (Blood) (Yeh et al., 2009), Ecoli, Haberman survival (Haber), and Pima Indian diabetes (Pima); and 2 from libsvm: SVM guide 1 (SVMg) (Hsu et al., 2003) and fourclass (Four) (Ho and Kleinberg, 1996). |
| Dataset Splits | Yes | Where a data set had a dedicated test set, this was used; otherwise, one fifth of the data was randomly separated a priori for this purpose. A range of values of NL and NU were trialled. As a proportion of the total available training data, NL varied from [0.025, 0.05, 0.1, 0.2], and NU from [0.025, 0.05, 0.1, 0.2, 0.4, 0.8], with NU being formed by discarding labels prior to training (for example, a test where NL = 0.05 and NU = 0.4 would indicate 0.45 of the available data was used for training, of which one ninth was labelled). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory amounts) used for running experiments. It mentions computational constraints but no specifics. |
| Software Dependencies | No | The paper mentions optimization methods like 'downhill simplex search (Lagarias et al., 1998)' and 'BFGS search (Nocedal and Wright, 1999)' which are algorithms, but does not specify any software libraries or their version numbers used for implementation. |
| Experiment Setup | Yes | Additionally, for the position parameter µ of each Gaussian a penalty term C||µ||2 was added onto each objective function with C set to a small constant ( 10 5). For each repetition a random set of parameters was generated and used as the starting point for each of the above learning schemes. Each model was optimised by repeatedly alternating between a small number of iterations of downhill simplex search (Lagarias et al., 1998), followed by a large numbers of iterations of BFGS search (Nocedal and Wright, 1999), until convergence. This process was repeated 100 times for each combination of NL and NU values. |