ICANN 2011: John Shawe-Taylor: Leveraging the data generating distribution for learning

John Shawe-Taylor from University College London (UK) gave an interesting talk about prediction theory. Prediction theory makes statements about the future error rate of a learned classifier. The goal in the prediction theory is to find the lower and the upper bound, i.e. the confidence interval on the error rate of a learned classifier.

In classification, the aim is to find a function capable of predicting output given input. Data samples (input, output -pairs) are assumed to come independently from some unknown distribution. Complexity of the input data distribution, such as high dimensionality, poses several challenges to the learning of a classifier. Good learning algorithms do not overfit, meaning that they are able to learn the underlying distribution, but do not model the random noise in the data. Often the correct complexity of a model is found by a model selection. The model selection and investigation of the model performance are usually done by looking at the incorrectly classified instances in the training or test data; the training or test error. The best model would be the one having the lowest error rate, indicating that it generalizes well on future data. Because the underlying data distribution is unknown, the true error rate is not known. Nevertheless, the empirical error can be estimated from the data samples. Although the true error, i.e. the generalization error is not observable, its bounds can be obtained. A bound tells how well a classifier performs on data that is generated from the same distribution but which is not yet seen. Both upper and lower bounds for the true error rate can be determined. For example, the upper bound c_D is defined as follows: with probability 1-δ over an identically and independently distributed (i.i.d.) draw of some sample, the expected future error rate is bounded by f(δ, c_D). δ denotes significance; it is a probability that the observed sample is unusual.

Shawe-Taylor concentrated on his talk on PAC-Bayes learning (PAC stands for probably approximately correct), which borrows ideas both from frequentistic and Bayesian statistical learning framework. In frequentistic learning, the only assumption made is that the observed data is i.i.d. Frequentistic statistical analysis defines confidence intervals for learned model parameters, which determine the region, in which the estimate will fall in 1- δ% of time, when the data is sampled infinitely many times from the distribution. In contrast, in the Bayesian statistical analysis, a prior distribution is defined over the model parameters, and the probability of the model parameter values can be addressed from the posterior distribution of the parameters. This leads to more detailed probabilistic prediction. However, in the Bayesian data analysis, the prior needs to be chosen, thus making the analysis subjective, whereas often more objective analysis would be more desirable. Moreover, if the prior is not correct, the final result will be biased.

PAC-Bayes bound is a frequentistic training set bound, which borrows ideas from the Bayesian analysis. It can be applied to stochastic classifiers, such as support vector machines (SVM), to optimize the bounds of the true error rate. Consider the weight vector of a SVM classifier and its ability to classify training data samples correctly. The set of all possible weight vectors forms a weight space or a version space. For a certain sample, there is a weight vector, which separates the version space to regions where that certain sample is correctly classifier and where it is not. There might exist a region, where all of the samples are classified correctly. In this region, a hypersphere can be placed so that its volume is maximal. The volume of the hypersphere can be used as a measure of the generalization and it defines the bound. The volume of the hypersphere has also connection to the model evidence, i.e. the data likelihood given the model. Evidence can be used for example, for model selection. Actually the volume of the hypersphere equals the evidence under the uniform prior distribution. Large value of the evidence ensures good generalization for a classifier.

The most important lesson to learn from the talk of Shawe-Taylor is the PAC-Bayes theorem. The PAC-Bayes theorem involves a class of classifiers together with a prior distribution and posterior over the class of classifiers. The PAC-Bayes bound holds for all choices of posterior, hence posterior does not need to be the classical Bayesian posterior. Often posterior is chosen based on the best bound. Moreover, the bound holds for all prior choices of prior, hence its validity is not affected by a poor choice of prior. This is contrast to the standard Bayesian analysis which only holds, if the prior assumptions are correct.

Given the test input x, a classifier c is first chosen from the posterior, and the output is c(x). PAC-Bayes theorem says, that the upper bound of the true error of this classification is determined as the Kullback-Leibler -divergence between the average true error rate and the average train error rate. The averages are obtained as marginalizing over the posterior distribution, thus giving the method a Bayesian flavour. The KL-divergence measures the closeness of the true error and train error, or the misfit between them. The smaller the KL-divergence, the tighter the bound.

When applying the PAC-Bayes for SVM, Shawe-Taylor has used prior and posterior of unit variance and a prior, which is centered at origin. The posterior can be then chosen by optimizing the bound. Shawe-Taylor presented also variants of the PAC-Bayes approach. One of them is to use part of the data to form data-dependent prior distribution over a hypothesis class. Because bound depends on the distance between prior and posterior, a prior which is closer to the posterior might lead to better results. The application of this to SVM is called Prior-SVM. Further improvement is η-prior SVM, in which the prior distribution is elongated in the direction of the weight vector, which is estimated from the subset of training data. In the work of Shawe-Taylor, classification errors of the standard PAC-Bayes, Prior-SVM and η-prior SVM were compared to 10-fold cross-validation and, the bounds were determined for each PAC-Bayes method using small data sets from UCI repository. The PAC-Bayes classifiers did not improve the classification errors significantly, but the bounds improved from standard PAC-Bayes SVM to η-prior SVM. In general, more advanced PAC-Bayes methods gave tighter bounds for the true classification errors, and their classification errors were comparable to those obtained using cross-validation.

Videolectures from John Shawe-Taylor related to the topic can be found in: http://videolectures.net/john_shawe_taylor/

ICANN 2011

Wednesday, June 22, 2011

John Shawe-Taylor: Leveraging the data generating distribution for learning

No comments:

Post a Comment