Tuesday, June 21, 2011

Enrique Romero: Using the Leader Algorithm with Support Vector Machines for Large Data Sets

One of the problems of Support Vector Machines (SVM) is that applying them to large datasets is computationally expensive. The computational cost often increases proportional to N^3, where N is the number of training samples.

Many different approaches to the problem have been proposed. E.g. chunking and decomposition methods optimize the SVM with respect to subsets of the data to lower the computational cost.

Romero presents an approach that aims to reduce the computational cost by reducing the number of training samples. The data set is first clustered using the Leader algorithm, and then only the samples chosen as the cluster identities by the Leader algorithm are used for training the SVM.

The Leader algorithm uses a distance measure D and and a predefined threshold T to partition the data. Neighborhoods that are withing distance T with respect to the distance measure D are clustered together and the cluster is represented using one of its data points, which is then referred to as the leader. The Leader algorithm is very fast: the algorithm makes a single pass through the dataset. All areas of the input space are presented in the clustering solution.

Reducing the size of the training set naturally decreases predictive performance but the computational cost decreases much more rapidly. As a future step, Romero proposes developing the Leader algorithm to preserve more data points close to the decision boundaries of the SVM.

No comments:

Post a Comment