Searching large data repositories is an extremely important research problem due to the overwhelming information overload we are facing daily. The poster by Tripathi, Oakes and Wermter presents a hybrid parallel classification approach for searching a large data repository more efficiently. The poster presents an approach to a supervised learning problem: learning to classify text data when labeled training data exists.
Many ways of enhancing the predictive performance of classifiers by using them in a subspace of the original input space have been studied in the recent years. These methods include e.g. the Random Subspace Method (RSM), which divides the original feature space into lower dimensional subspaces randomly and variants of the RSM that use some criteria to select the subspaces instead random assignment.
The novel idea of Tripathi et al. is to use semantic information about the data to optimize the selection of subspaces. After learning the subspaces, a set of classifiers is then used to classify the data with respect to some topics in the new subspaces.
To study the approach, the Reuters data is used. The Reuters data provides multiple levels of topics for its documents. The most broad topics (e.g. education, computers, politics) are inferred as semantic information and they are used for learning the lower dimensional subspaces with a maximum significance based method. After this, multiple classifiers are used within the learnt subspaces to classify the data with respect to more fine-grained topics (e.g. within the topic of education: schools, collage, exams).
Tripathi et al. do experiments using multiple different algorithms (e.g. multiple layer perceptrons, naive bayes classifier, random forest...) as a part of their hybrid architecture. The hybrid architecture both improves classification results and decreases computation times.