The main target of this paper is to show that a generative model based on restricted Boltzmann machines can be used to distinguish a foreground object (an object in interest) and a background image.
The proposed model starts from a layer of image pixels corresponding to a single image with two directed edges going forward to two separate layers that describe a foreground object and a background image, respectively. Then, each of those layers are connected to a separate layer of latent variables with undirected edges forming an restricted Boltzmann machine. While there exists an additional set of binary variables that denotes a mask of the foreground object in the image, and it is connected to the latent variables that were connected to the layer of the foreground object by the undirected edges.
In other words, there are two RBMs that model (1) jointly appearance and a shape of a foreground object (will be denoted as fRBM from now for simplicity) and (2) a background image, and they are conditioned on the original image (will be denoted as bRBM for simplicity).
This approach suggests that when it is possible to have good generative models for two distinct types of images (or in fact, any other kinds of data sets) it will be able to use them for separating a mixed image (in this case, simply foreground + background). Also, considering the depth of the proposed model (a directed layer + an undirected layer), it can be considered as one of the early approaches for applying deep learning to image segmentation tasks, see (Socher et al., 2011) for another possibility.
One important contribution of this approach is that it does not require explicit ground-truth segmentation of training samples to train the model. Instead, the authors initialize bRBM by training it with images that can be considered easily as backgrounds. Intuitively, this method drives fRBM to learn regularities found by the foreground objects in the training samples while background clutters are considered to be already well-modeled by bRBM. This is a neat trick, but they needed some more tricks in learning process in order to overcome some apparent problems such as training samples having regular structure in the background (such as photos of people taken in a single space).
The experimental results are impressive. However, more experiments on some other data might have been useful for readers to understand the value of the proposed model and learning method. The authors provide interesting future research directions such as; replacing RBMs with deep models, including few ground-truth segmentations to make it into semi-supervised learning, and another layer of hidden nodes immediately after the original image layer.
Heess, N., Le Roux, N. and Winn, J.. Weakly Supervised Learning of Foreground-Background Segmentation using Masked RBMs. ICANN 2011.
Le Roux, N., Heess, N., Shotton, J., Winn, J.. Learning a Generative Model of Images by Factoring Appearance and Shape.