The second plenary talk in ICANN 2011 was given by Prof. Geoffrey Hinton from University of Toronto. The topic was "Learning structural descriptions of objects using equivariant capsules". The accompanied paper in the proceeding is under the name: “Transforming Auto-encoders”. In this talk, he discussed the limitation of the convolutional neural network, and proposed a new way of learning invariant features under a new neural network framework.
The human brain does not need to go through a step of rotation to recognize an object. This is proven by a test where the task is to recognize objects positioned in arbitrary angles versus the task of imaginatively rotating the same object. However, in several recently popular computer vision algorithms, this rule is violated.
In most popular computer vision research, people use explicitly designed operators to extract the invariant features from images. These operators, according to Prof. Hinton, turn out to be misleading and not efficient. For instance, using convolutional neural network, one will try to learn the invariant features in different part of the images, and discard the spatial relationship between them. This will not work in a higher level features where we need to do, for instance, face identity analysis, which requires extremely strong spatial relationship between mouth and eyes.
Prof. Hinton arguess that the convolutional network way of representing the invariant features, where only some scalar output is used to represent the presence of the feature, is not capable of representing highly complex invariant feature sets. Subsampling methods have been proposed to make convolutional neural networks invariant for small changes in the viewing angle of the object. Prof. Hinton argues that it is not correct as the ultimate goal of learning feature should not be viewpoint invariant. Instead, the goal should be Equivariant features where changes in viewpoints lead to corresponding changes in neural networks. Equivariant feature means that the building block of the object features should be rotated correspondingly while the objects are rotated.
Therefore, he developed a new way of learning feature extractors which learn equivariant features through computation on local space called "capsules", and output informative results. These local features are accumulated hierarchically towards a more abstract representation. The network is then trained with images of the same objects when they are slightly shifted and rotated. In this way, each learned capsule is a "generative model". The difference between convolutional neural network and the "capsule method" is that the capsule method considers the spatial relationship of image features carrying spatial position along with the feature presence probability distribution.
This new way of representing the transformation of images has opened a new possibility for training invariant features and Prof. Hinton argues that this approach behaves closer to the way human brain functions and will be more promising one comparing to traditional computer vision methods.
For detailed explanation and demonstration, please see the full paper included in the proceeding of ICANN 2011.