Wednesday, June 22, 2011

K. A. Raftopoulos: Visual Pathways for Shape Abstraction

Konstantinos A. Raftopoulos gave a presentation called "Visual Pathways for Shape Abstraction" on Friday 17.6.2011. The paper was written by him and Stefanos D. Kollias, both from National Technical University of Athens. The speech presented first the neuroscience background of shape recognition in cortical neurons which has motivated this research and all related work. The cortical neurons have orientation specific receptive fields (RF) which enables them to detect boundaries. How this builds up the perception of lines and curves is still an open question, but it hasn't stopped trials with experimental shape recognition algorithms build along on neural networks.

Raftopoulos and Kollias propose a method which combines the skeleton abstraction of the shape with curvature information. This curvature-skeleton conveys both local and global shape information, which results in improved recognition capability of neural network classifiers. The method builds up on several layers of neurons which try to imitate the boundary detection of cortical layers. Layers of perceptrons are used in parallel calculation to determine the edges and then the curvature. But with this novel work they have added an additional layer which combines the high points of curvature with the medial axis points of the shape.

The experimental results were conducted by training a network of two hidden layers to classify 2-D images into two categories. They used 500 grayscale images of hands and rabbits from KIMIA shilouette database which the network could classify correctly. Then the generalization ability was tested with applying partial occlusion, deformations and missing parts to 50 images. The recognition of these deformed images is usually a very difficult task but their network could achieve a correct recognition of 91.6 %. A network without the curvature-skeleton information could achieve only 61.5 %.

This presentation/paper was really intriguing to myself, especially because it incorporated so strong biological connections to the model with good results. Only shorthand is that they really hadn't compared (or just didn't show us) their methology to other state-of-the-art image recognition processes. Also using just two image categories seems to be tentativeness although the method seems very promising.

M. Oubbati, J. Frick, and G. Palm: A Distributed Behavioral Model Using Neural Fields

The poster by Mohamed Oubbati, Josef Frick, and Günther Palm presents a classic swarm behavioral model by C.W. Reynolds (1987, 1999) using neural fields (Amari, S. 1977) which are equivalent to continuous recurrent neural networks. This emergent grouping behaviour doesn't need any leaders, it's just a result of three steering behaviors – separation, cohesion and alignment – in every individual agent, "boid", which are adjusted in accordance of other boids and objects.

The poster itself doesn't really go very deep in the theory, it mainly consists of pictures of the steering behaviors and the emergent global behavior. The basic equations of a 1-D dynamic neural field (DNF) are shown, with the stimulus design of the steering behaviors. Every boid shares the same encoded steering stimulus design, which is quite intuitively determined. In the paper the authors have also presented the implementation in detail.

The results show that the swarm behavior can be implemented by neural fields. The worthiness of this practice still eludes me, and the choice of neural fields isn't elaborated anywhere. In comparison, the neural fields are very computational processes, and one can achieve the same results in object-oriented programming with ease. Still, the work is really an investigation and perhaps the use of neural fields can be extrapolated.

Aapo Hyvärinen: Brain Imaging at rest: the ultimate neuroscience data set?

Aapo Hyvärinen from Helsinki University gave his plenary talk about the study of brains at rest. There are a lot of studies in neuroscience, but why is it so interesting to study brains at rest? Subject is said to be at rest, when he is not performing a given task or he is not stimulated in any other way by the researcher. First of all, there are not many analyses done yet on this topic, though the measurements are easy to repeat and there are no time-limits. In addition, it is more objective because it is free from the researchers experimental design. A new view point is to try to learn more of the brains internal dynamics with this study. Research shows that some parts of the brain, called the default network, are even more active during rest than during stimulation. Maybe this way it is possible find out the ultimate neuroscience data set.

The measurement for studying brains are traditionally done with Electroencephalography (EEG), Magnetoencephalography (MEG) or Functional magnetic resonance imaging (fMRI). Supervised methods can not be used to analyze this data and the most popular unsupervised method is independent component analyses (ICA). It is used to find components by maximizing sparsity of a given variable. Hyvärinen also presented a spatial version of ICA, that is often used with fMRI, and how it could be used in MEG. ICA has been used to find resting state networks in fMRI with good results. The results were very similar to ones acquired from research with very complex stimulation: movies.

Hyvärinen highlights the importance of testing significance of the results. ICA itself does not provide information of its result reliability, but there are ways to test it statistically: do a separate ICA on several subjects and pick the significant components which appear in two or more subjects. It is possible that all significant components won't appear in all subjects. Part of the analysis is seeking connections between the measured variables. Hyvärinen explained different approaches used, for example, structural equation models and when those can be estimated.

Exploratory data analysis with ICA could bring us a better understanding of functioning of brains. Hyvärinen proposes that these methods could be more used in studies, where complicated stimulations are used. He admits that speaking of the ultimate data set here is an overstatement, and until we can properly do two person neuroscience, we can not fully understand the human brains, like Riitta Hari said in her plenary talk.

Antti Ajanki and Samuel Kaski: Probabilistic Proactive Timeline Browser

In this poster the authors had conducted usage experiments with what they called a proactive image browser. What made it proactive was that it used explicit and implicit relevance cues, i.e. mouse clicks and cursor behavior, to modulate the relative sizes of the images. Another point was that all the images were shown at once, so no scrolling was needed. The hypothesis was that displaying the estimated relevance of the images this way would decrease the effort of finding a specific one.

To estimate the relevance of an image, a probabilistic generative model was utilized. The relevance of an image is thought to be reflected by a latent variable, whose value is estimated with linear regression from the observed content feature vector (coded by hand for the test set) with a gaussian error term. Then the the number of clicks is predicted from a multinomial distribution weights derived from the latent variables. The mouse movement features are predicted from a gaussian distribution with mean given by a certain linear mapping from the latent variables.

In the experiment, the subjects were asked to view a 17 minute videoclip and later recall certain events from it with the help of some of the frames shown on a timeline. The authors compared three different user interfaces: the proactive one described above, one that also shows all the images at once and zooms with a fisheye effect on hover and an ordinary scrollable one with constant image size. The results showed that the mean average precision (the better the precision the less the effort) of the proactive interface was better than the fisheye interface in all of the six tasks considered and lost to the traditional interface only in one task. This one loss seemed to be caused by the fact that the traditional scrolling interface initially showed the first image and the relevant images in one of the tasks happened to be in the very beginning. The hypothesis seemed to be correct, but the absolute measured precision still was not very high.

What I thought would be interesting to see in this application was using eye movement detection. The author confirmed that this was indeed the future plan. It would have been enlightening to have a demo version of the application available at the poster session but I guess that would have violated some (un)written rules of poster sessions.

Valero et al, Complex-Valued Independent Component Analysis of Natural Images

Natural image statistics is a useful tool for understanding the functions of the visual part of the brain. The research on Natural image statistics mainly focused on the “linear” model which tries to extract the independent sources from original natural images which correspond to Gabor-like receptive fields in the primary visual cortex.

Nowadays, researchers stepped into description of the statistics of the signals after simple “linear” model stage. It is pointed out that a learned squared outputs of the simple cells may lead to complex responses. For instances, while FFT is applied on natural images, there are phases term and magnitude term, which can be generally described by complex numbers. Similarly, in natural images, some image might contain phase information of some source images, and magnitude images from other source images. Furthermore, generally speaking, the information contains in the phase terms are of great importance. Therefore, naturally, Complex Independent Component Analysis comes into play.

Traditional complex ICA, according to the authors in the paper, assume a uniform distribution over the complex plain, which is the phase term, in natural image. This is not always the cases. In the experiment made in the paper, using the traditional complex ICA, the learned features in complex plain differ a lot from actual features, which is shown in the following figure. The blue curve shows the phases of complex ICA sources, and the black curve shows the learned phases of complex ICA sources.

Therefore, in this paper, the author proposed an extension to the traditional complex ICA that also models the phase information in natural images. This is done by assuming a von Mises distribution for the phase information of the output signal, instead of the uniform distribution that the standard cICA assumes. The extension allows for a better fit to the signal, as the phase distributions are often far from uniform. This assumed distribution is capable of capturing two peaks in the phase information. After learning, the learned feature fits better than the feature learned by traditional complex independent component analysis. The result is shown as follows,

In this paper, the author assumed a modified distribution over the phase information of the natural images, and thus had a better fit to the phase information. Therefore, this type of new distribution is good in terms of accuracy.

Groot et al.: Learning from Multiple Annotators with Gaussian Processes

Among the ICANN'11 posters, Perry Groot et al. proposed an approach for learning a consensus from multiple unreliable annotators. In their paper “Learning from Multiple Annotators with Gaussian Processes” they present a way of using noisy target value information from several sources to regress observed data onto these target values.

The problem can be considered as a generalization of the multiple annotators' classification problem into the case of continuous target variable values. An example of the multiple annotators' problem is an Internet service called Amazon's Mechanical Turk. It is a service for conducting simple tasks on users who receive a small pay for the trouble. The accuracy of users, however, varies on how much effort they put into the task and on how experienced they are. The idea of the proposed model is that this variation between users (i.e. annotators) can be learned from the data in order to improve performance in the overall task of fusing the results from multiple users.

The authors solve the multiple annotators' problem with a slightly modified Gaussian process (GP) model. Gaussian process is a widely-recognized nonparametric Bayesian model for the covariance of a collection of random variables. The standard GP model learns a latent variable representing the variance of observed target variables. The trick in this paper is to make this latent variable source-specific. Knowing the source (user/annotator) of each annotation, the model can learn a source-specific latent variable representing the uncertainty in the annotations from each source.

In the study, the authors compare the proposed multi-annotator model to the standard GP model learned with pooled data, individual annotator's data, and weighted individual annotator's data. The proposed model outperformed the comparison methods in the UCI 'housing' data set, for which the authors manually created an annotation. In principle, the GP framework also applies to a classification problem, where the target variables are binary, but exact solution for that remains intractable.

The multiple annotators' problem undoubtedly remains a central question in machine learning. Services such as Amazon's Mechanical Turk have shown that there is and will be need for computational methods for the problem of fusing data from several uncertain sources. This work is a nice simple tweak to a widely-recognized probabilistic model. Regardless of the straightforward nature of the solution, it provides promising results for the problem. Gaussian processes have been a field of intensive research during the recent ten years. Also in ICANN'11 there was at least one GP poster in addition to this one.

Geoffrey Hinton: Learning structural descriptions of objects using equivariant capsules

The second plenary talk in ICANN 2011 was given by Prof. Geoffrey Hinton from University of Toronto. The topic was "Learning structural descriptions of objects using equivariant capsules". The accompanied paper in the proceeding is under the name: “Transforming Auto-encoders”. In this talk, he discussed the limitation of the convolutional neural network, and proposed a new way of learning invariant features under a new neural network framework.

The human brain does not need to go through a step of rotation to recognize an object. This is proven by a test where the task is to recognize objects positioned in arbitrary angles versus the task of imaginatively rotating the same object. However, in several recently popular computer vision algorithms, this rule is violated.

In most popular computer vision research, people use explicitly designed operators to extract the invariant features from images. These operators, according to Prof. Hinton, turn out to be misleading and not efficient. For instance, using convolutional neural network, one will try to learn the invariant features in different part of the images, and discard the spatial relationship between them. This will not work in a higher level features where we need to do, for instance, face identity analysis, which requires extremely strong spatial relationship between mouth and eyes.

Prof. Hinton arguess that the convolutional network way of representing the invariant features, where only some scalar output is used to represent the presence of the feature, is not capable of representing highly complex invariant feature sets. Subsampling methods have been proposed to make convolutional neural networks invariant for small changes in the viewing angle of the object. Prof. Hinton argues that it is not correct as the ultimate goal of learning feature should not be viewpoint invariant. Instead, the goal should be Equivariant features where changes in viewpoints lead to corresponding changes in neural networks. Equivariant feature means that the building block of the object features should be rotated correspondingly while the objects are rotated.

Therefore, he developed a new way of learning feature extractors which learn equivariant features through computation on local space called "capsules", and output informative results. These local features are accumulated hierarchically towards a more abstract representation. The network is then trained with images of the same objects when they are slightly shifted and rotated. In this way, each learned capsule is a "generative model". The difference between convolutional neural network and the "capsule method" is that the capsule method considers the spatial relationship of image features carrying spatial position along with the feature presence probability distribution.

This new way of representing the transformation of images has opened a new possibility for training invariant features and Prof. Hinton argues that this approach behaves closer to the way human brain functions and will be more promising one comparing to traditional computer vision methods.

For detailed explanation and demonstration, please see the full paper included in the proceeding of ICANN 2011.

John Shawe-Taylor: Leveraging the data generating distribution for learning

John Shawe-Taylor from University College London (UK) gave an interesting talk about prediction theory. Prediction theory makes statements about the future error rate of a learned classifier. The goal in the prediction theory is to find the lower and the upper bound, i.e. the confidence interval on the error rate of a learned classifier.

In classification, the aim is to find a function capable of predicting output given input. Data samples (input, output -pairs) are assumed to come independently from some unknown distribution. Complexity of the input data distribution, such as high dimensionality, poses several challenges to the learning of a classifier. Good learning algorithms do not overfit, meaning that they are able to learn the underlying distribution, but do not model the random noise in the data. Often the correct complexity of a model is found by a model selection. The model selection and investigation of the model performance are usually done by looking at the incorrectly classified instances in the training or test data; the training or test error. The best model would be the one having the lowest error rate, indicating that it generalizes well on future data. Because the underlying data distribution is unknown, the true error rate is not known. Nevertheless, the empirical error can be estimated from the data samples. Although the true error, i.e. the generalization error is not observable, its bounds can be obtained. A bound tells how well a classifier performs on data that is generated from the same distribution but which is not yet seen. Both upper and lower bounds for the true error rate can be determined. For example, the upper bound cD is defined as follows: with probability 1-δ over an identically and independently distributed (i.i.d.) draw of some sample, the expected future error rate is bounded by f(δ, cD). δ denotes significance; it is a probability that the observed sample is unusual.

Shawe-Taylor concentrated on his talk on PAC-Bayes learning (PAC stands for probably approximately correct), which borrows ideas both from frequentistic and Bayesian statistical learning framework. In frequentistic learning, the only assumption made is that the observed data is i.i.d. Frequentistic statistical analysis defines confidence intervals for learned model parameters, which determine the region, in which the estimate will fall in 1- δ% of time, when the data is sampled infinitely many times from the distribution. In contrast, in the Bayesian statistical analysis, a prior distribution is defined over the model parameters, and the probability of the model parameter values can be addressed from the posterior distribution of the parameters. This leads to more detailed probabilistic prediction. However, in the Bayesian data analysis, the prior needs to be chosen, thus making the analysis subjective, whereas often more objective analysis would be more desirable. Moreover, if the prior is not correct, the final result will be biased.

PAC-Bayes bound is a frequentistic training set bound, which borrows ideas from the Bayesian analysis. It can be applied to stochastic classifiers, such as support vector machines (SVM), to optimize the bounds of the true error rate. Consider the weight vector of a SVM classifier and its ability to classify training data samples correctly. The set of all possible weight vectors forms a weight space or a version space. For a certain sample, there is a weight vector, which separates the version space to regions where that certain sample is correctly classifier and where it is not. There might exist a region, where all of the samples are classified correctly. In this region, a hypersphere can be placed so that its volume is maximal. The volume of the hypersphere can be used as a measure of the generalization and it defines the bound. The volume of the hypersphere has also connection to the model evidence, i.e. the data likelihood given the model. Evidence can be used for example, for model selection. Actually the volume of the hypersphere equals the evidence under the uniform prior distribution. Large value of the evidence ensures good generalization for a classifier.

The most important lesson to learn from the talk of Shawe-Taylor is the PAC-Bayes theorem. The PAC-Bayes theorem involves a class of classifiers together with a prior distribution and posterior over the class of classifiers. The PAC-Bayes bound holds for all choices of posterior, hence posterior does not need to be the classical Bayesian posterior. Often posterior is chosen based on the best bound. Moreover, the bound holds for all prior choices of prior, hence its validity is not affected by a poor choice of prior. This is contrast to the standard Bayesian analysis which only holds, if the prior assumptions are correct.

Given the test input x, a classifier c is first chosen from the posterior, and the output is c(x). PAC-Bayes theorem says, that the upper bound of the true error of this classification is determined as the Kullback-Leibler -divergence between the average true error rate and the average train error rate. The averages are obtained as marginalizing over the posterior distribution, thus giving the method a Bayesian flavour. The KL-divergence measures the closeness of the true error and train error, or the misfit between them. The smaller the KL-divergence, the tighter the bound.

When applying the PAC-Bayes for SVM, Shawe-Taylor has used prior and posterior of unit variance and a prior, which is centered at origin. The posterior can be then chosen by optimizing the bound. Shawe-Taylor presented also variants of the PAC-Bayes approach. One of them is to use part of the data to form data-dependent prior distribution over a hypothesis class. Because bound depends on the distance between prior and posterior, a prior which is closer to the posterior might lead to better results. The application of this to SVM is called Prior-SVM. Further improvement is η-prior SVM, in which the prior distribution is elongated in the direction of the weight vector, which is estimated from the subset of training data. In the work of Shawe-Taylor, classification errors of the standard PAC-Bayes, Prior-SVM and η-prior SVM were compared to 10-fold cross-validation and, the bounds were determined for each PAC-Bayes method using small data sets from UCI repository. The PAC-Bayes classifiers did not improve the classification errors significantly, but the bounds improved from standard PAC-Bayes SVM to η-prior SVM. In general, more advanced PAC-Bayes methods gave tighter bounds for the true classification errors, and their classification errors were comparable to those obtained using cross-validation.

Videolectures from John Shawe-Taylor related to the topic can be found in:

SNPBoost: Interaction Analysis of Risk Prediction on GWA Data

Among a few works related to bioinformatics was the poster of Ingrid Braenne et al from Lübeck, Germany, titled SNPboost: Interaction Analysis of Risk Prediction on GWA Data. As commonly known, the human genome consists of DNA, in which the order of the four nucleotides, A, T, C and G, determines the instructions to build structural and functional parts of our bodies. The human genome consists of over three billion nucleotides divided into 23 pairs of chromosomes. Small variations in certain locations of the genome, i.e. genomic loci, make us unique individuals. For example, a single nucleotide variation in a certain loci between individuals is called a single-nucleotide polymorphism, SNP. The human genome contains hundreds of thousands of SNPs. If a SNP is within a gene, the different variants of that gene are called alleles. An individual can be a homozygote or a heterozygote with respect to an allele, implying that a homozygote has the same variant of the gene in both chromosome pairs, whereas heterozygote has different variants in different chromosomes in a chromosome pair. A SNP is usually biallelic, meaning that there are only two high-frequent nucleotide variants of that SNP in the population. Within certain populations, the frequencies of different alleles usually vary, for example, between ethnic groups.

Different variants of a genomic locus, for example a gene, might be responsible for the development of a certain disease. There are certain diseases, which can be associated to a single SNP, such as sickle cell anemia, but most common diseases, i.e. national diseases, such as type I and type II diabetes, myocardial infarction and Chrohn’s disease are so called complex diseases or multi-gene diseases. Complex diseases are caused by several disease-inducing SNPs interacting with each other. SNP profiles of individuals can be obtained using sequencing techniques, and they are routinely analyzed using genome-wide association analysis (GWA). One of the goals in GWA is to associate disease risk to certain SNPs. In addition, the treatment of a disease can be tailored for each individual based on their SNP profiles, leading to personalized medicine. However, the disease diagnostics, or prediction of a disease risk based on SNP profiles, are unfortunately very difficult and uncertain. It is also interesting to determine the functional role of a SNP associated to a certain disease. A SNP can fall into coding regions, i.e. genes, but many of them fall also non-coding regions which makes their functional role harder to depict. Often the interaction of multiple SNPs is not investigated in GWA studies, as GWA only provides the statistical significance of disease association for single SNPs. However, one genetic variant can have only limited effect on risk in complex diseases, and the disease effect may only be visible in the interaction between multiple SNPs. The work of Braenne et al attempts to find solutions for this. In GWA studies, the data created is very high-dimensional and sample size low compared to it. While this causes difficulties to computational analysis, also the interpretation of the results becomes challenging. Therefore, some feature selection is often needed. This issue is also addressed in the work of Braenne et al.

Classifiers, such as support vector machine (SVM), has been successively used, for example, to improve the risk prediction for Type I and type II diabetes from SNP profiles. SVM is a standard benchmark classifying method particularly useful to study multivariate data sets. SVM aims to determine a hyperplane in a higher-dimensional feature space that separates two classes with a maximum separation, i.e. with a maximum margin. By finding a linear separator in a feature space, the classification boundaries are non-linear in the original variable space. In the work of Braenne et al, authors studied the performance of two types of classification tools, SVM and Boosting, to classify healthy and diseased SNP-profiles. Two different versions of SVM were used, namely SVMs with a linear and gaussian kernels. When training a SVM classifier, the authors did feature selection by selecting the subset of SNPs based on their statistical significance of the association with the disease, i.e. their p-values obtained from GWA analysis. Using preselected subset of SNPs might however miss the important interaction effects in the data, so the selection of SNPs for a classifier should be improved somehow.

Boosting method used in the study of Braenne et al is Adaboost, the most popular boosting algorithm. In boosting, the idea is to combine several weak classifiers to obtain one strong classifier. In this work, the set of weak classifiers contained 6 classifiers for each SNP. Each SNP is assumed to be biallelic, and therefore can have three discrete states: AA, AB and BB. AA and BB are homozygotic and AB is a heterozygotic allele. These three states can induce a disease or protect against it, so an individual can be in one of the 2x3=6 possible states; diseased AA, diseased AB, diseased BB, protected AA, protected AB, or protected BB. Weak classifiers try to predict correctly the disease state of an individual based on their genotype and possible state(inducing or protecting) of that SNP. This method is called SNPBoost.

The weak classifiers are combined such that the classification errors of the first single classifier is compensated as good as possible by the second classifier and so forth. These weak classifiers are added one after another in order to gain a set of classifiers that together boost the classification. Selecting classifiers also controls the number of SNP to be included to the classifier. This is claimed to find also possible SNP-SNP interactions between selected SNPs, because if a weak classifier positively interacts with a previous selected one, this weak classifier might be chosen since an interaction might improve the classification.

The data set used in this analysis contained 127370 SNPs, and 2032 individuals with 1222 controls and 810 cases. The data was randomly divided into training and test set both containing 405 individuals. The different classifiers were trained using training data and the classification performance was accessed using test data. Receiver operation characteristics (ROC) curves, i.e. the true positive rate versus the false positive rate were obtained on the test set for each method

Linear SVM (LSVM) with a preselected SNPs and SNPBoost both yielded a peak performance for small number of SNPs. Gaussian SVM (GSVM) yielded a peak in the performance with a slightly larger number of SNPs. The performance decreased with additional SNPs. SNPBoost outperformed the SVM with a linear kernel. Gaussian SVM outperformed linear SVM and SNPBoost, when preselected set of SNPs was chosen to train GSVM. With very large number of SNPs, SNPBoost outperforms all other methods. In addition, with small number of SNPs, the performance of the Gaussian SVM further improved when the subset of SNPs was chosen based on SNPBoost. The better selection of SNPboost compared to LSVM might be due to the fact that the SNPboost algorithm selects SNPs one at a time in order to bit by bit increases the classification and these SNPs might just better fit together.

Braenne et al also investigated the biological interpretation of the results. With SVM and preselected gene sets, 14 SNPs were found within genes. Two of these genes can be linked through an additional gene, meaning that there might be some functional relationships between these genes. When studying the 20 SNPs selected by the SNPboost algorithm, and the their corresponding genes, 9 out of 20 lie within genes and two of the genes can be linked through an additional gene. Both methods revealed one set of three genes, which possible functional relationships, but the relationship of the genes selected by SNPBoost might be an additional one. Future work of is to identify whether the SNPs in these genes increase the disease risk.

Tuesday, June 21, 2011

Tom Griffiths: Discovering human inductive biases

Tom Griffiths from Computational Cognitive Science Lab in University of California, Berkeley presented his research on "Discovering human inductive biases". His talk was interesting as he delved into the topic giving different real world examples and involving the audiences in a series of question-answer rounds. The majority of his talk focused on answering questions listed in the slide shown below. The answers to these questions will solve the problem of discovering human inductive bias.

Griffiths prefers to consider human learning as a black box that maps inputs to outputs. Hence, a strategy to solve this problem would be to learn the mapping used by humans. He presented the Bayes theorem as a solution to how to solve the problem of mapping from input to output.

However, he showed that Bayes rule can not be directly implemented because it is often very difficult to determine and express the priors human use in decision making. He also presented an interesting example involving the audience for the same purpose. The question was: a movie has made $90 million so far how much will it make? The audience’s answer varied between $300 and $500 million. Similarly, another question was that a movie has made $6 million, how much will it make? The audience answers were around $10 million. In order to give the contrasting view on the topic, he gave a counter example; if you see a 90 year old man, how much do you think will he live? The answers were between 95 to 100 years. Again, he followed that up another question, you meet a 6 year old boy, how much will he live? The audience answers were around 70.

From the answers to this problem, he emphasized that learning human inductive biases were hard because there were different answers/outputs when the data/inputs were the same numerically. The moral of the example is that priors have strong effect on predictions, so inductive biases can be inferred from behavior of humans if we can determine the relevant priors. They performed a set of cognitive experiments to conclude that different priors (Power-law, Gaussian and Erlang) were associated with different examples of human learning.

For example, power-law prior was associated with the case of movie making certain amount of money; while predicting the age would be associated with Gaussian prior. Thus, it is difficult to come-up with a single strategy that can be used for many tasks. Hence, Griffiths and colleagues have proposed several models for many different tasks. Some of the notable ones include Causal learning (Griffiths & Tenenbaum, 2009), Category Learning (Griffiths et al., 2008), Speech perception (Feldman & Griffiths, 2007), and subjective randomness (Griffiths & Tenenbaum, 2003).

Griffiths also presented about the concept of iterated learning (Kirby, 2001) realised with Bayesian learners where the distribution over the hypothesis converges to the prior. They build a concept known as “Rational process model” (Sanborn et al., 2006) by approximating Bayesian inference and connecting them to psychological processes. They use Monte Carlo mechanisms based on importance sampling (Shi, Griffiths et al., 2010) and “Win-stay, lose-shift” to approximate the Bayesian inference.

Overall, the talk was interesting comprising of several real world experiments and their mathematical formulation showing that Bayesian models of cognition provides a method to identify human inductive Biases. However, the relationship of priors and representations in those models to mental and neural processes were not transparent.

I personally liked the mathematical formulations of the problems and developing models that had more focus mathematical and statistical methods rather than mimicking the human learning process. These cognitive issue along with several neural and relevant issues were also discussed in panel discussion that immediately followed the talk by Griffiths.

Panel discussants: Geoffrey Hinton, Joshua Tenenbaum, Thomas Griffiths and Aapo Hyvärinen

The panel discussion was attended by all the keynote speakers except John Shawe-Taylor and Riitta Hari and co-ordinated by Timo Honkela.

Schaffernicht et al. : Weighted Mutual Information for Feature Selection

Evening of first day of ICANN conference i.e. 14th of June was reserved for the poster presentations. Out of papers accepted for publication in ICANN’11 proceedings; around 50 of them were provided with the opportunity to present their work orally while remaining (around 60) were provided with the opportunity to present their work as posters. Poster session was held in T-Building of Computer Science department in Aalto University of Science while the regular conference took place in Dipoli Congress Center. There was enthusiastic participation of both poster presenters and other regular attendants of the conference. It was heartening to find that some of the oral presenters were also presenting their work in the form of posters. It made some sense because poster sessions provides one of the best opportunities to discuss own research with fellow researchers and renowned professors in the field which provides new perspective and sometimes new dimension to research.

Out of many posters; I decided to write about this poster by Erik Schaffernicht and Horst-Michael Gross; the topic of which was Weighted Mutual Information for Feature Selection. There is no denying the importance of feature selection in learning algorithms. Hence, the research area is never saturated in this field and have opening for new ideas and methods. In this paper; the authors provide a simple trick to include only the relevant features and at the same time avoid redundant features. Similar to other wrapper methods; they individually train the classifier on the entire features. However, they determine the next feature to be included by their accuracy on the misclassified samples rather than the entire data samples. They provide weights to the samples and select the features which maximize the mutual information.

This idea is similar to the well known AdaBoost algorithm. The misclassified samples in the first round are given higher weights in the second round. This makes some sense because correctly classified samples are easily explained by the selected subset of features at any time instant and the crux of the problem is to find those features that better classify the misclassified samples. They experimented this methods in different datasets from UCI machine learning repository and also on some artificial datasets. Although it was not mentioned in the paper or displayed in poster but I enquired with them if they are using that in some real world datasets for current project and they informed me that they had deployed that in a control systems where data dimension is in thousands.

Overall, a simple but very effective and intelligent trick to select features. It achieves computational efficiency by reducing training cycles significantly and also selects the set of best discriminating features.

Nandita Tripathi: Hybrid Parallel Classifiers for Semantic Subspace Learning

Searching large data repositories is an extremely important research problem due to the overwhelming information overload we are facing daily. The poster by Tripathi, Oakes and Wermter presents a hybrid parallel classification approach for searching a large data repository more efficiently. The poster presents an approach to a supervised learning problem: learning to classify text data when labeled training data exists.

Many ways of enhancing the predictive performance of classifiers by using them in a subspace of the original input space have been studied in the recent years. These methods include e.g. the Random Subspace Method (RSM), which divides the original feature space into lower dimensional subspaces randomly and variants of the RSM that use some criteria to select the subspaces instead random assignment.

The novel idea of Tripathi et al. is to use semantic information about the data to optimize the selection of subspaces. After learning the subspaces, a set of classifiers is then used to classify the data with respect to some topics in the new subspaces.

To study the approach, the Reuters data is used. The Reuters data provides multiple levels of topics for its documents. The most broad topics (e.g. education, computers, politics) are inferred as semantic information and they are used for learning the lower dimensional subspaces with a maximum significance based method. After this, multiple classifiers are used within the learnt subspaces to classify the data with respect to more fine-grained topics (e.g. within the topic of education: schools, collage, exams).

Tripathi et al. do experiments using multiple different algorithms (e.g. multiple layer perceptrons, naive bayes classifier, random forest...) as a part of their hybrid architecture. The hybrid architecture both improves classification results and decreases computation times.

Enrique Romero: Using the Leader Algorithm with Support Vector Machines for Large Data Sets

One of the problems of Support Vector Machines (SVM) is that applying them to large datasets is computationally expensive. The computational cost often increases proportional to N^3, where N is the number of training samples.

Many different approaches to the problem have been proposed. E.g. chunking and decomposition methods optimize the SVM with respect to subsets of the data to lower the computational cost.

Romero presents an approach that aims to reduce the computational cost by reducing the number of training samples. The data set is first clustered using the Leader algorithm, and then only the samples chosen as the cluster identities by the Leader algorithm are used for training the SVM.

The Leader algorithm uses a distance measure D and and a predefined threshold T to partition the data. Neighborhoods that are withing distance T with respect to the distance measure D are clustered together and the cluster is represented using one of its data points, which is then referred to as the leader. The Leader algorithm is very fast: the algorithm makes a single pass through the dataset. All areas of the input space are presented in the clustering solution.

Reducing the size of the training set naturally decreases predictive performance but the computational cost decreases much more rapidly. As a future step, Romero proposes developing the Leader algorithm to preserve more data points close to the decision boundaries of the SVM.

Sunday, June 19, 2011

Joshua Tenenbaum: How to grow a mind: Statistics, structure and abstraction

Joshua Tenenbaum is the Associate Professor of Computational Cognitive Science at MIT. On Friday, the last day of ICANN 2011, he gave an inspiring plenary presentation about reverse-engineering learning and cognitive development.

He stated that the most perplexing quality of the brain from machine learning perspective is its ability to grasp abstract concepts and infer causal relations with such sparse data, i.e "how does the mind get so much from so little?". He gave an entertaining example of this by showing a grid full of pictures of computer generated unidentifiable objects and naming three of them as "tufas". He then pointed at other objects in the grid and asked the audience whether it was or wasn't a tufa. There was a strong consensus and the answers were quite confidently "yes" or "no".

To explore how this kind of inference could be possible, Tenenbaum focused on what he called abstract knowledge. His talk was then divided into three parts, answering three different questions about abstract knowledge.

How does abstract knowledge guide learning and inference from sparse data?

According to Tenenbaum, the mind learns and reasons according to Bayesian principles. Simply put, there exists some sort of generative model of data and hypotheses and the probability of a certain hypothesis given data is given by the Bayes' rule. The abstract background knowledge affects the model through the available hypotheses and in the prior probabilities given to these hypotheses. The likelihood gives the probability of the data given a hypothesis.

What forms does abstract knowledge take?

It doesn't seem feasible to assume that every logically possible hypothesis is somehow presented along with its prior and likelihood. The hypotheses need to be presented in a more structured way. As Tenenbaum puts it: "some more sophisticated forms of knowledge representation must underlie the probabilistic generative models needed for Bayesian cognition".

Causes and effects can be modeled in a general way with directed graphs. As an example, in a symptom-disease model we would have symptoms and diseases as nodes and edges running from the diseases to the symptoms. The role of background knowledge here would be to know that there are two kinds of nodes and that the edges always run from diseases to symptoms, in effect limiting amount of hypotheses to be considered.

On the other hand it seems that tree structured representations would be most effective for learning words and concepts from examples.

How is abstract knowledge acquired?

So it seems that abstract background knowledge is required to make learning possible. But how then is this background knowledge learned? How does one know when to use a tree structured presentation and when is some other form more suitable?

Tenenbaum presented the answer in hierarchical Bayesian models or HBMs. They enable hypotheses spaces of hypothesis spaces and priors on priors. More specifically, Tenenbaum proceeded to show how HBMs can be used to infer the form (e.g. tree, ring, chain) and the structure simultaneously. An impressive example was sorting synthesized faces varying in race and masculinity into a correct matrix structure, where race varied along the other axis and masculinity along the other.


Clearly one of the goals of the talk was to establish that abstract background knowledge is essential in human learning. Its role is to constrain the logically valid hypotheses to make learning possible. Human learning was then formulated as Bayesian inference over richly structured hierarchical generative models.

Friday, June 17, 2011

Reichert, P., Series, P. and Storkey, A.. A Hierarchical Generative Model of Recurrent Object-Based Attention in the Visual Cortex

A concept of deep Boltzmann machines (DBM) was proposed by Salakhutdinov & Hinton (2009) as a deep undirected probabilistic model. It differs from other directed models, such as deep belief network (DBN), such that a neuron in an intermediate layer is excited by signals from both upper and lower layers (recurrent processing). It is, in some sense, closer to how a human brain works, as it is unlikely that each neuron is arranged to be activated by the signal from lower layers only when recognizing and by the signal from upper layers only when generating (Hinton et al., 2006).

In this work, the authors considered DBM as a cortical model (Reichert et al., 2010), and tried to find empirical connections between DBM and the recurrent object-based attention.

The authors describe how some attentional theories suggest that in higher cortical areas form representation that are specific to one object at a time. The paper experimentally explores how some properties of DBM coincide with those theories. There are two main points on which the paper focuses:

(1) Recurrent processing helps DBM (or a human brain in case you agree to assume DBM as a right cortical model) concentrate on meaningful objects when images contain noise by letting higher layers tend to represent a single object at a time, which is contrast to lower layers that tend to encode a whole image (with multiple objects) into low-level features.
(2) Suppresive mechanism in higher layers avoids DBM from 'hallucinating' wrong objects by having sparse lateral activations.

In order to confirm those points, the paper uses various performance measures and inspection methods for DBM. One such method is to inspect the states of a single layer of hidden neurons by clamping the layer to the specific state and sampling from a visible layer. This method revealed that the recurrent processing (in contrast to feed-forward sweep from the visible layer to the top layer) drives the higher layers to focus their attentions on more specific single object at a time.

Additionally, the authors tried quantitative analysis by classifying cluttered data sets using the hidden states. For simple toy data sets, the recurrent processing indeed turned out to excel over a simple feedforward processing. However, for the more realistic data set such as MNIST handwritten digits with clutters, this simple approach was not sufficient.

An explanation was given by the authors that this difficulty arises from the fact that clutters in the image could let the higher layers 'hallucinate' such that one object is considered as (transformed to) another object. For instance, a digit 9 with clutters can be thought to be a digit 8 during the recurrent processing in the higher layers.

As a naive (but, effective according to the paper) remedy, they suggested intitializing the biases to negative values to sparsify the hidden states which results possibly in suppressed noise during the recurrent processing of the higher layers. The authors coined this approach as an additional suppressive mechanism.

These results show that DBM embodies a number of properties that can be somehow related to the attentional recurrent processing in the cortex. This work is meaningful, as it has shown that there is a middle point between neuroscience and machine learning where both fields are able to learn from each other.

Unfortunately, the authors trained DBM using a pre-training (Salakhutdinov & Hinton, 2009) only. As the pre-training is known to only greedily find a solution that is close to a local maximum likelihood solution, it can be debated whether the experimental results obtained in this paper indeed reflect the true nature of DBM.


Hinton, G.. A Fast Learning Algorithms for Deep Belief Networks. Neural Comp., Vol. 18, No. 7. (1 July 2006), pp. 1527-1554.
Salakhutdinov, R., Hinton, G.. Deep Boltzmann machines. Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS). Volutme 5. (2009).Reichert, D. P., Series, P., Storkey, A. J.. Hallucinations in Charles Bonnet Syndrome induced by homeostasis: a Deep Boltzmann Machine model. Advances in Neural Information Processing Systems 23 (2010).

Ramya Rasipuram and Mathew Magimai Doss: Improving Articulatory Feature and Phoneme Recognition using Multitask Learning

Articulatory features define properties of speech production, i.e. they describe the basic sounds we make. Phonemes on the other hand are the smallest units of sound used to form meaningful speech. In Finnish basically all the phonemes correspond to a letter, whereas in English they do not. However, phonemes are used to model pronounciation, and they are therefore cross- and multilingual.

The authors did experiments on their model using the TIMIT corpus, containing speech from American English speakers of different sexes and dialects. The corpus also contains the correct phonemes used in the speech. Following methods for phoneme recognition were applied:

  1. Independent MLP (multilayer perceptron)
  2. Multitask MLP
  3. Phoneme MLP

Independent MLP is a standard method, whereas (2) and (3) are novel methods presented in their paper. In each method, articulatory features were learned from the audio, and an MLP network was trained to predict the phonemes. In independent MLP the classifiers are independent. However, since the features actually are interrelated, multitask learning was considered to be needed. The prediction accuracies (speech to phoneme) for independent, multitask and phenome MLP were 67.4%, 68.9% and 70.2%, respectively.

Additionally, a hierarchical version was presented for each method. They performed better than the original ones, maintaining the order of performance.

Rasipuram presented their work to be continued with:

  • Automatic speech recognition studies
  • Different importance weights for features
  • Adding gender and rate of speech as features

The talk gained some critique, as one researcher in the audience stated that performance better than this had been achieved already years ago. This wasn't really addressed by the author.

Kauppi et al: Face Prediction from fMRI Data during Movie Stimulus: Strategies for Feature Selection

The topic of the poster was to predict from a person's fMRI (functional magnetic resonance imaging) data whether he's seeing a face or faces in a movie or not. In an fMRI test setup, a stimulus, in this case the movie Crash, is presented for the test subject. The test subject's brain activity is measured, resulting in high-dimensional brain activity data that contains complex interactions. In the data, the brain is divided into voxels, i.e. cubes or 3D-pixels.

Similar research had been done before, but there the test subjects were shown a set of movie clips, instead of a whole movie. The authors claim that showing a whole movie results in "more naturalistic" data.

The problem is a classification task with two classes: "face" and "non-face". It was solved using ordinary least squares (OLS) for regression. However, since there was a lot of data, OLS couldn't be used in the conventional way. The prediction was done using only a subset of the features, which were selected using prior information and different methods, resulting in four regression models:

  • Stepwise Regression (SWR)
  • Simulated Annealing (SA)
  • Least Absolute Shrinkage and Selection Operator (LASSO)
  • Least Angle Regression (LARS)

Out of which LASSO and LARS are regulated to be sparse, possibly resulting in less overfitting.

Figure: The best prediction acquired with LARS compared to the (roughly binary) annotation. As can be seen, binary prediction (1 when > 0.5) would match the annotation well. Also locations of 6 features in three bain regions visualized.

Human brain is divided into different regions with different tasks. This study provided a natural way (at least for a computer scientist) to find out which regions are associated with face recognition, and thus, can be used in the prediction. In their paper, the authors say, "our results support the view that face detection is distributed across the visual cortex, albeit the fusiform cortex has a strong influence on face detection."

Thursday, June 16, 2011

Heess, N., Le Roux, N. and Winn, J.. Weakly Supervised Learning of Foreground-Background Segmentation using Masked RBMs

This work (Heess et al., 2011) has been presented in ICANN 2011 as a part of the poster session.

The main target of this paper is to show that a generative model based on restricted Boltzmann machines can be used to distinguish a foreground object (an object in interest) and a background image.

The proposed model starts from a layer of image pixels corresponding to a single image with two directed edges going forward to two separate layers that describe a foreground object and a background image, respectively. Then, each of those layers are connected to a separate layer of latent variables with undirected edges forming an restricted Boltzmann machine. While there exists an additional set of binary variables that denotes a mask of the foreground object in the image, and it is connected to the latent variables that were connected to the layer of the foreground object by the undirected edges.

In other words, there are two RBMs that model (1) jointly appearance and a shape of a foreground object (will be denoted as fRBM from now for simplicity) and (2) a background image, and they are conditioned on the original image (will be denoted as bRBM for simplicity).

This approach suggests that when it is possible to have good generative models for two distinct types of images (or in fact, any other kinds of data sets) it will be able to use them for separating a mixed image (in this case, simply foreground + background). Also, considering the depth of the proposed model (a directed layer + an undirected layer), it can be considered as one of the early approaches for applying deep learning to image segmentation tasks, see (Socher et al., 2011) for another possibility.

One important contribution of this approach is that it does not require explicit ground-truth segmentation of training samples to train the model. Instead, the authors initialize bRBM by training it with images that can be considered easily as backgrounds. Intuitively, this method drives fRBM to learn regularities found by the foreground objects in the training samples while background clutters are considered to be already well-modeled by bRBM. This is a neat trick, but they needed some more tricks in learning process in order to overcome some apparent problems such as training samples having regular structure in the background (such as photos of people taken in a single space).

The experimental results are impressive. However, more experiments on some other data might have been useful for readers to understand the value of the proposed model and learning method. The authors provide interesting future research directions such as; replacing RBMs with deep models, including few ground-truth segmentations to make it into semi-supervised learning, and another layer of hidden nodes immediately after the original image layer.

Heess, N., Le Roux, N. and Winn, J.. Weakly Supervised Learning of Foreground-Background Segmentation using Masked RBMs. ICANN 2011.
Le Roux, N., Heess, N., Shotton, J., Winn, J.. Learning a Generative Model of Images by Factoring Appearance and Shape.

Wednesday, June 15, 2011

Riitta Hari: Towards Two-Person Neuroscience

Prof. Riitta Hari kicked off ICANN'11 with her invited talk "Towards Two-Person Neuroscience". So far the research on human brain has mostly focused on the study of a single brain. Humans, however, are social creatures, whose thoughts and actions are reflected by the other members in the community. In virtually any human culture, isolation is used as a punishment, not only for children but also for adults.

We all know that the interaction with other people affects our mood and thoughts very strongly. While an individual is interacting with another person, the brains of the two persons become coupled as one's brain analyzes the behavior of the other and vice versa. This is why the neuroscience community is now looking towards a pair instead of an individual as a proper unit of analysis.

There have already been studies on humans under controlled interaction, such as a movie or a computer game. While watching a movie, brains of individual viewers have been shown to be activated in a very synchronous fashion. Game against a human opponent activates the brain differently from a game against computer, which is also reflected in the reported feelings of the players.

Mirroring is a phenomenon which has been possible to study with existing technology. We feel pain when we are shown a picture of a suffering person. Already Ludwig Wittgenstein noted that "The human body is the best picture of the human soul". How individual's feelings tune into other person's feelings, is a more complicated question. It is a combination of the following factors:

  • similar senses, motor systems and the brain that the individuals have

  • the experience that they collect throughout their lives, and

  • the beliefs they test by acting in the community.

Machine learning steps in for the analysis of the high-dimensional data produced by the functional measurement technologies. Dimensionality reduction methods such as independent component analysis (ICA) extract noise-free components that can potentially be biologically interpreted.

So far in most of the studies of human interaction, only the activity of one brain has been measured regardless of the presence of the other interacting person. Soon, however, accurate measurements of several subjects at a time will be possible, and that will most likely push for a leap in the development of computational data fusion techniques. Then, we will not only have a link between a stimulus and a brain image but between a stimulus and images of several subjects' brains.

When the focus of brain research moves towards the analysis of two or more interacting subjects, efficient multi-view methods will be needed. Thus, multi-view learning is currently a hot area of machine learning research.

Prof. Hari's message to the ICANN audience was that the analysis remains the bottleneck in brain research. As methodological researchers, we should next consider the opportunities opened by the new experiment settings and measurement technologies, and see how to learn more from the data.

Tuesday, June 7, 2011

ICANN 2011 blog created

This blog is intended to provide information about and experiences from the International Conference on Artificial Neural Networks 2011. ICANN 2011 is co-located with WSOM 2011 conference and a similar blog has also been created for WSOM at