ICANN 2011
Wednesday, June 22, 2011
K. A. Raftopoulos: Visual Pathways for Shape Abstraction
Raftopoulos and Kollias propose a method which combines the skeleton abstraction of the shape with curvature information. This curvature-skeleton conveys both local and global shape information, which results in improved recognition capability of neural network classifiers. The method builds up on several layers of neurons which try to imitate the boundary detection of cortical layers. Layers of perceptrons are used in parallel calculation to determine the edges and then the curvature. But with this novel work they have added an additional layer which combines the high points of curvature with the medial axis points of the shape.
The experimental results were conducted by training a network of two hidden layers to classify 2-D images into two categories. They used 500 grayscale images of hands and rabbits from KIMIA shilouette database which the network could classify correctly. Then the generalization ability was tested with applying partial occlusion, deformations and missing parts to 50 images. The recognition of these deformed images is usually a very difficult task but their network could achieve a correct recognition of 91.6 %. A network without the curvature-skeleton information could achieve only 61.5 %.
This presentation/paper was really intriguing to myself, especially because it incorporated so strong biological connections to the model with good results. Only shorthand is that they really hadn't compared (or just didn't show us) their methology to other state-of-the-art image recognition processes. Also using just two image categories seems to be tentativeness although the method seems very promising.
M. Oubbati, J. Frick, and G. Palm: A Distributed Behavioral Model Using Neural Fields
The poster itself doesn't really go very deep in the theory, it mainly consists of pictures of the steering behaviors and the emergent global behavior. The basic equations of a 1-D dynamic neural field (DNF) are shown, with the stimulus design of the steering behaviors. Every boid shares the same encoded steering stimulus design, which is quite intuitively determined. In the paper the authors have also presented the implementation in detail.
The results show that the swarm behavior can be implemented by neural fields. The worthiness of this practice still eludes me, and the choice of neural fields isn't elaborated anywhere. In comparison, the neural fields are very computational processes, and one can achieve the same results in object-oriented programming with ease. Still, the work is really an investigation and perhaps the use of neural fields can be extrapolated.
Aapo Hyvärinen: Brain Imaging at rest: the ultimate neuroscience data set?
The measurement for studying brains are traditionally done with Electroencephalography (EEG), Magnetoencephalography (MEG) or Functional magnetic resonance imaging (fMRI). Supervised methods can not be used to analyze this data and the most popular unsupervised method is independent component analyses (ICA). It is used to find components by maximizing sparsity of a given variable. Hyvärinen also presented a spatial version of ICA, that is often used with fMRI, and how it could be used in MEG. ICA has been used to find resting state networks in fMRI with good results. The results were very similar to ones acquired from research with very complex stimulation: movies.
Hyvärinen highlights the importance of testing significance of the results. ICA itself does not provide information of its result reliability, but there are ways to test it statistically: do a separate ICA on several subjects and pick the significant components which appear in two or more subjects. It is possible that all significant components won't appear in all subjects. Part of the analysis is seeking connections between the measured variables. Hyvärinen explained different approaches used, for example, structural equation models and when those can be estimated.
Exploratory data analysis with ICA could bring us a better understanding of functioning of brains. Hyvärinen proposes that these methods could be more used in studies, where complicated stimulations are used. He admits that speaking of the ultimate data set here is an overstatement, and until we can properly do two person neuroscience, we can not fully understand the human brains, like Riitta Hari said in her plenary talk.
Antti Ajanki and Samuel Kaski: Probabilistic Proactive Timeline Browser
Valero et al, Complex-Valued Independent Component Analysis of Natural Images
http://cl.ly/2s420N1C2W281B3W1V15
In this paper, the author assumed a modified distribution over the phase information of the natural images, and thus had a better fit to the phase information. Therefore, this type of new distribution is good in terms of accuracy.
Groot et al.: Learning from Multiple Annotators with Gaussian Processes
Among the ICANN'11 posters, Perry Groot et al. proposed an approach for learning a consensus from multiple unreliable annotators. In their paper “Learning from Multiple Annotators with Gaussian Processes” they present a way of using noisy target value information from several sources to regress observed data onto these target values.
The problem can be considered as a generalization of the multiple annotators' classification problem into the case of continuous target variable values. An example of the multiple annotators' problem is an Internet service called Amazon's Mechanical Turk. It is a service for conducting simple tasks on users who receive a small pay for the trouble. The accuracy of users, however, varies on how much effort they put into the task and on how experienced they are. The idea of the proposed model is that this variation between users (i.e. annotators) can be learned from the data in order to improve performance in the overall task of fusing the results from multiple users.
The authors solve the multiple annotators' problem with a slightly modified Gaussian process (GP) model. Gaussian process is a widely-recognized nonparametric Bayesian model for the covariance of a collection of random variables. The standard GP model learns a latent variable representing the variance of observed target variables. The trick in this paper is to make this latent variable source-specific. Knowing the source (user/annotator) of each annotation, the model can learn a source-specific latent variable representing the uncertainty in the annotations from each source.
In the study, the authors compare the proposed multi-annotator model to the standard GP model learned with pooled data, individual annotator's data, and weighted individual annotator's data. The proposed model outperformed the comparison methods in the UCI 'housing' data set, for which the authors manually created an annotation. In principle, the GP framework also applies to a classification problem, where the target variables are binary, but exact solution for that remains intractable.
The multiple annotators' problem undoubtedly remains a central question in machine learning. Services such as Amazon's Mechanical Turk have shown that there is and will be need for computational methods for the problem of fusing data from several uncertain sources. This work is a nice simple tweak to a widely-recognized probabilistic model. Regardless of the straightforward nature of the solution, it provides promising results for the problem. Gaussian processes have been a field of intensive research during the recent ten years. Also in ICANN'11 there was at least one GP poster in addition to this one.
Geoffrey Hinton: Learning structural descriptions of objects using equivariant capsules
John Shawe-Taylor: Leveraging the data generating distribution for learning
John Shawe-Taylor from University College London (UK) gave an interesting talk about prediction theory. Prediction theory makes statements about the future error rate of a learned classifier. The goal in the prediction theory is to find the lower and the upper bound, i.e. the confidence interval on the error rate of a learned classifier.
In classification, the aim is to find a function capable of predicting output given input. Data samples (input, output -pairs) are assumed to come independently from some unknown distribution. Complexity of the input data distribution, such as high dimensionality, poses several challenges to the learning of a classifier. Good learning algorithms do not overfit, meaning that they are able to learn the underlying distribution, but do not model the random noise in the data. Often the correct complexity of a model is found by a model selection. The model selection and investigation of the model performance are usually done by looking at the incorrectly classified instances in the training or test data; the training or test error. The best model would be the one having the lowest error rate, indicating that it generalizes well on future data. Because the underlying data distribution is unknown, the true error rate is not known. Nevertheless, the empirical error can be estimated from the data samples. Although the true error, i.e. the generalization error is not observable, its bounds can be obtained. A bound tells how well a classifier performs on data that is generated from the same distribution but which is not yet seen. Both upper and lower bounds for the true error rate can be determined. For example, the upper bound cD is defined as follows: with probability 1-δ over an identically and independently distributed (i.i.d.) draw of some sample, the expected future error rate is bounded by f(δ, cD). δ denotes significance; it is a probability that the observed sample is unusual.
Shawe-Taylor concentrated on his talk on PAC-Bayes learning (PAC stands for probably approximately correct), which borrows ideas both from frequentistic and Bayesian statistical learning framework. In frequentistic learning, the only assumption made is that the observed data is i.i.d. Frequentistic statistical analysis defines confidence intervals for learned model parameters, which determine the region, in which the estimate will fall in 1- δ% of time, when the data is sampled infinitely many times from the distribution. In contrast, in the Bayesian statistical analysis, a prior distribution is defined over the model parameters, and the probability of the model parameter values can be addressed from the posterior distribution of the parameters. This leads to more detailed probabilistic prediction. However, in the Bayesian data analysis, the prior needs to be chosen, thus making the analysis subjective, whereas often more objective analysis would be more desirable. Moreover, if the prior is not correct, the final result will be biased.
PAC-Bayes bound is a frequentistic training set bound, which borrows ideas from the Bayesian analysis. It can be applied to stochastic classifiers, such as support vector machines (SVM), to optimize the bounds of the true error rate. Consider the weight vector of a SVM classifier and its ability to classify training data samples correctly. The set of all possible weight vectors forms a weight space or a version space. For a certain sample, there is a weight vector, which separates the version space to regions where that certain sample is correctly classifier and where it is not. There might exist a region, where all of the samples are classified correctly. In this region, a hypersphere can be placed so that its volume is maximal. The volume of the hypersphere can be used as a measure of the generalization and it defines the bound. The volume of the hypersphere has also connection to the model evidence, i.e. the data likelihood given the model. Evidence can be used for example, for model selection. Actually the volume of the hypersphere equals the evidence under the uniform prior distribution. Large value of the evidence ensures good generalization for a classifier.
The most important lesson to learn from the talk of Shawe-Taylor is the PAC-Bayes theorem. The PAC-Bayes theorem involves a class of classifiers together with a prior distribution and posterior over the class of classifiers. The PAC-Bayes bound holds for all choices of posterior, hence posterior does not need to be the classical Bayesian posterior. Often posterior is chosen based on the best bound. Moreover, the bound holds for all prior choices of prior, hence its validity is not affected by a poor choice of prior. This is contrast to the standard Bayesian analysis which only holds, if the prior assumptions are correct.
Given the test input x, a classifier c is first chosen from the posterior, and the output is c(x). PAC-Bayes theorem says, that the upper bound of the true error of this classification is determined as the Kullback-Leibler -divergence between the average true error rate and the average train error rate. The averages are obtained as marginalizing over the posterior distribution, thus giving the method a Bayesian flavour. The KL-divergence measures the closeness of the true error and train error, or the misfit between them. The smaller the KL-divergence, the tighter the bound.
Videolectures from John Shawe-Taylor related to the topic can be found in: http://videolectures.net/john_shawe_taylor/
SNPBoost: Interaction Analysis of Risk Prediction on GWA Data
Among a few works related to bioinformatics was the poster of Ingrid Braenne et al from Lübeck, Germany, titled SNPboost: Interaction Analysis of Risk Prediction on GWA Data. As commonly known, the human genome consists of DNA, in which the order of the four nucleotides, A, T, C and G, determines the instructions to build structural and functional parts of our bodies. The human genome consists of over three billion nucleotides divided into 23 pairs of chromosomes. Small variations in certain locations of the genome, i.e. genomic loci, make us unique individuals. For example, a single nucleotide variation in a certain loci between individuals is called a single-nucleotide polymorphism, SNP. The human genome contains hundreds of thousands of SNPs. If a SNP is within a gene, the different variants of that gene are called alleles. An individual can be a homozygote or a heterozygote with respect to an allele, implying that a homozygote has the same variant of the gene in both chromosome pairs, whereas heterozygote has different variants in different chromosomes in a chromosome pair. A SNP is usually biallelic, meaning that there are only two high-frequent nucleotide variants of that SNP in the population. Within certain populations, the frequencies of different alleles usually vary, for example, between ethnic groups.
Different variants of a genomic locus, for example a gene, might be responsible for the development of a certain disease. There are certain diseases, which can be associated to a single SNP, such as sickle cell anemia, but most common diseases, i.e. national diseases, such as type I and type II diabetes, myocardial infarction and Chrohn’s disease are so called complex diseases or multi-gene diseases. Complex diseases are caused by several disease-inducing SNPs interacting with each other. SNP profiles of individuals can be obtained using sequencing techniques, and they are routinely analyzed using genome-wide association analysis (GWA). One of the goals in GWA is to associate disease risk to certain SNPs. In addition, the treatment of a disease can be tailored for each individual based on their SNP profiles, leading to personalized medicine. However, the disease diagnostics, or prediction of a disease risk based on SNP profiles, are unfortunately very difficult and uncertain. It is also interesting to determine the functional role of a SNP associated to a certain disease. A SNP can fall into coding regions, i.e. genes, but many of them fall also non-coding regions which makes their functional role harder to depict. Often the interaction of multiple SNPs is not investigated in GWA studies, as GWA only provides the statistical significance of disease association for single SNPs. However, one genetic variant can have only limited effect on risk in complex diseases, and the disease effect may only be visible in the interaction between multiple SNPs. The work of Braenne et al attempts to find solutions for this. In GWA studies, the data created is very high-dimensional and sample size low compared to it. While this causes difficulties to computational analysis, also the interpretation of the results becomes challenging. Therefore, some feature selection is often needed. This issue is also addressed in the work of Braenne et al.
Classifiers, such as support vector machine (SVM), has been successively used, for example, to improve the risk prediction for Type I and type II diabetes from SNP profiles. SVM is a standard benchmark classifying method particularly useful to study multivariate data sets. SVM aims to determine a hyperplane in a higher-dimensional feature space that separates two classes with a maximum separation, i.e. with a maximum margin. By finding a linear separator in a feature space, the classification boundaries are non-linear in the original variable space. In the work of Braenne et al, authors studied the performance of two types of classification tools, SVM and Boosting, to classify healthy and diseased SNP-profiles. Two different versions of SVM were used, namely SVMs with a linear and gaussian kernels. When training a SVM classifier, the authors did feature selection by selecting the subset of SNPs based on their statistical significance of the association with the disease, i.e. their p-values obtained from GWA analysis. Using preselected subset of SNPs might however miss the important interaction effects in the data, so the selection of SNPs for a classifier should be improved somehow.
Boosting method used in the study of Braenne et al is Adaboost, the most popular boosting algorithm. In boosting, the idea is to combine several weak classifiers to obtain one strong classifier. In this work, the set of weak classifiers contained 6 classifiers for each SNP. Each SNP is assumed to be biallelic, and therefore can have three discrete states: AA, AB and BB. AA and BB are homozygotic and AB is a heterozygotic allele. These three states can induce a disease or protect against it, so an individual can be in one of the 2x3=6 possible states; diseased AA, diseased AB, diseased BB, protected AA, protected AB, or protected BB. Weak classifiers try to predict correctly the disease state of an individual based on their genotype and possible state(inducing or protecting) of that SNP. This method is called SNPBoost.
The weak classifiers are combined such that the classification errors of the first single classifier is compensated as good as possible by the second classifier and so forth. These weak classifiers are added one after another in order to gain a set of classifiers that together boost the classification. Selecting classifiers also controls the number of SNP to be included to the classifier. This is claimed to find also possible SNP-SNP interactions between selected SNPs, because if a weak classifier positively interacts with a previous selected one, this weak classifier might be chosen since an interaction might improve the classification.
The data set used in this analysis contained 127370 SNPs, and 2032 individuals with 1222 controls and 810 cases. The data was randomly divided into training and test set both containing 405 individuals. The different classifiers were trained using training data and the classification performance was accessed using test data. Receiver operation characteristics (ROC) curves, i.e. the true positive rate versus the false positive rate were obtained on the test set for each method
Linear SVM (LSVM) with a preselected SNPs and SNPBoost both yielded a peak performance for small number of SNPs. Gaussian SVM (GSVM) yielded a peak in the performance with a slightly larger number of SNPs. The performance decreased with additional SNPs. SNPBoost outperformed the SVM with a linear kernel. Gaussian SVM outperformed linear SVM and SNPBoost, when preselected set of SNPs was chosen to train GSVM. With very large number of SNPs, SNPBoost outperforms all other methods. In addition, with small number of SNPs, the performance of the Gaussian SVM further improved when the subset of SNPs was chosen based on SNPBoost. The better selection of SNPboost compared to LSVM might be due to the fact that the SNPboost algorithm selects SNPs one at a time in order to bit by bit increases the classification and these SNPs might just better fit together.
Braenne et al also investigated the biological interpretation of the results. With SVM and preselected gene sets, 14 SNPs were found within genes. Two of these genes can be linked through an additional gene, meaning that there might be some functional relationships between these genes. When studying the 20 SNPs selected by the SNPboost algorithm, and the their corresponding genes, 9 out of 20 lie within genes and two of the genes can be linked through an additional gene. Both methods revealed one set of three genes, which possible functional relationships, but the relationship of the genes selected by SNPBoost might be an additional one. Future work of is to identify whether the SNPs in these genes increase the disease risk.
Tuesday, June 21, 2011
Tom Griffiths: Discovering human inductive biases
Griffiths prefers to consider human learning as a black box that maps inputs to outputs. Hence, a strategy to solve this problem would be to learn the mapping used by humans. He presented the Bayes theorem as a solution to how to solve the problem of mapping from input to output.
However, he showed that Bayes rule can not be directly implemented because it is often very difficult to determine and express the priors human use in decision making. He also presented an interesting example involving the audience for the same purpose. The question was: a movie has made $90 million so far how much will it make? The audience’s answer varied between $300 and $500 million. Similarly, another question was that a movie has made $6 million, how much will it make? The audience answers were around $10 million. In order to give the contrasting view on the topic, he gave a counter example; if you see a 90 year old man, how much do you think will he live? The answers were between 95 to 100 years. Again, he followed that up another question, you meet a 6 year old boy, how much will he live? The audience answers were around 70.
From the answers to this problem, he emphasized that learning human inductive biases were hard because there were different answers/outputs when the data/inputs were the same numerically. The moral of the example is that priors have strong effect on predictions, so inductive biases can be inferred from behavior of humans if we can determine the relevant priors. They performed a set of cognitive experiments to conclude that different priors (Power-law, Gaussian and Erlang) were associated with different examples of human learning.
For example, power-law prior was associated with the case of movie making certain amount of money; while predicting the age would be associated with Gaussian prior. Thus, it is difficult to come-up with a single strategy that can be used for many tasks. Hence, Griffiths and colleagues have proposed several models for many different tasks. Some of the notable ones include Causal learning (Griffiths & Tenenbaum, 2009), Category Learning (Griffiths et al., 2008), Speech perception (Feldman & Griffiths, 2007), and subjective randomness (Griffiths & Tenenbaum, 2003).
Griffiths also presented about the concept of iterated learning (Kirby, 2001) realised with Bayesian learners where the distribution over the hypothesis converges to the prior. They build a concept known as “Rational process model” (Sanborn et al., 2006) by approximating Bayesian inference and connecting them to psychological processes. They use Monte Carlo mechanisms based on importance sampling (Shi, Griffiths et al., 2010) and “Win-stay, lose-shift” to approximate the Bayesian inference.
Overall, the talk was interesting comprising of several real world experiments and their mathematical formulation showing that Bayesian models of cognition provides a method to identify human inductive Biases. However, the relationship of priors and representations in those models to mental and neural processes were not transparent.
I personally liked the mathematical formulations of the problems and developing models that had more focus mathematical and statistical methods rather than mimicking the human learning process. These cognitive issue along with several neural and relevant issues were also discussed in panel discussion that immediately followed the talk by Griffiths.
The panel discussion was attended by all the keynote speakers except John Shawe-Taylor and Riitta Hari and co-ordinated by Timo Honkela.
Schaffernicht et al. : Weighted Mutual Information for Feature Selection
Evening of first day of ICANN conference i.e. 14th of June was reserved for the poster presentations. Out of papers accepted for publication in ICANN’11 proceedings; around 50 of them were provided with the opportunity to present their work orally while remaining (around 60) were provided with the opportunity to present their work as posters. Poster session was held in T-Building of Computer Science department in Aalto University of Science while the regular conference took place in Dipoli Congress Center. There was enthusiastic participation of both poster presenters and other regular attendants of the conference. It was heartening to find that some of the oral presenters were also presenting their work in the form of posters. It made some sense because poster sessions provides one of the best opportunities to discuss own research with fellow researchers and renowned professors in the field which provides new perspective and sometimes new dimension to research.
Out of many posters; I decided to write about this poster by Erik Schaffernicht and Horst-Michael Gross; the topic of which was Weighted Mutual Information for Feature Selection. There is no denying the importance of feature selection in learning algorithms. Hence, the research area is never saturated in this field and have opening for new ideas and methods. In this paper; the authors provide a simple trick to include only the relevant features and at the same time avoid redundant features. Similar to other wrapper methods; they individually train the classifier on the entire features. However, they determine the next feature to be included by their accuracy on the misclassified samples rather than the entire data samples. They provide weights to the samples and select the features which maximize the mutual information.
This idea is similar to the well known AdaBoost algorithm. The misclassified samples in the first round are given higher weights in the second round. This makes some sense because correctly classified samples are easily explained by the selected subset of features at any time instant and the crux of the problem is to find those features that better classify the misclassified samples. They experimented this methods in different datasets from UCI machine learning repository and also on some artificial datasets. Although it was not mentioned in the paper or displayed in poster but I enquired with them if they are using that in some real world datasets for current project and they informed me that they had deployed that in a control systems where data dimension is in thousands.
Overall, a simple but very effective and intelligent trick to select features. It achieves computational efficiency by reducing training cycles significantly and also selects the set of best discriminating features.
Nandita Tripathi: Hybrid Parallel Classifiers for Semantic Subspace Learning
Many ways of enhancing the predictive performance of classifiers by using them in a subspace of the original input space have been studied in the recent years. These methods include e.g. the Random Subspace Method (RSM), which divides the original feature space into lower dimensional subspaces randomly and variants of the RSM that use some criteria to select the subspaces instead random assignment.
The novel idea of Tripathi et al. is to use semantic information about the data to optimize the selection of subspaces. After learning the subspaces, a set of classifiers is then used to classify the data with respect to some topics in the new subspaces.
To study the approach, the Reuters data is used. The Reuters data provides multiple levels of topics for its documents. The most broad topics (e.g. education, computers, politics) are inferred as semantic information and they are used for learning the lower dimensional subspaces with a maximum significance based method. After this, multiple classifiers are used within the learnt subspaces to classify the data with respect to more fine-grained topics (e.g. within the topic of education: schools, collage, exams).
Tripathi et al. do experiments using multiple different algorithms (e.g. multiple layer perceptrons, naive bayes classifier, random forest...) as a part of their hybrid architecture. The hybrid architecture both improves classification results and decreases computation times.
Enrique Romero: Using the Leader Algorithm with Support Vector Machines for Large Data Sets
Many different approaches to the problem have been proposed. E.g. chunking and decomposition methods optimize the SVM with respect to subsets of the data to lower the computational cost.
Romero presents an approach that aims to reduce the computational cost by reducing the number of training samples. The data set is first clustered using the Leader algorithm, and then only the samples chosen as the cluster identities by the Leader algorithm are used for training the SVM.
The Leader algorithm uses a distance measure D and and a predefined threshold T to partition the data. Neighborhoods that are withing distance T with respect to the distance measure D are clustered together and the cluster is represented using one of its data points, which is then referred to as the leader. The Leader algorithm is very fast: the algorithm makes a single pass through the dataset. All areas of the input space are presented in the clustering solution.
Reducing the size of the training set naturally decreases predictive performance but the computational cost decreases much more rapidly. As a future step, Romero proposes developing the Leader algorithm to preserve more data points close to the decision boundaries of the SVM.
Sunday, June 19, 2011
Joshua Tenenbaum: How to grow a mind: Statistics, structure and abstraction
Friday, June 17, 2011
Reichert, P., Series, P. and Storkey, A.. A Hierarchical Generative Model of Recurrent Object-Based Attention in the Visual Cortex
Ramya Rasipuram and Mathew Magimai Doss: Improving Articulatory Feature and Phoneme Recognition using Multitask Learning
Articulatory features define properties of speech production, i.e. they describe the basic sounds we make. Phonemes on the other hand are the smallest units of sound used to form meaningful speech. In Finnish basically all the phonemes correspond to a letter, whereas in English they do not. However, phonemes are used to model pronounciation, and they are therefore cross- and multilingual.
The authors did experiments on their model using the TIMIT corpus, containing speech from American English speakers of different sexes and dialects. The corpus also contains the correct phonemes used in the speech. Following methods for phoneme recognition were applied:
- Independent MLP (multilayer perceptron)
- Multitask MLP
- Phoneme MLP
Independent MLP is a standard method, whereas (2) and (3) are novel methods presented in their paper. In each method, articulatory features were learned from the audio, and an MLP network was trained to predict the phonemes. In independent MLP the classifiers are independent. However, since the features actually are interrelated, multitask learning was considered to be needed. The prediction accuracies (speech to phoneme) for independent, multitask and phenome MLP were 67.4%, 68.9% and 70.2%, respectively.
Additionally, a hierarchical version was presented for each method. They performed better than the original ones, maintaining the order of performance.
Rasipuram presented their work to be continued with:
- Automatic speech recognition studies
- Different importance weights for features
- Adding gender and rate of speech as features
The talk gained some critique, as one researcher in the audience stated that performance better than this had been achieved already years ago. This wasn't really addressed by the author.
Kauppi et al: Face Prediction from fMRI Data during Movie Stimulus: Strategies for Feature Selection
Similar research had been done before, but there the test subjects were shown a set of movie clips, instead of a whole movie. The authors claim that showing a whole movie results in "more naturalistic" data.
The problem is a classification task with two classes: "face" and "non-face". It was solved using ordinary least squares (OLS) for regression. However, since there was a lot of data, OLS couldn't be used in the conventional way. The prediction was done using only a subset of the features, which were selected using prior information and different methods, resulting in four regression models:
- Stepwise Regression (SWR)
- Simulated Annealing (SA)
- Least Absolute Shrinkage and Selection Operator (LASSO)
- Least Angle Regression (LARS)
Out of which LASSO and LARS are regulated to be sparse, possibly resulting in less overfitting.
Figure: The best prediction acquired with LARS compared to the (roughly binary) annotation. As can be seen, binary prediction (1 when > 0.5) would match the annotation well. Also locations of 6 features in three bain regions visualized.
Human brain is divided into different regions with different tasks. This study provided a natural way (at least for a computer scientist) to find out which regions are associated with face recognition, and thus, can be used in the prediction. In their paper, the authors say, "our results support the view that face detection is distributed across the visual cortex, albeit the fusiform cortex has a strong influence on face detection."
Thursday, June 16, 2011
Heess, N., Le Roux, N. and Winn, J.. Weakly Supervised Learning of Foreground-Background Segmentation using Masked RBMs
Wednesday, June 15, 2011
Riitta Hari: Towards Two-Person Neuroscience
We all know that the interaction with other people affects our mood and thoughts very strongly. While an individual is interacting with another person, the brains of the two persons become coupled as one's brain analyzes the behavior of the other and vice versa. This is why the neuroscience community is now looking towards a pair instead of an individual as a proper unit of analysis.
There have already been studies on humans under controlled interaction, such as a movie or a computer game. While watching a movie, brains of individual viewers have been shown to be activated in a very synchronous fashion. Game against a human opponent activates the brain differently from a game against computer, which is also reflected in the reported feelings of the players.
Mirroring is a phenomenon which has been possible to study with existing technology. We feel pain when we are shown a picture of a suffering person. Already Ludwig Wittgenstein noted that "The human body is the best picture of the human soul". How individual's feelings tune into other person's feelings, is a more complicated question. It is a combination of the following factors:
- similar senses, motor systems and the brain that the individuals have
- the experience that they collect throughout their lives, and
- the beliefs they test by acting in the community.
Machine learning steps in for the analysis of the high-dimensional data produced by the functional measurement technologies. Dimensionality reduction methods such as independent component analysis (ICA) extract noise-free components that can potentially be biologically interpreted.
So far in most of the studies of human interaction, only the activity of one brain has been measured regardless of the presence of the other interacting person. Soon, however, accurate measurements of several subjects at a time will be possible, and that will most likely push for a leap in the development of computational data fusion techniques. Then, we will not only have a link between a stimulus and a brain image but between a stimulus and images of several subjects' brains.
When the focus of brain research moves towards the analysis of two or more interacting subjects, efficient multi-view methods will be needed. Thus, multi-view learning is currently a hot area of machine learning research.
Prof. Hari's message to the ICANN audience was that the analysis remains the bottleneck in brain research. As methodological researchers, we should next consider the opportunities opened by the new experiment settings and measurement technologies, and see how to learn more from the data.