Wednesday, June 22, 2011

SNPBoost: Interaction Analysis of Risk Prediction on GWA Data

Among a few works related to bioinformatics was the poster of Ingrid Braenne et al from Lübeck, Germany, titled SNPboost: Interaction Analysis of Risk Prediction on GWA Data. As commonly known, the human genome consists of DNA, in which the order of the four nucleotides, A, T, C and G, determines the instructions to build structural and functional parts of our bodies. The human genome consists of over three billion nucleotides divided into 23 pairs of chromosomes. Small variations in certain locations of the genome, i.e. genomic loci, make us unique individuals. For example, a single nucleotide variation in a certain loci between individuals is called a single-nucleotide polymorphism, SNP. The human genome contains hundreds of thousands of SNPs. If a SNP is within a gene, the different variants of that gene are called alleles. An individual can be a homozygote or a heterozygote with respect to an allele, implying that a homozygote has the same variant of the gene in both chromosome pairs, whereas heterozygote has different variants in different chromosomes in a chromosome pair. A SNP is usually biallelic, meaning that there are only two high-frequent nucleotide variants of that SNP in the population. Within certain populations, the frequencies of different alleles usually vary, for example, between ethnic groups.

Different variants of a genomic locus, for example a gene, might be responsible for the development of a certain disease. There are certain diseases, which can be associated to a single SNP, such as sickle cell anemia, but most common diseases, i.e. national diseases, such as type I and type II diabetes, myocardial infarction and Chrohn’s disease are so called complex diseases or multi-gene diseases. Complex diseases are caused by several disease-inducing SNPs interacting with each other. SNP profiles of individuals can be obtained using sequencing techniques, and they are routinely analyzed using genome-wide association analysis (GWA). One of the goals in GWA is to associate disease risk to certain SNPs. In addition, the treatment of a disease can be tailored for each individual based on their SNP profiles, leading to personalized medicine. However, the disease diagnostics, or prediction of a disease risk based on SNP profiles, are unfortunately very difficult and uncertain. It is also interesting to determine the functional role of a SNP associated to a certain disease. A SNP can fall into coding regions, i.e. genes, but many of them fall also non-coding regions which makes their functional role harder to depict. Often the interaction of multiple SNPs is not investigated in GWA studies, as GWA only provides the statistical significance of disease association for single SNPs. However, one genetic variant can have only limited effect on risk in complex diseases, and the disease effect may only be visible in the interaction between multiple SNPs. The work of Braenne et al attempts to find solutions for this. In GWA studies, the data created is very high-dimensional and sample size low compared to it. While this causes difficulties to computational analysis, also the interpretation of the results becomes challenging. Therefore, some feature selection is often needed. This issue is also addressed in the work of Braenne et al.

Classifiers, such as support vector machine (SVM), has been successively used, for example, to improve the risk prediction for Type I and type II diabetes from SNP profiles. SVM is a standard benchmark classifying method particularly useful to study multivariate data sets. SVM aims to determine a hyperplane in a higher-dimensional feature space that separates two classes with a maximum separation, i.e. with a maximum margin. By finding a linear separator in a feature space, the classification boundaries are non-linear in the original variable space. In the work of Braenne et al, authors studied the performance of two types of classification tools, SVM and Boosting, to classify healthy and diseased SNP-profiles. Two different versions of SVM were used, namely SVMs with a linear and gaussian kernels. When training a SVM classifier, the authors did feature selection by selecting the subset of SNPs based on their statistical significance of the association with the disease, i.e. their p-values obtained from GWA analysis. Using preselected subset of SNPs might however miss the important interaction effects in the data, so the selection of SNPs for a classifier should be improved somehow.

Boosting method used in the study of Braenne et al is Adaboost, the most popular boosting algorithm. In boosting, the idea is to combine several weak classifiers to obtain one strong classifier. In this work, the set of weak classifiers contained 6 classifiers for each SNP. Each SNP is assumed to be biallelic, and therefore can have three discrete states: AA, AB and BB. AA and BB are homozygotic and AB is a heterozygotic allele. These three states can induce a disease or protect against it, so an individual can be in one of the 2x3=6 possible states; diseased AA, diseased AB, diseased BB, protected AA, protected AB, or protected BB. Weak classifiers try to predict correctly the disease state of an individual based on their genotype and possible state(inducing or protecting) of that SNP. This method is called SNPBoost.

The weak classifiers are combined such that the classification errors of the first single classifier is compensated as good as possible by the second classifier and so forth. These weak classifiers are added one after another in order to gain a set of classifiers that together boost the classification. Selecting classifiers also controls the number of SNP to be included to the classifier. This is claimed to find also possible SNP-SNP interactions between selected SNPs, because if a weak classifier positively interacts with a previous selected one, this weak classifier might be chosen since an interaction might improve the classification.

The data set used in this analysis contained 127370 SNPs, and 2032 individuals with 1222 controls and 810 cases. The data was randomly divided into training and test set both containing 405 individuals. The different classifiers were trained using training data and the classification performance was accessed using test data. Receiver operation characteristics (ROC) curves, i.e. the true positive rate versus the false positive rate were obtained on the test set for each method

Linear SVM (LSVM) with a preselected SNPs and SNPBoost both yielded a peak performance for small number of SNPs. Gaussian SVM (GSVM) yielded a peak in the performance with a slightly larger number of SNPs. The performance decreased with additional SNPs. SNPBoost outperformed the SVM with a linear kernel. Gaussian SVM outperformed linear SVM and SNPBoost, when preselected set of SNPs was chosen to train GSVM. With very large number of SNPs, SNPBoost outperforms all other methods. In addition, with small number of SNPs, the performance of the Gaussian SVM further improved when the subset of SNPs was chosen based on SNPBoost. The better selection of SNPboost compared to LSVM might be due to the fact that the SNPboost algorithm selects SNPs one at a time in order to bit by bit increases the classification and these SNPs might just better fit together.

Braenne et al also investigated the biological interpretation of the results. With SVM and preselected gene sets, 14 SNPs were found within genes. Two of these genes can be linked through an additional gene, meaning that there might be some functional relationships between these genes. When studying the 20 SNPs selected by the SNPboost algorithm, and the their corresponding genes, 9 out of 20 lie within genes and two of the genes can be linked through an additional gene. Both methods revealed one set of three genes, which possible functional relationships, but the relationship of the genes selected by SNPBoost might be an additional one. Future work of is to identify whether the SNPs in these genes increase the disease risk.

No comments:

Post a Comment