### Archive

Archive for the ‘Classification’ Category

## Regularized Minimax on Synthetic Data

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

Experiments on Synthetic Data

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two types per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

Example 1

Example1. Original training data set. Both the red and blue classes have two types in 9:1 ratio.

Example 1. Plain logistic regression. No minimax. Almost all of the red circles are misclassified.

Example1. Minimax with gamma = 0.1

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

Example 2

Example2. Original training data set.

Example1. Logistic regression. No minimax.

Example2. Minimax with gamma = 0.5

Discussion

1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.

## Regularized Minimax for Robust Learning

March 13, 2010 1 comment

This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.

Cost Sensitive Loss Functions

Given a training data set $D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $\phi$ that is parametrized by a vector $w$ by minimizing a loss function

where $l(x_i, y_i, w)$ is the loss on example $i$ and $f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like

for some $\lambda > 0$.

If, in addition, the examples came with costs $c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain $\sum_i^N c_i = N$ and $c_i \ge 0$. So the unweighted learning problem corresponds to the case where all $c_i = 1$.

Assume that the learner is playing a game against an adversary that will assign the costs $\{c_i\}_{i=1,\ldots,N}$ to the training examples that will lead to the worst possible loss for any weight vector the learner produces.

How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution

For any realistic learning problem the above optimization problem does not have a unique solution.

Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.

We define

where $H(c) = -\sum_i c_i \log c_i$ (the Shannon entropy of the cost vector, save the normalization to sum to one).

The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector $c$ essentially restricts the set of  possible cost vectors the adversary has at his disposal.

Optimization

For convex loss functions (such as the logistic loss) $L(w, c)$ is convex in $w$ for a fixed cost assignment, therefore so is $R(w, c)$. Furthermore, $R(w, c)$ is concave in $c$ and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.

The theorem allows us to say that, for a fixed weight vector $w$, if

and if $\tilde{c}$ is unique, then

even though $\tilde{c}$ is a function of $w$.

Algorithm

The algorithm is very simple. Perform until convergence the following

1. At the $k^{th}$ iteration, for the weight vector $w^{k}$ find the cost vector $\tilde{c}$ the maximizes $R(w^{k},c)$.

2. Update $w^{k+1} = w^{k} - \eta \nabla_w R(w^{k}, \tilde{c})$, where $\eta$ is the learning rate.

The maximization in step 1 is also simple and can be shown to be

As expected, if $\gamma \rightarrow \infty$, the costs remain close to one and as $\gamma \rightarrow 0$ the entire cost budget is allocated to the example with the largest loss.

Of types and tokens

This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.

For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.

(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)

We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.

For type $t$ let $S_t$ be the set of all tokens of that type and $N_t$ be the number of tokens of that type. We now estimate $w$ by

under the same constraints on $c$ as above. Here $q$ is the observed type distribution in the training data and $KL(.\|.)$ is the KL-divergence.

The algorithm is identical to the one above except the maximum over $c$ for a fixed $w$ is slightly different.

Related Work and Discussion

1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.

2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors ($w$ for us) and not for the costs over the examples.

3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.

4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for  optimizing for the worst possible assignment of costs over query types.

In the next post I will present some experimental results on toy examples with synthetic data.

Acknowledgment

I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.

## Sparse online kernel logistic regression

In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.)

(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)

I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).

It seems to perform reasonably well on large datasets. Below I’ll show its behavior on some pseudo-randomly generated classification problems.

All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.

Example 1 (using 3 prototypes)

After first iteration

After second iteration

Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.

Example 2 (five prototypes)

After first iteration

After 5 iterations

Example 3 (five prototypes)

After first iteration

The right most panel shows the first two “transformed features”, i.e., the kernel values of the examples to the first two prototypes.

After second iteration

Implementation details and discusssion

The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the  prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.

As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).

I had to make some approximations to make the algorithm fast — the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.

The kernel parameter updating worked best when the RBF kernel was re-parametrized as $K(x,u) = exp(-exp(\theta) ||x-u||^2)$.

The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.

Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.

## Incremental complexity support vector machine

One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below.

(I generated them using a java SVM applet from here, whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with $\sigma = 1$ and the second with $\sigma = 10$ on two different data sets.

Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition.

The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large $\sigma$, which gives the classifier a propensity to make subtle distinctions even in regions of  somewhat low example density.

A possible solution: using complex kernels only where they are needed

We propose to build a cascaded classifier, which we will call Incremental Complexity SVM (ICSVM), as follows.

We are given a sequence of kernels $K_1, K_2,\ldots,K_m$ of increasing complexity. For example the sequence is of polynomial kernels, where $K_i$ is the polynomial kernel with degree $i$.

The learning algorithm first learns an SVM classifier $\psi_1$ with kernel $K_1$, that classifies a reasonable portion of the examples with a large margin $\lambda_1$. This can be accomplished by setting the SVM cost parameter $C$ to some low value.

Now all the examples outside the margin are thrown out, and another SVM classifier $\psi_2$ with kernel $K_2$ is learned, so that a reasonable portion of the remaining examples are classified with some large margin $\lambda_2$.

This procedure is continued until all the examples are classified outside the margin or the set of kernels is exhausted. The final classifier is a combination of all the classifiers $\psi_i$.

A test example can be classified as follows. We first apply classifier $\psi_1$ to the test example, and if it is classified with margin $\geq \lambda_1$, we output the assigned label and stop. If not we classify it with classifier $\psi_2$ in a similar fashion, and so on…

Such a scheme will avoid anomalous boundaries as those in the pictures above.

Discussion

1. With all the work that has been done on SVMs it is very likely that this idea or something very similar has been thought of, but I haven’t come across it.

2. There is some work on kernel learning where a convex combination of kernels is learned but I think that is a different idea.

3. One nice thing about such a classification scheme is that at run-time it will expend less computational resources on easier examples and more on more difficult ones.  As my thesis supervisor used to say, it is silly for most classifiers to insist on acting exactly the same way on both easy and hard cases.

4. The choices of the cost parameters $C$ for the SVMs is critical for the accuracy of the final classifier. Is there a way of formulating the choice of the parameters in terms of minimizing some overall upper bound on the generalization error from statistical learning theory?

5. Is there a one-shot SVM formulation with the set of kernels that exactly or approximately acts like our classifier?

6. The weird island-effect and what Ken calls the lava-lamp problem in the boundaries above are not just artifacts of SVMs. We would expect a sparse kernel logistic regression to behave similarly. It would be interesting to do a similar incremental kernel thing with other kernel-based classifiers.

Categories: Classification

## Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is $P(x,y)$, we have data drawn from some distribution $\hat{P}(x,y)$ (as always $x$ is the feature vector and $y$ is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since $P(x,y)$ is unknown, if $\hat{P}(x,y)$ is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

The type of bias

Assume that marginal feature distribution of the labeled points after active learning is given by $\hat{P}(x) = \sum_y\hat{P}(x,y)$. Therefore $\hat{P}(x)$ is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from $\hat{P}(x)$ we request its label from the oracle which returns a label according to the conditional distribution $P(y|x) = \frac{P(x,y)}{\sum_y P(x,y)}$.  That is there is no bias in the conditional distribution. Therefore $\hat{P}(x,y) = \hat{P}(x) P(y|x)$. This type of bias has been called covariate shift.

The data

After actively sampling the labels $n$ times, let us say we have the following data — a biased labeled training data set $\{x_i, y_i\}_{i=1,\ldots,n} \sim \hat{P}(x,y)$, where the feature vectors $x_i$ come from the original pool of $M$ unlabeled feature vectors  $\{x_i\}_{i=1,\ldots,M} \sim P(x)$

Let us define $\beta=\frac{P(x,y)}{\hat{P}(x,y)}=\frac{P(x)}{\hat{P}(x)}$. If $\beta$ is large we expect the feature vector $x$ to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example $\beta_i=\frac{P(x_i)}{\hat{P}(x_i)}$ for $i = 1,\ldots,n$. If we knew the values of $\{\beta_i\}$ we can correct for the bias during training and evaluation.

This paper by Huang et. al., and some of its references deal with the estimation of $\{\beta_i\}_{i=1,\ldots,n}$. Remark: This estimation needs take into account that $E_{\hat{P}(x)}[\beta_i] = 1$. This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang et. al.

Evaluation of the classifier

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier $f()$, and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate $\{\beta_i\}_{i=1,\ldots,n}$. Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by $\{\beta_i\}$.

If the empirical loss on the biased sample is written as $R = \frac{1}{n} \sum_i l(f(x_i), y_i)$, we write the estimate of the loss on the true distribution as the weighted loss $R_c= \frac{1}{n} \sum_i \beta_i l(f(x_i), y_i)$.

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

Learning the classifier

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss $R_c$ (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether any bias correction is necessary. Note that the posterior class distribution $P(y|x)$ is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on $P(x,y)$ will still converge to the Bayes error rate with examples drawn from $\hat{P}(x,y)$.

For example imagine constructing a $k$-Nearest Neighbor classifier on the biased labeled dataset.  If we let $k \rightarrow \infty$ and $\frac{k}{n} \rightarrow 0$, the classifier will converge to the Bayes-optimal classifier as $n \rightarrow \infty$, even if $\hat{P}(x)$ is biased. This is somewhat paradoxical and can be explained by looking at the case of finite $n$.

For finite $n$, the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.

## The redundancy of view-redundancy for co-training

Blum and Mitchell’s co-training is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning.

I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence).

I’ll first explain co-training with an example.

Co-training – The setup

Consider a $y \in \{0,1\}$ classification problem on the feature space $\mathcal{X}=\mathcal{X}_1 \times \mathcal{X}_2$. I.e., a feature vector $x$ can be split into two as $x = [x_1, x_2]$.

We make the rather restrictive assumption that $x_1$ and $x_2$ are class-conditionally independent for both classes. I.e., $P(x_1, x_2|y) = P(x_1|y) P(x_2|y)$ for $y \in \{0,1\}$.

(Note that unlike surrogate learning with mean-independence, both $\mathcal{X}_1$  and $\mathcal{X}_2$ are allowed to be multi-dimensional.)

Co-training makes an additional assumption that either view is sufficient for classification. This view-redundancy assumption basically states that the probability mass in the region of the feature space, where the Bayes optimal classifiers on the two views disagree with each other, is zero.

(The original co-training paper actually relaxes this assumption in the epilogue, but it is unnecessary to begin with, and the assumption has proliferated in later manifestations of co-training.)

We are given some labeled data (or a weak classifier on one of the views) and an large supply of unlabeled data. We are now ready to proceed with co-training to construct a Bayes optimal classifier.

Co-training – The algorithm

The algorithm is very simple. We use our weak classifier, say $h_1(x_1)$, (which we were given, or which we constructed using the measly labeled data) on the one view ($x_1$) to classify all the unlabeled data.  We select the examples classified with high confidence, and use these as labeled examples (using the labels assigned by the weak classifier) to train a classifier $h_2(x_2)$ on the other view ($x_2$).

We now classify the unlabeled data with $h_2(x_2)$ to similarly generate labeled data to retrain $h_1(x_1)$. This back-and-forth procedure is repeated until exhaustion.

Under the above assumptions (and with “sufficient” unlabeled data) $h_1$ and $h_2$ converge to the Bayes optimal classifiers on the respective feature views. Since either view is enough for classification, we just pick one of the classifiers and release it into the wild.

Co-training – Why does it work?

I’ll try to present an intuitive explanation of co-training using the example depicted in the following figure. Please focus on it intently.

The feature vector $x$ in the example is 2-dimensional and both views $x_1$ and $x_2$ are  1-dimensional. The class-conditional distributions are uncorrelated and jointly Gaussian (which means independent) and depicted by their equiprobability contours in the figure. The marginal class-conditional distributions are show along the two axes. Class $y=0$ is shown in red and class $y=1$ is shown in blue. The picture also shows some unlabeled examples.

Assume we have a weak classifier $h_1(x_1)$ on the first view. If we extend the classification boundary for this classifier to the entire space $x$,  the boundary necessarily comprises of lines parallel to the $x_2$ axis.  Let’s say there is only one such line and all the examples below that line are assigned class $y=1$ and all the examples above are assigned class $y=0$.

We now ignore all the examples close to the classification boundary of $h_1$ (i.e., all the examples in the grey band) and project the rest of the points onto the $x_2$ axis.

How will these projected points be distributed along $x_2$?

Since the examples that were ignored (in the grey band) were selected based on their $x_1$ values, owing to class-conditional independence, the marginal distribution along $x_2$ for either class will be exactly the same as if none of the samples were ignored. This is the key reason for the conditional-independence assumption.

The procedure has two subtle, but largely innocuous, consequences.

First, since we don’t know how many class $0$ and class $1$ examples are in the grey band the relative ratio of the examples of the two classes in the not-ignored set may not the same as in the original full unlabeled sample set. If the class priors $P(y)$ are known, this can easily be corrected for when we learn $h_2(x_2)$. If the class priors are unknown other assumptions on $h_1(x_1)$ are necessary.

Second, when we project the unlabeled examples on to $x_2$ we assign them the labels given to them by $h_1$ which can be erroneous. In the figure above, there will be examples in the region indicated by A that are actually class $1$ but have been assigned class $0$, and examples in region B that were from class $0$ but were called class $1$.

Again because of the class-conditional independence assumption these erroneously labeled examples will be distributed according to the marginal class-conditional $x_2$ distributions. I.e., in the figure above we imagine, along the $x_2$ axis, a very low amplitude blue distribution with the same shape and location as the red distribution, and a very low amplitude red distribution with the same shape under the blue distribution. (Note . This is the $(\alpha, \beta)$ noise in the original co-training paper.)

This amounts to having a labeled training set with label errors but with errors being generated independently of the location in the space. That is the number of errors in a region in the space is proportional to the number of examples in that region. These proportionally distributed errors are then washed out by the correctly labeled examples when we learn $h_2$.

To recap, co-training works because of the following fact. Starting from a weak classifier $h_1$ on $x_1$, we can generate very accurate and unbiased training data to train a classifier on $x_2$.

No need for view-redundancy

Notice that, in the above example, we made no appeal to any kind of view-redundancy (other than whatever we may get gratis from the independence assumption).

The vigilant reader may however level the following two objections against the above argument-by-example.

1. We build $h_1(x_1)$ and $h_2(x_2)$ separately. So when the training is done, without view redundancy, we have not shown a way to pick from the two to apply to new test data.

2. At every iteration we need to select unlabeled samples that were classified with high-confidence by $h_1$ to feed to the trainer for $h_2$. Without view-redundancy may be none of the samples will be classified with high confidence.

The first objection is easy to respond to. We pick neither $h_1$ nor $h_2$ for new test data. Instead we combine them to obtain a classifier $h(x_1,x_2)$. This is well justified because, under class-conditional independence, $P(y|x_1,x_2) \propto P(y|x_1) P(y|x_2)$.

We react to the second objection by dropping the requirement of classifying with high-confidence altogether.

Dropping the high-confidence requirement by surrogate learning

Instead of training $h_2(x_2)$ with examples that are classified with high confidence by $h_1(x_1)$, we train $h_2(x_2)$ with all the examples (using the scores assigned to them by $h_1(x_1)$).

At some iteration of co-training, define the random variable $z_1 = h_1(x_1)$. Since $x_1$ and $x_2$ are class-conditionally independent, $z_1$ and $x_2$ are also class-conditionally independent. In particular $z_1$  is class-conditionally mean-independent of $x_2$. Furthermore if $h_1$ is even a weakly useful classifier, barring pathologies, it will satisfy $E[z_1|y=0] \neq E[z_1|y=1]$.

We can therefore apply surrogate learning under mean-independence to learn the classifier on $x_2$. (This is essentially the same idea as Co-EM, which was introduced without much theoretical justification.)

Discussion

Hopefully the above argument has convinced the reader that the class-conditional view independence assumption obviates the view-redundancy requirement.

A natural question to ask is whether the reverse is true. That is, if we are given view-redundancy, can we completely eliminate the requirement of class-conditional independence? We can immediately see that the answer is no.

For example, we can duplicate all the features for any classification problem so that view-redundancy holds trivially between the two replicates. Moreover, the second replicate will be statistically fully dependent on the first.

Now if we are given a weak classifier on the first view (or replicate) and try to use its predictions on an unlabeled data set to obtain training data for the second, it would be equivalent to feeding back the predictions of a classifier to retrain itself (because the two views are duplicates of one another).

This type of procedure (which is an idea decades old) has been called, among other things, self-learning, self-correction, self-training and decision-directed adaptation. The problem with these approaches is that the training set so generated is biased and other assumptions are necessary for the feedback procedure to improve over the original classifier.

Of course this does not mean that the complete statistical independence assumption cannot be relaxed. The above argument only shows that at least some amount of independence is necessary.

## An effective kernelization of logistic regression

I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data.

Consider an $M$ class logistic regression model given by

$P(y|x)\propto\mbox{exp}(\beta_{y0} + \sum_{j}^{d}\beta_{yj}x_j)$ for $y =0,1,\ldots,M$

where $j$ indexes the $d$ features.

Fitting the model to a data set $D = \{x_i, y_i\}_{i=1,\ldots,N}$ involves estimating the betas to maximize the likelihood of $D$.

The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by kernelization. We write

$P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^N \beta_{yi} k(x,x_i))$ for $y=0,1,\ldots,M$.

where $k(.,.)$ is a kernel function.

In order to be able to use this classifier at run-time we have to store all the training feature vectors as part of the model because we need to compute the kernel value of the test example to every one of them. This would be highly inefficient, not to mention the severe over-fitting of the model to the training data.

The solution to both the test time efficiency and the over-fitting problems is to enforce sparsity. That is we somehow make sure that $\beta_{yi} =0$ for all but a few examples $x_i$ from the training data. The import vector machine does this by greedily picking some $n < N$ examples so that the reduced $n$ example model best approximates the full model.

Sparsification by randomized prototype selection

The sparsified kernel logistic regression therefore looks like

$P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^n\beta_{yi} k(x,u_i))$ for $y=0,1,\ldots,M$.

where the feature vectors $u_i$ are from the training data set. We can see that all we are doing is a vanilla logistic regression on a transformed feature space. The original $d$ dimensional feature vector has been transformed into an $n$ dimensional vector, where each dimension measures the kernel value of our test example $x$ to a prototype vector (or reference vector) $u_i$.

What happens if we just selected these $n$ prototypes randomly instead of greedily as in the import vector machine?

Avrim Blum showed that if the training data distribution is such that the two classes can be linearly separated with a margin $\gamma$ in the feature space induced by kernel function, then the classes can be, with high probability, linearly separated with margin $\gamma/2$ with low error, in the transformed feature space if we pick a sufficient number of prototypes randomly.

That’s a mouthful, but basically we can use Blum’s method for kernelizing logistic regression as follows. Just pick $n$ random vectors from your dataset (in fact they need not be labeled), compute the kernel value of an example to these $n$ points and use these as $n$ features to describe the example. We can then learn a straightforward logistic regression model on this $n$ dimensional feature space.

As Blum notes, $k(.,.)$ need not even be a valid kernel for using this method. Any reasonable similarity function would work, except the above theoretical guarantee doesn’t hold.

Going a step further — Learning the reference vectors

A key point to note is that there is no reason for the prototypes $\{u_1, u_2,\ldots,u_n\}$ to be part of the training data. Any reasonable reference points in the original feature space would work. We just need to pick them so as to enable the resulting classifier to separate the classes well.

Therefore I propose kernelizing logistic regression by maximizing the log-likelihood with respect to  the betas as well as the reference points. We can do this by gradient descent starting from a random $n$ points from our data set. The requirement is that the kernel function be differentiable with respect to the reference point $u$. (Note. Learning vector quantization is a somewhat related idea.)

Because of obvious symmetries, the log-likelihood function is non-convex with respect to the reference vectors, but  the local optima close to the randomly selected reference points are no worse than than the random reference points themselves.

The gradient with respect to a reference vector

Let us derive the gradient of the log-likelihood function with respect to a reference vector. First let us denote $k(x_i, u_j)$, i.e., the kernel value of the $i^{th}$ feature vector with the $j^{th}$ prototype by $z_{ij}$.

The log-likelihood of the data is given by

$L = \sum_{i=1}^N \sum_{y=1}^M \mbox{log}P(y|x_i) I(y=y_i)$

where $I(.)$ is the usual indicator function. The gradient of $L$ with respect to the parameters $\beta$ can be found in any textbook on logistic regression. The derivative of $P(y|x_i)$ with respect to the reference vector $u_l$ is

Putting it all together we have

That’s it. We can update all the reference vectors in the direction given by the above gradient by an amount that is controlled by the learning rate.

Checking our sums

Let us check what happens if there is only one reference vector $u_1$ and $z_{i1} = k(x_i, u_1) = $. That is, we use a linear kernel. We have

$\frac{\partial}{\partial u_1} z_{i1} = x_i$ and therefore

$\frac{\partial}{\partial u_1} L = \sum_{i=1}^N x_i[\beta_{y1} I(y=y_i) - \sum_{y=1}^M \beta_{y1} P(y|x_i)]$

which is very similar to the gradient of $L$ with respect to $\beta$ parameter. This is reasonable because with a linear kernel we are essentially learning a logistic regression classifier on the original feature space, where $beta_{y1} u_1$ takes the place of $\beta_y$.

If our kernel is the Gaussian radial basis function we have

$\frac{\partial}{\partial u_l} z_{il} = \frac{\partial}{\partial u_l} \mbox{exp}(-\lambda||x_i-u_l||^2) = 2\lambda (x_i - u_l) z_{il}$

Learning the kernel parameters

Of course gradient descent can be used to update the parameters of the kernel as well. For example we can initialize the parameter $\lambda$ of the Gaussian r.b.f. kernel to a reasonable value and optimize it to maximize the log-likelihood as well. The expression for the gradient with respect to the kernel parameter is

Going online

The optimization of the reference vectors can be done in an online fashion by stochastic gradient descent ala Bob Carpenter.

Is it better to update all the parameters of the model (betas, reference vectors, kernel parameters) at the same time or wait for one set (say the betas) to converge before updating the next set (reference vectors)?

Miscellany

1. Since conditional random fields are just generalized logistic regression classifiers, we can use the same approach to kernelize them. Even if the all the features are binary, the reference vectors can be allowed to be continuous.

2. My colleague Ken Williams suggests keeping the model small by sparsifying the reference vectors themselves. The reference vectors can be encouraged to be sparse by imposing a Laplacian L1 prior.

3. The complexity of the resulting classifier can be controlled by the choice of the kernel and the number of reference vectors. I don’t have a good intuition about the effect of the two choices. For a linear kernel it seems obvious that any number of reference points should lead to the same classifier. What happens with a fixed degree polynomial kernel as the number of reference points increases?

4. Since the reference points can be moved around in the feature space, it seems extravagant to learn the betas as well. What happens when we fix the betas to random values uniformly distributed in [-1,1] and just learn the reference vectors? For what kernels do we obtain the same model as if we learned the betas as well?

5. I wonder if a similar thing can be done for support vector machines where a user specifies the kernel and the number of support vectors and the learning algorithm picks the required number of support vectors (not necessarily from the data set) such that the margin (on the training data) is maximized.

6. Ken pointed me to Archetypes, which is another related idea. In archetypal analysis the problem is to find a specified number of archetypes (reference vectors) such that all the points the data set can be as closely approximated by convex sums of the archetypes as possible. Does not directly relate to classification.

Categories: Classification, Estimation