## Sparse online kernel logistic regression

In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.)

(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)

I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).

It seems to perform reasonably well on large datasets. Below I’ll show its behavior on some pseudo-randomly generated classification problems.

All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.

**Example 1 (using 3 prototypes)
**

Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.

**Example 2 (five prototypes)
**

**Example 3 (five prototypes)
**

The right most panel shows the first two “transformed features”, i.e., the kernel values of the examples to the first two prototypes.

**Implementation details and discusssion**

The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.

As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).

I had to make some approximations to make the algorithm fast — the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.

The kernel parameter updating worked best when the RBF kernel was re-parametrized as .

The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.

Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.

## An effective kernelization of logistic regression

I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data.

**Traditional sparse kernel logistic regression**

Consider an class logistic regression model given by

for

where indexes the features.

Fitting the model to a data set involves estimating the betas to maximize the likelihood of .

The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by *kernelization*. We write

for .

where is a kernel function.

In order to be able to use this classifier at run-time we have to store all the training feature vectors as part of the model because we need to compute the kernel value of the test example to every one of them. This would be highly inefficient, not to mention the severe over-fitting of the model to the training data.

The solution to both the test time efficiency and the over-fitting problems is to enforce *sparsity*. That is we somehow make sure that for all but a few examples from the training data. The import vector machine does this by greedily picking some examples so that the reduced example model best approximates the full model.

** Sparsification by randomized prototype selection**

The sparsified kernel logistic regression therefore looks like

for .

where the feature vectors are from the training data set. We can see that all we are doing is a vanilla logistic regression on a transformed feature space. The original dimensional feature vector has been transformed into an dimensional vector, where each dimension measures the kernel value of our test example to a prototype vector (or reference vector) .

What happens if we just selected these prototypes *randomly *instead of greedily as in the import vector machine?

Avrim Blum showed that if the training data distribution is such that the two classes can be linearly separated with a margin in the feature space induced by kernel function, then the classes can be, with high probability, linearly separated with margin with low error, in the transformed feature space if we pick a sufficient number of prototypes *randomly*.

That’s a mouthful, but basically we can use Blum’s method for kernelizing logistic regression as follows. Just pick random vectors from your dataset (in fact they need not be labeled), compute the kernel value of an example to these points and use these as features to describe the example. We can then learn a straightforward logistic regression model on this dimensional feature space.

As Blum notes, need not even be a valid kernel for using this method. Any reasonable similarity function would work, except the above theoretical guarantee doesn’t hold.

**Going a step further — Learning the reference vectors**

A key point to note is that there is no reason for the prototypes to be part of the training data. Any reasonable reference points in the original feature space would work. We just need to pick them so as to enable the resulting classifier to separate the classes well.

Therefore I propose kernelizing logistic regression by maximizing the log-likelihood with respect to the betas* as well as *the reference points. We can do this by gradient descent starting from a random points from our data set. The requirement is that the kernel function be differentiable with respect to the reference point . (Note. Learning vector quantization is a somewhat related idea.)

Because of obvious symmetries, the log-likelihood function is non-convex with respect to the reference vectors, but the local optima close to the randomly selected reference points are no worse than than the random reference points themselves.

**The gradient with respect to a reference vector**

Let us derive the gradient of the log-likelihood function with respect to a reference vector. First let us denote , i.e., the kernel value of the feature vector with the prototype by .

The log-likelihood of the data is given by

where is the usual indicator function. The gradient of with respect to the parameters can be found in any textbook on logistic regression. The derivative of with respect to the reference vector is

Putting it all together we have

That’s it. We can update all the reference vectors in the direction given by the above gradient by an amount that is controlled by the learning rate.

**Checking our sums**

Let us check what happens if there is only one reference vector and . That is, we use a linear kernel. We have

and therefore

which is very similar to the gradient of with respect to parameter. This is reasonable because with a linear kernel we are essentially learning a logistic regression classifier on the original feature space, where takes the place of .

If our kernel is the Gaussian radial basis function we have

**Learning the kernel parameters**

Of course gradient descent can be used to update the parameters of the kernel as well. For example we can initialize the parameter of the Gaussian r.b.f. kernel to a reasonable value and optimize it to maximize the log-likelihood as well. The expression for the gradient with respect to the kernel parameter is

**Going online**

The optimization of the reference vectors can be done in an online fashion by stochastic gradient descent ala Bob Carpenter.

Is it better to update all the parameters of the model (betas, reference vectors, kernel parameters) at the same time or wait for one set (say the betas) to converge before updating the next set (reference vectors)?

**Miscellany**

1. Since conditional random fields are just generalized logistic regression classifiers, we can use the same approach to kernelize them. Even if the all the features are binary, the reference vectors can be allowed to be continuous.

2. My colleague Ken Williams suggests keeping the model small by sparsifying the reference vectors themselves. The reference vectors can be encouraged to be sparse by imposing a Laplacian L1 prior.

3. The complexity of the resulting classifier can be controlled by the choice of the kernel and the number of reference vectors. I don’t have a good intuition about the effect of the two choices. For a linear kernel it seems obvious that any number of reference points should lead to the same classifier. What happens with a fixed degree polynomial kernel as the number of reference points increases?

4. Since the reference points can be moved around in the feature space, it seems extravagant to learn the betas as well. What happens when we fix the betas to random values uniformly distributed in [-1,1] and just learn the reference vectors? For what kernels do we obtain the same model as if we learned the betas as well?

5. I wonder if a similar thing can be done for support vector machines where a user specifies the kernel and the number of support vectors and the learning algorithm picks the required number of support vectors (not necessarily from the data set) such that the margin (on the training data) is maximized.

6. Ken pointed me to Archetypes, which is another related idea. In archetypal analysis the problem is to find a specified number of archetypes (reference vectors) such that all the points the data set can be as closely approximated by convex sums of the archetypes as possible. Does not directly relate to classification.