## Regularized Minimax on Synthetic Data

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

**Experiments on Synthetic Data**

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a mixture of two normal distributions (i.e., there are two *types* per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

**Example 1**

** **

** **

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

**Example 2**

**Discussion**

1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.

## Sparse online kernel logistic regression

In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.)

(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)

I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).

It seems to perform reasonably well on large datasets. Below I’ll show its behavior on some pseudo-randomly generated classification problems.

All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.

**Example 1 (using 3 prototypes)
**

Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.

**Example 2 (five prototypes)
**

**Example 3 (five prototypes)
**

The right most panel shows the first two “transformed features”, i.e., the kernel values of the examples to the first two prototypes.

**Implementation details and discusssion**

The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.

As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).

I had to make some approximations to make the algorithm fast — the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.

The kernel parameter updating worked best when the RBF kernel was re-parametrized as .

The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.

Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.

## Online logistic regression

I like Bob Carpenter’s paper — “Lazy sparse online logistic regression …“. In addition to being a nice overview of logistic regression, it describes online training for logistic regression by stochastic gradient descent under various parameter priors.

Another cool feature is that if the feature dimensionality is large but the examples are sparse, only the parameters corresponding to the features that are non-zero (for the current example) need to be updated (this is the lazy part). It is super easy to implement (a few hundred lines in C, for an svm_light like stand-alone application) and trains very fast, as attested to by Leon Bottou.

There is one issue about the regularization discount in a truly online setting where there is no “end of epoch”, which was discussed by Carpenter. He suggests leaving it at a constant, which, as he points out, corresponds to steadily decreasing the variance of the prior with the number of examples.

In my implementation I used 1/(N_INIT+NumExamplesSeenThusFar), where N_INIT is some constant (say 100). The effect of this is that as the dataset becomes large the prior is ignored, as it should be. However, the earlier examples contribute less to the parameter estimates than later ones.

To allow for better representation capability with sufficient data, I implemented the polynomial degree 2 kernel by an explicit higher dimensional feature map. This is either cumbersome or impossible for other kernels. I will discuss a more general kernelization of the method in a later post.

(Update April 04 2010. Some slides for a basic tutorial on logistic regression, online learning, kernelization and sequence classification.)