## Random Fourier Features for Kernel Density Estimation

The NIPS paper Random Fourier Features for Large-scale Kernel Machines, by Rahimi and Recht presents a method for randomized feature mapping where dot products in the transformed feature space approximate (a certain class of) positive definite (p.d.) kernels in the original space.

We know that for any p.d. kernel there exists a *deterministic* map that has the aforementioned property but it may be infinite dimensional. The paper presents results indicating that with the randomized map we can get away with only a “small” number of features (at least for a classification setting).

Before applying the method to density estimation let us review the relevant section of the paper briefly.

**Bochner’s Theorem and Random Fourier Features**

Assume that we have data in and a continuous p.d. kernel defined for every pair of points . Assume further that the kernel is shift-invariant, i.e., and that the kernel is scaled so that .

The theorem by Bochner states that under the above conditions must be the Fourier transform of a non-negative measure on . In other words, there exists a probability density function for such that .

where (1) is because is real. Equation (2) says that if we draw a random vector according to and form two vectors and , then the expected value of is .

Therefore, for , if we choose the transformation

with drawn according to , linear inner products in this transformed space will approximate .

**Gaussian RBF Kernel**

The Gaussian radial basis function kernel satisfies all the above conditions and we know that the Fourier transform of the Gaussian is another Gaussian (with the reciprocal variance). Therefore for “linearizing” the Gaussian r.b.f. kernel, we draw samples from a Gaussian distribution for the transformation.

**Parzen Window Density Estimation**

Given a data set , the the so-called Parzen window probability density estimator is defined as follows

where is often a positive, symmetric, shift-invariant kernel and is the bandwidth parameter that controls the scale of influence of the data points.

A common kernel that is used for Parzen window density estimation is the Gaussian density. If we make the same choice we can apply our feature transformation to linearize the procedure. We have

where has been absorbed into the kernel variance.

Therefore all we need to do is take the mean of the transformed data points and estimate the pdf at a new point to be (proportional to) the inner product its transformed feature vector with the mean.

Of course since the kernel value is only approximated by the inner product of the random Fourier features we expect that the estimate pdf will differ from a plain unadorned Parzen window estimate. But different how?

**Experiments**

Below are some pictures showing how the method performs on some synthetic data. I generated a few dozen points from a mixture of Gaussians and plotted contours of the estimated pdf for the region around the points. I did this for several choices of and (the scale parameter for the Gaussian kernel).

First let us check that the method performs as expected for large values of because the kernel value is well approximated by the inner product of the Fourier features. The first 3 pictures are for for various values of .

—————————————————————————

—————————————————————————

Now let us see what happens when we decrease . We expect the error in approximating the kernel would lead to obviously erroneous pdf. This is clearly evident for the case of .

—————————————————————————

—————————————————————————

The following picture for and is even stranger.

—————————————————————————

—————————————————————————

**Discussion**

It seems that even for a simple 2D example, we seem to need to compute a very large number of random Fourier features to make the estimated pdf accurate. (For this small example this is very wasteful, since a plain Parzen window estimate would require less memory and computation.)

However, the pictures do indicate that if the approach is to be used for outlier detection (aka novelty detection) *from a given data set, *we might be able get away with much smaller . That is, even if the estimated pdf has a big error on the entire space, on the points from the data it seems to be reasonably accurate.

## Regularized Minimax on Synthetic Data

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

**Experiments on Synthetic Data**

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a mixture of two normal distributions (i.e., there are two *types* per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

**Example 1**

** **

** **

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

**Example 2**

**Discussion**

1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.

## Regularized Minimax for Robust Learning

This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.

**Cost Sensitive Loss Functions
**

Given a training data set , most learning algorithms learn a classifier that is parametrized by a vector by minimizing a loss function

where is the loss on example and is some function that penalizes complexity. For example for logistic regression the loss function looks like

for some .

If, in addition, the examples came with costs (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain and . So the unweighted learning problem corresponds to the case where all .

**A Game Against An Adversary**

Assume that the learner is playing a game against an adversary that will assign the costs to the training examples that will lead to the worst possible loss for any weight vector the learner produces.

How do we learn in order to minimize this maximum possible loss? The solution is to look for the the *minimax* solution

For any realistic learning problem the above optimization problem does not have a unique solution.

Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.

where (the Shannon entropy of the cost vector, save the normalization to sum to one).

The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector essentially restricts the set of possible cost vectors the adversary has at his disposal.

**Optimization**

For convex loss functions (such as the logistic loss) is convex in for a fixed cost assignment, therefore so is . Furthermore, is concave in and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.

The theorem allows us to say that, for a fixed weight vector , if

and if is unique, then

even though is a function of .

**Algorithm**

The algorithm is very simple. Perform until convergence the following

1. At the iteration, for the weight vector find the cost vector the maximizes .

2. Update , where is the learning rate.

The maximization in step 1 is also simple and can be shown to be

As expected, if , the costs remain close to one and as the entire cost budget is allocated to the example with the largest loss.

**Of types and tokens**

This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.

For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.

(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)

We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible *type* distribution rather than for the observed *token* distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.

For type let be the set of all tokens of that type and be the number of tokens of that type. We now estimate by

under the same constraints on as above. Here is the observed type distribution in the training data and is the KL-divergence.

The algorithm is identical to the one above except the maximum over for a fixed is slightly different.

**Related Work and Discussion**

1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.

2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors ( for us) and not for the costs over the examples.

3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.

4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a *particular* user. Such a minimax approach may be suitable for optimizing for the worst possible assignment of costs over query types.

In the next post I will present some experimental results on toy examples with synthetic data.

**Acknowledgment**

I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.

## Sparse online kernel logistic regression

In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.)

(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)

I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).

It seems to perform reasonably well on large datasets. Below I’ll show its behavior on some pseudo-randomly generated classification problems.

All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.

**Example 1 (using 3 prototypes)
**

Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.

**Example 2 (five prototypes)
**

**Example 3 (five prototypes)
**

The right most panel shows the first two “transformed features”, i.e., the kernel values of the examples to the first two prototypes.

**Implementation details and discusssion**

The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.

As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).

I had to make some approximations to make the algorithm fast — the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.

The kernel parameter updating worked best when the RBF kernel was re-parametrized as .

The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.

Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.

## Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in *active learning* the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is *pool-based*. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is , we have data drawn from some distribution (as always is the feature vector and is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since is unknown, if is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

**The type of bias**

Assume that marginal feature distribution of the labeled points after active learning is given by . Therefore is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from we request its label from the oracle which returns a label according to the conditional distribution . That is there is *no bias* in the conditional distribution. Therefore . This type of bias has been called *covariate shift*.

**The data**

After actively sampling the labels times, let us say we have the following data — a biased labeled training data set , where the feature vectors come from the original pool of unlabeled feature vectors

Let us define . If is large we expect the feature vector to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example for . If we knew the values of we can correct for the bias during training and evaluation.

This paper by Huang *et. al.*, and some of its references deal with the estimation of . *Remark*: This estimation needs take into account that . This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang *et. al*.

**Evaluation of the classifier**

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier , and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate . Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by .

If the empirical loss on the biased sample is written as , we write the estimate of the loss on the true distribution as the weighted loss .

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

**Learning the classifier**

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether *any* bias correction is necessary. Note that the posterior class distribution is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on will still converge to the Bayes error rate with examples drawn from .

For example imagine constructing a -Nearest Neighbor classifier on the biased labeled dataset. If we let and , the classifier will converge to the Bayes-optimal classifier as , *even if* * is biased*. This is somewhat paradoxical and can be explained by looking at the case of finite .

For finite , the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.

## An effective kernelization of logistic regression

I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data.

**Traditional sparse kernel logistic regression**

Consider an class logistic regression model given by

for

where indexes the features.

Fitting the model to a data set involves estimating the betas to maximize the likelihood of .

The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by *kernelization*. We write

for .

where is a kernel function.

In order to be able to use this classifier at run-time we have to store all the training feature vectors as part of the model because we need to compute the kernel value of the test example to every one of them. This would be highly inefficient, not to mention the severe over-fitting of the model to the training data.

The solution to both the test time efficiency and the over-fitting problems is to enforce *sparsity*. That is we somehow make sure that for all but a few examples from the training data. The import vector machine does this by greedily picking some examples so that the reduced example model best approximates the full model.

** Sparsification by randomized prototype selection**

The sparsified kernel logistic regression therefore looks like

for .

where the feature vectors are from the training data set. We can see that all we are doing is a vanilla logistic regression on a transformed feature space. The original dimensional feature vector has been transformed into an dimensional vector, where each dimension measures the kernel value of our test example to a prototype vector (or reference vector) .

What happens if we just selected these prototypes *randomly *instead of greedily as in the import vector machine?

Avrim Blum showed that if the training data distribution is such that the two classes can be linearly separated with a margin in the feature space induced by kernel function, then the classes can be, with high probability, linearly separated with margin with low error, in the transformed feature space if we pick a sufficient number of prototypes *randomly*.

That’s a mouthful, but basically we can use Blum’s method for kernelizing logistic regression as follows. Just pick random vectors from your dataset (in fact they need not be labeled), compute the kernel value of an example to these points and use these as features to describe the example. We can then learn a straightforward logistic regression model on this dimensional feature space.

As Blum notes, need not even be a valid kernel for using this method. Any reasonable similarity function would work, except the above theoretical guarantee doesn’t hold.

**Going a step further — Learning the reference vectors**

A key point to note is that there is no reason for the prototypes to be part of the training data. Any reasonable reference points in the original feature space would work. We just need to pick them so as to enable the resulting classifier to separate the classes well.

Therefore I propose kernelizing logistic regression by maximizing the log-likelihood with respect to the betas* as well as *the reference points. We can do this by gradient descent starting from a random points from our data set. The requirement is that the kernel function be differentiable with respect to the reference point . (Note. Learning vector quantization is a somewhat related idea.)

Because of obvious symmetries, the log-likelihood function is non-convex with respect to the reference vectors, but the local optima close to the randomly selected reference points are no worse than than the random reference points themselves.

**The gradient with respect to a reference vector**

Let us derive the gradient of the log-likelihood function with respect to a reference vector. First let us denote , i.e., the kernel value of the feature vector with the prototype by .

The log-likelihood of the data is given by

where is the usual indicator function. The gradient of with respect to the parameters can be found in any textbook on logistic regression. The derivative of with respect to the reference vector is

Putting it all together we have

That’s it. We can update all the reference vectors in the direction given by the above gradient by an amount that is controlled by the learning rate.

**Checking our sums**

Let us check what happens if there is only one reference vector and . That is, we use a linear kernel. We have

and therefore

which is very similar to the gradient of with respect to parameter. This is reasonable because with a linear kernel we are essentially learning a logistic regression classifier on the original feature space, where takes the place of .

If our kernel is the Gaussian radial basis function we have

**Learning the kernel parameters**

Of course gradient descent can be used to update the parameters of the kernel as well. For example we can initialize the parameter of the Gaussian r.b.f. kernel to a reasonable value and optimize it to maximize the log-likelihood as well. The expression for the gradient with respect to the kernel parameter is

**Going online**

The optimization of the reference vectors can be done in an online fashion by stochastic gradient descent ala Bob Carpenter.

Is it better to update all the parameters of the model (betas, reference vectors, kernel parameters) at the same time or wait for one set (say the betas) to converge before updating the next set (reference vectors)?

**Miscellany**

1. Since conditional random fields are just generalized logistic regression classifiers, we can use the same approach to kernelize them. Even if the all the features are binary, the reference vectors can be allowed to be continuous.

2. My colleague Ken Williams suggests keeping the model small by sparsifying the reference vectors themselves. The reference vectors can be encouraged to be sparse by imposing a Laplacian L1 prior.

3. The complexity of the resulting classifier can be controlled by the choice of the kernel and the number of reference vectors. I don’t have a good intuition about the effect of the two choices. For a linear kernel it seems obvious that any number of reference points should lead to the same classifier. What happens with a fixed degree polynomial kernel as the number of reference points increases?

4. Since the reference points can be moved around in the feature space, it seems extravagant to learn the betas as well. What happens when we fix the betas to random values uniformly distributed in [-1,1] and just learn the reference vectors? For what kernels do we obtain the same model as if we learned the betas as well?

5. I wonder if a similar thing can be done for support vector machines where a user specifies the kernel and the number of support vectors and the learning algorithm picks the required number of support vectors (not necessarily from the data set) such that the margin (on the training data) is maximized.

6. Ken pointed me to Archetypes, which is another related idea. In archetypal analysis the problem is to find a specified number of archetypes (reference vectors) such that all the points the data set can be as closely approximated by convex sums of the archetypes as possible. Does not directly relate to classification.

## Estimation of a distribution from i.i.d. sums

Here’s an estimation problem that I ran into not long ago while working on a problem in entity co-reference resolution in natural language documents.

Let be a random variable taking on values in . We are given data , where is the sum of independent draws of for . We are required to estimate the distribution of from .

For some distributions of we can use the method-of-moments. For example if , we know that the mean of is . We can therefore estimate as the sample mean, i.e., . Because of the nice additive property of the parameters for sums of i.i.d. poisson random variables, the maximum likelihood estimate also turns out be the same as .

The problem becomes more difficult when is say a six-sided die (i.e., the sample space is ) and we would like to estimate the probability of the faces . How can one obtain the maximum likelihood estimate in such a case?