### Archive

Archive for the ‘Estimation’ Category

## Random Fourier Features for Kernel Density Estimation

The NIPS paper Random Fourier Features for Large-scale Kernel Machines, by Rahimi and Recht presents a method for randomized feature mapping where dot products in the transformed feature space approximate (a certain class of) positive definite (p.d.) kernels in the original space.

We know that for any p.d. kernel there exists a deterministic map that has the aforementioned property but it may be infinite dimensional. The paper presents results indicating that with the randomized map we  can get away with only a “small” number of features (at least for a classification setting).

Before applying the method to density estimation let us review the relevant section of the paper briefly.

Bochner’s Theorem and Random Fourier Features

Assume that we have data in $R^d$ and a continuous p.d. kernel $K(x,y)$ defined for every pair of points $x,y \in R^d$. Assume further that the kernel is shift-invariant, i.e., $K(x,y) = K(x-y) \triangleq K(\delta)$ and that the kernel is scaled so that $K(0) = 1$.

The theorem by Bochner states that under the above conditions $K(\delta)$ must be the Fourier transform of a non-negative measure on $R^d$. In other words, there exists a probability density function $p(\delta)$ for $\delta \in R^d$ such that $K(\delta) = \mathcal{F}(p(\delta))$.

where (1) is because $K(.)$ is real. Equation (2) says that if we draw a random vector $w$ according to $p(w)$ and form two vectors $\phi(x) = (cos(w^T x), sin(w^T x))$ and $\phi(y) = (cos(w^T y), sin(w^T y))$, then the expected value of $<\phi(x),\phi(y)>$ is $K(x-y)$.

Therefore, for $x \in R^d$, if we choose the transformation

$\phi(x) = \frac{1}{\sqrt{D}} (cos(w_1^T x), sin(w_1^T x), cos(w_2^T x), sin(w_2^T x), \ldots, cos(w_D^T x), sin(w_D^T x))$

with $w_1,\ldots, w_D$ drawn according to $p(w)$, linear inner products in this transformed space will approximate $K(.)$.

Gaussian RBF Kernel

The Gaussian radial basis function kernel satisfies all the above conditions and we know that the Fourier transform of the Gaussian is another Gaussian (with the reciprocal variance). Therefore for “linearizing” the Gaussian r.b.f. kernel, we draw $D$ samples from a Gaussian distribution for the transformation.

Parzen Window Density Estimation

Given a data  set  $\{x_1, x_2, \ldots, x_N\} \subset R^d$, the the so-called Parzen window probability density estimator is defined as follows

$\hat{p}(x) \propto \frac{1}{N} \sum_i K((x-x_i)/h)$

where $K(.)$ is often a positive, symmetric, shift-invariant kernel and $h$ is the bandwidth parameter that controls the scale of influence of the data points.

A common kernel that is used for Parzen window density estimation is the Gaussian density. If we make the same choice we can apply our feature transformation to linearize the procedure. We have

where $h$ has been absorbed into the kernel variance.

Therefore all we need to do is take the mean of the transformed data points and estimate the pdf at a new point to be (proportional to) the inner product its transformed feature vector with the mean.

Of course since the kernel value is only approximated by the inner product of the random Fourier features we expect that the estimate pdf will differ from a plain unadorned Parzen window estimate.  But different how?

Experiments

Below are some pictures showing how the method performs on some synthetic data. I generated a few dozen points from a mixture of Gaussians and plotted contours of the estimated pdf for the region around the points. I did this for several choices of $D$ and $\gamma$ (the scale parameter for the Gaussian kernel).

First let us check that the method performs as expected for large values of $D$ because the kernel value is well approximated by the inner product of the Fourier features. The first 3 pictures are for $D = 10000$ for various values of $\gamma$.

D = 10000 and gamma = 2.0

D = 10000 and gamma = 1.0

D = 10000 and gamma = 0.5

—————————————————————————

—————————————————————————

Now let us see what happens when we decrease $D$. We expect the error in approximating the kernel would lead to obviously erroneous pdf.  This is clearly evident for the case of $D=100$.

D=1000 and gamma = 1.0

D=100 and gamma = 1.0

—————————————————————————

—————————————————————————

The following picture for  $D = 1000$ and $\gamma = 2.0$ is even stranger.

D = 1000 and gamma = 2.0

—————————————————————————

—————————————————————————

Discussion

It seems that even for a simple 2D example, we seem to need to compute a very large number of random Fourier features to make the estimated pdf accurate. (For this small example this is very wasteful, since a plain Parzen window estimate would require less memory and computation.)

However, the pictures do indicate that if the approach is to be used for outlier detection (aka novelty detection) from a given data set, we might be able get away with much smaller $D$. That is, even if the estimated pdf has a big error on the entire space, on the points from the data it seems to be reasonably accurate.

## Regularized Minimax on Synthetic Data

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

Experiments on Synthetic Data

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two types per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

Example 1

Example1. Original training data set. Both the red and blue classes have two types in 9:1 ratio.

Example 1. Plain logistic regression. No minimax. Almost all of the red circles are misclassified.

Example1. Minimax with gamma = 0.1

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

Example 2

Example2. Original training data set.

Example1. Logistic regression. No minimax.

Example2. Minimax with gamma = 0.5

Discussion

1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.

## Regularized Minimax for Robust Learning

March 13, 2010 1 comment

This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.

Cost Sensitive Loss Functions

Given a training data set $D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $\phi$ that is parametrized by a vector $w$ by minimizing a loss function

where $l(x_i, y_i, w)$ is the loss on example $i$ and $f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like

for some $\lambda > 0$.

If, in addition, the examples came with costs $c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain $\sum_i^N c_i = N$ and $c_i \ge 0$. So the unweighted learning problem corresponds to the case where all $c_i = 1$.

Assume that the learner is playing a game against an adversary that will assign the costs $\{c_i\}_{i=1,\ldots,N}$ to the training examples that will lead to the worst possible loss for any weight vector the learner produces.

How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution

For any realistic learning problem the above optimization problem does not have a unique solution.

Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.

We define

where $H(c) = -\sum_i c_i \log c_i$ (the Shannon entropy of the cost vector, save the normalization to sum to one).

The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector $c$ essentially restricts the set of  possible cost vectors the adversary has at his disposal.

Optimization

For convex loss functions (such as the logistic loss) $L(w, c)$ is convex in $w$ for a fixed cost assignment, therefore so is $R(w, c)$. Furthermore, $R(w, c)$ is concave in $c$ and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.

The theorem allows us to say that, for a fixed weight vector $w$, if

and if $\tilde{c}$ is unique, then

even though $\tilde{c}$ is a function of $w$.

Algorithm

The algorithm is very simple. Perform until convergence the following

1. At the $k^{th}$ iteration, for the weight vector $w^{k}$ find the cost vector $\tilde{c}$ the maximizes $R(w^{k},c)$.

2. Update $w^{k+1} = w^{k} - \eta \nabla_w R(w^{k}, \tilde{c})$, where $\eta$ is the learning rate.

The maximization in step 1 is also simple and can be shown to be

As expected, if $\gamma \rightarrow \infty$, the costs remain close to one and as $\gamma \rightarrow 0$ the entire cost budget is allocated to the example with the largest loss.

Of types and tokens

This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.

For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.

(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)

We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.

For type $t$ let $S_t$ be the set of all tokens of that type and $N_t$ be the number of tokens of that type. We now estimate $w$ by

under the same constraints on $c$ as above. Here $q$ is the observed type distribution in the training data and $KL(.\|.)$ is the KL-divergence.

The algorithm is identical to the one above except the maximum over $c$ for a fixed $w$ is slightly different.

Related Work and Discussion

1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.

2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors ($w$ for us) and not for the costs over the examples.

3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.

4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for  optimizing for the worst possible assignment of costs over query types.

In the next post I will present some experimental results on toy examples with synthetic data.

Acknowledgment

I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.

## Sparse online kernel logistic regression

In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.)

(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)

I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).

It seems to perform reasonably well on large datasets. Below I’ll show its behavior on some pseudo-randomly generated classification problems.

All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.

Example 1 (using 3 prototypes)

After first iteration

After second iteration

Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.

Example 2 (five prototypes)

After first iteration

After 5 iterations

Example 3 (five prototypes)

After first iteration

The right most panel shows the first two “transformed features”, i.e., the kernel values of the examples to the first two prototypes.

After second iteration

Implementation details and discusssion

The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the  prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.

As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).

I had to make some approximations to make the algorithm fast — the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.

The kernel parameter updating worked best when the RBF kernel was re-parametrized as $K(x,u) = exp(-exp(\theta) ||x-u||^2)$.

The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.

Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.

## Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is $P(x,y)$, we have data drawn from some distribution $\hat{P}(x,y)$ (as always $x$ is the feature vector and $y$ is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since $P(x,y)$ is unknown, if $\hat{P}(x,y)$ is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

The type of bias

Assume that marginal feature distribution of the labeled points after active learning is given by $\hat{P}(x) = \sum_y\hat{P}(x,y)$. Therefore $\hat{P}(x)$ is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from $\hat{P}(x)$ we request its label from the oracle which returns a label according to the conditional distribution $P(y|x) = \frac{P(x,y)}{\sum_y P(x,y)}$.  That is there is no bias in the conditional distribution. Therefore $\hat{P}(x,y) = \hat{P}(x) P(y|x)$. This type of bias has been called covariate shift.

The data

After actively sampling the labels $n$ times, let us say we have the following data — a biased labeled training data set $\{x_i, y_i\}_{i=1,\ldots,n} \sim \hat{P}(x,y)$, where the feature vectors $x_i$ come from the original pool of $M$ unlabeled feature vectors  $\{x_i\}_{i=1,\ldots,M} \sim P(x)$

Let us define $\beta=\frac{P(x,y)}{\hat{P}(x,y)}=\frac{P(x)}{\hat{P}(x)}$. If $\beta$ is large we expect the feature vector $x$ to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example $\beta_i=\frac{P(x_i)}{\hat{P}(x_i)}$ for $i = 1,\ldots,n$. If we knew the values of $\{\beta_i\}$ we can correct for the bias during training and evaluation.

This paper by Huang et. al., and some of its references deal with the estimation of $\{\beta_i\}_{i=1,\ldots,n}$. Remark: This estimation needs take into account that $E_{\hat{P}(x)}[\beta_i] = 1$. This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang et. al.

Evaluation of the classifier

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier $f()$, and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate $\{\beta_i\}_{i=1,\ldots,n}$. Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by $\{\beta_i\}$.

If the empirical loss on the biased sample is written as $R = \frac{1}{n} \sum_i l(f(x_i), y_i)$, we write the estimate of the loss on the true distribution as the weighted loss $R_c= \frac{1}{n} \sum_i \beta_i l(f(x_i), y_i)$.

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

Learning the classifier

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss $R_c$ (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether any bias correction is necessary. Note that the posterior class distribution $P(y|x)$ is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on $P(x,y)$ will still converge to the Bayes error rate with examples drawn from $\hat{P}(x,y)$.

For example imagine constructing a $k$-Nearest Neighbor classifier on the biased labeled dataset.  If we let $k \rightarrow \infty$ and $\frac{k}{n} \rightarrow 0$, the classifier will converge to the Bayes-optimal classifier as $n \rightarrow \infty$, even if $\hat{P}(x)$ is biased. This is somewhat paradoxical and can be explained by looking at the case of finite $n$.

For finite $n$, the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.

## An effective kernelization of logistic regression

I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data.

Consider an $M$ class logistic regression model given by

$P(y|x)\propto\mbox{exp}(\beta_{y0} + \sum_{j}^{d}\beta_{yj}x_j)$ for $y =0,1,\ldots,M$

where $j$ indexes the $d$ features.

Fitting the model to a data set $D = \{x_i, y_i\}_{i=1,\ldots,N}$ involves estimating the betas to maximize the likelihood of $D$.

The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by kernelization. We write

$P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^N \beta_{yi} k(x,x_i))$ for $y=0,1,\ldots,M$.

where $k(.,.)$ is a kernel function.

In order to be able to use this classifier at run-time we have to store all the training feature vectors as part of the model because we need to compute the kernel value of the test example to every one of them. This would be highly inefficient, not to mention the severe over-fitting of the model to the training data.

The solution to both the test time efficiency and the over-fitting problems is to enforce sparsity. That is we somehow make sure that $\beta_{yi} =0$ for all but a few examples $x_i$ from the training data. The import vector machine does this by greedily picking some $n < N$ examples so that the reduced $n$ example model best approximates the full model.

Sparsification by randomized prototype selection

The sparsified kernel logistic regression therefore looks like

$P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^n\beta_{yi} k(x,u_i))$ for $y=0,1,\ldots,M$.

where the feature vectors $u_i$ are from the training data set. We can see that all we are doing is a vanilla logistic regression on a transformed feature space. The original $d$ dimensional feature vector has been transformed into an $n$ dimensional vector, where each dimension measures the kernel value of our test example $x$ to a prototype vector (or reference vector) $u_i$.

What happens if we just selected these $n$ prototypes randomly instead of greedily as in the import vector machine?

Avrim Blum showed that if the training data distribution is such that the two classes can be linearly separated with a margin $\gamma$ in the feature space induced by kernel function, then the classes can be, with high probability, linearly separated with margin $\gamma/2$ with low error, in the transformed feature space if we pick a sufficient number of prototypes randomly.

That’s a mouthful, but basically we can use Blum’s method for kernelizing logistic regression as follows. Just pick $n$ random vectors from your dataset (in fact they need not be labeled), compute the kernel value of an example to these $n$ points and use these as $n$ features to describe the example. We can then learn a straightforward logistic regression model on this $n$ dimensional feature space.

As Blum notes, $k(.,.)$ need not even be a valid kernel for using this method. Any reasonable similarity function would work, except the above theoretical guarantee doesn’t hold.

Going a step further — Learning the reference vectors

A key point to note is that there is no reason for the prototypes $\{u_1, u_2,\ldots,u_n\}$ to be part of the training data. Any reasonable reference points in the original feature space would work. We just need to pick them so as to enable the resulting classifier to separate the classes well.

Therefore I propose kernelizing logistic regression by maximizing the log-likelihood with respect to  the betas as well as the reference points. We can do this by gradient descent starting from a random $n$ points from our data set. The requirement is that the kernel function be differentiable with respect to the reference point $u$. (Note. Learning vector quantization is a somewhat related idea.)

Because of obvious symmetries, the log-likelihood function is non-convex with respect to the reference vectors, but  the local optima close to the randomly selected reference points are no worse than than the random reference points themselves.

The gradient with respect to a reference vector

Let us derive the gradient of the log-likelihood function with respect to a reference vector. First let us denote $k(x_i, u_j)$, i.e., the kernel value of the $i^{th}$ feature vector with the $j^{th}$ prototype by $z_{ij}$.

The log-likelihood of the data is given by

$L = \sum_{i=1}^N \sum_{y=1}^M \mbox{log}P(y|x_i) I(y=y_i)$

where $I(.)$ is the usual indicator function. The gradient of $L$ with respect to the parameters $\beta$ can be found in any textbook on logistic regression. The derivative of $P(y|x_i)$ with respect to the reference vector $u_l$ is

Putting it all together we have

That’s it. We can update all the reference vectors in the direction given by the above gradient by an amount that is controlled by the learning rate.

Checking our sums

Let us check what happens if there is only one reference vector $u_1$ and $z_{i1} = k(x_i, u_1) = $. That is, we use a linear kernel. We have

$\frac{\partial}{\partial u_1} z_{i1} = x_i$ and therefore

$\frac{\partial}{\partial u_1} L = \sum_{i=1}^N x_i[\beta_{y1} I(y=y_i) - \sum_{y=1}^M \beta_{y1} P(y|x_i)]$

which is very similar to the gradient of $L$ with respect to $\beta$ parameter. This is reasonable because with a linear kernel we are essentially learning a logistic regression classifier on the original feature space, where $beta_{y1} u_1$ takes the place of $\beta_y$.

If our kernel is the Gaussian radial basis function we have

$\frac{\partial}{\partial u_l} z_{il} = \frac{\partial}{\partial u_l} \mbox{exp}(-\lambda||x_i-u_l||^2) = 2\lambda (x_i - u_l) z_{il}$

Learning the kernel parameters

Of course gradient descent can be used to update the parameters of the kernel as well. For example we can initialize the parameter $\lambda$ of the Gaussian r.b.f. kernel to a reasonable value and optimize it to maximize the log-likelihood as well. The expression for the gradient with respect to the kernel parameter is

Going online

The optimization of the reference vectors can be done in an online fashion by stochastic gradient descent ala Bob Carpenter.

Is it better to update all the parameters of the model (betas, reference vectors, kernel parameters) at the same time or wait for one set (say the betas) to converge before updating the next set (reference vectors)?

Miscellany

1. Since conditional random fields are just generalized logistic regression classifiers, we can use the same approach to kernelize them. Even if the all the features are binary, the reference vectors can be allowed to be continuous.

2. My colleague Ken Williams suggests keeping the model small by sparsifying the reference vectors themselves. The reference vectors can be encouraged to be sparse by imposing a Laplacian L1 prior.

3. The complexity of the resulting classifier can be controlled by the choice of the kernel and the number of reference vectors. I don’t have a good intuition about the effect of the two choices. For a linear kernel it seems obvious that any number of reference points should lead to the same classifier. What happens with a fixed degree polynomial kernel as the number of reference points increases?

4. Since the reference points can be moved around in the feature space, it seems extravagant to learn the betas as well. What happens when we fix the betas to random values uniformly distributed in [-1,1] and just learn the reference vectors? For what kernels do we obtain the same model as if we learned the betas as well?

5. I wonder if a similar thing can be done for support vector machines where a user specifies the kernel and the number of support vectors and the learning algorithm picks the required number of support vectors (not necessarily from the data set) such that the margin (on the training data) is maximized.

6. Ken pointed me to Archetypes, which is another related idea. In archetypal analysis the problem is to find a specified number of archetypes (reference vectors) such that all the points the data set can be as closely approximated by convex sums of the archetypes as possible. Does not directly relate to classification.

Categories: Classification, Estimation

## Estimation of a distribution from i.i.d. sums

Let $X$ be a random variable taking on values in $\{0,1,2,\ldots\}$.  We are given data $D=\{(N_1,S_1), (N_2,S_2),\ldots,(N_k,S_k)\}$, where $S_i$ is the sum of $N_i$ independent draws of $X$ for $i = 1, 2,\ldots, k$. We are required to estimate the distribution of $X$ from $D$.
For some distributions of $X$ we can use the method-of-moments. For example if $X \sim \mbox{Poisson}(\lambda)$, we know that the mean of $X$ is $\lambda$. We can therefore estimate $\lambda$ as the sample mean, i.e., $\hat{\lambda}=\frac{S_1+S_2+\ldots+S_k}{N_1+N_2+\ldots+N_k}$. Because of the nice additive property of the parameters for sums of i.i.d. poisson random variables, the maximum likelihood estimate also turns out be the same as $\hat{\lambda}$.
The problem becomes more difficult when $X$ is say a six-sided die (i.e., the sample space is $\{1,2,3,4,5,6\}$) and we would like to estimate the probability of the faces . How can one obtain the maximum likelihood estimate in such a case?