BWT for NLP (1)

September 26, 2009 Leave a comment

The Burrows-Wheeler transform (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing.  I will present one such use.

BWT massages the original string into being more amenable to compression. Of course the transform doesn’t alter the compressibility (entropy rate) of the original string. All it does is make the string more compressible by algorithms we know.

The reason string permutation by BWT (as opposed to say sorting the string, which makes it really compressible) is useful is that the reverse transform (undoing the permutation) can be done with very little additional information. Mark Nelson wrote a nice introduction to the transform.  Moreover, the BWT essentially involves the construction of the suffix array for the string, and therefore can be done in time and space linear in the length of the string.

Here is an example of the Burrows-Wheeler tranformation of the first stanza of Yeats’ Sailing to Byzantium. I added some newlines to the transformed string, and the underscores represent spaces in the original string. Notice the long runs of characters in the transformed string.

Original string

THAT is no country for old men. The young In one another’s arms, birds in the trees – Those dying generations – at their song, The salmon-falls, the mackerel-crowded seas, Fish, flesh, or fowl, commend all summer long Whatever is begotten, born, and dies. Caught in that sensual music all neglect Monuments of unageing intellect.

BWTransformed string

rsgnsnlhhs__lntsnH__T__.A____ss.,gt,.-gcd,es s,,,ode,yrgtsgrTredllssrn,edtrln,ntefemnu__fs___eh_hrC___ia__-eennlew_r_nshhhhslldrnbghrttmmgsmhvmnkielto-___nnnnna_ueesstWtTtTttTgsd__ye_teb__Fcweallolgfaaeaa_l




A useful property

Effros et. al. showed that for a string generated by a finite-memory source, the BWT of the string is asymptotically (in the length of the string) indistinguishable from a piece-wise independent and identically distributed (i.i.d.) string. This is not surprising given that symbols with similar contexts appear sequentially in the BWT string, and for finite memory sources the current symbol is generated i.i.d. given a finite length context.

This property can be exploited to easily cluster words according to context by using BWT.

Word clustering

In this paper, among other things, Brown present a word clustering algorithm based on maximizing the average mutual information between the cluster ids of adjacent words. Some results are presented in Table 2 in the paper.

Such word clusters can be useful for feature engineering for sequence tagging tasks such as part-of-speech tagging or named-entity recognition. One of the most commonly used features for such tasks is one which checks if the current word is in a carefully constructed list of words.

Brown et. al. admit that, even after optimizations, their algorithm is slow and resort to approximations. (I realize that computers have gotten much faster since but still their algorithm is cubic in the size of the vocabulary.)

Word clustering based on BWT

We will cluster two words together if they appear independently given certain contexts (albeit with different probabilities). We first perform a BW transform on the input string of words (considering each word as a symbol, unlike in the example above) and measure whether the two words appear independently in an i.i.d. fragment.

Instead of actually trying to chop the BWT string into i.i.d. fragments before analysis, we adopt a proxy metric. We check if the number of times the two words are next to each other in the BWT string is large compared to what we would expect from their frequencies. We compute this as probability ratio with appropriate smoothing.

Another neat consequence of doing the clustering by BWT is that we only need to consider pairs of words that do appear next to each other in the BWT string. Therefore the selection of candidates for clustering is linear in the length of the string and not quadratic in the size of the vocabulary.

Some results

I ran this algorithm on about a month’s worth of New York Times and Wall Street Journal news data and these are the pairs of words with the highest scores.

january february 0.177721578886
january march 0.143172972502
march february 0.142398170589
englandgeoneng jerseyusanj 0.141412321852
news becdnews 0.135642386152
finala final 0.131901568726
finala finalb 0.122728309966
finala finalc 0.113085215849
cafd cea 0.107549686029
february april 0.100734422316
january april 0.0993752546848
has have 0.0967101802923
march april 0.0929933503714
did does 0.0854452561942
has had 0.0833642704346
will would 0.0827179598199
have had 0.0773517518078

january february 0.177721578886

january march 0.143172972502

march february 0.142398170589

englandgeoneng jerseyusanj 0.141412321852

news becdnews 0.135642386152

finala final 0.131901568726

finala finalb 0.122728309966

finala finalc 0.113085215849

cafd cea 0.107549686029

february april 0.100734422316

january april 0.0993752546848

has have 0.0967101802923

march april 0.0929933503714

did does 0.0854452561942

has had 0.0833642704346

will would 0.0827179598199

have had 0.0773517518078

I constructed a graph by joining all word pairs that have a score above a threshold and ran a greedy maximal clique algorithm. These are some of the resulting word clusters.

older young younger

announced today yesterday said reported

month today week yesterday

days month months decade year weeks years

decades months decade weeks years

com org www

writing write wrote

directed edited produced

should will probably could would may might can

worries worried concerns

work worked working works

wearing wear wore

win lost losing

man people men

against like to about that by for on in with from of at

under by on with into over from of

baton moulin khmer

daughter husband sister father wife mother son

red green blue black

ice sour whipped

time days months year years day

eastern coast southeastern

bergen orange nassau westchester

east ivory west

goes gone go going went

known seen well

travel review leisure weekly editorial

cultural financial foreign editorial national metropolitan

thursdays wednesdays fridays sundays tuesdays

thursday today monday sunday yesterday wednesday saturday friday tuesday


1. For the above results, I only did the clustering based on right contexts. We can easily extend the word-pair score to take into account left contexts as well by concatenating the BWT of the reversed string to the BWT of the original string, and calculating the scores on this double length transformed string.

2. The word clustering algorithm of Brown et. al. proceeds by iteratively merging the best pair of words and replacing the two words in the alphabet (and the string) by a merged word. We can imagine doing something similar with our approach, except, because BWT uses the order on the alphabet, we need to decide where to insert the merged word.

3. One thing that I should have done but didn’t for the above results is to order the alphabet (of words) lexicographically. Instead I assign positive integers to the words based on their first appearance in the string, which is the order BWT uses to sort. Fixing this should improve the results.


Incremental complexity support vector machine

September 18, 2009 4 comments

One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below.



(I generated them using a java SVM applet from here, whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with \sigma = 1 and the second with \sigma = 10 on two different data sets.

Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition.

The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large \sigma, which gives the classifier a propensity to make subtle distinctions even in regions of  somewhat low example density.

A possible solution: using complex kernels only where they are needed

We propose to build a cascaded classifier, which we will call Incremental Complexity SVM (ICSVM), as follows.

We are given a sequence of kernels K_1, K_2,\ldots,K_m of increasing complexity. For example the sequence is of polynomial kernels, where K_i is the polynomial kernel with degree i.

The learning algorithm first learns an SVM classifier \psi_1 with kernel K_1, that classifies a reasonable portion of the examples with a large margin \lambda_1. This can be accomplished by setting the SVM cost parameter C to some low value.

Now all the examples outside the margin are thrown out, and another SVM classifier \psi_2 with kernel K_2 is learned, so that a reasonable portion of the remaining examples are classified with some large margin \lambda_2.

This procedure is continued until all the examples are classified outside the margin or the set of kernels is exhausted. The final classifier is a combination of all the classifiers \psi_i.

A test example can be classified as follows. We first apply classifier \psi_1 to the test example, and if it is classified with margin \geq \lambda_1, we output the assigned label and stop. If not we classify it with classifier \psi_2 in a similar fashion, and so on…

Such a scheme will avoid anomalous boundaries as those in the pictures above.


1. With all the work that has been done on SVMs it is very likely that this idea or something very similar has been thought of, but I haven’t come across it.

2. There is some work on kernel learning where a convex combination of kernels is learned but I think that is a different idea.

3. One nice thing about such a classification scheme is that at run-time it will expend less computational resources on easier examples and more on more difficult ones.  As my thesis supervisor used to say, it is silly for most classifiers to insist on acting exactly the same way on both easy and hard cases.

4. The choices of the cost parameters C for the SVMs is critical for the accuracy of the final classifier. Is there a way of formulating the choice of the parameters in terms of minimizing some overall upper bound on the generalization error from statistical learning theory?

5. Is there a one-shot SVM formulation with the set of kernels that exactly or approximately acts like our classifier?

6. The weird island-effect and what Ken calls the lava-lamp problem in the boundaries above are not just artifacts of SVMs. We would expect a sparse kernel logistic regression to behave similarly. It would be interesting to do a similar incremental kernel thing with other kernel-based classifiers.

Training data bias caused by active learning

September 10, 2009 Leave a comment

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is P(x,y), we have data drawn from some distribution \hat{P}(x,y) (as always x is the feature vector and y is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since P(x,y) is unknown, if \hat{P}(x,y) is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

The type of bias

Assume that marginal feature distribution of the labeled points after active learning is given by \hat{P}(x) = \sum_y\hat{P}(x,y). Therefore \hat{P}(x) is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from \hat{P}(x) we request its label from the oracle which returns a label according to the conditional distribution P(y|x) = \frac{P(x,y)}{\sum_y P(x,y)}.  That is there is no bias in the conditional distribution. Therefore \hat{P}(x,y) = \hat{P}(x) P(y|x). This type of bias has been called covariate shift.

The data

After actively sampling the labels n times, let us say we have the following data — a biased labeled training data set \{x_i, y_i\}_{i=1,\ldots,n} \sim \hat{P}(x,y), where the feature vectors x_i come from the original pool of M unlabeled feature vectors  \{x_i\}_{i=1,\ldots,M} \sim P(x)

Let us define \beta=\frac{P(x,y)}{\hat{P}(x,y)}=\frac{P(x)}{\hat{P}(x)}. If \beta is large we expect the feature vector x to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example \beta_i=\frac{P(x_i)}{\hat{P}(x_i)} for i = 1,\ldots,n. If we knew the values of \{\beta_i\} we can correct for the bias during training and evaluation.

This paper by Huang et. al., and some of its references deal with the estimation of \{\beta_i\}_{i=1,\ldots,n}. Remark: This estimation needs take into account that E_{\hat{P}(x)}[\beta_i] = 1. This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang et. al.

Evaluation of the classifier

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier f(), and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate \{\beta_i\}_{i=1,\ldots,n}. Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by \{\beta_i\}.

If the empirical loss on the biased sample is written as R = \frac{1}{n} \sum_i l(f(x_i), y_i), we write the estimate of the loss on the true distribution as the weighted loss R_c= \frac{1}{n} \sum_i \beta_i l(f(x_i), y_i).

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

Learning the classifier

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss R_c (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether any bias correction is necessary. Note that the posterior class distribution P(y|x) is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on P(x,y) will still converge to the Bayes error rate with examples drawn from \hat{P}(x,y).

For example imagine constructing a k-Nearest Neighbor classifier on the biased labeled dataset.  If we let k \rightarrow \infty and \frac{k}{n} \rightarrow 0, the classifier will converge to the Bayes-optimal classifier as n \rightarrow \infty, even if \hat{P}(x) is biased. This is somewhat paradoxical and can be explained by looking at the case of finite n.

For finite n, the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.

The redundancy of view-redundancy for co-training

August 23, 2009 Leave a comment

Blum and Mitchell’s co-training is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning.

I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence).

I’ll first explain co-training with an example.

Co-training – The setup

Consider a y \in \{0,1\} classification problem on the feature space \mathcal{X}=\mathcal{X}_1 \times \mathcal{X}_2. I.e., a feature vector x can be split into two as x = [x_1, x_2].

We make the rather restrictive assumption that x_1 and x_2 are class-conditionally independent for both classes. I.e., P(x_1, x_2|y) = P(x_1|y) P(x_2|y) for y \in \{0,1\}.

(Note that unlike surrogate learning with mean-independence, both \mathcal{X}_1  and \mathcal{X}_2 are allowed to be multi-dimensional.)

Co-training makes an additional assumption that either view is sufficient for classification. This view-redundancy assumption basically states that the probability mass in the region of the feature space, where the Bayes optimal classifiers on the two views disagree with each other, is zero.

(The original co-training paper actually relaxes this assumption in the epilogue, but it is unnecessary to begin with, and the assumption has proliferated in later manifestations of co-training.)

We are given some labeled data (or a weak classifier on one of the views) and an large supply of unlabeled data. We are now ready to proceed with co-training to construct a Bayes optimal classifier.

Co-training – The algorithm

The algorithm is very simple. We use our weak classifier, say h_1(x_1), (which we were given, or which we constructed using the measly labeled data) on the one view (x_1) to classify all the unlabeled data.  We select the examples classified with high confidence, and use these as labeled examples (using the labels assigned by the weak classifier) to train a classifier h_2(x_2) on the other view (x_2).

We now classify the unlabeled data with h_2(x_2) to similarly generate labeled data to retrain h_1(x_1). This back-and-forth procedure is repeated until exhaustion.

Under the above assumptions (and with “sufficient” unlabeled data) h_1 and h_2 converge to the Bayes optimal classifiers on the respective feature views. Since either view is enough for classification, we just pick one of the classifiers and release it into the wild.

Co-training – Why does it work?

I’ll try to present an intuitive explanation of co-training using the example depicted in the following figure. Please focus on it intently.


The feature vector x in the example is 2-dimensional and both views x_1 and x_2 are  1-dimensional. The class-conditional distributions are uncorrelated and jointly Gaussian (which means independent) and depicted by their equiprobability contours in the figure. The marginal class-conditional distributions are show along the two axes. Class y=0 is shown in red and class y=1 is shown in blue. The picture also shows some unlabeled examples.

Assume we have a weak classifier h_1(x_1) on the first view. If we extend the classification boundary for this classifier to the entire space x,  the boundary necessarily comprises of lines parallel to the x_2 axis.  Let’s say there is only one such line and all the examples below that line are assigned class y=1 and all the examples above are assigned class y=0.

We now ignore all the examples close to the classification boundary of h_1 (i.e., all the examples in the grey band) and project the rest of the points onto the x_2 axis.

How will these projected points be distributed along x_2?

Since the examples that were ignored (in the grey band) were selected based on their x_1 values, owing to class-conditional independence, the marginal distribution along x_2 for either class will be exactly the same as if none of the samples were ignored. This is the key reason for the conditional-independence assumption.

The procedure has two subtle, but largely innocuous, consequences.

First, since we don’t know how many class 0 and class 1 examples are in the grey band the relative ratio of the examples of the two classes in the not-ignored set may not the same as in the original full unlabeled sample set. If the class priors P(y) are known, this can easily be corrected for when we learn h_2(x_2). If the class priors are unknown other assumptions on h_1(x_1) are necessary.

Second, when we project the unlabeled examples on to x_2 we assign them the labels given to them by h_1 which can be erroneous. In the figure above, there will be examples in the region indicated by A that are actually class 1 but have been assigned class 0, and examples in region B that were from class 0 but were called class 1.

Again because of the class-conditional independence assumption these erroneously labeled examples will be distributed according to the marginal class-conditional x_2 distributions. I.e., in the figure above we imagine, along the x_2 axis, a very low amplitude blue distribution with the same shape and location as the red distribution, and a very low amplitude red distribution with the same shape under the blue distribution. (Note . This is the (\alpha, \beta) noise in the original co-training paper.)

This amounts to having a labeled training set with label errors but with errors being generated independently of the location in the space. That is the number of errors in a region in the space is proportional to the number of examples in that region. These proportionally distributed errors are then washed out by the correctly labeled examples when we learn h_2.

To recap, co-training works because of the following fact. Starting from a weak classifier h_1 on x_1, we can generate very accurate and unbiased training data to train a classifier on x_2.

No need for view-redundancy

Notice that, in the above example, we made no appeal to any kind of view-redundancy (other than whatever we may get gratis from the independence assumption).

The vigilant reader may however level the following two objections against the above argument-by-example.

1. We build h_1(x_1) and h_2(x_2) separately. So when the training is done, without view redundancy, we have not shown a way to pick from the two to apply to new test data.

2. At every iteration we need to select unlabeled samples that were classified with high-confidence by h_1 to feed to the trainer for h_2. Without view-redundancy may be none of the samples will be classified with high confidence.

The first objection is easy to respond to. We pick neither h_1 nor h_2 for new test data. Instead we combine them to obtain a classifier h(x_1,x_2). This is well justified because, under class-conditional independence, P(y|x_1,x_2) \propto P(y|x_1) P(y|x_2).

We react to the second objection by dropping the requirement of classifying with high-confidence altogether.

Dropping the high-confidence requirement by surrogate learning

Instead of training h_2(x_2) with examples that are classified with high confidence by h_1(x_1), we train h_2(x_2) with all the examples (using the scores assigned to them by h_1(x_1)).

At some iteration of co-training, define the random variable z_1 = h_1(x_1). Since x_1 and x_2 are class-conditionally independent, z_1 and x_2 are also class-conditionally independent. In particular z_1  is class-conditionally mean-independent of x_2. Furthermore if h_1 is even a weakly useful classifier, barring pathologies, it will satisfy E[z_1|y=0] \neq E[z_1|y=1].

We can therefore apply surrogate learning under mean-independence to learn the classifier on x_2. (This is essentially the same idea as Co-EM, which was introduced without much theoretical justification.)


Hopefully the above argument has convinced the reader that the class-conditional view independence assumption obviates the view-redundancy requirement.

A natural question to ask is whether the reverse is true. That is, if we are given view-redundancy, can we completely eliminate the requirement of class-conditional independence? We can immediately see that the answer is no.

For example, we can duplicate all the features for any classification problem so that view-redundancy holds trivially between the two replicates. Moreover, the second replicate will be statistically fully dependent on the first.

Now if we are given a weak classifier on the first view (or replicate) and try to use its predictions on an unlabeled data set to obtain training data for the second, it would be equivalent to feeding back the predictions of a classifier to retrain itself (because the two views are duplicates of one another).

This type of procedure (which is an idea decades old) has been called, among other things, self-learning, self-correction, self-training and decision-directed adaptation. The problem with these approaches is that the training set so generated is biased and other assumptions are necessary for the feedback procedure to improve over the original classifier.

Of course this does not mean that the complete statistical independence assumption cannot be relaxed. The above argument only shows that at least some amount of independence is necessary.

An effective kernelization of logistic regression

August 17, 2009 3 comments

I will present a sparse kernelization of logistic regression where the prototypes are not necessarily from the training data.

Traditional sparse kernel logistic regression

Consider an M class logistic regression model given by

P(y|x)\propto\mbox{exp}(\beta_{y0} + \sum_{j}^{d}\beta_{yj}x_j) for y =0,1,\ldots,M

where j indexes the d features.

Fitting the model to a data set D = \{x_i, y_i\}_{i=1,\ldots,N} involves estimating the betas to maximize the likelihood of D.

The above logistic regression model is quite simple (because the classifier is a linear function of the features of the example), and in some circumstances we might want a classifier that can produce a more complex decision boundary. One way to achieve this is by kernelization. We write

P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^N \beta_{yi} k(x,x_i)) for y=0,1,\ldots,M.

where k(.,.) is a kernel function.

In order to be able to use this classifier at run-time we have to store all the training feature vectors as part of the model because we need to compute the kernel value of the test example to every one of them. This would be highly inefficient, not to mention the severe over-fitting of the model to the training data.

The solution to both the test time efficiency and the over-fitting problems is to enforce sparsity. That is we somehow make sure that \beta_{yi} =0 for all but a few examples x_i from the training data. The import vector machine does this by greedily picking some n < N examples so that the reduced n example model best approximates the full model.

Sparsification by randomized prototype selection

The sparsified kernel logistic regression therefore looks like

P(y|x) \propto \mbox{exp}(\beta_{y0} + \sum_{i=1}^n\beta_{yi} k(x,u_i)) for y=0,1,\ldots,M.

where the feature vectors u_i are from the training data set. We can see that all we are doing is a vanilla logistic regression on a transformed feature space. The original d dimensional feature vector has been transformed into an n dimensional vector, where each dimension measures the kernel value of our test example x to a prototype vector (or reference vector) u_i.

What happens if we just selected these n prototypes randomly instead of greedily as in the import vector machine?

Avrim Blum showed that if the training data distribution is such that the two classes can be linearly separated with a margin \gamma in the feature space induced by kernel function, then the classes can be, with high probability, linearly separated with margin \gamma/2 with low error, in the transformed feature space if we pick a sufficient number of prototypes randomly.

That’s a mouthful, but basically we can use Blum’s method for kernelizing logistic regression as follows. Just pick n random vectors from your dataset (in fact they need not be labeled), compute the kernel value of an example to these n points and use these as n features to describe the example. We can then learn a straightforward logistic regression model on this n dimensional feature space.

As Blum notes, k(.,.) need not even be a valid kernel for using this method. Any reasonable similarity function would work, except the above theoretical guarantee doesn’t hold.

Going a step further — Learning the reference vectors

A key point to note is that there is no reason for the prototypes \{u_1, u_2,\ldots,u_n\} to be part of the training data. Any reasonable reference points in the original feature space would work. We just need to pick them so as to enable the resulting classifier to separate the classes well.

Therefore I propose kernelizing logistic regression by maximizing the log-likelihood with respect to  the betas as well as the reference points. We can do this by gradient descent starting from a random n points from our data set. The requirement is that the kernel function be differentiable with respect to the reference point u. (Note. Learning vector quantization is a somewhat related idea.)

Because of obvious symmetries, the log-likelihood function is non-convex with respect to the reference vectors, but  the local optima close to the randomly selected reference points are no worse than than the random reference points themselves.

The gradient with respect to a reference vector

Let us derive the gradient of the log-likelihood function with respect to a reference vector. First let us denote k(x_i, u_j), i.e., the kernel value of the i^{th} feature vector with the j^{th} prototype by z_{ij}.

The log-likelihood of the data is given by

L = \sum_{i=1}^N \sum_{y=1}^M \mbox{log}P(y|x_i) I(y=y_i)

where I(.) is the usual indicator function. The gradient of L with respect to the parameters \beta can be found in any textbook on logistic regression. The derivative of P(y|x_i) with respect to the reference vector u_l is


Putting it all together we have


That’s it. We can update all the reference vectors in the direction given by the above gradient by an amount that is controlled by the learning rate.

Checking our sums

Let us check what happens if there is only one reference vector u_1 and z_{i1} = k(x_i, u_1) = <x_i, u_1>. That is, we use a linear kernel. We have

\frac{\partial}{\partial u_1} z_{i1} = x_i and therefore

\frac{\partial}{\partial u_1} L = \sum_{i=1}^N x_i[\beta_{y1} I(y=y_i) - \sum_{y=1}^M \beta_{y1} P(y|x_i)]

which is very similar to the gradient of L with respect to \beta parameter. This is reasonable because with a linear kernel we are essentially learning a logistic regression classifier on the original feature space, where beta_{y1} u_1 takes the place of \beta_y.

If our kernel is the Gaussian radial basis function we have

\frac{\partial}{\partial u_l} z_{il} = \frac{\partial}{\partial u_l} \mbox{exp}(-\lambda||x_i-u_l||^2) = 2\lambda (x_i - u_l) z_{il}

Learning the kernel parameters

Of course gradient descent can be used to update the parameters of the kernel as well. For example we can initialize the parameter \lambda of the Gaussian r.b.f. kernel to a reasonable value and optimize it to maximize the log-likelihood as well. The expression for the gradient with respect to the kernel parameter is


Going online

The optimization of the reference vectors can be done in an online fashion by stochastic gradient descent ala Bob Carpenter.

Is it better to update all the parameters of the model (betas, reference vectors, kernel parameters) at the same time or wait for one set (say the betas) to converge before updating the next set (reference vectors)?


1. Since conditional random fields are just generalized logistic regression classifiers, we can use the same approach to kernelize them. Even if the all the features are binary, the reference vectors can be allowed to be continuous.

2. My colleague Ken Williams suggests keeping the model small by sparsifying the reference vectors themselves. The reference vectors can be encouraged to be sparse by imposing a Laplacian L1 prior.

3. The complexity of the resulting classifier can be controlled by the choice of the kernel and the number of reference vectors. I don’t have a good intuition about the effect of the two choices. For a linear kernel it seems obvious that any number of reference points should lead to the same classifier. What happens with a fixed degree polynomial kernel as the number of reference points increases?

4. Since the reference points can be moved around in the feature space, it seems extravagant to learn the betas as well. What happens when we fix the betas to random values uniformly distributed in [-1,1] and just learn the reference vectors? For what kernels do we obtain the same model as if we learned the betas as well?

5. I wonder if a similar thing can be done for support vector machines where a user specifies the kernel and the number of support vectors and the learning algorithm picks the required number of support vectors (not necessarily from the data set) such that the margin (on the training data) is maximized.

6. Ken pointed me to Archetypes, which is another related idea. In archetypal analysis the problem is to find a specified number of archetypes (reference vectors) such that all the points the data set can be as closely approximated by convex sums of the archetypes as possible. Does not directly relate to classification.

A surrogate learning mystery

August 11, 2009 Leave a comment

I’ll present an application of the surrogate learning idea in the previous post. It is mildly surprising at first blush, which I’ll contrive to make more mysterious.

Murders she induced

For readers who appreciate this sort of a thing, here’s the grandmother of all murder mysteries. In this one Miss Marple solves a whole bunch of unrelated murders all at once.

Miss Marple having finally grown wise to the unreliability of feminine intuition in solving murder mysteries spent some time learning statistics. She then convinced one of her flat-footed friends at the Scotland Yard to give her the files on all the unsolved murders, that were just sitting around gathering dust and waiting for someone with her imagination and statistical wiles.

She came home with the massive pile of papers and sat down to study them.

The file on each murder carefully listed all the suspects, with their possible motives, their accessibility to the murder weapon, psychological characteristics, previous convictions and many other details. There was a large variety in the number of suspects for different murders.

Furthermore, for every case the investigating officer made a note that the murderer was very likely to be part of the suspect pool. Only, it was not possible to narrow down the pool to one, and none of the suspects confessed.

Thinking statistically, Miss Marple decided to encode the characteristics of each suspect by a feature vector. Drawing on her vast experience in these matters she assigned numeric values to features like motives, psychology, relationship to the victim and access to the weapon.

All that was left was to build a statistical model that predicted the probability of a suspect being the murderer from his feature vector.  She would then rank all the suspects of a case by this probability, and then she can have Scotland Yard try to get the most likely suspect to confess.

At this point Miss Marple paused for a moment to rue her oversight in not asking for the files on the solved murders as well. She could have used the suspects (and the known murderer) from the solved murders as labeled examples to train her statistical model. However, she soon realized …

I know you are tense with anticipation of the inevitable twist in the tale. Here it is.

… she soon realized that she could build her model by training a regressor to predict, from the feature vector of a suspect, the size of the suspect pool he belongs to.

The clues

Let’s re-examine the pertinent facts — 1. every suspect pool contains the murderer, 2.  the suspect pools are of varying sizes.

The one, perhaps controversial, assumption we’ll need to make is that the feature vectors for both the murderers and the non-murderers are mean-independent of the size of the suspect pool they come from. (However, if it is good enough for Miss Marple, it is good enough for me.)

In any case, the violation of this assumption would imply that, when whether or not a suspect is a murderer is known, knowing his feature vector allows us to better predict the size of the suspect pool he belongs to. Because this is far-fetched, we can believe that the assumption is true.

In fact it may seem that knowing whether or not a suspect is murderer itself adds nothing to the predictability of the size of the pool from which the suspect is drawn. We will however show that the means for the murderers and non-murderers is different.

Let us now see why Miss Marple’s method works.

Marple math

Let the suspect pools be denoted \{S_1, S_2,\ldots,S_k\} with sizes |S_i| = s_i. Let each suspect be described by the feature vector denoted x_2. We now define the class label y=1 if a particular suspect is a murderer and y=0 otherwise.

We append the feature x_1 = -s_i to the feature vector of a suspect from pool S_i. That is, the surrogate feature we use is the negative of the pool size. (The negative is for a technical reason to match the previous post. In practice I would recommend \frac{1}{s_i}, but this makes the argument below a bit more involved.)

The first thing to note is that the assumption we made above translates to E[x_1|y,x_2] = E[x_1|y] for both y=0 and y=1.

All we need to show is that E[x_1|y=1] > E[x_1|y=0] and we are home free, because the argument in the previous post implies that E[x_1|x_2] can be used to rank suspects in a pool just as well as P(y|x_2).

So all we need to do is to build a regressor to predict x_1 = -s_i from the feature vector x_2 and apply this regressor to the x_2 of all the suspects in each pool and rank them by its output.

So why is E[x_1|y=1] > E[x_1|y=0]?

First it is clear that E[x_1|y=1] = -E[s] (where s is the random variable describing the size of the suspect pool) because for each pool there is one murderer and we assign the negative of that pool size as x_1 for that murderer.

Now there are a total of s_1 -1 + s_2 -1+\ldots+s_k-1 = \sum s_i - k non-murderers. For the non-murderers from the i^{th} pool the assigned x_1 is -s_i. Therefore the estimated expected value is -\frac{\sum (s_i-1).s_i}{\sum s_i - k}

If we divide through by k and let the number of pools go to infinity, the estimate converges to the true mean, which is

E[x_1|y=0] = -\frac{E[s^2] - E[s]}{E[s] - 1}

It is easy to show that this quantity is always lower than -E[s] = E[x_1|y=1] if the variance of s is non-zero (and if E[s] > 1), which we were given by the fact that the pool sizes are variable across murders.

And there you have it.

A less criminal example

Let me try to demystify the approach even more.

Assume that we have a database of images of faces of people along with their last names. Now the problem is the following — for a given face-name input pair to find the matching face and name in the database. Of course if all the names are unique this is trivial, but there are some common names like Smith and Lee.

Let us say that our procedure for this matching is to first obtain from the database the face images of the people with the same last name as the input, and then using our state-of-the-art face image comparison features to decide which of those is the right one. Do we need a training set of human-annotated input-database record pairs to train such a ranker? The above discussion suggests not.

Let us say we create a “fake” training set from a lot of unlabeled input image-name pairs by getting the set of records from the database with the matching name and assigning each feature vector a label which is the reciprocal of the size of the set. We then learn to predict this label from the image features.

The reason we would expect this predictor to behave similarly to the predictor of image match is that we can view our fake labeling as a probabilistic labeling of the feature vectors. For a given set our label is the a priori probability of a face image being the right match to the input image.

The independence assumption just makes sure that our probabilistic labeling is not biased.

I am curious about other possible applications.

Surrogate learning with mean independence

August 7, 2009 Leave a comment

In this paper we showed that if we had a feature x_1 that was class-conditionally statistically independent of the rest of the features, denoted x_2, learning a classifier between the two classes y=0 and y = 1 can be transformed into learning a predictor of x_1 from x_2 and another of y from x_1. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense x_1 acts as a surrogate for y.

Similar ideas can be found in Ando and Zhang ’07, Quadrianto et. al. ’08, Blitzer et. al. ’06, and others.

Derivation from mean-independence

I’ll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable U is mean-independent of  the r.v. V if E[U|V] = E[U]. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.

We assume that the feature space contains the single feature x_1 and the rest of the features x_2. We are still in a two-class situation, i.e., y \in \{0,1\}. We further assume

1. x_1 is at least somewhat useful for classification, or in other words, E[x_1|y=0] \neq E[x_1|y=1].

2. x_1 is class-conditionally mean-independent of x_2, i.e., E[x_1|y,x_2] = E[x_1|y] for y \in \{0,1\}.

Now let us consider the quantity E[x_1|x_2]. We have


=E[x_1|x_2,y=0] P(y=0|x_2) + E[x_1|x_2,y=1] P(y=1|x_2)

=E[x_1|y=0] P(y=0|x_2) + E[x_1|y=1] P(y=1|x_2)

Notice that E[x_1|x_2] is a convex sum of E[x_1|y=0] and E[x_1|y=1].

Now using the fact that P(y=0|x_2) + P(y=1|x_2) = 1  we can show after some algebra that


We have succeeded in decoupling y and x_2 on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of x_1 and a regressor (which can be learned on unlabeled data) to compute E[x_1|x_2].  Again x_1 acting as a surrogate for y is predicted from x_2.

As opposed to the formulation in the paper this formulation easily accommodates continuous valued x_1.


1. The first thing to note is that we are only able to write an expression for P(y|x_2) but not P(y|x_1, x_2). That is, we are able to weaken the independence to mean-independence at the expense of “wasting” feature x_1.

Of course if we have full statistical independence we can use x_1, by using Equation (1) and the fact that, under independence, we have

P(y|x_1, x_2) \propto P(y|x_2) P(x_1|y).

2. If (without loss of generality) we assume that E[x_1|y=1] > E[x_1|y=0], because E[x_1|x_2] lies somewhere between E[x_1|y=0] and E[x_1|y=1], Equation (1) says that P(y=1|x_2) is a monotonically increasing function of E[x_1|x_2].

This means that E[x_1|x_2] itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built entirely from unlabeled samples.

I’ll post a neat little application of this method soon.