## BWT for NLP (1)

The Burrows-Wheeler transform (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing. I will present one such use.

BWT massages the original string into being more amenable to compression. Of course the transform doesn’t alter the compressibility (entropy rate) of the original string. All it does is make the string more compressible by algorithms we know.

The reason string permutation by BWT (as opposed to say sorting the string, which makes it *really* compressible) is useful is that the reverse transform (undoing the permutation) can be done with very little additional information. Mark Nelson wrote a nice introduction to the transform. Moreover, the BWT essentially involves the construction of the suffix array for the string, and therefore can be done in time and space linear in the length of the string.

Here is an example of the Burrows-Wheeler tranformation of the first stanza of Yeats’ *Sailing to Byzantium*. I added some newlines to the transformed string, and the underscores represent spaces in the original string. Notice the long runs of characters in the transformed string.

*Original string*

THAT is no country for old men. The young In one another’s arms, birds in the trees – Those dying generations – at their song, The salmon-falls, the mackerel-crowded seas, Fish, flesh, or fowl, commend all summer long Whatever is begotten, born, and dies. Caught in that sensual music all neglect Monuments of unageing intellect.

*BWTransformed string*

rsgnsnlhhs__lntsnH__T__.A____ss.,gt,.-gcd,es s,,,ode,yrgtsgrTredllssrn,edtrln,ntefemnu__fs___eh_hrC___ia__-eennlew_r_nshhhhslldrnbghrttmmgsmhvmnkielto-___nnnnna_ueesstWtTtTttTgsd__ye_teb__Fcweallolgfaaeaa_l

__mumoulr_reoeIiiueao_eouoii_aoeiueon__cm_sliM_

fbhngycrfeoeeoieiteaoctamleen’idit_o__ieu_n_cchaanta

____oa_nnosans_oomeoord_

**A useful property**

Effros et. al. showed that for a string generated by a finite-memory source, the BWT of the string is asymptotically (in the length of the string) indistinguishable from a piece-wise independent and identically distributed (i.i.d.) string. This is not surprising given that symbols with similar contexts appear sequentially in the BWT string, and for finite memory sources the current symbol is generated i.i.d. given a finite length context.

This property can be exploited to easily cluster words according to context by using BWT.

**Word clustering**

In this paper, among other things, Brown et.al. present a word clustering algorithm based on maximizing the average mutual information between the cluster ids of adjacent words. Some results are presented in Table 2 in the paper.

Such word clusters can be useful for feature engineering for sequence tagging tasks such as part-of-speech tagging or named-entity recognition. One of the most commonly used features for such tasks is one which checks if the current word is in a carefully constructed list of words.

Brown et. al. admit that, even after optimizations, their algorithm is slow and resort to approximations. (I realize that computers have gotten much faster since but still their algorithm is cubic in the size of the vocabulary.)

**Word clustering based on BWT**

We will cluster two words together if they appear independently given certain contexts (albeit with different probabilities). We first perform a BW transform on the input string of words (considering each *word* as a symbol, unlike in the example above) and measure whether the two words appear independently in an i.i.d. fragment.

Instead of actually trying to chop the BWT string into i.i.d. fragments before analysis, we adopt a proxy metric. We check if the number of times the two words are next to each other in the BWT string is large compared to what we would expect from their frequencies. We compute this as probability ratio with appropriate smoothing.

Another neat consequence of doing the clustering by BWT is that we only need to consider pairs of words that do appear next to each other in the BWT string. Therefore the selection of candidates for clustering is linear in the length of the string and not quadratic in the size of the vocabulary.

**Some results**

I ran this algorithm on about a month’s worth of New York Times and Wall Street Journal news data and these are the pairs of words with the highest scores.

january february 0.177721578886

january march 0.143172972502

march february 0.142398170589

englandgeoneng jerseyusanj 0.141412321852

news becdnews 0.135642386152

finala final 0.131901568726

finala finalb 0.122728309966

finala finalc 0.113085215849

cafd cea 0.107549686029

february april 0.100734422316

january april 0.0993752546848

has have 0.0967101802923

march april 0.0929933503714

did does 0.0854452561942

has had 0.0833642704346

will would 0.0827179598199

have had 0.0773517518078

…

I constructed a graph by joining all word pairs that have a score above a threshold and ran a greedy maximal clique algorithm. These are some of the resulting word clusters.

older young younger

announced today yesterday said reported

month today week yesterday

days month months decade year weeks years

decades months decade weeks years

com org www

writing write wrote

directed edited produced

should will probably could would may might can

worries worried concerns

work worked working works

wearing wear wore

win lost losing

man people men

against like to about that by for on in with from of at

under by on with into over from of

baton moulin khmer

daughter husband sister father wife mother son

red green blue black

ice sour whipped

time days months year years day

eastern coast southeastern

bergen orange nassau westchester

east ivory west

goes gone go going went

known seen well

travel review leisure weekly editorial

cultural financial foreign editorial national metropolitan

thursdays wednesdays fridays sundays tuesdays

thursday today monday sunday yesterday wednesday saturday friday tuesday

…

**Discussion**

1. For the above results, I only did the clustering based on right contexts. We can easily extend the word-pair score to take into account left contexts as well by concatenating the BWT of the *reversed* string to the BWT of the original string, and calculating the scores on this double length transformed string.

2. The word clustering algorithm of Brown et. al. proceeds by iteratively merging the best pair of words and replacing the two words in the alphabet (and the string) by a merged word. We can imagine doing something similar with our approach, except, because BWT uses the order on the alphabet, we need to decide where to insert the merged word.

3. One thing that I should have done but didn’t for the above results is to order the alphabet (of words) lexicographically. Instead I assign positive integers to the words based on their first appearance in the string, which is the order BWT uses to sort. Fixing this should improve the results.

## Incremental complexity support vector machine

One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below.

(I generated them using a java SVM applet from here, whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with and the second with on two different data sets.

Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition.

The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large , which gives the classifier a propensity to make subtle distinctions even in regions of somewhat low example density.

**A possible solution: using complex kernels only where they are needed**

We propose to build a cascaded classifier, which we will call Incremental Complexity SVM (ICSVM), as follows.

We are given a sequence of kernels of increasing complexity. For example the sequence is of polynomial kernels, where is the polynomial kernel with degree .

The learning algorithm first learns an SVM classifier with kernel , that classifies a *reasonable* portion of the examples with a large margin . This can be accomplished by setting the SVM cost parameter to some low value.

Now all the examples outside the margin are thrown out, and another SVM classifier with kernel is learned, so that a reasonable portion of the remaining examples are classified with some large margin .

This procedure is continued until all the examples are classified outside the margin or the set of kernels is exhausted. The final classifier is a combination of all the classifiers .

A test example can be classified as follows. We first apply classifier to the test example, and if it is classified with margin , we output the assigned label and stop. If not we classify it with classifier in a similar fashion, and so on…

Such a scheme will avoid anomalous boundaries as those in the pictures above.

**Discussion**

1. With all the work that has been done on SVMs it is very likely that this idea or something very similar has been thought of, but I haven’t come across it.

2. There is some work on kernel learning where a convex combination of kernels is learned but I think that is a different idea.

3. One nice thing about such a classification scheme is that at run-time it will expend less computational resources on easier examples and more on more difficult ones. As my thesis supervisor used to say, it is silly for most classifiers to insist on acting exactly the same way on both easy and hard cases.

4. The choices of the cost parameters for the SVMs is critical for the accuracy of the final classifier. Is there a way of formulating the choice of the parameters in terms of minimizing some overall upper bound on the generalization error from statistical learning theory?

5. Is there a one-shot SVM formulation with the set of kernels that exactly or approximately acts like our classifier?

6. The weird island-effect and what Ken calls the lava-lamp problem in the boundaries above are not just artifacts of SVMs. We would expect a sparse kernel logistic regression to behave similarly. It would be interesting to do a similar incremental kernel thing with other kernel-based classifiers.

## Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in *active learning* the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is *pool-based*. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is , we have data drawn from some distribution (as always is the feature vector and is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since is unknown, if is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

**The type of bias**

Assume that marginal feature distribution of the labeled points after active learning is given by . Therefore is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from we request its label from the oracle which returns a label according to the conditional distribution . That is there is *no bias* in the conditional distribution. Therefore . This type of bias has been called *covariate shift*.

**The data**

After actively sampling the labels times, let us say we have the following data — a biased labeled training data set , where the feature vectors come from the original pool of unlabeled feature vectors

Let us define . If is large we expect the feature vector to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example for . If we knew the values of we can correct for the bias during training and evaluation.

This paper by Huang *et. al.*, and some of its references deal with the estimation of . *Remark*: This estimation needs take into account that . This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang *et. al*.

**Evaluation of the classifier**

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier , and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate . Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by .

If the empirical loss on the biased sample is written as , we write the estimate of the loss on the true distribution as the weighted loss .

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

**Learning the classifier**

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether *any* bias correction is necessary. Note that the posterior class distribution is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on will still converge to the Bayes error rate with examples drawn from .

For example imagine constructing a -Nearest Neighbor classifier on the biased labeled dataset. If we let and , the classifier will converge to the Bayes-optimal classifier as , *even if* * is biased*. This is somewhat paradoxical and can be explained by looking at the case of finite .

For finite , the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.