### Archive

Archive for the ‘Natural Language Processing’ Category

## Regularized Minimax for Robust Learning

March 13, 2010 1 comment

This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.

Cost Sensitive Loss Functions

Given a training data set $D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $\phi$ that is parametrized by a vector $w$ by minimizing a loss function

where $l(x_i, y_i, w)$ is the loss on example $i$ and $f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like

for some $\lambda > 0$.

If, in addition, the examples came with costs $c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain $\sum_i^N c_i = N$ and $c_i \ge 0$. So the unweighted learning problem corresponds to the case where all $c_i = 1$.

Assume that the learner is playing a game against an adversary that will assign the costs $\{c_i\}_{i=1,\ldots,N}$ to the training examples that will lead to the worst possible loss for any weight vector the learner produces.

How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution

For any realistic learning problem the above optimization problem does not have a unique solution.

Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.

We define

where $H(c) = -\sum_i c_i \log c_i$ (the Shannon entropy of the cost vector, save the normalization to sum to one).

The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector $c$ essentially restricts the set of  possible cost vectors the adversary has at his disposal.

Optimization

For convex loss functions (such as the logistic loss) $L(w, c)$ is convex in $w$ for a fixed cost assignment, therefore so is $R(w, c)$. Furthermore, $R(w, c)$ is concave in $c$ and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.

The theorem allows us to say that, for a fixed weight vector $w$, if

and if $\tilde{c}$ is unique, then

even though $\tilde{c}$ is a function of $w$.

Algorithm

The algorithm is very simple. Perform until convergence the following

1. At the $k^{th}$ iteration, for the weight vector $w^{k}$ find the cost vector $\tilde{c}$ the maximizes $R(w^{k},c)$.

2. Update $w^{k+1} = w^{k} - \eta \nabla_w R(w^{k}, \tilde{c})$, where $\eta$ is the learning rate.

The maximization in step 1 is also simple and can be shown to be

As expected, if $\gamma \rightarrow \infty$, the costs remain close to one and as $\gamma \rightarrow 0$ the entire cost budget is allocated to the example with the largest loss.

Of types and tokens

This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.

For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.

(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)

We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.

For type $t$ let $S_t$ be the set of all tokens of that type and $N_t$ be the number of tokens of that type. We now estimate $w$ by

under the same constraints on $c$ as above. Here $q$ is the observed type distribution in the training data and $KL(.\|.)$ is the KL-divergence.

The algorithm is identical to the one above except the maximum over $c$ for a fixed $w$ is slightly different.

Related Work and Discussion

1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.

2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors ($w$ for us) and not for the costs over the examples.

3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.

4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for  optimizing for the worst possible assignment of costs over query types.

In the next post I will present some experimental results on toy examples with synthetic data.

Acknowledgment

I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.

## BWT for NLP (2)

I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference.

The task was to produce an automatic metric to evaluate machine generated summaries (i.e., system summaries) against human generated summaries for the TAC ’09 Update Summarization Task. Clearly the automatic metric is just some function that produces a similarity score between the system summary and the human generated (the so-called model) summary.

The  proposed metrics were evaluated by comparing their rankings of the system summaries from different peers to that of the ranking produced by human judges.

Similarity Metric

We use an estimate of the conditional “compressibility” of the model summary given the system summary as the similarity metric. The conditional compressibility is defined as the increase in the compressibility of the model summary when the system summary has been observed.

In order to judge the similarity of the system summary $S$, to the model summary $M$, we propose to use the difference in compressibility of $M$ when $S$ is not seen to when $S$ is given. This metric basically
captures the reduction in the uncertainty in $M$ when $S$ is known.

We define the compressibility $c(M)$ of any string $M$ by

$c(M) = \frac{H(M)}{|M|}$

and the conditional compressibility of string $M$ over an alphabet $\mathcal{A}$ given another string $S$ over the same alphabet as

$c(M|S) = \frac{H(S+M) - H(S)}{|M|}$

where $S+M$ is the concatenation of the strings $S$ and $M$, $H(S)$ is the entropy of string $S$, and $|M|$ is the length of the string $M$.

The fractional increase in compressibility of $M$ given $S$ can then measured by

$r(M|S) = \frac{c(M) - c(M|S)}{c(M)}$.

We use $r(M|S)$ as the similarity metric to measure the similarity of a system summary $S$ to the model summary $M$.

Our metric is similar to the one proposed by Li and Vitanyi and is theoretically well-justified from the perspective of algorithmic information theory. One peculiarity is that our similarity is asymmetric.

The only thing that is needed to implement the above similarity metric is an estimate of the entropy $H(S)$ for a string $S$. We use the BWT for this estimate.

BWT-based String Entropy Estimate

We use the Move-To-Front (MTF) entropy of the Burrows-Wheeler transform of a given string $S$ as an estimate for its entropy $H(S)$.

The MTF encoding of a string is performed by traversing the string and assigning to each symbol the position of that symbol in the alphabet and then moving the symbol to the front of the alphabet. Therefore a sequence with a lot of runs will  have a lot of zeros in its MTF encoding.

In this paper the MTF coding is used to define the MTF entropy (which the authors also call local entropy) of a string $R$ as

$\mbox{MTFE}(R) = \sum_i \mbox{log}(\mbox{MTF}(R)_i + 1)$

where $\mbox{MTF}(R)_i$ is the $i^{th}$ symbol of the MTF coding of the string $R$.

Now we define $H(S)$, the entropy of string $S$ as

$H(S) = \mbox{MTFE}(\mbox{BWT}(S))$

where $\mbox{BWT}(S)$ is the BWT of string $S$.

Since the Burrows-Wheeler transform involves just the construction of a suffix array, the computation of our compression based evaluation metric is linear in time and space in the length of the model and system summary strings.

Some Technical Details

For our implementation, we considered each word in a string as a separate symbol. Our alphabet of symbols therefore contained all the words in the two strings being compared. The words were normalized by lower casing and removing punctuation. Because BWT needs an ordered alphabet, we used the lexicographic order on the words in the alphabet.

Results

The results on the TAC-AESOP task (above) show that the BWT based metric (FraCC in the table) is reasonable for summarization evaluation, especially because there are not very many knobs to tune. I obtained these results from Frank (who will present them at TAC next week). The “best metric” is the AESOP submission that seemed to have high scores across several measures.

## BWT for NLP (1)

The Burrows-Wheeler transform (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing.  I will present one such use.

BWT massages the original string into being more amenable to compression. Of course the transform doesn’t alter the compressibility (entropy rate) of the original string. All it does is make the string more compressible by algorithms we know.

The reason string permutation by BWT (as opposed to say sorting the string, which makes it really compressible) is useful is that the reverse transform (undoing the permutation) can be done with very little additional information. Mark Nelson wrote a nice introduction to the transform.  Moreover, the BWT essentially involves the construction of the suffix array for the string, and therefore can be done in time and space linear in the length of the string.

Here is an example of the Burrows-Wheeler tranformation of the first stanza of Yeats’ Sailing to Byzantium. I added some newlines to the transformed string, and the underscores represent spaces in the original string. Notice the long runs of characters in the transformed string.

Original string

THAT is no country for old men. The young In one another’s arms, birds in the trees – Those dying generations – at their song, The salmon-falls, the mackerel-crowded seas, Fish, flesh, or fowl, commend all summer long Whatever is begotten, born, and dies. Caught in that sensual music all neglect Monuments of unageing intellect.

BWTransformed string

rsgnsnlhhs__lntsnH__T__.A____ss.,gt,.-gcd,es s,,,ode,yrgtsgrTredllssrn,edtrln,ntefemnu__fs___eh_hrC___ia__-eennlew_r_nshhhhslldrnbghrttmmgsmhvmnkielto-___nnnnna_ueesstWtTtTttTgsd__ye_teb__Fcweallolgfaaeaa_l

__mumoulr_reoeIiiueao_eouoii_aoeiueon__cm_sliM_

fbhngycrfeoeeoieiteaoctamleen’idit_o__ieu_n_cchaanta

____oa_nnosans_oomeoord_

A useful property

Effros et. al. showed that for a string generated by a finite-memory source, the BWT of the string is asymptotically (in the length of the string) indistinguishable from a piece-wise independent and identically distributed (i.i.d.) string. This is not surprising given that symbols with similar contexts appear sequentially in the BWT string, and for finite memory sources the current symbol is generated i.i.d. given a finite length context.

This property can be exploited to easily cluster words according to context by using BWT.

Word clustering

In this paper, among other things, Brown et.al. present a word clustering algorithm based on maximizing the average mutual information between the cluster ids of adjacent words. Some results are presented in Table 2 in the paper.

Such word clusters can be useful for feature engineering for sequence tagging tasks such as part-of-speech tagging or named-entity recognition. One of the most commonly used features for such tasks is one which checks if the current word is in a carefully constructed list of words.

Brown et. al. admit that, even after optimizations, their algorithm is slow and resort to approximations. (I realize that computers have gotten much faster since but still their algorithm is cubic in the size of the vocabulary.)

Word clustering based on BWT

We will cluster two words together if they appear independently given certain contexts (albeit with different probabilities). We first perform a BW transform on the input string of words (considering each word as a symbol, unlike in the example above) and measure whether the two words appear independently in an i.i.d. fragment.

Instead of actually trying to chop the BWT string into i.i.d. fragments before analysis, we adopt a proxy metric. We check if the number of times the two words are next to each other in the BWT string is large compared to what we would expect from their frequencies. We compute this as probability ratio with appropriate smoothing.

Another neat consequence of doing the clustering by BWT is that we only need to consider pairs of words that do appear next to each other in the BWT string. Therefore the selection of candidates for clustering is linear in the length of the string and not quadratic in the size of the vocabulary.

Some results

I ran this algorithm on about a month’s worth of New York Times and Wall Street Journal news data and these are the pairs of words with the highest scores.

january february 0.177721578886
january march 0.143172972502
march february 0.142398170589
englandgeoneng jerseyusanj 0.141412321852
news becdnews 0.135642386152
finala final 0.131901568726
finala finalb 0.122728309966
finala finalc 0.113085215849
cafd cea 0.107549686029
february april 0.100734422316
january april 0.0993752546848
has have 0.0967101802923
march april 0.0929933503714
did does 0.0854452561942
will would 0.0827179598199

january february 0.177721578886

january march 0.143172972502

march february 0.142398170589

englandgeoneng jerseyusanj 0.141412321852

news becdnews 0.135642386152

finala final 0.131901568726

finala finalb 0.122728309966

finala finalc 0.113085215849

cafd cea 0.107549686029

february april 0.100734422316

january april 0.0993752546848

has have 0.0967101802923

march april 0.0929933503714

did does 0.0854452561942

will would 0.0827179598199

I constructed a graph by joining all word pairs that have a score above a threshold and ran a greedy maximal clique algorithm. These are some of the resulting word clusters.

older young younger

announced today yesterday said reported

month today week yesterday

days month months decade year weeks years

com org www

writing write wrote

directed edited produced

should will probably could would may might can

worries worried concerns

work worked working works

wearing wear wore

win lost losing

man people men

against like to about that by for on in with from of at

under by on with into over from of

baton moulin khmer

daughter husband sister father wife mother son

red green blue black

ice sour whipped

time days months year years day

eastern coast southeastern

bergen orange nassau westchester

east ivory west

goes gone go going went

known seen well

travel review leisure weekly editorial

cultural financial foreign editorial national metropolitan

thursdays wednesdays fridays sundays tuesdays

thursday today monday sunday yesterday wednesday saturday friday tuesday

Discussion

1. For the above results, I only did the clustering based on right contexts. We can easily extend the word-pair score to take into account left contexts as well by concatenating the BWT of the reversed string to the BWT of the original string, and calculating the scores on this double length transformed string.

2. The word clustering algorithm of Brown et. al. proceeds by iteratively merging the best pair of words and replacing the two words in the alphabet (and the string) by a merged word. We can imagine doing something similar with our approach, except, because BWT uses the order on the alphabet, we need to decide where to insert the merged word.

3. One thing that I should have done but didn’t for the above results is to order the alphabet (of words) lexicographically. Instead I assign positive integers to the words based on their first appearance in the string, which is the order BWT uses to sort. Fixing this should improve the results.