## Regularized Minimax for Robust Learning

This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.

Cost Sensitive Loss Functions

Given a training data set $D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $\phi$ that is parametrized by a vector $w$ by minimizing a loss function

where $l(x_i, y_i, w)$ is the loss on example $i$ and $f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like

for some $\lambda > 0$.

If, in addition, the examples came with costs $c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain $\sum_i^N c_i = N$ and $c_i \ge 0$. So the unweighted learning problem corresponds to the case where all $c_i = 1$.

Assume that the learner is playing a game against an adversary that will assign the costs $\{c_i\}_{i=1,\ldots,N}$ to the training examples that will lead to the worst possible loss for any weight vector the learner produces.

How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution

For any realistic learning problem the above optimization problem does not have a unique solution.

Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.

We define

where $H(c) = -\sum_i c_i \log c_i$ (the Shannon entropy of the cost vector, save the normalization to sum to one).

The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector $c$ essentially restricts the set of  possible cost vectors the adversary has at his disposal.

Optimization

For convex loss functions (such as the logistic loss) $L(w, c)$ is convex in $w$ for a fixed cost assignment, therefore so is $R(w, c)$. Furthermore, $R(w, c)$ is concave in $c$ and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.

The theorem allows us to say that, for a fixed weight vector $w$, if

and if $\tilde{c}$ is unique, then

even though $\tilde{c}$ is a function of $w$.

Algorithm

The algorithm is very simple. Perform until convergence the following

1. At the $k^{th}$ iteration, for the weight vector $w^{k}$ find the cost vector $\tilde{c}$ the maximizes $R(w^{k},c)$.

2. Update $w^{k+1} = w^{k} - \eta \nabla_w R(w^{k}, \tilde{c})$, where $\eta$ is the learning rate.

The maximization in step 1 is also simple and can be shown to be

As expected, if $\gamma \rightarrow \infty$, the costs remain close to one and as $\gamma \rightarrow 0$ the entire cost budget is allocated to the example with the largest loss.

Of types and tokens

This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.

For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.

(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)

We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.

For type $t$ let $S_t$ be the set of all tokens of that type and $N_t$ be the number of tokens of that type. We now estimate $w$ by

under the same constraints on $c$ as above. Here $q$ is the observed type distribution in the training data and $KL(.\|.)$ is the KL-divergence.

The algorithm is identical to the one above except the maximum over $c$ for a fixed $w$ is slightly different.

Related Work and Discussion

1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.

2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors ($w$ for us) and not for the costs over the examples.

3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.

4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for  optimizing for the worst possible assignment of costs over query types.

In the next post I will present some experimental results on toy examples with synthetic data.

Acknowledgment

I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.

Hi, there is evidence that in cost sensitive AdaBoost, the objective $\sum_i c_i \ell (x_i, y_i, w_i)$ seems not work, possibly due to the exponential form of \ell() such that the multiplicative weight c_i is quickly “absorbed” by AdaBoost procedure and the final classification boundary produced is almost identical to that of normal AdaBoost (by tuning the decision threshold for the classifier output).