Regularized Minimax for Robust Learning
This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.
Cost Sensitive Loss Functions
Given a training data set , most learning algorithms learn a classifier that is parametrized by a vector by minimizing a loss function
where is the loss on example and is some function that penalizes complexity. For example for logistic regression the loss function looks like
for some .
If, in addition, the examples came with costs (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )
We further constrain and . So the unweighted learning problem corresponds to the case where all .
A Game Against An Adversary
Assume that the learner is playing a game against an adversary that will assign the costs to the training examples that will lead to the worst possible loss for any weight vector the learner produces.
How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution
For any realistic learning problem the above optimization problem does not have a unique solution.
Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.
where (the Shannon entropy of the cost vector, save the normalization to sum to one).
The new minimax optimization problem can be posed as
subject to the constraints
Note that the regularization term on the cost vector essentially restricts the set of possible cost vectors the adversary has at his disposal.
For convex loss functions (such as the logistic loss) is convex in for a fixed cost assignment, therefore so is . Furthermore, is concave in and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.
The theorem allows us to say that, for a fixed weight vector , if
and if is unique, then
even though is a function of .
The algorithm is very simple. Perform until convergence the following
1. At the iteration, for the weight vector find the cost vector the maximizes .
2. Update , where is the learning rate.
The maximization in step 1 is also simple and can be shown to be
As expected, if , the costs remain close to one and as the entire cost budget is allocated to the example with the largest loss.
Of types and tokens
This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.
For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.
(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)
We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.
For type let be the set of all tokens of that type and be the number of tokens of that type. We now estimate by
under the same constraints on as above. Here is the observed type distribution in the training data and is the KL-divergence.
The algorithm is identical to the one above except the maximum over for a fixed is slightly different.
Related Work and Discussion
1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.
2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors ( for us) and not for the costs over the examples.
3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.
4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for optimizing for the worst possible assignment of costs over query types.
In the next post I will present some experimental results on toy examples with synthetic data.
I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.