### Archive

Archive for the ‘Logistic Regression’ Category

## Regularized Minimax on Synthetic Data

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

Experiments on Synthetic Data

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two types per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

Example 1

Example1. Original training data set. Both the red and blue classes have two types in 9:1 ratio.

Example 1. Plain logistic regression. No minimax. Almost all of the red circles are misclassified.

Example1. Minimax with gamma = 0.1

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

Example 2

Example2. Original training data set.

Example1. Logistic regression. No minimax.

Example2. Minimax with gamma = 0.5

Discussion

1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.

## Regularized Minimax for Robust Learning

March 13, 2010 1 comment

This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.

Cost Sensitive Loss Functions

Given a training data set $D = \{x_i, y_i\}_{i=1,\ldots,N}$, most learning algorithms learn a classifier $\phi$ that is parametrized by a vector $w$ by minimizing a loss function

where $l(x_i, y_i, w)$ is the loss on example $i$ and $f(w)$ is some function that penalizes complexity. For example for logistic regression the loss function looks like

for some $\lambda > 0$.

If, in addition, the examples came with costs $c_i$ (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain $\sum_i^N c_i = N$ and $c_i \ge 0$. So the unweighted learning problem corresponds to the case where all $c_i = 1$.

Assume that the learner is playing a game against an adversary that will assign the costs $\{c_i\}_{i=1,\ldots,N}$ to the training examples that will lead to the worst possible loss for any weight vector the learner produces.

How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution

For any realistic learning problem the above optimization problem does not have a unique solution.

Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.

We define

where $H(c) = -\sum_i c_i \log c_i$ (the Shannon entropy of the cost vector, save the normalization to sum to one).

The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector $c$ essentially restricts the set of  possible cost vectors the adversary has at his disposal.

Optimization

For convex loss functions (such as the logistic loss) $L(w, c)$ is convex in $w$ for a fixed cost assignment, therefore so is $R(w, c)$. Furthermore, $R(w, c)$ is concave in $c$ and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.

The theorem allows us to say that, for a fixed weight vector $w$, if

and if $\tilde{c}$ is unique, then

even though $\tilde{c}$ is a function of $w$.

Algorithm

The algorithm is very simple. Perform until convergence the following

1. At the $k^{th}$ iteration, for the weight vector $w^{k}$ find the cost vector $\tilde{c}$ the maximizes $R(w^{k},c)$.

2. Update $w^{k+1} = w^{k} - \eta \nabla_w R(w^{k}, \tilde{c})$, where $\eta$ is the learning rate.

The maximization in step 1 is also simple and can be shown to be

As expected, if $\gamma \rightarrow \infty$, the costs remain close to one and as $\gamma \rightarrow 0$ the entire cost budget is allocated to the example with the largest loss.

Of types and tokens

This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.

For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.

(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)

We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.

For type $t$ let $S_t$ be the set of all tokens of that type and $N_t$ be the number of tokens of that type. We now estimate $w$ by

under the same constraints on $c$ as above. Here $q$ is the observed type distribution in the training data and $KL(.\|.)$ is the KL-divergence.

The algorithm is identical to the one above except the maximum over $c$ for a fixed $w$ is slightly different.

Related Work and Discussion

1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.

2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors ($w$ for us) and not for the costs over the examples.

3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.

4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for  optimizing for the worst possible assignment of costs over query types.

In the next post I will present some experimental results on toy examples with synthetic data.

Acknowledgment

I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.