### Archive

Posts Tagged ‘biased training data’

## Regularized Minimax on Synthetic Data

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

Experiments on Synthetic Data

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two types per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

Example 1

Example1. Original training data set. Both the red and blue classes have two types in 9:1 ratio.

Example 1. Plain logistic regression. No minimax. Almost all of the red circles are misclassified.

Example1. Minimax with gamma = 0.1

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

Example 2

Example2. Original training data set.

Example1. Logistic regression. No minimax.

Example2. Minimax with gamma = 0.5

Discussion

1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.

## Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is pool-based. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is $P(x,y)$, we have data drawn from some distribution $\hat{P}(x,y)$ (as always $x$ is the feature vector and $y$ is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since $P(x,y)$ is unknown, if $\hat{P}(x,y)$ is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

The type of bias

Assume that marginal feature distribution of the labeled points after active learning is given by $\hat{P}(x) = \sum_y\hat{P}(x,y)$. Therefore $\hat{P}(x)$ is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from $\hat{P}(x)$ we request its label from the oracle which returns a label according to the conditional distribution $P(y|x) = \frac{P(x,y)}{\sum_y P(x,y)}$.  That is there is no bias in the conditional distribution. Therefore $\hat{P}(x,y) = \hat{P}(x) P(y|x)$. This type of bias has been called covariate shift.

The data

After actively sampling the labels $n$ times, let us say we have the following data — a biased labeled training data set $\{x_i, y_i\}_{i=1,\ldots,n} \sim \hat{P}(x,y)$, where the feature vectors $x_i$ come from the original pool of $M$ unlabeled feature vectors  $\{x_i\}_{i=1,\ldots,M} \sim P(x)$

Let us define $\beta=\frac{P(x,y)}{\hat{P}(x,y)}=\frac{P(x)}{\hat{P}(x)}$. If $\beta$ is large we expect the feature vector $x$ to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example $\beta_i=\frac{P(x_i)}{\hat{P}(x_i)}$ for $i = 1,\ldots,n$. If we knew the values of $\{\beta_i\}$ we can correct for the bias during training and evaluation.

This paper by Huang et. al., and some of its references deal with the estimation of $\{\beta_i\}_{i=1,\ldots,n}$. Remark: This estimation needs take into account that $E_{\hat{P}(x)}[\beta_i] = 1$. This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang et. al.

Evaluation of the classifier

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier $f()$, and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate $\{\beta_i\}_{i=1,\ldots,n}$. Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by $\{\beta_i\}$.

If the empirical loss on the biased sample is written as $R = \frac{1}{n} \sum_i l(f(x_i), y_i)$, we write the estimate of the loss on the true distribution as the weighted loss $R_c= \frac{1}{n} \sum_i \beta_i l(f(x_i), y_i)$.

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

Learning the classifier

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss $R_c$ (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether any bias correction is necessary. Note that the posterior class distribution $P(y|x)$ is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on $P(x,y)$ will still converge to the Bayes error rate with examples drawn from $\hat{P}(x,y)$.

For example imagine constructing a $k$-Nearest Neighbor classifier on the biased labeled dataset.  If we let $k \rightarrow \infty$ and $\frac{k}{n} \rightarrow 0$, the classifier will converge to the Bayes-optimal classifier as $n \rightarrow \infty$, even if $\hat{P}(x)$ is biased. This is somewhat paradoxical and can be explained by looking at the case of finite $n$.

For finite $n$, the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.