## Training data bias caused by active learning

As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in *active learning* the learner is allowed to select points for which labels are requested.

Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is *pool-based*. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.

One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is , we have data drawn from some distribution (as always is the feature vector and is the class label).

We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.

In general since is unknown, if is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.

**The type of bias**

Assume that marginal feature distribution of the labeled points after active learning is given by . Therefore is the putative distribution from which we can assume the feature vectors with labels have been sampled from.

For every feature vector thus sampled from we request its label from the oracle which returns a label according to the conditional distribution . That is there is *no bias* in the conditional distribution. Therefore . This type of bias has been called *covariate shift*.

**The data**

After actively sampling the labels times, let us say we have the following data — a biased labeled training data set , where the feature vectors come from the original pool of unlabeled feature vectors

Let us define . If is large we expect the feature vector to be under-represented in the labeled data set and if it is small it is over-represented.

Now define for each labeled example for . If we knew the values of we can correct for the bias during training and evaluation.

This paper by Huang *et. al.*, and some of its references deal with the estimation of . *Remark*: This estimation needs take into account that . This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang *et. al*.

**Evaluation of the classifier**

We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier , and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate . Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by .

If the empirical loss on the biased sample is written as , we write the estimate of the loss on the true distribution as the weighted loss .

Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.

**Learning the classifier**

How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.

However, a natural question that can be raised is whether *any* bias correction is necessary. Note that the posterior class distribution is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on will still converge to the Bayes error rate with examples drawn from .

For example imagine constructing a -Nearest Neighbor classifier on the biased labeled dataset. If we let and , the classifier will converge to the Bayes-optimal classifier as , *even if* * is biased*. This is somewhat paradoxical and can be explained by looking at the case of finite .

For finite , the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.