## Surrogate learning with mean independence

In this paper we showed that if we had a feature that was class-conditionally statistically independent of the rest of the features, denoted , learning a classifier between the two classes and can be transformed into learning a predictor of from and another of from . Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense acts as a *surrogate *for .

Similar ideas can be found in Ando and Zhang ’07, Quadrianto et. al. ’08, Blitzer et. al. ’06, and others.

**Derivation from mean-independence**

I’ll now derive a similar surrogate learning algorithm from *mean independence *rather than full statistical independence. Recall that the random variable is mean-independent of the r.v. if . Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.

We assume that the feature space contains the single feature and the rest of the features . We are still in a two-class situation, i.e., . We further assume

1. is at least somewhat useful for classification, or in other words, .

2. is class-conditionally mean-independent of , i.e., for .

Now let us consider the quantity . We have

Notice that is a convex sum of and .

Now using the fact that we can show after some algebra that

We have succeeded in decoupling and on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of and a regressor (which can be learned on unlabeled data) to compute . Again acting as a surrogate for is predicted from .

As opposed to the formulation in the paper this formulation easily accommodates continuous valued .

**Discussion**

1. The first thing to note is that we are only able to write an expression for but not . That is, we are able to weaken the independence to mean-independence at the expense of “wasting” feature .

Of course if we have full statistical independence we can use , by using Equation (1) and the fact that, under independence, we have

.

2. If (without loss of generality) we assume that , because lies somewhere between and , Equation (1) says that is a monotonically increasing function of .

This means that itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built *entirely* from unlabeled samples.

I’ll post a neat little application of this method soon.