Posts Tagged ‘mean-independence’

Surrogate learning with mean independence

August 7, 2009 Leave a comment

In this paper we showed that if we had a feature x_1 that was class-conditionally statistically independent of the rest of the features, denoted x_2, learning a classifier between the two classes y=0 and y = 1 can be transformed into learning a predictor of x_1 from x_2 and another of y from x_1. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense x_1 acts as a surrogate for y.

Similar ideas can be found in Ando and Zhang ’07, Quadrianto et. al. ’08, Blitzer et. al. ’06, and others.

Derivation from mean-independence

I’ll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable U is mean-independent of  the r.v. V if E[U|V] = E[U]. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.

We assume that the feature space contains the single feature x_1 and the rest of the features x_2. We are still in a two-class situation, i.e., y \in \{0,1\}. We further assume

1. x_1 is at least somewhat useful for classification, or in other words, E[x_1|y=0] \neq E[x_1|y=1].

2. x_1 is class-conditionally mean-independent of x_2, i.e., E[x_1|y,x_2] = E[x_1|y] for y \in \{0,1\}.

Now let us consider the quantity E[x_1|x_2]. We have


=E[x_1|x_2,y=0] P(y=0|x_2) + E[x_1|x_2,y=1] P(y=1|x_2)

=E[x_1|y=0] P(y=0|x_2) + E[x_1|y=1] P(y=1|x_2)

Notice that E[x_1|x_2] is a convex sum of E[x_1|y=0] and E[x_1|y=1].

Now using the fact that P(y=0|x_2) + P(y=1|x_2) = 1  we can show after some algebra that


We have succeeded in decoupling y and x_2 on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of x_1 and a regressor (which can be learned on unlabeled data) to compute E[x_1|x_2].  Again x_1 acting as a surrogate for y is predicted from x_2.

As opposed to the formulation in the paper this formulation easily accommodates continuous valued x_1.


1. The first thing to note is that we are only able to write an expression for P(y|x_2) but not P(y|x_1, x_2). That is, we are able to weaken the independence to mean-independence at the expense of “wasting” feature x_1.

Of course if we have full statistical independence we can use x_1, by using Equation (1) and the fact that, under independence, we have

P(y|x_1, x_2) \propto P(y|x_2) P(x_1|y).

2. If (without loss of generality) we assume that E[x_1|y=1] > E[x_1|y=0], because E[x_1|x_2] lies somewhere between E[x_1|y=0] and E[x_1|y=1], Equation (1) says that P(y=1|x_2) is a monotonically increasing function of E[x_1|x_2].

This means that E[x_1|x_2] itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built entirely from unlabeled samples.

I’ll post a neat little application of this method soon.