Home > Classification, Semi-supervised learning > Surrogate learning with mean independence

## Surrogate learning with mean independence

In this paper we showed that if we had a feature $x_1$ that was class-conditionally statistically independent of the rest of the features, denoted $x_2$, learning a classifier between the two classes $y=0$ and $y = 1$ can be transformed into learning a predictor of $x_1$ from $x_2$ and another of $y$ from $x_1$. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense $x_1$ acts as a surrogate for $y$.

Similar ideas can be found in Ando and Zhang ’07, Quadrianto et. al. ’08, Blitzer et. al. ’06, and others.

Derivation from mean-independence

I’ll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable $U$ is mean-independent of  the r.v. $V$ if $E[U|V] = E[U]$. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.

We assume that the feature space contains the single feature $x_1$ and the rest of the features $x_2$. We are still in a two-class situation, i.e., $y \in \{0,1\}$. We further assume

1. $x_1$ is at least somewhat useful for classification, or in other words, $E[x_1|y=0] \neq E[x_1|y=1]$.

2. $x_1$ is class-conditionally mean-independent of $x_2$, i.e., $E[x_1|y,x_2] = E[x_1|y]$ for $y \in \{0,1\}$.

Now let us consider the quantity $E[x_1|x_2]$. We have

$E[x_1|x_2]=$

$=E[x_1|x_2,y=0] P(y=0|x_2) + E[x_1|x_2,y=1] P(y=1|x_2)$

$=E[x_1|y=0] P(y=0|x_2) + E[x_1|y=1] P(y=1|x_2)$

Notice that $E[x_1|x_2]$ is a convex sum of $E[x_1|y=0]$ and $E[x_1|y=1]$.

Now using the fact that $P(y=0|x_2) + P(y=1|x_2) = 1$  we can show after some algebra that

$P(y=1|x_2)=\frac{E[x_1|x_2]-E[x_1|y=0]}{E[x_1|y=1]-E[x_1|y=0]}\;\;\;\;\;\;(1)$

We have succeeded in decoupling $y$ and $x_2$ on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of $x_1$ and a regressor (which can be learned on unlabeled data) to compute $E[x_1|x_2]$.  Again $x_1$ acting as a surrogate for $y$ is predicted from $x_2$.

As opposed to the formulation in the paper this formulation easily accommodates continuous valued $x_1$.

Discussion

1. The first thing to note is that we are only able to write an expression for $P(y|x_2)$ but not $P(y|x_1, x_2)$. That is, we are able to weaken the independence to mean-independence at the expense of “wasting” feature $x_1$.

Of course if we have full statistical independence we can use $x_1$, by using Equation (1) and the fact that, under independence, we have

$P(y|x_1, x_2) \propto P(y|x_2) P(x_1|y)$.

2. If (without loss of generality) we assume that $E[x_1|y=1] > E[x_1|y=0]$, because $E[x_1|x_2]$ lies somewhere between $E[x_1|y=0]$ and $E[x_1|y=1]$, Equation (1) says that $P(y=1|x_2)$ is a monotonically increasing function of $E[x_1|x_2]$.

This means that $E[x_1|x_2]$ itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built entirely from unlabeled samples.

I’ll post a neat little application of this method soon.