### Archive

Posts Tagged ‘Semi-supervised learning’

## A surrogate learning mystery

I’ll present an application of the surrogate learning idea in the previous post. It is mildly surprising at first blush, which I’ll contrive to make more mysterious.

Murders she induced

For readers who appreciate this sort of a thing, here’s the grandmother of all murder mysteries. In this one Miss Marple solves a whole bunch of unrelated murders all at once.

Miss Marple having finally grown wise to the unreliability of feminine intuition in solving murder mysteries spent some time learning statistics. She then convinced one of her flat-footed friends at the Scotland Yard to give her the files on all the unsolved murders, that were just sitting around gathering dust and waiting for someone with her imagination and statistical wiles.

She came home with the massive pile of papers and sat down to study them.

The file on each murder carefully listed all the suspects, with their possible motives, their accessibility to the murder weapon, psychological characteristics, previous convictions and many other details. There was a large variety in the number of suspects for different murders.

Furthermore, for every case the investigating officer made a note that the murderer was very likely to be part of the suspect pool. Only, it was not possible to narrow down the pool to one, and none of the suspects confessed.

Thinking statistically, Miss Marple decided to encode the characteristics of each suspect by a feature vector. Drawing on her vast experience in these matters she assigned numeric values to features like motives, psychology, relationship to the victim and access to the weapon.

All that was left was to build a statistical model that predicted the probability of a suspect being the murderer from his feature vector.  She would then rank all the suspects of a case by this probability, and then she can have Scotland Yard try to get the most likely suspect to confess.

At this point Miss Marple paused for a moment to rue her oversight in not asking for the files on the solved murders as well. She could have used the suspects (and the known murderer) from the solved murders as labeled examples to train her statistical model. However, she soon realized …

I know you are tense with anticipation of the inevitable twist in the tale. Here it is.

… she soon realized that she could build her model by training a regressor to predict, from the feature vector of a suspect, the size of the suspect pool he belongs to.

The clues

Let’s re-examine the pertinent facts — 1. every suspect pool contains the murderer, 2.  the suspect pools are of varying sizes.

The one, perhaps controversial, assumption we’ll need to make is that the feature vectors for both the murderers and the non-murderers are mean-independent of the size of the suspect pool they come from. (However, if it is good enough for Miss Marple, it is good enough for me.)

In any case, the violation of this assumption would imply that, when whether or not a suspect is a murderer is known, knowing his feature vector allows us to better predict the size of the suspect pool he belongs to. Because this is far-fetched, we can believe that the assumption is true.

In fact it may seem that knowing whether or not a suspect is murderer itself adds nothing to the predictability of the size of the pool from which the suspect is drawn. We will however show that the means for the murderers and non-murderers is different.

Let us now see why Miss Marple’s method works.

Marple math

Let the suspect pools be denoted $\{S_1, S_2,\ldots,S_k\}$ with sizes $|S_i| = s_i$. Let each suspect be described by the feature vector denoted $x_2$. We now define the class label $y=1$ if a particular suspect is a murderer and $y=0$ otherwise.

We append the feature $x_1 = -s_i$ to the feature vector of a suspect from pool $S_i$. That is, the surrogate feature we use is the negative of the pool size. (The negative is for a technical reason to match the previous post. In practice I would recommend $\frac{1}{s_i}$, but this makes the argument below a bit more involved.)

The first thing to note is that the assumption we made above translates to $E[x_1|y,x_2] = E[x_1|y]$ for both $y=0$ and $y=1$.

All we need to show is that $E[x_1|y=1] > E[x_1|y=0]$ and we are home free, because the argument in the previous post implies that $E[x_1|x_2]$ can be used to rank suspects in a pool just as well as $P(y|x_2)$.

So all we need to do is to build a regressor to predict $x_1 = -s_i$ from the feature vector $x_2$ and apply this regressor to the $x_2$ of all the suspects in each pool and rank them by its output.

So why is $E[x_1|y=1] > E[x_1|y=0]$?

First it is clear that $E[x_1|y=1] = -E[s]$ (where $s$ is the random variable describing the size of the suspect pool) because for each pool there is one murderer and we assign the negative of that pool size as $x_1$ for that murderer.

Now there are a total of $s_1 -1 + s_2 -1+\ldots+s_k-1 = \sum s_i - k$ non-murderers. For the non-murderers from the $i^{th}$ pool the assigned $x_1$ is $-s_i$. Therefore the estimated expected value is $-\frac{\sum (s_i-1).s_i}{\sum s_i - k}$

If we divide through by $k$ and let the number of pools go to infinity, the estimate converges to the true mean, which is

$E[x_1|y=0] = -\frac{E[s^2] - E[s]}{E[s] - 1}$

It is easy to show that this quantity is always lower than $-E[s] = E[x_1|y=1]$ if the variance of $s$ is non-zero (and if $E[s] > 1$), which we were given by the fact that the pool sizes are variable across murders.

And there you have it.

A less criminal example

Let me try to demystify the approach even more.

Assume that we have a database of images of faces of people along with their last names. Now the problem is the following — for a given face-name input pair to find the matching face and name in the database. Of course if all the names are unique this is trivial, but there are some common names like Smith and Lee.

Let us say that our procedure for this matching is to first obtain from the database the face images of the people with the same last name as the input, and then using our state-of-the-art face image comparison features to decide which of those is the right one. Do we need a training set of human-annotated input-database record pairs to train such a ranker? The above discussion suggests not.

Let us say we create a “fake” training set from a lot of unlabeled input image-name pairs by getting the set of records from the database with the matching name and assigning each feature vector a label which is the reciprocal of the size of the set. We then learn to predict this label from the image features.

The reason we would expect this predictor to behave similarly to the predictor of image match is that we can view our fake labeling as a probabilistic labeling of the feature vectors. For a given set our label is the a priori probability of a face image being the right match to the input image.

The independence assumption just makes sure that our probabilistic labeling is not biased.

I am curious about other possible applications.

## Surrogate learning with mean independence

In this paper we showed that if we had a feature $x_1$ that was class-conditionally statistically independent of the rest of the features, denoted $x_2$, learning a classifier between the two classes $y=0$ and $y = 1$ can be transformed into learning a predictor of $x_1$ from $x_2$ and another of $y$ from $x_1$. Since the first predictor can be learned on unlabeled examples and the second is a classifier on a 1-D space, the learning problem becomes easy. In a sense $x_1$ acts as a surrogate for $y$.

Similar ideas can be found in Ando and Zhang ’07, Quadrianto et. al. ’08, Blitzer et. al. ’06, and others.

Derivation from mean-independence

I’ll now derive a similar surrogate learning algorithm from mean independence rather than full statistical independence. Recall that the random variable $U$ is mean-independent of  the r.v. $V$ if $E[U|V] = E[U]$. Albeit weaker than full independence, mean-independence is still a pretty strong assumption. In particular it is stronger than the lack of correlation.

We assume that the feature space contains the single feature $x_1$ and the rest of the features $x_2$. We are still in a two-class situation, i.e., $y \in \{0,1\}$. We further assume

1. $x_1$ is at least somewhat useful for classification, or in other words, $E[x_1|y=0] \neq E[x_1|y=1]$.

2. $x_1$ is class-conditionally mean-independent of $x_2$, i.e., $E[x_1|y,x_2] = E[x_1|y]$ for $y \in \{0,1\}$.

Now let us consider the quantity $E[x_1|x_2]$. We have

$E[x_1|x_2]=$

$=E[x_1|x_2,y=0] P(y=0|x_2) + E[x_1|x_2,y=1] P(y=1|x_2)$

$=E[x_1|y=0] P(y=0|x_2) + E[x_1|y=1] P(y=1|x_2)$

Notice that $E[x_1|x_2]$ is a convex sum of $E[x_1|y=0]$ and $E[x_1|y=1]$.

Now using the fact that $P(y=0|x_2) + P(y=1|x_2) = 1$  we can show after some algebra that

$P(y=1|x_2)=\frac{E[x_1|x_2]-E[x_1|y=0]}{E[x_1|y=1]-E[x_1|y=0]}\;\;\;\;\;\;(1)$

We have succeeded in decoupling $y$ and $x_2$ on the right hand side, which results in a simple semi-supervised classification method. We just need the class-conditional means of $x_1$ and a regressor (which can be learned on unlabeled data) to compute $E[x_1|x_2]$.  Again $x_1$ acting as a surrogate for $y$ is predicted from $x_2$.

As opposed to the formulation in the paper this formulation easily accommodates continuous valued $x_1$.

Discussion

1. The first thing to note is that we are only able to write an expression for $P(y|x_2)$ but not $P(y|x_1, x_2)$. That is, we are able to weaken the independence to mean-independence at the expense of “wasting” feature $x_1$.

Of course if we have full statistical independence we can use $x_1$, by using Equation (1) and the fact that, under independence, we have

$P(y|x_1, x_2) \propto P(y|x_2) P(x_1|y)$.

2. If (without loss of generality) we assume that $E[x_1|y=1] > E[x_1|y=0]$, because $E[x_1|x_2]$ lies somewhere between $E[x_1|y=0]$ and $E[x_1|y=1]$, Equation (1) says that $P(y=1|x_2)$ is a monotonically increasing function of $E[x_1|x_2]$.

This means that $E[x_1|x_2]$ itself can be used as the classifier, and labeled examples are needed only to determine a threshold for trading off precision vs. recall. The classifier (or perhaps we should call it a ranker) therefore is built entirely from unlabeled samples.

I’ll post a neat little application of this method soon.