<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Innuo</title>
	<atom:link href="http://mlstat.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://mlstat.wordpress.com</link>
	<description>Machine learning, statistics etc.</description>
	<lastBuildDate>Thu, 17 Feb 2011 15:47:06 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='mlstat.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Innuo</title>
		<link>http://mlstat.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://mlstat.wordpress.com/osd.xml" title="Innuo" />
	<atom:link rel='hub' href='http://mlstat.wordpress.com/?pushpress=hub'/>
		<item>
		<title>The Cult of Universality in Statistical Learning Theory</title>
		<link>http://mlstat.wordpress.com/2010/10/31/the-cult-of-universality-in-statistical-learning-theory/</link>
		<comments>http://mlstat.wordpress.com/2010/10/31/the-cult-of-universality-in-statistical-learning-theory/#comments</comments>
		<pubDate>Sun, 31 Oct 2010 18:27:16 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Learning Theory]]></category>
		<category><![CDATA[Rant]]></category>
		<category><![CDATA[error bound]]></category>
		<category><![CDATA[generalization error]]></category>
		<category><![CDATA[learning theory]]></category>
		<category><![CDATA[rademacher complexity]]></category>
		<category><![CDATA[VC dimension]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=525</guid>
		<description><![CDATA[The question is frequently raised as to why the theory and practice of machine learning are so divergent. Whereas if you glance at any article about classification, chances are that you will find symbol upon lemma &#38; equation upon inequality, making claims about the bounds on the error rates, that should putatively guide the engineer [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=525&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The question is frequently raised as to why the theory and practice of machine learning are so divergent. Whereas if you glance at any article about classification, chances are that you will find symbol upon lemma &amp; equation upon inequality, making claims about the bounds on the error rates, that should putatively guide the engineer in the solution of her problem.</p>
<p>However, the situation seems to be that the engineer having been forewarned by her pragmatic colleagues (or having checked a few herself) that these bounds are vacuous for most realistic problems, circumvents them altogether in her search for any useful nuggets in the article.</p>
<p>So why do these oft-ignored analyses still persist in a field that is largely comprised of engineers? From my brief survey of the literature it seems that one  (but, by no means, the only) reason is the needless preponderance of <em>worst-case thinking</em>. (Being a panglossian believer of the purity of science and of the intentions of its workers, I am immediately dismissing the cynical suggestion that these analyses are appended to an article only to intimidate the insecure reviewer.)</p>
<p><strong>The cult of universality</strong></p>
<p>An inventive engineer designs a learning algorithm for her problem of classifying birds from the recordings of their calls. She suspects that her algorithm is more generally applicable and sits down to analyze it formally. She vaguely recalls various neat <em>generalization error </em>bounds she learned about during her  days at the university, and wonders if they are applicable.</p>
<p>The bounds made claims of the kind</p>
<p>&#8220;for my classifier whose complexity is <img src='http://s0.wp.com/latex.php?latex=c&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c' title='c' class='latex' />, if trained on <img src='http://s0.wp.com/latex.php?latex=m&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='m' title='m' class='latex' /> examples, then for any distribution that generated the data, it is guaranteed that the</p>
<p>generalization error rate <img src='http://s0.wp.com/latex.php?latex=%5Cleq&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;leq' title='&#92;leq' class='latex' /> error rate on the training set + some function of (c,m)</p>
<p>with high probability&#8221;.</p>
<p>Some widely used measures of the complexity of a classifier are its VC dimension and its Rademacher complexity, both of which measure the ability of the classifier to separate <em>any </em>training set. The intuition is that if the classifier can imitate any arbitrary labeling of a set of vectors, it will generalize poorly.</p>
<p>Because of the phrase &#8220;for any distribution&#8221; in the statement of the bound, the bound is said to be <em>universally</em> applicable. It is this pursuit of <em>universality</em> which is a deplorable manifestation of worst-case thinking. It is tolerable in mathematicians that delight in pathologies, but can be debilitating in engineers.</p>
<p>The extent of pessimism induced by the requirement of universality is not well appreciated. The following example is designed to illustrate this by relaxing the requirement from &#8220;any distribution&#8221; to &#8220;any smooth distribution&#8221;, which is not much of a relaxation at all.</p>
<p>Assume that I have a small training data set <img src='http://s0.wp.com/latex.php?latex=%5C%7B%28x_i%2C+y_i%29%5C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{(x_i, y_i)&#92;}' title='&#92;{(x_i, y_i)&#92;}' class='latex' /> in <img src='http://s0.wp.com/latex.php?latex=R%5Ed&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R^d' title='R^d' class='latex' /> drawn from a continuous distribution <img src='http://s0.wp.com/latex.php?latex=p%28x%2C+y%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='p(x, y)' title='p(x, y)' class='latex' />.  Assume further that <img src='http://s0.wp.com/latex.php?latex=p%28x%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='p(x)' title='p(x)' class='latex' /> is reasonably smooth.</p>
<p>I now build a  linear classifier under some loss (say an SVM). I then take all the  training examples that are misclassified by the linear classifier and  memorize them along with their labels.</p>
<p>For a test vector <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x' title='x' class='latex' />, if <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x' title='x' class='latex' />  is within <img src='http://s0.wp.com/latex.php?latex=%5Cepsilon&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;epsilon' title='&#92;epsilon' class='latex' /> of a memorized training example I give it the label  of the training example. Otherwise I use the linear classifier to obtain  my prediction.</p>
<p>I can make <img src='http://s0.wp.com/latex.php?latex=%5Cepsilon&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;epsilon' title='&#92;epsilon' class='latex' /> very small and since the  training examples will be in general position with probability one, this  classification scheme is unambiguous.</p>
<p>This classifier  will have zero error on all training sets and therefore will have high  complexity according to the usual complexity measures like VC,  Rademacher etc. However, if I ignore the  contribution of the memorized points (which only play a  role for a set  of vanishingly small probability), I have a linear  classifier.</p>
<p>Therefore, although it is reasonable to expect any analysis to yield very similar bounds on the generalization error for a linear classifier and my linear+memorization classifier, the requirement of universality leads to vacuous bounds for the latter.</p>
<p>Even if I assume nothing more than smoothness, I do not know how to derive reasonable statements with the existing tools. And we almost always know much more about the data distributions!</p>
<p>To reiterate, checking one&#8217;s learning algorithm against the worst possible distribution is akin to designing a bicycle and checking how well it serves for holding up one&#8217;s pants.</p>
<p><strong>&#8220;The medicine bottle rules&#8221;</strong></p>
<p>Our engineer ponders these issues, muses about the &#8220;no free lunch&#8221; results that imply that for any two classifiers there are distributions for which either one of them is better than the other, and wonders about the philosophical distinction between <em>a priori</em> restricting the function space that learning algorithm searches in, and <em>a priori</em> restricting the distributions that the learning algorithm is applicable for.</p>
<p>After a short nap, she decides on a sensible route for her analysis.</p>
<p>1. <em>State the restrictions on the distribution</em>. She shows that her algorithm will perform very well if her assumptions of the data distribution are satisfied. She further argues that the allowed distributions are still broad enough to cover many other problems.</p>
<p>2. <em>State to what extent the assumptions can be violated</em>. She analyzes how the quality of her algorithm degrades when the assumptions are satisfied only approximately.</p>
<p>3. <em>State which assumptions are necessary</em>. She analyzes the situations where her algorithm will definitely fail.</p>
<p>I believe that these are good rules to follow while analyzing classification algorithms.  My professor <a href="http://www.ecse.rpi.edu/~nagy/" target="_blank">George Nagy</a> calls these the <em>medicine bottle rules</em>, because like on medicine label, we require information on how to administer the drug, what it is for, what is bad for, and perhaps on interesting side effects.</p>
<p>I do not claim to follow this advice unfailingly and I admit to some of the above crimes. I, however, do believe that medicine bottle analysis is vastly more useful than much of what passes for learning theory. I look forward to hearing from you, nimble reader, of your thoughts on the kinds of analyses you would care enough about to read.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/525/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/525/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/525/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/525/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/525/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/525/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/525/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/525/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/525/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/525/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/525/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/525/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/525/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/525/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=525&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2010/10/31/the-cult-of-universality-in-statistical-learning-theory/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>
	</item>
		<item>
		<title>Random Fourier Features for Kernel Density Estimation</title>
		<link>http://mlstat.wordpress.com/2010/10/04/random-fourier-features-for-kernel-density-estimation/</link>
		<comments>http://mlstat.wordpress.com/2010/10/04/random-fourier-features-for-kernel-density-estimation/#comments</comments>
		<pubDate>Mon, 04 Oct 2010 22:41:17 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Estimation]]></category>
		<category><![CDATA[Bochner's theorem]]></category>
		<category><![CDATA[density estimation]]></category>
		<category><![CDATA[Gaussian kernel]]></category>
		<category><![CDATA[novelty detection]]></category>
		<category><![CDATA[outlier detection]]></category>
		<category><![CDATA[Parzen window]]></category>
		<category><![CDATA[Random Fourier Features]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=470</guid>
		<description><![CDATA[The NIPS paper Random Fourier Features for Large-scale Kernel Machines, by Rahimi and Recht presents a method for randomized feature mapping where dot products in the transformed feature space approximate (a certain class of) positive definite (p.d.) kernels in the original space. We know that for any p.d. kernel there exists a deterministic map that [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=470&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The NIPS paper <a href="http://pages.cs.wisc.edu/~brecht/papers/07.rah.rec.nips.pdf" target="_blank">Random Fourier Features for Large-scale Kernel Machines</a>, by Rahimi and Recht presents a method for randomized feature mapping where dot products in the transformed feature space approximate (a certain class of) positive definite (p.d.) kernels in the original space.</p>
<p>We know that for any p.d. kernel there exists a <em>deterministic</em> map that has the aforementioned property but it may be infinite dimensional. The paper presents results indicating that with the randomized map we  can get away with only a &#8220;small&#8221; number of features (at least for a classification setting).</p>
<p>Before applying the method to density estimation let us review the relevant section of the paper briefly.</p>
<p><strong>Bochner&#8217;s Theorem and Random Fourier Features</strong></p>
<p>Assume that we have data in <img src='http://s0.wp.com/latex.php?latex=R%5Ed&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R^d' title='R^d' class='latex' /> and a continuous p.d. kernel <img src='http://s0.wp.com/latex.php?latex=K%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(x,y)' title='K(x,y)' class='latex' /> defined for every pair of points <img src='http://s0.wp.com/latex.php?latex=x%2Cy+%5Cin+R%5Ed&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x,y &#92;in R^d' title='x,y &#92;in R^d' class='latex' />. Assume further that the kernel is shift-invariant, i.e., <img src='http://s0.wp.com/latex.php?latex=K%28x%2Cy%29+%3D+K%28x-y%29+%5Ctriangleq+K%28%5Cdelta%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(x,y) = K(x-y) &#92;triangleq K(&#92;delta)' title='K(x,y) = K(x-y) &#92;triangleq K(&#92;delta)' class='latex' /> and that the kernel is scaled so that <img src='http://s0.wp.com/latex.php?latex=K%280%29+%3D+1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(0) = 1' title='K(0) = 1' class='latex' />.</p>
<p>The theorem by Bochner states that under the above conditions <img src='http://s0.wp.com/latex.php?latex=K%28%5Cdelta%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(&#92;delta)' title='K(&#92;delta)' class='latex' /> must be the Fourier transform of a non-negative measure on <img src='http://s0.wp.com/latex.php?latex=R%5Ed&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R^d' title='R^d' class='latex' />. In other words, there exists a probability density function <img src='http://s0.wp.com/latex.php?latex=p%28%5Cdelta%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='p(&#92;delta)' title='p(&#92;delta)' class='latex' /> for <img src='http://s0.wp.com/latex.php?latex=%5Cdelta+%5Cin+R%5Ed&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;delta &#92;in R^d' title='&#92;delta &#92;in R^d' class='latex' /> such that <img src='http://s0.wp.com/latex.php?latex=K%28%5Cdelta%29+%3D+%5Cmathcal%7BF%7D%28p%28%5Cdelta%29%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(&#92;delta) = &#92;mathcal{F}(p(&#92;delta))' title='K(&#92;delta) = &#92;mathcal{F}(p(&#92;delta))' class='latex' />.</p>
<p style="text-align:center;"><a href="http://mlstat.files.wordpress.com/2010/10/fourier1.png"></a><a href="http://mlstat.files.wordpress.com/2010/10/fourier11.png"><img class="aligncenter size-full wp-image-477" title="fourier1" src="http://mlstat.files.wordpress.com/2010/10/fourier11.png?w=545&#038;h=198" alt="" width="545" height="198" /></a></p>
<p>where (1) is because <img src='http://s0.wp.com/latex.php?latex=K%28.%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(.)' title='K(.)' class='latex' /> is real. Equation (2) says that if we draw a random vector <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' /> according to <img src='http://s0.wp.com/latex.php?latex=p%28w%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='p(w)' title='p(w)' class='latex' /> and form two vectors <img src='http://s0.wp.com/latex.php?latex=%5Cphi%28x%29+%3D+%28cos%28w%5ET+x%29%2C+sin%28w%5ET+x%29%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;phi(x) = (cos(w^T x), sin(w^T x))' title='&#92;phi(x) = (cos(w^T x), sin(w^T x))' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Cphi%28y%29+%3D+%28cos%28w%5ET+y%29%2C+sin%28w%5ET+y%29%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;phi(y) = (cos(w^T y), sin(w^T y))' title='&#92;phi(y) = (cos(w^T y), sin(w^T y))' class='latex' />, then the expected value of <img src='http://s0.wp.com/latex.php?latex=%3C%5Cphi%28x%29%2C%5Cphi%28y%29%3E&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&lt;&#92;phi(x),&#92;phi(y)&gt;' title='&lt;&#92;phi(x),&#92;phi(y)&gt;' class='latex' /> is <img src='http://s0.wp.com/latex.php?latex=K%28x-y%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(x-y)' title='K(x-y)' class='latex' />.</p>
<p>Therefore, for <img src='http://s0.wp.com/latex.php?latex=x+%5Cin+R%5Ed&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x &#92;in R^d' title='x &#92;in R^d' class='latex' />, if we choose the transformation</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cphi%28x%29+%3D+%5Cfrac%7B1%7D%7B%5Csqrt%7BD%7D%7D+%28cos%28w_1%5ET+x%29%2C+sin%28w_1%5ET+x%29%2C+cos%28w_2%5ET+x%29%2C+sin%28w_2%5ET+x%29%2C+%5Cldots%2C+cos%28w_D%5ET+x%29%2C+sin%28w_D%5ET+x%29%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;phi(x) = &#92;frac{1}{&#92;sqrt{D}} (cos(w_1^T x), sin(w_1^T x), cos(w_2^T x), sin(w_2^T x), &#92;ldots, cos(w_D^T x), sin(w_D^T x))' title='&#92;phi(x) = &#92;frac{1}{&#92;sqrt{D}} (cos(w_1^T x), sin(w_1^T x), cos(w_2^T x), sin(w_2^T x), &#92;ldots, cos(w_D^T x), sin(w_D^T x))' class='latex' /></p>
<p>with <img src='http://s0.wp.com/latex.php?latex=w_1%2C%5Cldots%2C+w_D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w_1,&#92;ldots, w_D' title='w_1,&#92;ldots, w_D' class='latex' /> drawn according to <img src='http://s0.wp.com/latex.php?latex=p%28w%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='p(w)' title='p(w)' class='latex' />, linear inner products in this transformed space will approximate <img src='http://s0.wp.com/latex.php?latex=K%28.%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(.)' title='K(.)' class='latex' />.</p>
<p><strong>Gaussian RBF Kernel</strong></p>
<p>The Gaussian radial basis function kernel satisfies all the above conditions and we know that the Fourier transform of the Gaussian is another Gaussian (with the reciprocal variance). Therefore for &#8220;linearizing&#8221; the Gaussian r.b.f. kernel, we draw <img src='http://s0.wp.com/latex.php?latex=D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D' title='D' class='latex' /> samples from a Gaussian distribution for the transformation.</p>
<p><strong>Parzen Window Density Estimation</strong></p>
<p>Given a data  set  <img src='http://s0.wp.com/latex.php?latex=%5C%7Bx_1%2C+x_2%2C+%5Cldots%2C+x_N%5C%7D+%5Csubset+R%5Ed&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{x_1, x_2, &#92;ldots, x_N&#92;} &#92;subset R^d' title='&#92;{x_1, x_2, &#92;ldots, x_N&#92;} &#92;subset R^d' class='latex' />, the the so-called Parzen window probability density estimator is defined as follows</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Chat%7Bp%7D%28x%29+%5Cpropto+%5Cfrac%7B1%7D%7BN%7D+%5Csum_i+K%28%28x-x_i%29%2Fh%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{p}(x) &#92;propto &#92;frac{1}{N} &#92;sum_i K((x-x_i)/h)' title='&#92;hat{p}(x) &#92;propto &#92;frac{1}{N} &#92;sum_i K((x-x_i)/h)' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=K%28.%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(.)' title='K(.)' class='latex' /> is often a positive, symmetric, shift-invariant kernel and <img src='http://s0.wp.com/latex.php?latex=h&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h' title='h' class='latex' /> is the bandwidth parameter that controls the scale of influence of the data points.</p>
<p>A common kernel that is used for Parzen window density estimation is the Gaussian density. If we make the same choice we can apply our feature transformation to linearize the procedure. We have</p>
<p><a href="http://mlstat.files.wordpress.com/2010/10/fourier2.png"><img class="aligncenter size-full wp-image-495" title="fourier2" src="http://mlstat.files.wordpress.com/2010/10/fourier2.png?w=294&#038;h=175" alt="" width="294" height="175" /></a></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=h&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h' title='h' class='latex' /> has been absorbed into the kernel variance.</p>
<p>Therefore all we need to do is take the mean of the transformed data points and estimate the pdf at a new point to be (proportional to) the inner product its transformed feature vector with the mean.</p>
<p>Of course since the kernel value is only approximated by the inner product of the random Fourier features we expect that the estimate pdf will differ from a plain unadorned Parzen window estimate.  But different how?</p>
<p><strong>Experiments</strong></p>
<p>Below are some pictures showing how the method performs on some synthetic data. I generated a few dozen points from a mixture of Gaussians and plotted contours of the estimated pdf for the region around the points. I did this for several choices of <img src='http://s0.wp.com/latex.php?latex=D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D' title='D' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Cgamma&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;gamma' title='&#92;gamma' class='latex' /> (the scale parameter for the Gaussian kernel).</p>
<p>First let us check that the method performs as expected for large values of <img src='http://s0.wp.com/latex.php?latex=D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D' title='D' class='latex' /> because the kernel value is well approximated by the inner product of the Fourier features. The first 3 pictures are for <img src='http://s0.wp.com/latex.php?latex=D+%3D+10000&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D = 10000' title='D = 10000' class='latex' /> for various values of <img src='http://s0.wp.com/latex.php?latex=%5Cgamma&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;gamma' title='&#92;gamma' class='latex' />.</p>
<div id="attachment_503" class="wp-caption aligncenter" style="width: 460px"><a href="http://mlstat.files.wordpress.com/2010/10/k2d10000.png"><img class="size-full wp-image-503" title="K2D10000" src="http://mlstat.files.wordpress.com/2010/10/k2d10000.png?w=450&#038;h=339" alt="" width="450" height="339" /></a><p class="wp-caption-text">D = 10000 and gamma = 2.0</p></div>
<div id="attachment_501" class="wp-caption aligncenter" style="width: 460px"><a href="http://mlstat.files.wordpress.com/2010/10/k1d10000.png"><img class="size-full wp-image-501  " title="k1D10000" src="http://mlstat.files.wordpress.com/2010/10/k1d10000.png?w=450&#038;h=339" alt="" width="450" height="339" /></a><p class="wp-caption-text"> D = 10000 and gamma = 1.0</p></div>
<div id="attachment_502" class="wp-caption aligncenter" style="width: 460px"><a href="http://mlstat.files.wordpress.com/2010/10/kp5d10000.png"><img class="size-full wp-image-502" title="Kp5D10000" src="http://mlstat.files.wordpress.com/2010/10/kp5d10000.png?w=450&#038;h=339" alt="" width="450" height="339" /></a><p class="wp-caption-text">D = 10000  and gamma = 0.5</p></div>
<p>—————————————————————————</p>
<p>—————————————————————————</p>
<p>Now let us see what happens when we decrease <img src='http://s0.wp.com/latex.php?latex=D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D' title='D' class='latex' />. We expect the error in approximating the kernel would lead to obviously erroneous pdf.  This is clearly evident for the case of <img src='http://s0.wp.com/latex.php?latex=D%3D100&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D=100' title='D=100' class='latex' />.</p>
<div id="attachment_506" class="wp-caption aligncenter" style="width: 460px"><a href="http://mlstat.files.wordpress.com/2010/10/k1d1000.png"><img class="size-full wp-image-506" title="k1D1000" src="http://mlstat.files.wordpress.com/2010/10/k1d1000.png?w=450&#038;h=339" alt="" width="450" height="339" /></a><p class="wp-caption-text">D=1000 and gamma = 1.0</p></div>
<div id="attachment_508" class="wp-caption aligncenter" style="width: 460px"><a href="http://mlstat.files.wordpress.com/2010/10/k1d1001.png"><img class="size-full wp-image-508" title="k1D100" src="http://mlstat.files.wordpress.com/2010/10/k1d1001.png?w=450&#038;h=339" alt="" width="450" height="339" /></a><p class="wp-caption-text">D=100 and gamma = 1.0</p></div>
<p>—————————————————————————</p>
<p>—————————————————————————</p>
<p>The following picture for  <img src='http://s0.wp.com/latex.php?latex=D+%3D+1000&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D = 1000' title='D = 1000' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Cgamma+%3D+2.0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;gamma = 2.0' title='&#92;gamma = 2.0' class='latex' /> is even stranger.</p>
<div id="attachment_509" class="wp-caption aligncenter" style="width: 460px"><a href="http://mlstat.files.wordpress.com/2010/10/k2d1000.png"><img class="size-full wp-image-509" title="K2D1000" src="http://mlstat.files.wordpress.com/2010/10/k2d1000.png?w=450&#038;h=339" alt="" width="450" height="339" /></a><p class="wp-caption-text">D = 1000 and gamma = 2.0</p></div>
<p>—————————————————————————</p>
<p>—————————————————————————</p>
<p><strong>Discussion</strong></p>
<p>It seems that even for a simple 2D example, we seem to need to compute a very large number of random Fourier features to make the estimated pdf accurate. (For this small example this is very wasteful, since a plain Parzen window estimate would require less memory and computation.)</p>
<p>However, the pictures do indicate that if the approach is to be used for outlier detection (aka novelty detection) <em>from a given data set, </em>we might be able get away with much smaller <img src='http://s0.wp.com/latex.php?latex=D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D' title='D' class='latex' />. That is, even if the estimated pdf has a big error on the entire space, on the points from the data it seems to be reasonably accurate.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/470/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/470/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/470/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/470/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/470/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/470/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/470/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/470/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/470/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/470/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/470/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/470/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/470/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/470/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=470&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2010/10/04/random-fourier-features-for-kernel-density-estimation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/fourier11.png" medium="image">
			<media:title type="html">fourier1</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/fourier2.png" medium="image">
			<media:title type="html">fourier2</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/k2d10000.png" medium="image">
			<media:title type="html">K2D10000</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/k1d10000.png" medium="image">
			<media:title type="html">k1D10000</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/kp5d10000.png" medium="image">
			<media:title type="html">Kp5D10000</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/k1d1000.png" medium="image">
			<media:title type="html">k1D1000</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/k1d1001.png" medium="image">
			<media:title type="html">k1D100</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/10/k2d1000.png" medium="image">
			<media:title type="html">K2D1000</media:title>
		</media:content>
	</item>
		<item>
		<title>Regularized Minimax on Synthetic Data</title>
		<link>http://mlstat.wordpress.com/2010/04/19/regularized-minimax-on-synthetic-data/</link>
		<comments>http://mlstat.wordpress.com/2010/04/19/regularized-minimax-on-synthetic-data/#comments</comments>
		<pubDate>Mon, 19 Apr 2010 23:20:12 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Classification]]></category>
		<category><![CDATA[Domain Adaptation]]></category>
		<category><![CDATA[Estimation]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[biased training data]]></category>
		<category><![CDATA[cost-sensitive learning]]></category>
		<category><![CDATA[logistic regression]]></category>
		<category><![CDATA[robust learning]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=451</guid>
		<description><![CDATA[First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=451&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>First I would like to mention that, since my <a href="http://mlstat.wordpress.com/2010/03/13/constrained-minimax-for-robust-learning/">last post</a>, I came across the paper from 2005 on <a href="http://www.ri.cmu.edu/pub_files/pub4/bagnell_james_2005_1/bagnell_james_2005_1.pdf" target="_blank">Robust Supervised Learning</a> by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.</p>
<p><strong>Experiments on Synthetic Data</strong></p>
<p>I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.</p>
<p>In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two <em>types</em> per class).</p>
<p>The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.</p>
<p>We would expect a plain logistic regression classifier will minimize the overall &#8220;error&#8221; on the training data.</p>
<p>However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.</p>
<p><strong>Example 1</strong></p>
<p><strong> </strong></p>
<div id="attachment_453" class="wp-caption aligncenter" style="width: 459px"><strong><strong><a href="http://mlstat.files.wordpress.com/2010/04/original.png"><img class="size-full wp-image-453" title="original" src="http://mlstat.files.wordpress.com/2010/04/original.png?w=449&#038;h=339" alt="" width="449" height="339" /></a></strong></strong><p class="wp-caption-text">Example1. Original training data set. Both the red and blue classes have two types in 9:1 ratio.</p></div>
<div id="attachment_454" class="wp-caption aligncenter" style="width: 459px"><strong><strong><a href="http://mlstat.files.wordpress.com/2010/04/plain_logit.png"><img class="size-full wp-image-454" title="plain_logit" src="http://mlstat.files.wordpress.com/2010/04/plain_logit.png?w=449&#038;h=339" alt="" width="449" height="339" /></a></strong></strong><p class="wp-caption-text">Example 1. Plain logistic regression. No minimax. Almost all of the red circles are misclassified.</p></div>
<div id="attachment_455" class="wp-caption aligncenter" style="width: 459px"><strong><strong><a href="http://mlstat.files.wordpress.com/2010/04/gamma0point5.png"><img class="size-full wp-image-455" title="gamma0point5" src="http://mlstat.files.wordpress.com/2010/04/gamma0point5.png?w=449&#038;h=339" alt="" width="449" height="339" /></a></strong></strong><p class="wp-caption-text">Example 1. Minimax with <img src='http://s0.wp.com/latex.php?latex=%5C%5Cgamma+%3D+0.5&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;&#92;gamma = 0.5' title='&#92;&#92;gamma = 0.5' class='latex' /></p></div>
<div id="attachment_456" class="wp-caption aligncenter" style="width: 459px"><strong><strong><a href="http://mlstat.files.wordpress.com/2010/04/gamma0point1.png"><img class="size-full wp-image-456" title="gamma0point1" src="http://mlstat.files.wordpress.com/2010/04/gamma0point1.png?w=449&#038;h=339" alt="" width="449" height="339" /></a></strong></strong><p class="wp-caption-text">Example1. Minimax with gamma = 0.1</p></div>
<p><strong> </strong><br />
Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.</p>
<p><strong>Example 2</strong></p>
<div id="attachment_458" class="wp-caption aligncenter" style="width: 459px"><a href="http://mlstat.files.wordpress.com/2010/04/original2.png"><img class="size-full wp-image-458" title="original2" src="http://mlstat.files.wordpress.com/2010/04/original2.png?w=449&#038;h=339" alt="" width="449" height="339" /></a><p class="wp-caption-text">Example2. Original training data set.</p></div>
<div id="attachment_459" class="wp-caption aligncenter" style="width: 459px"><a href="http://mlstat.files.wordpress.com/2010/04/plain_logit2.png"><img class="size-full wp-image-459" title="plain_logit2" src="http://mlstat.files.wordpress.com/2010/04/plain_logit2.png?w=449&#038;h=339" alt="" width="449" height="339" /></a><p class="wp-caption-text">Example1. Logistic regression. No minimax.</p></div>
<div id="attachment_460" class="wp-caption aligncenter" style="width: 459px"><a href="http://mlstat.files.wordpress.com/2010/04/gamma0point5_2.png"><img class="size-full wp-image-460" title="gamma0point5_2" src="http://mlstat.files.wordpress.com/2010/04/gamma0point5_2.png?w=449&#038;h=339" alt="" width="449" height="339" /></a><p class="wp-caption-text">Example2. Minimax with gamma = 0.5</p></div>
<p><strong>Discussion</strong></p>
<p>1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.</p>
<p>2. Unfortunately we didn&#8217;t quite get it to help on the named-entity recognition problem that motivated the work.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/451/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/451/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/451/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/451/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/451/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/451/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/451/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/451/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=451&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2010/04/19/regularized-minimax-on-synthetic-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/04/original.png" medium="image">
			<media:title type="html">original</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/04/plain_logit.png" medium="image">
			<media:title type="html">plain_logit</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/04/gamma0point5.png" medium="image">
			<media:title type="html">gamma0point5</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/04/gamma0point1.png" medium="image">
			<media:title type="html">gamma0point1</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/04/original2.png" medium="image">
			<media:title type="html">original2</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/04/plain_logit2.png" medium="image">
			<media:title type="html">plain_logit2</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/04/gamma0point5_2.png" medium="image">
			<media:title type="html">gamma0point5_2</media:title>
		</media:content>
	</item>
		<item>
		<title>Regularized Minimax for Robust Learning</title>
		<link>http://mlstat.wordpress.com/2010/03/13/constrained-minimax-for-robust-learning/</link>
		<comments>http://mlstat.wordpress.com/2010/03/13/constrained-minimax-for-robust-learning/#comments</comments>
		<pubDate>Sat, 13 Mar 2010 17:34:32 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Classification]]></category>
		<category><![CDATA[Domain Adaptation]]></category>
		<category><![CDATA[Estimation]]></category>
		<category><![CDATA[Logistic Regression]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[cost-sensitive learning]]></category>
		<category><![CDATA[danskin's theorem]]></category>
		<category><![CDATA[minimax]]></category>
		<category><![CDATA[robust learning]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=392</guid>
		<description><![CDATA[This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift. Cost Sensitive Loss Functions Given a training data set , most learning algorithms learn a classifier that is parametrized by a vector [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=392&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.</p>
<p><strong>Cost Sensitive Loss Functions<br />
</strong></p>
<p>Given a training data set <img src='http://s0.wp.com/latex.php?latex=D+%3D+%5C%7Bx_i%2C+y_i%5C%7D_%7Bi%3D1%2C%5Cldots%2CN%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='D = &#92;{x_i, y_i&#92;}_{i=1,&#92;ldots,N}' title='D = &#92;{x_i, y_i&#92;}_{i=1,&#92;ldots,N}' class='latex' />, most learning algorithms learn a classifier <img src='http://s0.wp.com/latex.php?latex=%5Cphi&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;phi' title='&#92;phi' class='latex' /> that is parametrized by a vector <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' /> by minimizing a loss function</p>
<p style="text-align:center;"><a href="http://mlstat.files.wordpress.com/2010/03/minimax1.png"><img class="size-full wp-image-396 aligncenter" title="minimax1" src="http://mlstat.files.wordpress.com/2010/03/minimax1.png?w=233&#038;h=38" alt="" width="233" height="38" /></a></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=l%28x_i%2C+y_i%2C+w%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='l(x_i, y_i, w)' title='l(x_i, y_i, w)' class='latex' /> is the loss on example <img src='http://s0.wp.com/latex.php?latex=i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='i' title='i' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=f%28w%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='f(w)' title='f(w)' class='latex' /> is some function that penalizes complexity. For example for logistic regression the loss function looks like</p>
<p style="text-align:center;"><a href="http://mlstat.files.wordpress.com/2010/03/minimax2.png"><img class="size-full wp-image-400 aligncenter" title="minimax2" src="http://mlstat.files.wordpress.com/2010/03/minimax2.png?w=268&#038;h=39" alt="" width="268" height="39" /></a></p>
<p>for some <img src='http://s0.wp.com/latex.php?latex=%5Clambda+%3E+0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;lambda &gt; 0' title='&#92;lambda &gt; 0' class='latex' />.</p>
<p>If, in addition, the examples came with costs <img src='http://s0.wp.com/latex.php?latex=c_i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c_i' title='c_i' class='latex' /> (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see <a href="http://researchweb.watson.ibm.com/dar/papers/pdf/finalICDM2003.pdf" target="_blank">this paper by Zadrozny et. al.</a> )</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax32.png"><img class="aligncenter size-full wp-image-430" title="minimax3" src="http://mlstat.files.wordpress.com/2010/03/minimax32.png?w=261&#038;h=35" alt="" width="261" height="35" /></a></p>
<p>We further constrain <img src='http://s0.wp.com/latex.php?latex=%5Csum_i%5EN+c_i+%3D+N&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;sum_i^N c_i = N' title='&#92;sum_i^N c_i = N' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=c_i+%5Cge+0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c_i &#92;ge 0' title='c_i &#92;ge 0' class='latex' />. So the unweighted learning problem corresponds to the case where all <img src='http://s0.wp.com/latex.php?latex=c_i+%3D+1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c_i = 1' title='c_i = 1' class='latex' />.</p>
<p><strong>A Game Against An Adversary</strong></p>
<p>Assume that the learner is playing a game against an adversary that will assign the costs <img src='http://s0.wp.com/latex.php?latex=%5C%7Bc_i%5C%7D_%7Bi%3D1%2C%5Cldots%2CN%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{c_i&#92;}_{i=1,&#92;ldots,N}' title='&#92;{c_i&#92;}_{i=1,&#92;ldots,N}' class='latex' /> to the training examples that will lead to the worst possible loss for any weight vector the learner produces.</p>
<p>How do we learn in order to minimize this maximum possible loss? The solution is to look for the the <em>minimax</em> solution</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax4.png"><img class="aligncenter size-full wp-image-412" title="minimax4" src="http://mlstat.files.wordpress.com/2010/03/minimax4.png?w=219&#038;h=27" alt="" width="219" height="27" /></a></p>
<p>For any realistic learning problem the above optimization problem does not have a unique solution.</p>
<p>Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.</p>
<p>We define<a href="http://mlstat.files.wordpress.com/2010/03/minimax61.png"></a><a href="http://mlstat.files.wordpress.com/2010/03/minimax62.png"><br />
</a><a href="http://mlstat.files.wordpress.com/2010/03/minimax63.png"><img class="aligncenter size-full wp-image-421" title="minimax6" src="http://mlstat.files.wordpress.com/2010/03/minimax63.png?w=230&#038;h=20" alt="" width="230" height="20" /></a></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=H%28c%29+%3D+-%5Csum_i+c_i+%5Clog+c_i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='H(c) = -&#92;sum_i c_i &#92;log c_i' title='H(c) = -&#92;sum_i c_i &#92;log c_i' class='latex' /> (the Shannon entropy of the cost vector, save the normalization to sum to one).</p>
<p>The new minimax optimization problem can be posed as</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax5.png"><img class="aligncenter size-full wp-image-416" title="minimax5" src="http://mlstat.files.wordpress.com/2010/03/minimax5.png?w=260&#038;h=30" alt="" width="260" height="30" /></a></p>
<p>subject to the constraints</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax7.png"><img class="aligncenter size-full wp-image-417" title="minimax7" src="http://mlstat.files.wordpress.com/2010/03/minimax7.png?w=204&#038;h=46" alt="" width="204" height="46" /></a></p>
<p>Note that the regularization term on the cost vector <img src='http://s0.wp.com/latex.php?latex=c&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c' title='c' class='latex' /> essentially restricts the set of  possible cost vectors the adversary has at his disposal.</p>
<p><strong>Optimization</strong></p>
<p>For convex loss functions (such as the logistic loss) <img src='http://s0.wp.com/latex.php?latex=L%28w%2C+c%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='L(w, c)' title='L(w, c)' class='latex' /> is convex in <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' /> for a fixed cost assignment, therefore so is <img src='http://s0.wp.com/latex.php?latex=R%28w%2C+c%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R(w, c)' title='R(w, c)' class='latex' />. Furthermore, <img src='http://s0.wp.com/latex.php?latex=R%28w%2C+c%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R(w, c)' title='R(w, c)' class='latex' /> is concave in <img src='http://s0.wp.com/latex.php?latex=c&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c' title='c' class='latex' /> and is restricted to a convex and compact set. We can therefore apply <a href="http://en.wikipedia.org/wiki/Danskin%27s_theorem" target="_blank">Danskin&#8217;s theorem</a> to perform the optimization.</p>
<p>The theorem allows us to say that, for a fixed weight vector <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' />, if</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax8.png"><img class="aligncenter size-full wp-image-423" title="minimax8" src="http://mlstat.files.wordpress.com/2010/03/minimax8.png?w=279&#038;h=52" alt="" width="279" height="52" /></a></p>
<p>and if <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bc%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;tilde{c}' title='&#92;tilde{c}' class='latex' /> is unique, then</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax9.png"><img class="aligncenter size-full wp-image-424" title="minimax9" src="http://mlstat.files.wordpress.com/2010/03/minimax9.png?w=191&#038;h=19" alt="" width="191" height="19" /></a></p>
<p>even though <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bc%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;tilde{c}' title='&#92;tilde{c}' class='latex' /> is a function of <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' />.</p>
<p><strong>Algorithm</strong></p>
<p>The algorithm is very simple. Perform until convergence the following</p>
<p><span style="color:#333399;">1. At the <img src='http://s0.wp.com/latex.php?latex=k%5E%7Bth%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='k^{th}' title='k^{th}' class='latex' /> iteration, for the weight vector <img src='http://s0.wp.com/latex.php?latex=w%5E%7Bk%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w^{k}' title='w^{k}' class='latex' /> find the cost vector <img src='http://s0.wp.com/latex.php?latex=%5Ctilde%7Bc%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;tilde{c}' title='&#92;tilde{c}' class='latex' /> the maximizes <img src='http://s0.wp.com/latex.php?latex=R%28w%5E%7Bk%7D%2Cc%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R(w^{k},c)' title='R(w^{k},c)' class='latex' />.<br />
</span></p>
<p><span style="color:#333399;">2. Update <img src='http://s0.wp.com/latex.php?latex=w%5E%7Bk%2B1%7D+%3D+w%5E%7Bk%7D+-+%5Ceta+%5Cnabla_w+R%28w%5E%7Bk%7D%2C+%5Ctilde%7Bc%7D%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w^{k+1} = w^{k} - &#92;eta &#92;nabla_w R(w^{k}, &#92;tilde{c})' title='w^{k+1} = w^{k} - &#92;eta &#92;nabla_w R(w^{k}, &#92;tilde{c})' class='latex' />, where <img src='http://s0.wp.com/latex.php?latex=%5Ceta&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;eta' title='&#92;eta' class='latex' /> is the learning rate.</span></p>
<p>The maximization in step 1 is also simple and can be shown to be</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax10.png"><img class="aligncenter size-full wp-image-431" title="minimax10" src="http://mlstat.files.wordpress.com/2010/03/minimax10.png?w=187&#038;h=44" alt="" width="187" height="44" /></a></p>
<p>As expected, if <img src='http://s0.wp.com/latex.php?latex=%5Cgamma+%5Crightarrow+%5Cinfty&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;gamma &#92;rightarrow &#92;infty' title='&#92;gamma &#92;rightarrow &#92;infty' class='latex' />, the costs remain close to one and as <img src='http://s0.wp.com/latex.php?latex=%5Cgamma+%5Crightarrow+0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;gamma &#92;rightarrow 0' title='&#92;gamma &#92;rightarrow 0' class='latex' /> the entire cost budget is allocated to the example with the largest loss.</p>
<p><strong>Of types and tokens</strong></p>
<p>This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.</p>
<p>For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.</p>
<p>(The fact that &#8216;Joe Plumber&#8221; occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)</p>
<p>We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible <em>type</em> distribution rather than for the observed <em>token</em> distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.</p>
<p>For type <img src='http://s0.wp.com/latex.php?latex=t&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='t' title='t' class='latex' /> let <img src='http://s0.wp.com/latex.php?latex=S_t&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S_t' title='S_t' class='latex' /> be the set of all tokens of that type and <img src='http://s0.wp.com/latex.php?latex=N_t&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='N_t' title='N_t' class='latex' /> be the number of tokens of that type. We now estimate <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' /> by</p>
<p><a href="http://mlstat.files.wordpress.com/2010/03/minimax11.png"><img class="aligncenter size-full wp-image-433" title="minimax11" src="http://mlstat.files.wordpress.com/2010/03/minimax11.png?w=282&#038;h=166" alt="" width="282" height="166" /></a></p>
<p>under the same constraints on <img src='http://s0.wp.com/latex.php?latex=c&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c' title='c' class='latex' /> as above. Here <img src='http://s0.wp.com/latex.php?latex=q&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='q' title='q' class='latex' /> is the observed type distribution in the training data and <img src='http://s0.wp.com/latex.php?latex=KL%28.%5C%7C.%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='KL(.&#92;|.)' title='KL(.&#92;|.)' class='latex' /> is the KL-divergence.</p>
<p>The algorithm is identical to the one above except the maximum over <img src='http://s0.wp.com/latex.php?latex=c&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c' title='c' class='latex' /> for a fixed <img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' /> is slightly different.</p>
<p style="text-align:center;"><a href="http://mlstat.files.wordpress.com/2010/03/minimax_new.png"><img class="aligncenter size-full wp-image-445" title="minimax_new" src="http://mlstat.files.wordpress.com/2010/03/minimax_new.png?w=152&#038;h=53" alt="" width="152" height="53" /></a></p>
<p><strong>Related Work and Discussion</strong></p>
<p>1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on<a href="www.cs.nyu.edu/~roweis/papers/invar-chapter.pdf" target="_blank"> adversarial view for covariate shift</a> by Globerson et. al. and <a href="books.nips.cc/papers/files/nips22/NIPS2009_0534.pdf">the NIPS</a> paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.</p>
<p>2. If each class is considered to one type, the usual Duda &amp; Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors (<img src='http://s0.wp.com/latex.php?latex=w&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='w' title='w' class='latex' /> for us) and not for the costs over the examples.</p>
<p>3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.</p>
<p>4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a <em>particular</em> user. Such a minimax approach may be suitable for  optimizing for the worst possible assignment of costs over query types.</p>
<p>In the next post I will present some experimental results on toy examples with synthetic data.</p>
<p><strong>Acknowledgment</strong></p>
<p>I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin&#8217;s theorem, and to Marc Light for suggesting the problem.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/392/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/392/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/392/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/392/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/392/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/392/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/392/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/392/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/392/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/392/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/392/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/392/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/392/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/392/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=392&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2010/03/13/constrained-minimax-for-robust-learning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax1.png" medium="image">
			<media:title type="html">minimax1</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax2.png" medium="image">
			<media:title type="html">minimax2</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax32.png" medium="image">
			<media:title type="html">minimax3</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax4.png" medium="image">
			<media:title type="html">minimax4</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax63.png" medium="image">
			<media:title type="html">minimax6</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax5.png" medium="image">
			<media:title type="html">minimax5</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax7.png" medium="image">
			<media:title type="html">minimax7</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax8.png" medium="image">
			<media:title type="html">minimax8</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax9.png" medium="image">
			<media:title type="html">minimax9</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax10.png" medium="image">
			<media:title type="html">minimax10</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax11.png" medium="image">
			<media:title type="html">minimax11</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2010/03/minimax_new.png" medium="image">
			<media:title type="html">minimax_new</media:title>
		</media:content>
	</item>
		<item>
		<title>Sparse online kernel logistic regression</title>
		<link>http://mlstat.wordpress.com/2009/12/06/sparse-online-kernel-logistic-regression/</link>
		<comments>http://mlstat.wordpress.com/2009/12/06/sparse-online-kernel-logistic-regression/#comments</comments>
		<pubDate>Sun, 06 Dec 2009 19:52:05 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Classification]]></category>
		<category><![CDATA[Estimation]]></category>
		<category><![CDATA[kernel logistic regression]]></category>
		<category><![CDATA[logistic regression]]></category>
		<category><![CDATA[online learning]]></category>
		<category><![CDATA[prototype kernel]]></category>
		<category><![CDATA[sparse kernel]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=361</guid>
		<description><![CDATA[In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.) (As a brief aside, I note that an essentially identical approach [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=361&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://mlstat.wordpress.com/2009/08/17/an-effective-kernelization-of-logistic-regression/" target="_self">previous post</a>, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. <a href="http://mlstat.files.wordpress.com/2010/04/handout.pdf" target="_self">Slides for a tutorial</a> on this stuff.)</p>
<p>(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.5333" target="_blank">Snelson and Gharahmani</a>. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)</p>
<p>I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).</p>
<p>It seems to perform reasonably well on large datasets. Below I&#8217;ll show its behavior on some pseudo-randomly generated classification problems.</p>
<p>All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.</p>
<p><strong>Example 1 (using 3 prototypes)<br />
</strong></p>
<div id="attachment_368" class="wp-caption alignleft" style="width: 550px"><a href="http://mlstat.files.wordpress.com/2009/12/iterb11.png"><img class="size-full wp-image-368" title="After first iteration" src="http://mlstat.files.wordpress.com/2009/12/iterb11.png?w=540&#038;h=278" alt="" width="540" height="278" /></a><p class="wp-caption-text">After first iteration</p></div>
<div id="attachment_371" class="wp-caption alignleft" style="width: 549px"><a href="http://mlstat.files.wordpress.com/2009/12/iterb21.png"><img class="size-full wp-image-371 " title="After second iteration" src="http://mlstat.files.wordpress.com/2009/12/iterb21.png?w=539&#038;h=283" alt="" width="539" height="283" /></a><p class="wp-caption-text">After second iteration</p></div>
<div id="attachment_373" class="wp-caption alignleft" style="width: 569px"><a href="http://mlstat.files.wordpress.com/2009/12/iterbn2.png"><img class="size-full wp-image-373   " title="iterbn" src="http://mlstat.files.wordpress.com/2009/12/iterbn2.png?w=559&#038;h=292" alt="" width="559" height="292" /></a><p class="wp-caption-text">After about 10 iterations</p></div>
<p>Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.</p>
<p><strong>Example 2 (five prototypes)<br />
</strong></p>
<div id="attachment_378" class="wp-caption alignleft" style="width: 550px"><a href="http://mlstat.files.wordpress.com/2009/12/itera1.png"><img class="size-full wp-image-378 " title="itera1" src="http://mlstat.files.wordpress.com/2009/12/itera1.png?w=540&#038;h=282" alt="" width="540" height="282" /></a><p class="wp-caption-text">After first iteration</p></div>
<div id="attachment_379" class="wp-caption alignleft" style="width: 550px"><a href="http://mlstat.files.wordpress.com/2009/12/iteran.png"><img class="size-full wp-image-379 " title="iteran" src="http://mlstat.files.wordpress.com/2009/12/iteran.png?w=540&#038;h=282" alt="" width="540" height="282" /></a><p class="wp-caption-text">After 5 iterations</p></div>
<p><strong>Example 3 (five prototypes)<br />
</strong></p>
<div id="attachment_380" class="wp-caption alignleft" style="width: 595px"><a href="http://mlstat.files.wordpress.com/2009/12/iter1.png"><img class="size-full wp-image-380 " title="Iter1" src="http://mlstat.files.wordpress.com/2009/12/iter1.png?w=585&#038;h=208" alt="" width="585" height="208" /></a><p class="wp-caption-text">After first iteration</p></div>
<p>The right most panel shows the first two &#8220;transformed features&#8221;, i.e., the kernel values of the examples to the first two prototypes.</p>
<div id="attachment_381" class="wp-caption alignnone" style="width: 505px"><a href="http://mlstat.files.wordpress.com/2009/12/iter2.png"><img class="size-full wp-image-381 " title="iter2" src="http://mlstat.files.wordpress.com/2009/12/iter2.png?w=495&#038;h=261" alt="" width="495" height="261" /></a><p class="wp-caption-text">After second iteration</p></div>
<p><strong>Implementation details and discusssion</strong></p>
<p>The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the  prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.</p>
<p>As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).</p>
<p>I had to make some approximations to make the algorithm fast &#8212; the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.</p>
<p>The kernel parameter updating worked best when the RBF kernel was re-parametrized as <img src='http://s0.wp.com/latex.php?latex=K%28x%2Cu%29+%3D+exp%28-exp%28%5Ctheta%29+%7C%7Cx-u%7C%7C%5E2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K(x,u) = exp(-exp(&#92;theta) ||x-u||^2)' title='K(x,u) = exp(-exp(&#92;theta) ||x-u||^2)' class='latex' />.</p>
<p>The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.</p>
<p>Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/361/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/361/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/361/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/361/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/361/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/361/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/361/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/361/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/361/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/361/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/361/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/361/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/361/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/361/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=361&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2009/12/06/sparse-online-kernel-logistic-regression/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/12/iterb11.png" medium="image">
			<media:title type="html">After first iteration</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/12/iterb21.png" medium="image">
			<media:title type="html">After second iteration</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/12/iterbn2.png" medium="image">
			<media:title type="html">iterbn</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/12/itera1.png" medium="image">
			<media:title type="html">itera1</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/12/iteran.png" medium="image">
			<media:title type="html">iteran</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/12/iter1.png" medium="image">
			<media:title type="html">Iter1</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/12/iter2.png" medium="image">
			<media:title type="html">iter2</media:title>
		</media:content>
	</item>
		<item>
		<title>BWT for NLP (2)</title>
		<link>http://mlstat.wordpress.com/2009/11/12/bwt-for-nlp-2/</link>
		<comments>http://mlstat.wordpress.com/2009/11/12/bwt-for-nlp-2/#comments</comments>
		<pubDate>Thu, 12 Nov 2009 21:58:57 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Information theory]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Burrows-Wheeler Transform]]></category>
		<category><![CDATA[BWT]]></category>
		<category><![CDATA[String Similarity]]></category>
		<category><![CDATA[Summarization]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=335</guid>
		<description><![CDATA[I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference. The task [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=335&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference.</p>
<p>The task was to produce an automatic metric to evaluate machine generated summaries (i.e., <em>system</em> summaries) against human generated summaries for the TAC &#8217;09 Update Summarization Task. Clearly the automatic metric is just some function that produces a similarity score between the system summary and the human generated (the so-called <em>model</em>) summary.</p>
<p>The  proposed metrics were evaluated by comparing their rankings of the system summaries from different peers to that of the ranking produced by human judges.</p>
<div id="_mcePaste"><strong>Similarity Metric</strong></div>
<p>We use an estimate of the conditional &#8220;compressibility&#8221; of the model summary given the system summary as the similarity metric. The conditional compressibility is defined as the increase in the compressibility of the model summary when the system summary has been observed.</p>
<div>
<div id="_mcePaste">In order to judge the similarity of the system summary <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' />, to the model summary <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' />, we propose to use the difference in compressibility of <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' /> when <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> is not seen to when <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> is given. This metric basically</div>
<div id="_mcePaste">captures the reduction in the uncertainty in <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' /> when <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> is known.</div>
<p>We define the compressibility <img src='http://s0.wp.com/latex.php?latex=c%28M%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c(M)' title='c(M)' class='latex' /> of any string <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' /> by</p>
<p><img src='http://s0.wp.com/latex.php?latex=c%28M%29+%3D+%5Cfrac%7BH%28M%29%7D%7B%7CM%7C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c(M) = &#92;frac{H(M)}{|M|}' title='c(M) = &#92;frac{H(M)}{|M|}' class='latex' /></p>
<p>and the conditional compressibility of string <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' /> over an alphabet <img src='http://s0.wp.com/latex.php?latex=%5Cmathcal%7BA%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;mathcal{A}' title='&#92;mathcal{A}' class='latex' /> given another string <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> over the same alphabet as</p>
<p><img src='http://s0.wp.com/latex.php?latex=c%28M%7CS%29+%3D+%5Cfrac%7BH%28S%2BM%29+-+H%28S%29%7D%7B%7CM%7C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='c(M|S) = &#92;frac{H(S+M) - H(S)}{|M|}' title='c(M|S) = &#92;frac{H(S+M) - H(S)}{|M|}' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=S%2BM&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S+M' title='S+M' class='latex' /> is the concatenation of the strings <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=H%28S%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='H(S)' title='H(S)' class='latex' /> is the entropy of string <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' />, and <img src='http://s0.wp.com/latex.php?latex=%7CM%7C&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='|M|' title='|M|' class='latex' /> is the length of the string <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' />.</p>
<div id="_mcePaste">The fractional increase in compressibility of <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' /> given <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> can then measured by</div>
<p><img src='http://s0.wp.com/latex.php?latex=r%28M%7CS%29+%3D+%5Cfrac%7Bc%28M%29+-+c%28M%7CS%29%7D%7Bc%28M%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='r(M|S) = &#92;frac{c(M) - c(M|S)}{c(M)}' title='r(M|S) = &#92;frac{c(M) - c(M|S)}{c(M)}' class='latex' />.</p>
<p>We use <img src='http://s0.wp.com/latex.php?latex=r%28M%7CS%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='r(M|S)' title='r(M|S)' class='latex' /> as the similarity metric to measure the similarity of a system summary <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> to the model summary <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' />.</p>
<p>Our metric is similar to the one <a href="http://homepages.cwi.nl/~paulv/papers/similarity.pdf" target="_blank">proposed</a> by Li and Vitanyi and is theoretically well-justified from the perspective of algorithmic information theory. One peculiarity is that our similarity is asymmetric.</p>
<p>The only thing that is needed to implement the above similarity metric is an estimate of the entropy <img src='http://s0.wp.com/latex.php?latex=H%28S%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='H(S)' title='H(S)' class='latex' /> for a string <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' />. We use the BWT for this estimate.</p>
<p><strong>BWT-based String Entropy Estimate</strong></p>
<p>We use the Move-To-Front (MTF) entropy of the Burrows-Wheeler transform of a given string <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> as an estimate for its entropy $H(S)$.</p>
</div>
<div>The <a href="http://en.wikipedia.org/wiki/Move-to-front_transform" target="_blank">MTF encoding</a> of a string is performed by traversing the string and assigning to each symbol the position of that symbol in the alphabet and then moving the symbol to the front of the alphabet. Therefore a sequence with a lot of runs will  have a lot of zeros in its MTF encoding.</div>
<p>In <a href="http://www.cs.tau.ac.il/~haimk/adv-ds-2007/bwt_analysis_journal.pdf" target="_blank">this paper</a> the MTF coding is used to define the MTF entropy (which the authors also call <em>local entropy</em>) of a string <img src='http://s0.wp.com/latex.php?latex=R&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R' title='R' class='latex' /> as</p>
<p><img src='http://s0.wp.com/latex.php?latex=%5Cmbox%7BMTFE%7D%28R%29+%3D+%5Csum_i+%5Cmbox%7Blog%7D%28%5Cmbox%7BMTF%7D%28R%29_i+%2B+1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;mbox{MTFE}(R) = &#92;sum_i &#92;mbox{log}(&#92;mbox{MTF}(R)_i + 1)' title='&#92;mbox{MTFE}(R) = &#92;sum_i &#92;mbox{log}(&#92;mbox{MTF}(R)_i + 1)' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=%5Cmbox%7BMTF%7D%28R%29_i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;mbox{MTF}(R)_i' title='&#92;mbox{MTF}(R)_i' class='latex' /> is the <img src='http://s0.wp.com/latex.php?latex=i%5E%7Bth%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='i^{th}' title='i^{th}' class='latex' /> symbol of the MTF coding of the string <img src='http://s0.wp.com/latex.php?latex=R&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R' title='R' class='latex' />.</p>
<p>Now we define <img src='http://s0.wp.com/latex.php?latex=H%28S%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='H(S)' title='H(S)' class='latex' />, the entropy of string <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' /> as</p>
<p><img src='http://s0.wp.com/latex.php?latex=H%28S%29+%3D+%5Cmbox%7BMTFE%7D%28%5Cmbox%7BBWT%7D%28S%29%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='H(S) = &#92;mbox{MTFE}(&#92;mbox{BWT}(S))' title='H(S) = &#92;mbox{MTFE}(&#92;mbox{BWT}(S))' class='latex' /></p>
<p>where <img src='http://s0.wp.com/latex.php?latex=%5Cmbox%7BBWT%7D%28S%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;mbox{BWT}(S)' title='&#92;mbox{BWT}(S)' class='latex' /> is the BWT of string <img src='http://s0.wp.com/latex.php?latex=S&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='S' title='S' class='latex' />.</p>
<p>Since the Burrows-Wheeler transform involves just the construction of a suffix array, the computation of our compression based evaluation metric is linear in time and space in the length of the model and system summary strings.</p>
<p><strong>Some Technical Details</strong></p>
<div id="_mcePaste">For our implementation, we considered each word in a string as a separate symbol. Our alphabet of symbols therefore contained all the words in the two strings being compared. The words were normalized by lower casing and removing punctuation. Because BWT needs an ordered alphabet, we used the lexicographic order on the words in the alphabet.</div>
<p><strong>Results</strong></p>
<div><img class="aligncenter size-full wp-image-350" title="table1" src="http://mlstat.files.wordpress.com/2009/11/table1.png?w=540&#038;h=283" alt="table1" width="540" height="283" /></div>
<p>The results on the TAC-AESOP task (above) show that the BWT based metric (FraCC in the table) is reasonable for summarization evaluation, especially because there are not very many knobs to tune. I obtained these results from Frank (who will present them at TAC next week). The &#8220;best metric&#8221; is the AESOP submission that seemed to have high scores across several measures.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/335/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/335/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/335/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/335/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/335/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/335/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/335/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/335/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/335/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/335/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/335/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/335/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/335/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/335/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=335&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2009/11/12/bwt-for-nlp-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/11/table1.png" medium="image">
			<media:title type="html">table1</media:title>
		</media:content>
	</item>
		<item>
		<title>BWT for NLP (1)</title>
		<link>http://mlstat.wordpress.com/2009/09/26/bwt-for-nlp-1/</link>
		<comments>http://mlstat.wordpress.com/2009/09/26/bwt-for-nlp-1/#comments</comments>
		<pubDate>Sat, 26 Sep 2009 19:30:02 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Information theory]]></category>
		<category><![CDATA[Natural Language Processing]]></category>
		<category><![CDATA[Burrows-Wheeler Transform]]></category>
		<category><![CDATA[BWT]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[word clustering]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=296</guid>
		<description><![CDATA[The Burrows-Wheeler transform (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing.  I will present one such use. BWT massages the original string into being [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=296&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://en.wikipedia.org/wiki/Burrows–Wheeler_transform">Burrows-Wheeler transform</a> (BWT), which is the main step in the bzip2 compression algorithm, is a permutation transform on a string over an ordered alphabet. It is a clever idea and can be useful for some string processing for natural language processing.  I will present one such use.</p>
<p>BWT massages the original string into being more amenable to compression. Of course the transform doesn&#8217;t alter the compressibility (entropy rate) of the original string. All it does is make the string more compressible by algorithms we know.</p>
<p>The reason string permutation by BWT (as opposed to say sorting the string, which makes it <em>really</em> compressible) is useful is that the reverse transform (undoing the permutation) can be done with very little additional information. Mark Nelson wrote a <a href="http://marknelson.us/1996/09/01/bwt/" target="_blank">nice introduction</a> to the transform.  Moreover, the BWT essentially involves the construction of the <a href="http://en.wikipedia.org/wiki/Suffix_array" target="_blank">suffix array</a> for the string, and therefore can be done in time and space linear in the length of the string.</p>
<p>Here is an example of the Burrows-Wheeler tranformation of the first stanza of Yeats&#8217; <em>Sailing to Byzantium</em>. I added some newlines to the transformed string, and the underscores represent spaces in the original string. Notice the long runs of characters in the transformed string.</p>
<p><em>Original string</em></p>
<p style="font:12px Verdana;color:#333333;margin:0;">THAT is no country for old men. The young In one another&#8217;s arms, birds in the trees &#8211; Those dying generations &#8211; at their song, The salmon-falls, the mackerel-crowded seas, Fish, flesh, or fowl, commend all summer long Whatever is begotten, born, and dies. Caught in that sensual music all neglect Monuments of unageing intellect.</p>
<p><em>BWTransformed string</em></p>
<p style="font:12px Verdana;color:#333333;margin:0;">rsgnsnlhhs__lntsnH__T__.A____ss.,gt,.-gcd,es s,,,ode,yrgtsgrTredllssrn,edtrln,ntefemnu__fs___eh_hrC___ia__-eennlew_r_nshhhhslldrnbghrttmmgsmhvmnkielto-___nnnnna_ueesstWtTtTttTgsd__ye_teb__Fcweallolgfaaeaa_l</p>
<p style="font:12px Verdana;color:#333333;margin:0;">__mumoulr_reoeIiiueao_eouoii_aoeiueon__cm_sliM_</p>
<p style="font:12px Verdana;color:#333333;margin:0;">fbhngycrfeoeeoieiteaoctamleen&#8217;idit_o__ieu_n_cchaanta</p>
<p style="font:12px Verdana;color:#333333;margin:0;">____oa_nnosans_oomeoord_</p>
<p><strong>A useful property</strong></p>
<p><a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.2614" target="_blank">Effros et. al. showed</a> that for a string generated by a finite-memory source, the BWT of the string is asymptotically (in the length of the string) indistinguishable from a piece-wise independent and identically distributed (i.i.d.) string. This is not surprising given that symbols with similar contexts appear sequentially in the BWT string, and for finite memory sources the current symbol is generated i.i.d. given a finite length context.</p>
<p>This property can be exploited to easily cluster words according to context by using BWT.</p>
<p><strong>Word clustering</strong></p>
<p><a href="http://acl.ldc.upenn.edu/J/J92/J92-4003.pdf" target="_blank">In this paper</a>, among other things, Brown et.al. present a word clustering algorithm based on maximizing the average mutual information between the cluster ids of adjacent words. Some results are presented in Table 2 in the paper.</p>
<p>Such word clusters can be useful for feature engineering for sequence tagging tasks such as part-of-speech tagging or named-entity recognition. One of the most commonly used features for such tasks is one which checks if the current word is in a carefully constructed list of words.</p>
<p>Brown et. al. admit that, even after optimizations, their algorithm is slow and resort to approximations. (I realize that computers have gotten much faster since but still their algorithm is cubic in the size of the vocabulary.)</p>
<p><strong>Word clustering based on BWT</strong></p>
<p>We will cluster two words together if they appear independently given certain contexts (albeit with different probabilities). We first perform a BW transform on the input string of words (considering each <em>word</em> as a symbol, unlike in the example above) and measure whether the two words appear independently in an i.i.d. fragment.</p>
<p>Instead of actually trying to chop the BWT string into i.i.d. fragments before analysis, we adopt a proxy metric. We check if the number of times the two words are next to each other in the BWT string is large compared to what we would expect from their frequencies. We compute this as probability ratio with appropriate smoothing.</p>
<p>Another neat consequence of doing the clustering by BWT is that we only need to consider pairs of words that do appear next to each other in the BWT string. Therefore the selection of candidates for clustering is linear in the length of the string and not quadratic in the size of the vocabulary.</p>
<p><strong>Some results</strong></p>
<p>I ran this algorithm on about a month&#8217;s worth of New York Times and Wall Street Journal news data and these are the pairs of words with the highest scores.</p>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">january february 0.177721578886</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">january march 0.143172972502</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">march february 0.142398170589</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">englandgeoneng jerseyusanj 0.141412321852</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">news becdnews 0.135642386152</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">finala final 0.131901568726</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">finala finalb 0.122728309966</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">finala finalc 0.113085215849</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">cafd cea 0.107549686029</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">february april 0.100734422316</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">january april 0.0993752546848</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">has have 0.0967101802923</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">march april 0.0929933503714</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">did does 0.0854452561942</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">has had 0.0833642704346</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">will would 0.0827179598199</div>
<div id="_mcePaste" style="position:absolute;left:-10000px;top:1131px;width:1px;height:1px;">have had 0.0773517518078</div>
<p style="padding-left:60px;"><span style="color:#993366;">january february 0.177721578886</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">january march 0.143172972502</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">march february 0.142398170589</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">englandgeoneng jerseyusanj 0.141412321852</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">news becdnews 0.135642386152</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">finala final 0.131901568726</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">finala finalb 0.122728309966</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">finala finalc 0.113085215849</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">cafd cea 0.107549686029</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">february april 0.100734422316</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">january april 0.0993752546848</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">has have 0.0967101802923</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">march april 0.0929933503714</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">did does 0.0854452561942</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">has had 0.0833642704346</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">will would 0.0827179598199</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">have had 0.0773517518078</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">&#8230;</span></p>
<p>I constructed a graph by joining all word pairs that have a score above a threshold and ran a greedy maximal clique algorithm. These are some of the resulting word clusters.</p>
<p style="padding-left:60px;"><span style="color:#993366;">older young younger</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">announced today yesterday said reported</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">month today week yesterday</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">days month months decade year weeks years</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">decades months decade weeks years</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">com org www</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">writing write wrote</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">directed edited produced</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">should will probably could would may might can</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">worries worried concerns</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">work worked working works</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">wearing wear wore</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">win lost losing</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">man people men</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">against like to about that by for on in with from of at</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">under by on with into over from of</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">baton moulin khmer</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">daughter husband sister father wife mother son</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">red green blue black</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">ice sour whipped</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">time days months year years day</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">eastern coast southeastern</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">bergen orange nassau westchester</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">east ivory west</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">goes gone go going went</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">known seen well</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">travel review leisure weekly editorial</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">cultural financial foreign editorial national metropolitan</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">thursdays wednesdays fridays sundays tuesdays</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">thursday today monday sunday yesterday wednesday saturday friday tuesday</span></p>
<p style="padding-left:60px;"><span style="color:#993366;">&#8230;</span></p>
<p><strong>Discussion</strong></p>
<p>1. For the above results, I only did the clustering based on right contexts. We can easily extend the word-pair score to take into account left contexts as well by concatenating the BWT of the <em>reversed</em> string to the BWT of the original string, and calculating the scores on this double length transformed string.</p>
<p>2. The word clustering algorithm of Brown et. al. proceeds by iteratively merging the best pair of words and replacing the two words in the alphabet (and the string) by a merged word. We can imagine doing something similar with our approach, except, because BWT uses the order on the alphabet, we need to decide where to insert the merged word.</p>
<p>3. One thing that I should have done but didn&#8217;t for the above results is to order the alphabet (of words) lexicographically. Instead I assign positive integers to the words based on their first appearance in the string, which is the order BWT uses to sort. Fixing this should improve the results.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/296/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/296/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/296/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=296&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2009/09/26/bwt-for-nlp-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>
	</item>
		<item>
		<title>Incremental complexity support vector machine</title>
		<link>http://mlstat.wordpress.com/2009/09/18/incremental-complexity-support-vector-machine/</link>
		<comments>http://mlstat.wordpress.com/2009/09/18/incremental-complexity-support-vector-machine/#comments</comments>
		<pubDate>Sat, 19 Sep 2009 02:02:57 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Classification]]></category>
		<category><![CDATA[cascaded kernels]]></category>
		<category><![CDATA[incremental complexity SVM]]></category>
		<category><![CDATA[support vector machine]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=298</guid>
		<description><![CDATA[One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below. (I generated them using a java SVM applet from here, whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=298&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>One of the problems with using complex kernels with support vector machines is that they tend to produce classification boundaries that are odd, like the ones below.</p>
<p><img class="aligncenter size-full wp-image-306" title="svm_rbf_s1" src="http://mlstat.files.wordpress.com/2009/09/svm_rbf_s11.png?w=450&#038;h=367" alt="svm_rbf_s1" width="450" height="367" /></p>
<p><img class="aligncenter size-full wp-image-308" title="svm_rbf_s10" src="http://mlstat.files.wordpress.com/2009/09/svm_rbf_s103.png?w=450&#038;h=356" alt="svm_rbf_s10" width="450" height="356" /></p>
<p>(I generated them using a java SVM applet <a href="http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtml" target="_blank">from here</a>, whose reliability I cannot swear to, but have no reason to doubt.) Both SVM boundaries are with Gaussian RBF kernels: the first with <img src='http://s0.wp.com/latex.php?latex=%5Csigma+%3D+1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;sigma = 1' title='&#92;sigma = 1' class='latex' /> and the second with <img src='http://s0.wp.com/latex.php?latex=%5Csigma+%3D+10&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;sigma = 10' title='&#92;sigma = 10' class='latex' /> on two different data sets.</p>
<p>Note the segments of the boundary to the east of the blue examples in the bottom figure, and those to the south and to the north-east of the blue examples in the top figure. They seem to violate intuition.</p>
<p>The reason for these anomalous boundaries is of course the large complexity of the function class induced by the RBF kernel with large <img src='http://s0.wp.com/latex.php?latex=%5Csigma&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;sigma' title='&#92;sigma' class='latex' />, which gives the classifier a propensity to make subtle distinctions even in regions of  somewhat low example density.</p>
<p><strong>A possible solution: using complex kernels only where they are needed</strong></p>
<p>We propose to build a cascaded classifier, which we will call Incremental Complexity SVM (ICSVM), as follows.</p>
<p>We are given a sequence of kernels <img src='http://s0.wp.com/latex.php?latex=K_1%2C+K_2%2C%5Cldots%2CK_m&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K_1, K_2,&#92;ldots,K_m' title='K_1, K_2,&#92;ldots,K_m' class='latex' /> of increasing complexity. For example the sequence is of polynomial kernels, where <img src='http://s0.wp.com/latex.php?latex=K_i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K_i' title='K_i' class='latex' /> is the polynomial kernel with degree <img src='http://s0.wp.com/latex.php?latex=i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='i' title='i' class='latex' />.</p>
<p>The learning algorithm first learns an SVM classifier <img src='http://s0.wp.com/latex.php?latex=%5Cpsi_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;psi_1' title='&#92;psi_1' class='latex' /> with kernel <img src='http://s0.wp.com/latex.php?latex=K_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K_1' title='K_1' class='latex' />, that classifies a <em>reasonable</em> portion of the examples with a large margin <img src='http://s0.wp.com/latex.php?latex=%5Clambda_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;lambda_1' title='&#92;lambda_1' class='latex' />. This can be accomplished by setting the SVM cost parameter <img src='http://s0.wp.com/latex.php?latex=C&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='C' title='C' class='latex' /> to some low value.</p>
<p>Now all the examples outside the margin are thrown out, and another SVM classifier <img src='http://s0.wp.com/latex.php?latex=%5Cpsi_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;psi_2' title='&#92;psi_2' class='latex' /> with kernel <img src='http://s0.wp.com/latex.php?latex=K_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='K_2' title='K_2' class='latex' /> is learned, so that a reasonable portion of the remaining examples are classified with some large margin <img src='http://s0.wp.com/latex.php?latex=%5Clambda_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;lambda_2' title='&#92;lambda_2' class='latex' />.</p>
<p>This procedure is continued until all the examples are classified outside the margin or the set of kernels is exhausted. The final classifier is a combination of all the classifiers <img src='http://s0.wp.com/latex.php?latex=%5Cpsi_i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;psi_i' title='&#92;psi_i' class='latex' />.</p>
<p>A test example can be classified as follows. We first apply classifier <img src='http://s0.wp.com/latex.php?latex=%5Cpsi_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;psi_1' title='&#92;psi_1' class='latex' /> to the test example, and if it is classified with margin <img src='http://s0.wp.com/latex.php?latex=%5Cgeq+%5Clambda_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;geq &#92;lambda_1' title='&#92;geq &#92;lambda_1' class='latex' />, we output the assigned label and stop. If not we classify it with classifier <img src='http://s0.wp.com/latex.php?latex=%5Cpsi_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;psi_2' title='&#92;psi_2' class='latex' /> in a similar fashion, and so on&#8230;</p>
<p>Such a scheme will avoid anomalous boundaries as those in the pictures above.</p>
<p><strong>Discussion</strong></p>
<p>1. With all the work that has been done on SVMs it is very likely that this idea or something very similar has been thought of, but I haven&#8217;t come across it.</p>
<p>2. There is some work on kernel learning where a convex combination of kernels is learned but I think that is a different idea.</p>
<p>3. One nice thing about such a classification scheme is that at run-time it will expend less computational resources on easier examples and more on more difficult ones.  As my <a href="http://www.ecse.rpi.edu/~nagy/" target="_blank">thesis supervisor</a> used to say, it is silly for most classifiers to insist on acting exactly the same way on both easy and hard cases.</p>
<p>4. The choices of the cost parameters <img src='http://s0.wp.com/latex.php?latex=C&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='C' title='C' class='latex' /> for the SVMs is critical for the accuracy of the final classifier. Is there a way of formulating the choice of the parameters in terms of minimizing some overall upper bound on the generalization error from statistical learning theory?</p>
<p>5. Is there a one-shot SVM formulation with the set of kernels that exactly or approximately acts like our classifier?</p>
<p>6. The weird island-effect and what Ken calls the lava-lamp problem in the boundaries above are not just artifacts of SVMs. We would expect a <a href="http://mlstat.wordpress.com/2009/08/17/an-effective-kernelization-of-logistic-regression/" target="_self">sparse kernel logistic regression </a>to behave similarly. It would be interesting to do a similar incremental kernel thing with other kernel-based classifiers.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/298/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/298/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/298/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/298/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/298/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/298/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/298/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/298/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/298/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/298/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/298/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/298/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/298/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/298/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=298&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2009/09/18/incremental-complexity-support-vector-machine/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/09/svm_rbf_s11.png" medium="image">
			<media:title type="html">svm_rbf_s1</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/09/svm_rbf_s103.png" medium="image">
			<media:title type="html">svm_rbf_s10</media:title>
		</media:content>
	</item>
		<item>
		<title>Training data bias caused by active learning</title>
		<link>http://mlstat.wordpress.com/2009/09/10/training-data-bias-caused-by-active-learning/</link>
		<comments>http://mlstat.wordpress.com/2009/09/10/training-data-bias-caused-by-active-learning/#comments</comments>
		<pubDate>Fri, 11 Sep 2009 01:19:32 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Active learning]]></category>
		<category><![CDATA[Classification]]></category>
		<category><![CDATA[Estimation]]></category>
		<category><![CDATA[biased training data]]></category>
		<category><![CDATA[sample selection bias]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=268</guid>
		<description><![CDATA[As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in active learning the learner is allowed to select points for which labels are requested. Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=268&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>As opposed to the traditional supervised learning setting where the labeled training data is generated (we hope) independently and identically, in <em>active learning</em> the learner is allowed to select points for which labels are requested.</p>
<p>Because it is often impossible to construct the equivalent real-world object from its feature values, almost universally, active learning is <em>pool-based</em>. That is we start with a large pool of unlabeled data and the learner (usually sequentially) picks the objects from the pool for which the labels are requested.</p>
<p>One unavoidable effect of active learning is that we end up with a biased training data set. If the true data distribution is <img src='http://s0.wp.com/latex.php?latex=P%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(x,y)' title='P(x,y)' class='latex' />, we have data drawn from some distribution <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x,y)' title='&#92;hat{P}(x,y)' class='latex' /> (as always <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x' title='x' class='latex' /> is the feature vector and <img src='http://s0.wp.com/latex.php?latex=y&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='y' title='y' class='latex' /> is the class label).</p>
<p>We would like to correct for this bias so it does not lead to learning an incorrect classifier. And furthermore we want to use this biased data set to accurately evaluate the classifier.</p>
<p>In general since <img src='http://s0.wp.com/latex.php?latex=P%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(x,y)' title='P(x,y)' class='latex' /> is unknown, if <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x,y)' title='&#92;hat{P}(x,y)' class='latex' /> is arbitrarily different from it there is nothing that can be done. However, thankfully, the bias caused by active learning is more tame.</p>
<p><strong>The type of bias</strong></p>
<p>Assume that marginal feature distribution of the labeled points after active learning is given by <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%29+%3D+%5Csum_y%5Chat%7BP%7D%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x) = &#92;sum_y&#92;hat{P}(x,y)' title='&#92;hat{P}(x) = &#92;sum_y&#92;hat{P}(x,y)' class='latex' />. Therefore <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x)' title='&#92;hat{P}(x)' class='latex' /> is the putative distribution from which we can assume the feature vectors with labels have been sampled from.</p>
<p>For every feature vector thus sampled from <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x)' title='&#92;hat{P}(x)' class='latex' /> we request its label from the oracle which returns a label according to the conditional distribution <img src='http://s0.wp.com/latex.php?latex=P%28y%7Cx%29+%3D+%5Cfrac%7BP%28x%2Cy%29%7D%7B%5Csum_y+P%28x%2Cy%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(y|x) = &#92;frac{P(x,y)}{&#92;sum_y P(x,y)}' title='P(y|x) = &#92;frac{P(x,y)}{&#92;sum_y P(x,y)}' class='latex' />.  That is there is <em>no bias</em> in the conditional distribution. Therefore <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%2Cy%29+%3D+%5Chat%7BP%7D%28x%29+P%28y%7Cx%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x,y) = &#92;hat{P}(x) P(y|x)' title='&#92;hat{P}(x,y) = &#92;hat{P}(x) P(y|x)' class='latex' />. This type of bias has been called <em><a href="http://www.is.titech.ac.jp/~shimo/pub/Shimodaira%20JSPI2000.pdf" target="_blank">covariate shift</a></em>.</p>
<p><strong>The data</strong></p>
<p>After actively sampling the labels <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='n' title='n' class='latex' /> times, let us say we have the following data &#8212; a biased labeled training data set <img src='http://s0.wp.com/latex.php?latex=%5C%7Bx_i%2C+y_i%5C%7D_%7Bi%3D1%2C%5Cldots%2Cn%7D+%5Csim+%5Chat%7BP%7D%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{x_i, y_i&#92;}_{i=1,&#92;ldots,n} &#92;sim &#92;hat{P}(x,y)' title='&#92;{x_i, y_i&#92;}_{i=1,&#92;ldots,n} &#92;sim &#92;hat{P}(x,y)' class='latex' />, where the feature vectors <img src='http://s0.wp.com/latex.php?latex=x_i&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_i' title='x_i' class='latex' /> come from the original pool of <img src='http://s0.wp.com/latex.php?latex=M&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='M' title='M' class='latex' /> unlabeled feature vectors  <img src='http://s0.wp.com/latex.php?latex=%5C%7Bx_i%5C%7D_%7Bi%3D1%2C%5Cldots%2CM%7D+%5Csim+P%28x%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{x_i&#92;}_{i=1,&#92;ldots,M} &#92;sim P(x)' title='&#92;{x_i&#92;}_{i=1,&#92;ldots,M} &#92;sim P(x)' class='latex' /></p>
<p>Let us define <img src='http://s0.wp.com/latex.php?latex=%5Cbeta%3D%5Cfrac%7BP%28x%2Cy%29%7D%7B%5Chat%7BP%7D%28x%2Cy%29%7D%3D%5Cfrac%7BP%28x%29%7D%7B%5Chat%7BP%7D%28x%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;beta=&#92;frac{P(x,y)}{&#92;hat{P}(x,y)}=&#92;frac{P(x)}{&#92;hat{P}(x)}' title='&#92;beta=&#92;frac{P(x,y)}{&#92;hat{P}(x,y)}=&#92;frac{P(x)}{&#92;hat{P}(x)}' class='latex' />. If <img src='http://s0.wp.com/latex.php?latex=%5Cbeta&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;beta' title='&#92;beta' class='latex' /> is large we expect the feature vector <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x' title='x' class='latex' /> to be under-represented in the labeled data set and if it is small it is over-represented.</p>
<p>Now define for each labeled example <img src='http://s0.wp.com/latex.php?latex=%5Cbeta_i%3D%5Cfrac%7BP%28x_i%29%7D%7B%5Chat%7BP%7D%28x_i%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;beta_i=&#92;frac{P(x_i)}{&#92;hat{P}(x_i)}' title='&#92;beta_i=&#92;frac{P(x_i)}{&#92;hat{P}(x_i)}' class='latex' /> for <img src='http://s0.wp.com/latex.php?latex=i+%3D+1%2C%5Cldots%2Cn&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='i = 1,&#92;ldots,n' title='i = 1,&#92;ldots,n' class='latex' />. If we knew the values of <img src='http://s0.wp.com/latex.php?latex=%5C%7B%5Cbeta_i%5C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{&#92;beta_i&#92;}' title='&#92;{&#92;beta_i&#92;}' class='latex' /> we can correct for the bias during training and evaluation.</p>
<p><a href="http://books.nips.cc/papers/files/nips19/NIPS2006_0915.pdf" target="_blank">This paper</a> by Huang <em>et. al.</em>, and some of its references deal with the estimation of <img src='http://s0.wp.com/latex.php?latex=%5C%7B%5Cbeta_i%5C%7D_%7Bi%3D1%2C%5Cldots%2Cn%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{&#92;beta_i&#92;}_{i=1,&#92;ldots,n}' title='&#92;{&#92;beta_i&#92;}_{i=1,&#92;ldots,n}' class='latex' />. <em>Remark</em>: This estimation needs take into account that <img src='http://s0.wp.com/latex.php?latex=E_%7B%5Chat%7BP%7D%28x%29%7D%5B%5Cbeta_i%5D+%3D+1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='E_{&#92;hat{P}(x)}[&#92;beta_i] = 1' title='E_{&#92;hat{P}(x)}[&#92;beta_i] = 1' class='latex' />. This implies that the sample mean of the beta values on the labeled data set should be somewhere close to unity. This constraint is explicitly imposed in the estimation method of Huang <em>et. al</em>.</p>
<p><strong>Evaluation of the classifier</strong></p>
<p>We shall first look at bias-correction for evaluation. Imagine that we are handed a classifier <img src='http://s0.wp.com/latex.php?latex=f%28%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='f()' title='f()' class='latex' />, and we are asked to use the biased labeled data set to evaluate its accuracy. Also assume that we used the above method to estimate <img src='http://s0.wp.com/latex.php?latex=%5C%7B%5Cbeta_i%5C%7D_%7Bi%3D1%2C%5Cldots%2Cn%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{&#92;beta_i&#92;}_{i=1,&#92;ldots,n}' title='&#92;{&#92;beta_i&#92;}_{i=1,&#92;ldots,n}' class='latex' />. Now fixing the bias for evaluation boils down to just using a weighted average of the errors, where the weights are given by <img src='http://s0.wp.com/latex.php?latex=%5C%7B%5Cbeta_i%5C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;{&#92;beta_i&#92;}' title='&#92;{&#92;beta_i&#92;}' class='latex' />.</p>
<p>If the empirical loss on the biased sample is written as <img src='http://s0.wp.com/latex.php?latex=R+%3D+%5Cfrac%7B1%7D%7Bn%7D+%5Csum_i+l%28f%28x_i%29%2C+y_i%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R = &#92;frac{1}{n} &#92;sum_i l(f(x_i), y_i)' title='R = &#92;frac{1}{n} &#92;sum_i l(f(x_i), y_i)' class='latex' />, we write the estimate of the loss on the true distribution as the weighted loss <img src='http://s0.wp.com/latex.php?latex=R_c%3D+%5Cfrac%7B1%7D%7Bn%7D+%5Csum_i+%5Cbeta_i+l%28f%28x_i%29%2C+y_i%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R_c= &#92;frac{1}{n} &#92;sum_i &#92;beta_i l(f(x_i), y_i)' title='R_c= &#92;frac{1}{n} &#92;sum_i &#92;beta_i l(f(x_i), y_i)' class='latex' />.</p>
<p>Therefore we increase the contribution of the under-represented examples, and decrease that of the over-represented examples, to the overall loss.</p>
<p><strong>Learning the classifier</strong></p>
<p>How can the bias be accounted for during learning? The straightforward way is to learn the classifier parameters to minimize the weighted loss <img src='http://s0.wp.com/latex.php?latex=R_c&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='R_c' title='R_c' class='latex' /> (plus some regularization term) as opposed to the un-weighted empirical loss on the labeled data set.</p>
<p>However, a natural question that can be raised is whether <em>any</em> bias correction is necessary. Note that the posterior class distribution <img src='http://s0.wp.com/latex.php?latex=P%28y%7Cx%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(y|x)' title='P(y|x)' class='latex' /> is unbiased in the labeled sample. This means that any Bayes-consistent diagnostic classifier on <img src='http://s0.wp.com/latex.php?latex=P%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(x,y)' title='P(x,y)' class='latex' /> will still converge to the Bayes error rate with examples drawn from <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%2Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x,y)' title='&#92;hat{P}(x,y)' class='latex' />.</p>
<p>For example imagine constructing a <img src='http://s0.wp.com/latex.php?latex=k&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='k' title='k' class='latex' />-Nearest Neighbor classifier on the biased labeled dataset.  If we let <img src='http://s0.wp.com/latex.php?latex=k+%5Crightarrow+%5Cinfty&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='k &#92;rightarrow &#92;infty' title='k &#92;rightarrow &#92;infty' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7Bk%7D%7Bn%7D+%5Crightarrow+0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;frac{k}{n} &#92;rightarrow 0' title='&#92;frac{k}{n} &#92;rightarrow 0' class='latex' />, the classifier will converge to the Bayes-optimal classifier as <img src='http://s0.wp.com/latex.php?latex=n+%5Crightarrow+%5Cinfty&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='n &#92;rightarrow &#92;infty' title='n &#92;rightarrow &#92;infty' class='latex' />, <em>even if</em> <img src='http://s0.wp.com/latex.php?latex=%5Chat%7BP%7D%28x%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;hat{P}(x)' title='&#92;hat{P}(x)' class='latex' /><em> is biased</em>. This is somewhat paradoxical and can be explained by looking at the case of finite <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='n' title='n' class='latex' />.</p>
<p>For finite <img src='http://s0.wp.com/latex.php?latex=n&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='n' title='n' class='latex' />, the classifier trades off proportionally more errors in low density regions for fewer overall errors. This means that by correcting for the bias by optimizing the weighted loss, we can obtain a lower error rate. Therefore although both the bias-corrected and un-corrected classifiers converge to the Bayes error, the former converges faster.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/268/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/268/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/268/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=268&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2009/09/10/training-data-bias-caused-by-active-learning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>
	</item>
		<item>
		<title>The redundancy of view-redundancy for co-training</title>
		<link>http://mlstat.wordpress.com/2009/08/23/the-redundancy-of-view-redundancy-for-co-training/</link>
		<comments>http://mlstat.wordpress.com/2009/08/23/the-redundancy-of-view-redundancy-for-co-training/#comments</comments>
		<pubDate>Sun, 23 Aug 2009 22:08:38 +0000</pubDate>
		<dc:creator>mlstat</dc:creator>
				<category><![CDATA[Classification]]></category>
		<category><![CDATA[Semi-supervised learning]]></category>
		<category><![CDATA[class-conditional independence]]></category>
		<category><![CDATA[co-training]]></category>
		<category><![CDATA[multi-view learning]]></category>
		<category><![CDATA[surrogate learning]]></category>

		<guid isPermaLink="false">http://mlstat.wordpress.com/?p=230</guid>
		<description><![CDATA[Blum and Mitchell&#8217;s co-training is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning. I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=230&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cs.cmu.edu/~avrim/Papers/cotrain.pdf" target="_blank">Blum and Mitchell&#8217;s </a><em><a href="http://www.cs.cmu.edu/~avrim/Papers/cotrain.pdf" target="_blank">co-training</a></em> is a (very deservedly) popular semi-supervised learning algorithm that relies on class-conditional feature independence, and view-redundancy (or view-agreement) for semi-supervised learning.</p>
<p>I will argue that the view-redundancy assumption is unnecessary, and along the way show how surrogate learning can be plugged into co-training  (which is not all that surprising considering that both are multi-view semi-sup algorithms that rely on class-conditional view-independence).</p>
<p>I&#8217;ll first explain co-training with an example.</p>
<p><strong> Co-training &#8211; The setup</strong></p>
<p>Consider a <img src='http://s0.wp.com/latex.php?latex=y+%5Cin+%5C%7B0%2C1%5C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='y &#92;in &#92;{0,1&#92;}' title='y &#92;in &#92;{0,1&#92;}' class='latex' /> classification problem on the feature space <img src='http://s0.wp.com/latex.php?latex=%5Cmathcal%7BX%7D%3D%5Cmathcal%7BX%7D_1+%5Ctimes+%5Cmathcal%7BX%7D_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;mathcal{X}=&#92;mathcal{X}_1 &#92;times &#92;mathcal{X}_2' title='&#92;mathcal{X}=&#92;mathcal{X}_1 &#92;times &#92;mathcal{X}_2' class='latex' />. I.e., a feature vector <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x' title='x' class='latex' /> can be split into two as <img src='http://s0.wp.com/latex.php?latex=x+%3D+%5Bx_1%2C+x_2%5D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x = [x_1, x_2]' title='x = [x_1, x_2]' class='latex' />.</p>
<p>We make the rather restrictive assumption that <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are class-conditionally independent for both classes. I.e., <img src='http://s0.wp.com/latex.php?latex=P%28x_1%2C+x_2%7Cy%29+%3D+P%28x_1%7Cy%29+P%28x_2%7Cy%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(x_1, x_2|y) = P(x_1|y) P(x_2|y)' title='P(x_1, x_2|y) = P(x_1|y) P(x_2|y)' class='latex' /> for <img src='http://s0.wp.com/latex.php?latex=y+%5Cin+%5C%7B0%2C1%5C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='y &#92;in &#92;{0,1&#92;}' title='y &#92;in &#92;{0,1&#92;}' class='latex' />.</p>
<p>(Note that unlike <a href="http://mlstat.wordpress.com/2009/08/07/surrogate-learning-with-mean-independence/">surrogate learning with mean-independence</a>, both <img src='http://s0.wp.com/latex.php?latex=%5Cmathcal%7BX%7D_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;mathcal{X}_1' title='&#92;mathcal{X}_1' class='latex' />  and <img src='http://s0.wp.com/latex.php?latex=%5Cmathcal%7BX%7D_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;mathcal{X}_2' title='&#92;mathcal{X}_2' class='latex' /> are allowed to be multi-dimensional.)</p>
<p>Co-training makes an additional assumption that either view is sufficient for classification. This <em>view-redundancy</em> assumption basically states that the probability mass in the region of the feature space, where the Bayes optimal classifiers on the two views disagree with each other, is zero.</p>
<p>(The original co-training paper actually relaxes this assumption in the epilogue, but it is unnecessary to begin with, and the assumption has proliferated in later manifestations of co-training.)</p>
<p>We are given some labeled data (or a weak classifier on one of the views) and an large supply of unlabeled data. We are now ready to proceed with co-training to construct a Bayes optimal classifier.</p>
<p><strong>Co-training &#8211; The algorithm</strong></p>
<p>The algorithm is very simple. We use our weak classifier, say <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />, (which we were given, or which we constructed using the measly labeled data) on the one view (<img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_1' title='x_1' class='latex' />) to classify all the unlabeled data.  We select the examples classified with high confidence, and use these as labeled examples (using the labels assigned by the weak classifier) to train a classifier <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> on the other view (<img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' />).</p>
<p>We now classify the unlabeled data with <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> to similarly generate labeled data to retrain <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />. This back-and-forth procedure is repeated until exhaustion.</p>
<p>Under the above assumptions (and with &#8220;sufficient&#8221; unlabeled data) <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1' title='h_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2' title='h_2' class='latex' /> converge to the Bayes optimal classifiers on the respective feature views. Since either view is enough for classification, we just pick one of the classifiers and release it into the wild.</p>
<p><strong>Co-training &#8211; Why does it work?</strong></p>
<p>I&#8217;ll try to present an intuitive explanation of co-training using the example depicted in the following figure. Please focus on it intently.</p>
<p><img class="alignnone size-full wp-image-232" title="co-training" src="http://mlstat.files.wordpress.com/2009/08/co-training1.png?w=450&#038;h=299" alt="co-training" width="450" height="299" /></p>
<p>The feature vector <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x' title='x' class='latex' /> in the example is 2-dimensional and both views <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are  1-dimensional. The class-conditional distributions are uncorrelated and jointly Gaussian (which means independent) and depicted by their equiprobability contours in the figure. The marginal class-conditional distributions are show along the two axes. Class <img src='http://s0.wp.com/latex.php?latex=y%3D0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='y=0' title='y=0' class='latex' /> is shown in red and class <img src='http://s0.wp.com/latex.php?latex=y%3D1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='y=1' title='y=1' class='latex' /> is shown in blue. The picture also shows some unlabeled examples.</p>
<p>Assume we have a weak classifier <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' /> on the first view. If we extend the classification boundary for this classifier to the entire space <img src='http://s0.wp.com/latex.php?latex=x&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x' title='x' class='latex' />,  the boundary necessarily comprises of lines parallel to the <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> axis.  Let&#8217;s say there is only one such line and all the examples below that line are assigned class <img src='http://s0.wp.com/latex.php?latex=y%3D1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='y=1' title='y=1' class='latex' /> and all the examples above are assigned class <img src='http://s0.wp.com/latex.php?latex=y%3D0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='y=0' title='y=0' class='latex' />.</p>
<p>We now ignore all the examples close to the classification boundary of <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1' title='h_1' class='latex' /> (i.e., all the examples in the grey band) and project the rest of the points onto the <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> axis.</p>
<p>How will these projected points be distributed along <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' />?</p>
<p>Since the examples that were ignored (in the grey band) were selected based on their <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_1' title='x_1' class='latex' /> values, owing to class-conditional independence, the marginal distribution along <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> for either class will be <em>exactly</em> the same as if none of the samples were ignored. This is the key reason for the conditional-independence assumption.</p>
<p>The procedure has two subtle, but largely innocuous, consequences.</p>
<p>First, since we don&#8217;t know how many class <img src='http://s0.wp.com/latex.php?latex=0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='0' title='0' class='latex' /> and class <img src='http://s0.wp.com/latex.php?latex=1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='1' title='1' class='latex' /> examples are in the grey band the relative ratio of the examples of the two classes in the not-ignored set may not the same as in the original full unlabeled sample set. If the class priors <img src='http://s0.wp.com/latex.php?latex=P%28y%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(y)' title='P(y)' class='latex' /> are known, this can easily be corrected for when we learn <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' />. If the class priors are unknown other assumptions on <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' /> are necessary.</p>
<p>Second, when we project the unlabeled examples on to <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> we assign them the labels given to them by <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1' title='h_1' class='latex' /> which can be erroneous. In the figure above, there will be examples in the region indicated by A that are actually class <img src='http://s0.wp.com/latex.php?latex=1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='1' title='1' class='latex' /> but have been assigned class <img src='http://s0.wp.com/latex.php?latex=0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='0' title='0' class='latex' />, and examples in region B that were from class <img src='http://s0.wp.com/latex.php?latex=0&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='0' title='0' class='latex' /> but were called class <img src='http://s0.wp.com/latex.php?latex=1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='1' title='1' class='latex' />.</p>
<p>Again because of the class-conditional independence assumption these erroneously labeled examples will be distributed according to the marginal class-conditional <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> distributions. I.e., in the figure above we imagine, along the <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> axis, a very low amplitude blue distribution with the same shape and location as the red distribution, and a very low amplitude red distribution with the same shape under the blue distribution. (Note . This is the <img src='http://s0.wp.com/latex.php?latex=%28%5Calpha%2C+%5Cbeta%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='(&#92;alpha, &#92;beta)' title='(&#92;alpha, &#92;beta)' class='latex' /> noise in the original co-training paper.)</p>
<p>This amounts to having a labeled training set with label errors but with errors being generated <em>independently</em> of the location in the space. That is the number of errors in a region in the space is proportional to the number of examples in that region. These proportionally distributed errors are then washed out by the correctly labeled examples when we learn <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2' title='h_2' class='latex' />.</p>
<p>To recap, co-training works because of the following fact. Starting from a weak classifier <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1' title='h_1' class='latex' /> on <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_1' title='x_1' class='latex' />, we can generate very accurate and <em>unbiased</em> training data to train a classifier on <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' />.</p>
<p><strong>No need for view-redundancy</strong></p>
<p>Notice that, in the above example, we made no appeal to any kind of view-redundancy (other than whatever we may get gratis from the independence assumption).</p>
<p>The vigilant reader may however level the following two objections against the above argument-by-example.</p>
<p>1. We build <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> separately. So when the training is done, without view redundancy, we have not shown a way to pick from the two to apply to new test data.</p>
<p>2. At every iteration we need to select unlabeled samples that were classified with <em>high</em>-confidence by <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1' title='h_1' class='latex' /> to feed to the trainer for <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2' title='h_2' class='latex' />. Without view-redundancy may be <em>none</em> of the samples will be classified with high confidence.</p>
<p>The first objection is easy to respond to. We pick neither <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1' title='h_1' class='latex' /> nor <img src='http://s0.wp.com/latex.php?latex=h_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2' title='h_2' class='latex' /> for new test data. Instead we combine them to obtain a classifier <img src='http://s0.wp.com/latex.php?latex=h%28x_1%2Cx_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h(x_1,x_2)' title='h(x_1,x_2)' class='latex' />. This is well justified because, under class-conditional independence, <img src='http://s0.wp.com/latex.php?latex=P%28y%7Cx_1%2Cx_2%29+%5Cpropto+P%28y%7Cx_1%29+P%28y%7Cx_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='P(y|x_1,x_2) &#92;propto P(y|x_1) P(y|x_2)' title='P(y|x_1,x_2) &#92;propto P(y|x_1) P(y|x_2)' class='latex' />.</p>
<p>We react to the second objection by dropping the requirement of classifying with high-confidence altogether.</p>
<p><strong>Dropping the high-confidence requirement by surrogate learning</strong></p>
<p>Instead of training <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> with examples that are classified with high confidence by <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />, we train <img src='http://s0.wp.com/latex.php?latex=h_2%28x_2%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_2(x_2)' title='h_2(x_2)' class='latex' /> with all the examples (using the scores assigned to them by <img src='http://s0.wp.com/latex.php?latex=h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1(x_1)' title='h_1(x_1)' class='latex' />).</p>
<p>At some iteration of co-training, define the random variable <img src='http://s0.wp.com/latex.php?latex=z_1+%3D+h_1%28x_1%29&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='z_1 = h_1(x_1)' title='z_1 = h_1(x_1)' class='latex' />. Since <img src='http://s0.wp.com/latex.php?latex=x_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_1' title='x_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are class-conditionally independent, <img src='http://s0.wp.com/latex.php?latex=z_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='z_1' title='z_1' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' /> are also class-conditionally independent. In particular <img src='http://s0.wp.com/latex.php?latex=z_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='z_1' title='z_1' class='latex' />  is class-conditionally <em>mean-independent</em> of <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' />. Furthermore if <img src='http://s0.wp.com/latex.php?latex=h_1&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='h_1' title='h_1' class='latex' /> is even a weakly useful classifier, barring pathologies, it will satisfy <img src='http://s0.wp.com/latex.php?latex=E%5Bz_1%7Cy%3D0%5D+%5Cneq+E%5Bz_1%7Cy%3D1%5D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='E[z_1|y=0] &#92;neq E[z_1|y=1]' title='E[z_1|y=0] &#92;neq E[z_1|y=1]' class='latex' />.</p>
<p>We can therefore apply <a href="http://http://mlstat.wordpress.com/2009/08/07/surrogate-learning-with-mean-independence/" target="_self">surrogate learning under mean-independence</a> to learn the classifier on <img src='http://s0.wp.com/latex.php?latex=x_2&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='x_2' title='x_2' class='latex' />. (This is essentially the same idea as <a href="http://www.informatik.uni-freiburg.de/cgnm/lehre/pm-05s/bib/multi-view/Nigam2000-effectiveness-and-applicability-of-cotraining.pdf" target="_blank">Co-EM</a>, which was introduced without much theoretical justification.)</p>
<p><strong>Discussion</strong></p>
<p>Hopefully the above argument has convinced the reader that the class-conditional view independence assumption obviates the view-redundancy requirement.</p>
<p>A natural question to ask is whether the reverse is true. That is, if we are given view-redundancy, can we completely eliminate the requirement of class-conditional independence? We can immediately see that the answer is no.</p>
<p>For example, we can duplicate all the features for any classification problem so that view-redundancy holds trivially between the two replicates. Moreover, the second replicate will be statistically fully dependent on the first.</p>
<p>Now if we are given a weak classifier on the first view (or replicate) and try to use its predictions on an unlabeled data set to obtain training data for the second, it would be equivalent to feeding back the predictions of a classifier to retrain itself (because the two views are duplicates of one another).</p>
<p>This type of procedure (which is <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.77.6032" target="_blank">an idea decades old</a>) has been called, among other things, self-learning, self-correction, self-training and decision-directed adaptation. The problem with these approaches is that the training set so generated is <em>biased</em> and other assumptions are necessary for the feedback procedure to improve over the original classifier.</p>
<p>Of course this does not mean that the complete statistical independence assumption cannot be relaxed. The above argument only shows that at least <em>some amount</em> of independence is necessary.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/mlstat.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/mlstat.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/mlstat.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/mlstat.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/mlstat.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/mlstat.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/mlstat.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/mlstat.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/mlstat.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/mlstat.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/mlstat.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/mlstat.wordpress.com/230/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/mlstat.wordpress.com/230/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/mlstat.wordpress.com/230/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=mlstat.wordpress.com&amp;blog=6090177&amp;post=230&amp;subd=mlstat&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://mlstat.wordpress.com/2009/08/23/the-redundancy-of-view-redundancy-for-co-training/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">mlstat</media:title>
		</media:content>

		<media:content url="http://mlstat.files.wordpress.com/2009/08/co-training1.png" medium="image">
			<media:title type="html">co-training</media:title>
		</media:content>
	</item>
	</channel>
</rss>
