A Short Presentation on Probabilistic Programming and Variational Inference

May 9, 2018 Leave a comment


Probabilistic Programming for Bayesian Computation: Part I


Categories: Uncategorized

The Cult of Universality in Statistical Learning Theory

October 31, 2010 3 comments

The question is frequently raised as to why the theory and practice of machine learning are so divergent. Whereas if you glance at any article about classification, chances are that you will find symbol upon lemma & equation upon inequality, making claims about the bounds on the error rates, that should putatively guide the engineer in the solution of her problem.

However, the situation seems to be that the engineer having been forewarned by her pragmatic colleagues (or having checked a few herself) that these bounds are vacuous for most realistic problems, circumvents them altogether in her search for any useful nuggets in the article.

So why do these oft-ignored analyses still persist in a field that is largely comprised of engineers? From my brief survey of the literature it seems that one  (but, by no means, the only) reason is the needless preponderance of worst-case thinking. (Being a panglossian believer of the purity of science and of the intentions of its workers, I am immediately dismissing the cynical suggestion that these analyses are appended to an article only to intimidate the insecure reviewer.)

The cult of universality

An inventive engineer designs a learning algorithm for her problem of classifying birds from the recordings of their calls. She suspects that her algorithm is more generally applicable and sits down to analyze it formally. She vaguely recalls various neat generalization error bounds she learned about during her  days at the university, and wonders if they are applicable.

The bounds made claims of the kind

“for my classifier whose complexity is c, if trained on m examples, then for any distribution that generated the data, it is guaranteed that the

generalization error rate \leq error rate on the training set + some function of (c,m)

with high probability”.

Some widely used measures of the complexity of a classifier are its VC dimension and its Rademacher complexity, both of which measure the ability of the classifier to separate any training set. The intuition is that if the classifier can imitate any arbitrary labeling of a set of vectors, it will generalize poorly.

Because of the phrase “for any distribution” in the statement of the bound, the bound is said to be universally applicable. It is this pursuit of universality which is a deplorable manifestation of worst-case thinking. It is tolerable in mathematicians that delight in pathologies, but can be debilitating in engineers.

The extent of pessimism induced by the requirement of universality is not well appreciated. The following example is designed to illustrate this by relaxing the requirement from “any distribution” to “any smooth distribution”, which is not much of a relaxation at all.

Assume that I have a small training data set \{(x_i, y_i)\} in R^d drawn from a continuous distribution p(x, y).  Assume further that p(x) is reasonably smooth.

I now build a linear classifier under some loss (say an SVM). I then take all the training examples that are misclassified by the linear classifier and memorize them along with their labels.

For a test vector x, if x is within \epsilon of a memorized training example I give it the label of the training example. Otherwise I use the linear classifier to obtain my prediction.

I can make \epsilon very small and since the training examples will be in general position with probability one, this classification scheme is unambiguous.

This classifier will have zero error on all training sets and therefore will have high complexity according to the usual complexity measures like VC, Rademacher etc. However, if I ignore the contribution of the memorized points (which only play a role for a set of vanishingly small probability), I have a linear classifier.

Therefore, although it is reasonable to expect any analysis to yield very similar bounds on the generalization error for a linear classifier and my linear+memorization classifier, the requirement of universality leads to vacuous bounds for the latter.

Even if I assume nothing more than smoothness, I do not know how to derive reasonable statements with the existing tools. And we almost always know much more about the data distributions!

To reiterate, checking one’s learning algorithm against the worst possible distribution is akin to designing a bicycle and checking how well it serves for holding up one’s pants.

“The medicine bottle rules”

Our engineer ponders these issues, muses about the “no free lunch” results that imply that for any two classifiers there are distributions for which either one of them is better than the other, and wonders about the philosophical distinction between a priori restricting the function space that learning algorithm searches in, and a priori restricting the distributions that the learning algorithm is applicable for.

After a short nap, she decides on a sensible route for her analysis.

1. State the restrictions on the distribution. She shows that her algorithm will perform very well if her assumptions of the data distribution are satisfied. She further argues that the allowed distributions are still broad enough to cover many other problems.

2. State to what extent the assumptions can be violated. She analyzes how the quality of her algorithm degrades when the assumptions are satisfied only approximately.

3. State which assumptions are necessary. She analyzes the situations where her algorithm will definitely fail.

I believe that these are good rules to follow while analyzing classification algorithms.  My professor George Nagy calls these the medicine bottle rules, because like on medicine label, we require information on how to administer the drug, what it is for, what is bad for, and perhaps on interesting side effects.

I do not claim to follow this advice unfailingly and I admit to some of the above crimes. I, however, do believe that medicine bottle analysis is vastly more useful than much of what passes for learning theory. I look forward to hearing from you, nimble reader, of your thoughts on the kinds of analyses you would care enough about to read.

Random Fourier Features for Kernel Density Estimation

October 4, 2010 4 comments

The NIPS paper Random Fourier Features for Large-scale Kernel Machines, by Rahimi and Recht presents a method for randomized feature mapping where dot products in the transformed feature space approximate (a certain class of) positive definite (p.d.) kernels in the original space.

We know that for any p.d. kernel there exists a deterministic map that has the aforementioned property but it may be infinite dimensional. The paper presents results indicating that with the randomized map we  can get away with only a “small” number of features (at least for a classification setting).

Before applying the method to density estimation let us review the relevant section of the paper briefly.

Bochner’s Theorem and Random Fourier Features

Assume that we have data in R^d and a continuous p.d. kernel K(x,y) defined for every pair of points x,y \in R^d. Assume further that the kernel is shift-invariant, i.e., K(x,y) = K(x-y) \triangleq K(\delta) and that the kernel is scaled so that K(0) = 1.

The theorem by Bochner states that under the above conditions K(\delta) must be the Fourier transform of a non-negative measure on R^d. In other words, there exists a probability density function p(\delta) for \delta \in R^d such that K(\delta) = \mathcal{F}(p(\delta)).

where (1) is because K(.) is real. Equation (2) says that if we draw a random vector w according to p(w) and form two vectors \phi(x) = (cos(w^T x), sin(w^T x)) and \phi(y) = (cos(w^T y), sin(w^T y)), then the expected value of <\phi(x),\phi(y)> is K(x-y).

Therefore, for x \in R^d, if we choose the transformation

\phi(x) = \frac{1}{\sqrt{D}} (cos(w_1^T x), sin(w_1^T x), cos(w_2^T x), sin(w_2^T x), \ldots, cos(w_D^T x), sin(w_D^T x))

with w_1,\ldots, w_D drawn according to p(w), linear inner products in this transformed space will approximate K(.).

Gaussian RBF Kernel

The Gaussian radial basis function kernel satisfies all the above conditions and we know that the Fourier transform of the Gaussian is another Gaussian (with the reciprocal variance). Therefore for “linearizing” the Gaussian r.b.f. kernel, we draw D samples from a Gaussian distribution for the transformation.

Parzen Window Density Estimation

Given a data  set  \{x_1, x_2, \ldots, x_N\} \subset R^d, the the so-called Parzen window probability density estimator is defined as follows

\hat{p}(x) \propto \frac{1}{N} \sum_i K((x-x_i)/h)

where K(.) is often a positive, symmetric, shift-invariant kernel and h is the bandwidth parameter that controls the scale of influence of the data points.

A common kernel that is used for Parzen window density estimation is the Gaussian density. If we make the same choice we can apply our feature transformation to linearize the procedure. We have

where h has been absorbed into the kernel variance.

Therefore all we need to do is take the mean of the transformed data points and estimate the pdf at a new point to be (proportional to) the inner product its transformed feature vector with the mean.

Of course since the kernel value is only approximated by the inner product of the random Fourier features we expect that the estimate pdf will differ from a plain unadorned Parzen window estimate.  But different how?


Below are some pictures showing how the method performs on some synthetic data. I generated a few dozen points from a mixture of Gaussians and plotted contours of the estimated pdf for the region around the points. I did this for several choices of D and \gamma (the scale parameter for the Gaussian kernel).

First let us check that the method performs as expected for large values of D because the kernel value is well approximated by the inner product of the Fourier features. The first 3 pictures are for D = 10000 for various values of \gamma.

D = 10000 and gamma = 2.0

D = 10000 and gamma = 1.0

D = 10000 and gamma = 0.5



Now let us see what happens when we decrease D. We expect the error in approximating the kernel would lead to obviously erroneous pdf.  This is clearly evident for the case of D=100.

D=1000 and gamma = 1.0

D=100 and gamma = 1.0



The following picture for  D = 1000 and \gamma = 2.0 is even stranger.

D = 1000 and gamma = 2.0




It seems that even for a simple 2D example, we seem to need to compute a very large number of random Fourier features to make the estimated pdf accurate. (For this small example this is very wasteful, since a plain Parzen window estimate would require less memory and computation.)

However, the pictures do indicate that if the approach is to be used for outlier detection (aka novelty detection) from a given data set, we might be able get away with much smaller D. That is, even if the estimated pdf has a big error on the entire space, on the points from the data it seems to be reasonably accurate.

Regularized Minimax on Synthetic Data

April 19, 2010 Leave a comment

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

Experiments on Synthetic Data

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a  mixture of two normal distributions (i.e., there are two types per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

Example 1

Example1. Original training data set. Both the red and blue classes have two types in 9:1 ratio.

Example 1. Plain logistic regression. No minimax. Almost all of the red circles are misclassified.

Example1. Minimax with gamma = 0.1

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

Example 2

Example2. Original training data set.

Example1. Logistic regression. No minimax.

Example2. Minimax with gamma = 0.5


1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.

Regularized Minimax for Robust Learning

March 13, 2010 1 comment

This post is about using minimax estimation for robust learning when the test data distribution is expected to be different from the training data distribution, i.e learning that is robust to data drift.

Cost Sensitive Loss Functions

Given a training data set D = \{x_i, y_i\}_{i=1,\ldots,N}, most learning algorithms learn a classifier \phi that is parametrized by a vector w by minimizing a loss function

where l(x_i, y_i, w) is the loss on example i and f(w) is some function that penalizes complexity. For example for logistic regression the loss function looks like

for some \lambda > 0.

If, in addition, the examples came with costs c_i (that somehow specify the importance of minimizing the loss on that particular example), we can perform cost sensitive learning by over/under-sampling the training data or minimize a cost-weighted loss function (see this paper by Zadrozny et. al. )

We further constrain \sum_i^N c_i = N and c_i \ge 0. So the unweighted learning problem corresponds to the case where all c_i = 1.

A Game Against An Adversary

Assume that the learner is playing a game against an adversary that will assign the costs \{c_i\}_{i=1,\ldots,N} to the training examples that will lead to the worst possible loss for any weight vector the learner produces.

How do we learn in order to minimize this maximum possible loss? The solution is to look for the the minimax solution

For any realistic learning problem the above optimization problem does not have a unique solution.

Instead, let us assume that the adversary has to pay a price for assigning his costs, which depends upon how much they deviate from uniform. One way is to make the price proportional to the negative of the entropy of the cost distribution.

We define

where H(c) = -\sum_i c_i \log c_i (the Shannon entropy of the cost vector, save the normalization to sum to one).

The new minimax optimization problem can be posed as

subject to the constraints

Note that the regularization term on the cost vector c essentially restricts the set of  possible cost vectors the adversary has at his disposal.


For convex loss functions (such as the logistic loss) L(w, c) is convex in w for a fixed cost assignment, therefore so is R(w, c). Furthermore, R(w, c) is concave in c and is restricted to a convex and compact set. We can therefore apply Danskin’s theorem to perform the optimization.

The theorem allows us to say that, for a fixed weight vector w, if

and if \tilde{c} is unique, then

even though \tilde{c} is a function of w.


The algorithm is very simple. Perform until convergence the following

1. At the k^{th} iteration, for the weight vector w^{k} find the cost vector \tilde{c} the maximizes R(w^{k},c).

2. Update w^{k+1} = w^{k} - \eta \nabla_w R(w^{k}, \tilde{c}), where \eta is the learning rate.

The maximization in step 1 is also simple and can be shown to be

As expected, if \gamma \rightarrow \infty, the costs remain close to one and as \gamma \rightarrow 0 the entire cost budget is allocated to the example with the largest loss.

Of types and tokens

This line of work was motivated by the following intuition of my colleague Marc Light about the burstiness of types in language data.

For named entity recognition the training data is often drawn from a small time window and is likely to contain entity types whose distribution is not representative of the data that the recognizer is going see in general.

(The fact that ‘Joe Plumber” occurs so frequently in our data is because we were unlucky enough to collect annotated data in 2008.)

We can build a recognizer that is robust to such misfortunes by optimizing for the worst possible type distribution rather than for the observed token distribution. One way to accomplish this is to learn the classifier by minimax over the cost assignments for different types.

For type t let S_t be the set of all tokens of that type and N_t be the number of tokens of that type. We now estimate w by

under the same constraints on c as above. Here q is the observed type distribution in the training data and KL(.\|.) is the KL-divergence.

The algorithm is identical to the one above except the maximum over c for a fixed w is slightly different.

Related Work and Discussion

1. The only other work I am aware of that optimizes for a similar notion of robustness is the one on adversarial view for covariate shift by Globerson et. al. and the NIPS paper by Bruckner and Scheffer. Both these papers deal with minimax learning for robustness to additive transformation of feature vectors (or addition/deletion of features). Although it is an obvious extension, I have not seen the regularization term that restricts the domain for the cost vectors. I think it allows for learning models that are not overly pessimistic.

2. If each class is considered to one type, the usual Duda & Hart kind of minimax over class priors can be obtained. Minimax estimation is usually done for optimizing for the worst possible prior over the parameter vectors (w for us) and not for the costs over the examples.

3. For named entity recognition, the choice of how to group examples by types is interesting and requires further theory and experimentation.

4. For information retrieval often the ranker is learned from several example queries. The learning algorithm tries to obtain a ranker that matches human judgments for the document collection for the example queries. Since the queries are usually sampled from the query logs, the learned ranker may perform poorly for a particular user. Such a minimax approach may be suitable for  optimizing for the worst possible assignment of costs over query types.

In the next post I will present some experimental results on toy examples with synthetic data.


I am very grateful to Michael Bruckner for clarifying his NIPS paper and some points about the applicability of Danskin’s theorem, and to Marc Light for suggesting the problem.

Sparse online kernel logistic regression

December 6, 2009 Leave a comment

In a previous post, I talked about an idea for sparsifying kernel logistic regression by using random prototypes. I also showed how the prototypes themselves (as well as the kernel parameters) can be updated. (Update Apr 2010. Slides for a tutorial on this stuff.)

(As a brief aside, I note that an essentially identical approach was used to sparsify Gaussian Process Regression by Snelson and Gharahmani. For GPR they use gradient ascent on the log-likelihood to learn the prototypes and labels, which is akin to learning the prototypes and betas for logistic regression. The set of prototypes and labels generated by their algorithm can be thought of as a pseudo training set.)

I recently (with the help of my super-competent Java developer colleague Hiroko Bretz) implemented the sparse kernel logistic regression algorithm. The learning is done in an online fashion (i.e., using stochastic gradient descent).

It seems to perform reasonably well on large datasets. Below I’ll show its behavior on some pseudo-randomly generated classification problems.

All the pictures below are for logistic regression with the Gaussian RBF kernel. All data sets have 1000 examples from three classes which are mixtures of Gaussians in 2D (shown in red, blue and green). The left panel is the training data and the right panel are the predictions on the same data set by the learned logistic regression classifier. The prototypes are shown as black squares.

Example 1 (using 3 prototypes)

After first iteration

After second iteration

After about 10 iterations

Although the classifier changes considerably from iteration to iteration, the prototypes do not seem to change much.

Example 2 (five prototypes)

After first iteration

After 5 iterations

Example 3 (five prototypes)

After first iteration

The right most panel shows the first two “transformed features”, i.e., the kernel values of the examples to the first two prototypes.

After second iteration

Implementation details and discusssion

The algorithm runs through the whole data set to update the betas (fixing everything else), then runs over the whole data set again to update the  prototypes (fixing the betas and the kernel params), and then another time for the kernel parameter. These three update steps are repeated until convergence.

As an indication of the speed, it takes about 10 minutes until convergence with 50 prototypes, on a data set with a quarter million examples and about 7000 binary features (about 20 non-zero features/example).

I had to make some approximations to make the algorithm fast — the prototypes had to be updated lazily (i.e., only the feature indices that have the feature ON are updated), and the RBF kernel is computed using the distance only along the subspace of the ON features.

The kernel parameter updating worked best when the RBF kernel was re-parametrized as K(x,u) = exp(-exp(\theta) ||x-u||^2).

The learning rate for betas was annealed, but those of the prototypes and the kernel parameter was fixed at a constant value.

Finally, and importantly, I did not play much with the initial choice of the prototypes. I just picked a random subset from the training data. I think more clever ways of initialization will likely lead to much better classifiers. Even a simple approach like K-means will probably be very effective.

BWT for NLP (2)

November 12, 2009 2 comments

I show how the Burrows-Wheeler Transform can be used to compute the similarity between two strings. We submitted results from this method (along with results from the Context-Chain metric developed by my colleagues Frank Schilder and Ravi Kondadadi) for the Automatically Evaluating the Summaries of Peers (AESOP) task of the TAC 2009 conference.

The task was to produce an automatic metric to evaluate machine generated summaries (i.e., system summaries) against human generated summaries for the TAC ’09 Update Summarization Task. Clearly the automatic metric is just some function that produces a similarity score between the system summary and the human generated (the so-called model) summary.

The  proposed metrics were evaluated by comparing their rankings of the system summaries from different peers to that of the ranking produced by human judges.

Similarity Metric

We use an estimate of the conditional “compressibility” of the model summary given the system summary as the similarity metric. The conditional compressibility is defined as the increase in the compressibility of the model summary when the system summary has been observed.

In order to judge the similarity of the system summary S, to the model summary M, we propose to use the difference in compressibility of M when S is not seen to when S is given. This metric basically
captures the reduction in the uncertainty in M when S is known.

We define the compressibility c(M) of any string M by

c(M) = \frac{H(M)}{|M|}

and the conditional compressibility of string M over an alphabet \mathcal{A} given another string S over the same alphabet as

c(M|S) = \frac{H(S+M) - H(S)}{|M|}

where S+M is the concatenation of the strings S and M, H(S) is the entropy of string S, and |M| is the length of the string M.

The fractional increase in compressibility of M given S can then measured by

r(M|S) = \frac{c(M) - c(M|S)}{c(M)}.

We use r(M|S) as the similarity metric to measure the similarity of a system summary S to the model summary M.

Our metric is similar to the one proposed by Li and Vitanyi and is theoretically well-justified from the perspective of algorithmic information theory. One peculiarity is that our similarity is asymmetric.

The only thing that is needed to implement the above similarity metric is an estimate of the entropy H(S) for a string S. We use the BWT for this estimate.

BWT-based String Entropy Estimate

We use the Move-To-Front (MTF) entropy of the Burrows-Wheeler transform of a given string S as an estimate for its entropy $H(S)$.

The MTF encoding of a string is performed by traversing the string and assigning to each symbol the position of that symbol in the alphabet and then moving the symbol to the front of the alphabet. Therefore a sequence with a lot of runs will  have a lot of zeros in its MTF encoding.

In this paper the MTF coding is used to define the MTF entropy (which the authors also call local entropy) of a string R as

\mbox{MTFE}(R) = \sum_i \mbox{log}(\mbox{MTF}(R)_i + 1)

where \mbox{MTF}(R)_i is the i^{th} symbol of the MTF coding of the string R.

Now we define H(S), the entropy of string S as

H(S) = \mbox{MTFE}(\mbox{BWT}(S))

where \mbox{BWT}(S) is the BWT of string S.

Since the Burrows-Wheeler transform involves just the construction of a suffix array, the computation of our compression based evaluation metric is linear in time and space in the length of the model and system summary strings.

Some Technical Details

For our implementation, we considered each word in a string as a separate symbol. Our alphabet of symbols therefore contained all the words in the two strings being compared. The words were normalized by lower casing and removing punctuation. Because BWT needs an ordered alphabet, we used the lexicographic order on the words in the alphabet.



The results on the TAC-AESOP task (above) show that the BWT based metric (FraCC in the table) is reasonable for summarization evaluation, especially because there are not very many knobs to tune. I obtained these results from Frank (who will present them at TAC next week). The “best metric” is the AESOP submission that seemed to have high scores across several measures.