- May 18, 2016 Probabilistic Numerics for Computer Scientists
- Dec 3, 2015 Probabilistic Integration
- Jan 16, 2015 Connections, Part III: Bayesian Optimization
- Jan 15, 2015 Connections, Part II: Stochastic numerical methods
- Jan 14, 2015 Connections, Part I: Uncertainty Quantification
- Sep 5, 2014 Tübingen Manifesto: Community
- Sep 3, 2014 Tübingen Manifesto: Priors and Prior Work
- Sep 1, 2014 Tübingen Manifesto: Probabilistic Numerics and Probabilistic Programming
- Aug 27, 2014 Tübingen Manifesto: Uncertainty
- Aug 22, 2014 Roundtable in Tübingen

*As I go around presenting the idea of probabilistic numerics to various
audiences, certain questions about related areas come up repeatedly. This post
explains how probabilistic numerics compares to the area of Bayesian
Optimization. Previous posts discussed connections to
uncertainty quantification and
stochastic methods.*

*A disclaimer: Obviously, everyone has different opinions about the scope and
purpose of certain concepts and academic fields. And I am not an expert in the
areas discussed here. This post relates a part of my own personal
justification, why I think certain ideas are novel and interesting. It is not
intended as a holistic overview of a field. If anyone disagrees with
characterizations I make here, I would be glad if you could relate your
opinion in the comments section below.*

One of the most frequent questions regarding PN from the machine learning community is “what about Bayesian Optimization? Shouldn’t it be part of PN, too?” Some of things I write here arose from an email thread mainly driven by Roman Garnett, John Cunningham, Mike Osborne and myself.

**Bayesian Optimization (BO)** is a
probabilistic description of the task of finding the global extremum of a
function that is not “directly” accessible, either because it is embodied in
some physical process, or because it has very high evaluation cost. A good
example is the search for good parameters of a robotic control problem (where
each function evaluation involves a physical experiment), or finding good
setups for a large machine learning algorithm, such as a deep net (where each
function evaluation involves training the net to convergence, which may take
weeks on a cluster).

BO is a relatively young community, but already very successful. A workshop on BO is now an annual fixture at NIPS. The recent 2014 installment attracted around 50 people (my own rough guess), making it one NIPS’s larger satellites. And there is some personal overlap between the PN and BO communities, for example through Roman and Mike (together with Christian Schuler, I also contributed a BO algorithm myself (Hennig & Schuler, 2012)). Nevertheless, in my personal opinion, there is a difference in focusses that makes BO rather different from what we are trying to achieve with probabilistic numerical methods.

There is a spectrum of algorithms in BO, but most Bayesian optimizers fit into
the following scheme: We aim to find the minimum of
. To decide at which to collect the
next datapoint(s) , the algorithm builds a regression posterior
from previously collected evaluations . This is the
“Bayesian” bit, and virtually always involves Gaussian process models. Said
posterior is then transformed into a utility for the next evaluation,
. For example, this functional may predict how likely
is to be lower than previous function values, or the expected distance
between and the previous lowest achieved value, or something altogether
more complicated, like the expected information gain about the location of the
minimum from an evaluation at . Then – and this is crucial in this
context – the Bayesian global optimizer **performs a numerical global
optimization** to find the global minimum . How exactly this is done varies across implementations. There
are two main differences between the original optimization task of the Bayesian
optimizer and this new numerical global optimization problem: The latter is
free of noise, and the objective may be a lot cheaper, because it only involves
computations on the surrogate, not physical experiments. The “structural
complexity” (the shape) of the two problems, may be quite similar though.

In this sense BO “just” turns a tough physical optimization problem into a tough numerical optimization problem (the quotation marks are truly necessary, the models used in this are are anything but trivial). This doesn’t mean BO is pointless. It has been hugely successful recently at providing automated, surprisingly smart strategies saving time and money on design problems. But for our purposes it puts BO on a conceptual level “above” the base layer of numerics.

In our email thread, Roman gave a pointed criticism to this characterization:
Don’t all numerical methods just turn a hard problem into an easier one?
Quasi-Newton methods cheaply approximate expensive Hessian functions, for
example. This is true, and one could have a philosophical debate about the
boundaries between a numerical method and an algorithm merely *using* numerics
(Bayesian quadrature methods, for example, may well use a numerical
optimization algorithm to design their evaluation strategy). But what is more
important from a practical standpoint is the difference in research focus:
Having attended most of the recent NIPS workshop on BO, I would say the focus
of work in this community is currently on more elaborate structured models
(high-dimensional, heteroscedastic, with varying evaluation cost, etc.), and on
identifying interesting application areas (molecular biology, automated machine
learning, robotics, etc.). There is only limited debate about the computational
cost of BO methods, and on the empirical robustness of these algorithms. This
make sense, because BO algorithms are quite elaborate; so their behaviour and
cost is difficult to analyse.

In contrast, numerical methods form the base layer, the *inner loop* of
algorithms across many problem domains. They have to be parsimonious, and
trustworthy. When building probabilistic numerical methods, we have to compare
ourselves with the elegant, rugged algorithms built in the applied mathematical
communities over the past century. In my opinion it would be a mistake to
demand a large computational budget from our potential users. Instead we need
to show that probabilistic functionality can deliver efficiency and
expressivity gains at very limited overhead. In so far, I think it is a good
strategy to keep the developments in BO and PN separate for the time being.

However, our tiny cottage industry can learn a lot from the great success that BO has had in recent years. In about half a decade, BO has started from an existing basis of just a handful of (much older) early papers to a thriving community with many different competing key players. There are about as many existing basic papers on PN ideas as there were on BO a few years ago. I would be thrilled if, in 5 years, PN were where BO is today.

- Hennig, P., & Schuler, C. J. (2012). Entropy Search for Information-Efficient Global Optimization.
*Journal of Machine Learning Research*,*13*, 1809–1837.Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly adresses the decision problem of maximizing information gain from each evaluation.

@article{HennigS2012, title = {Entropy Search for Information-Efficient Global Optimization}, author = {Hennig, P. and Schuler, CJ.}, month = jun, volume = {13}, pages = {1809-1837}, journal = {Journal of Machine Learning Research}, year = {2012}, file = {http://jmlr.csail.mit.edu/papers/volume13/hennig12a/hennig12a.pdf}, link = {http://jmlr.csail.mit.edu/papers/v13/hennig12a.html}, code = {http://probabilistic-optimization.org/Global.html} }