Probabilistic-Numerics.org

# Literature

This page collects literature on all areas of probabilistic numerics, sorted by problem type. If you would like your publication to be featured in this list, please do not hesitate to contact us. The fastest way to get your documents onto the site is to clone our github repository, add your documents to the relevant BibTeX-file in /_bibliography, then either send us a pull-request, or an email with the updated file (see box on the frontpage for our contacts).

## General and Foundational

The following papers are often cited as early works on the idea of uncertainty over the result of deterministic computations. Some entries have a “notes” field providing further information about the relevance of the cited work, or pointers to specific results therein.

1. Poincaré, H. (1896). Calcul des probabilités. Paris: Gauthier-Villars.

Likely the first text explicitly discussing the notion of computation as inference. The introduction (§1.7) contains the following insightful, albeit informal, sentence: Une question de probabilités ne se pose que par suite de notre ignorance: il n’y aurait place que pour la certitude si nous connaissions toutes les données du problème. D’autre part, notre ignorance ne doit pas être complète, sans quoi nous ne pourrions rien évaluer. Une classification s’opérerait donc suivant le plus ou moins de profondeur de notre ignorance. Ainsi la probabilité pour que la sixième décimale d’un nombre dans une table de logarithmes soit égale à 6 est a priori de 1/10; en réalité, toutes les données du problème sont bien déterminées, et, si nous voulions nous en donner la peine, nous connaîtrions exactement cette probabilité. De même, dans les interpolations, dans le calcul des intégrales définies par la méthode de Cotes ou celle de Gauss, etc. Further along, in chapter XV, page 292, he constructs a (basic) version of the Wiener integral from scratch and applies it to interpolation problems (in the beginning of that chapter he reviews the problem of interpolating an unknown function over a known basis of polynomials using classical deterministic methods, in the second part he assumes that the coefficients in the basis are Gaussian and treats interpolation points as data in an inference problem). [Many thanks to Houman Owhadi for pointing out this second reference.]

@book{poincare1896,
author = {Poincar{\'e}, H.},
publisher = {Gauthier-Villars},
title = {{Calcul des probabilit{\'e}s}},
year = {1896},
notes = {Likely the first text explicitly discussing the notion of computation as inference. The introduction (§1.7) contains the following insightful, albeit informal, sentence:

Une question de probabilités ne se pose que par suite de notre ignorance: il n’y aurait place que pour la certitude si nous connaissions toutes les données du problème. D’autre part, notre ignorance ne doit pas être complète, sans quoi nous ne pourrions rien évaluer. Une classification s’opérerait donc suivant le plus ou moins de profondeur de notre ignorance.

Ainsi la probabilité pour que la sixième décimale d’un nombre dans une table de logarithmes soit égale à 6 est a priori de 1/10; en réalité, toutes les données du problème sont bien déterminées, et, si nous voulions nous en donner la peine, nous connaîtrions exactement cette probabilité. De même, dans les interpolations, dans le calcul des intégrales définies par la méthode de Cotes ou celle de Gauss, etc.

Further along, in chapter XV, page 292, he constructs a (basic) version of the Wiener integral from scratch and applies it to interpolation problems (in the beginning of that chapter he reviews the problem of interpolating an unknown function over a known basis of polynomials using classical deterministic methods, in the second part he assumes that the coefficients in the basis are Gaussian and treats interpolation points as data in an inference problem). [Many thanks to Houman Owhadi for pointing out this second reference.]
}
}

2. Erdös, P., & Kac, M. (1940). The Gaussian law of errors in the theory of additive number theoretic functions. American Journal of Mathematics, 738–742.
@article{kac1940gaussian,
title = {The Gaussian law of errors in the theory of additive number
theoretic functions},
author = {Erd{\"o}s, P. and Kac, M},
journal = {American Journal of Mathematics},
pages = {738--742},
year = {1940},
}

3. Kac, M. (1949). On distributions of certain Wiener functionals. Transactions of the AMS, 65(1), 1–13.
@article{kac1949distributions,
title = {On distributions of certain Wiener functionals},
author = {Kac, M.},
journal = {Transactions of the AMS},
volume = {65},
number = {1},
pages = {1--13},
year = {1949},
}

4. Sul’din, A. V. (1959). Wiener measure and its applications to approximation methods. I. Izvestiya Vysshikh Uchebnykh Zavedenii. Matematika, (6), 145–158.
@article{sul1959wiener,
title = {Wiener measure and its applications to approximation methods. I},
author = {Sul'din, Al'bert Valentinovich},
journal = {Izvestiya Vysshikh Uchebnykh Zavedenii. Matematika},
number = {6},
pages = {145--158},
year = {1959},
publisher = {Kazan (Volga region) Federal University}
}

5. Sul’din, A. V. (1960). Wiener measure and its applications to approximation methods. II. Izvestiya Vysshikh Uchebnykh Zavedenii. Matematika, (5), 165–179.
@article{sul1960wiener,
title = {Wiener measure and its applications to approximation methods. II},
author = {Sul'din, Al'bert Valentinovich},
journal = {Izvestiya Vysshikh Uchebnykh Zavedenii. Matematika},
number = {5},
pages = {165--179},
year = {1960},
publisher = {Kazan (Volga region) Federal University}
}

6. Sard, A. (1963). Linear approximation. Providence, R.I.: American Mathematical Society.

Towards the end of this book, Sard introduces the probability concepts of his time (Wiener and Kolmogorov) into the classical theory of linear approximation. [Many thanks to Houman Owhadi for this reference.]

@book{sard1963linear,
title = {Linear approximation},
author = {Sard, Arthur},
year = {1963},
publisher = {American Mathematical Society},
notes = {Towards the end of this book, Sard introduces the probability concepts of his time (Wiener and Kolmogorov) into the classical theory of linear approximation. [Many thanks to Houman Owhadi for this reference.]}
}

7. Kimeldorf, G. S., & Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics, 41(2), 495–502.

This seminal paper points out that splines can be interpreted as the posterior mean functions of certain Gaussian processes. The paper does not explicitly make the connection to the use of splines in numerics – which means that many classic numerical methods in various areas can be interpreted as average-case Gaussian inference rules – but later authors, most notably Diaconis, make it clear that they got the idea.

@article{kimeldorf1970correspondence,
title = {A correspondence between Bayesian estimation on stochastic processes and smoothing by splines},
author = {Kimeldorf, George S and Wahba, Grace},
journal = {The Annals of Mathematical Statistics},
volume = {41},
number = {2},
pages = {495--502},
year = {1970},
notes = {This seminal paper points out that splines can be interpreted as the posterior mean functions of certain Gaussian processes. The paper does not explicitly make the connection to the use of splines in numerics -- which means that many classic numerical methods in various areas can be interpreted as average-case Gaussian inference rules -- but later authors, most notably Diaconis, make it clear that they got the idea.}
}

8. Larkin, F. M. (1972). Gaussian measure in Hilbert space and applications in numerical analysis. Rocky Mountain Journal of Mathematics, 2(3), 379–422.

The numerical analyst is often called upon to estimate a function from a very limited knowledge of its properties (e.g. a finite number of ordinate values). This problem may be made well posed in a variety of ways, but an attractive approach is to regard the required function as a member of a linear space on which a probability measure is constructed, and then use established techniques of probability theory and statistics in order to infer properties of the function from the given information. This formulation agrees with established theory, for the problem of optimal linear approximation (using a Gaussian probability distribution), and also permits the estimation of nonlinear functionals, as well as extension to the case of "noisy" data.

@article{larkin1972gaussian,
title = {Gaussian measure in {H}ilbert space and applications in numerical analysis},
author = {Larkin, F. M.},
journal = {Rocky Mountain Journal of Mathematics},
volume = {2},
number = {3},
year = {1972},
pages = {379--422}
}

9. Kadane, J. B., & Wasilkowski, G. W. (1985). Average case epsilon-complexity in computer science: A Bayesian view. In Bayesian Statistics 2, Proceedings of the Second Valencia International Meeting (pp. 361–374).

Relations between average case epsilon-complexity and Bayesian statistics are discussed. An algorithm corresponds to a decision function, and the choice of information to the choice of an experiment. Adaptive information in epsilon-complexity theory corresponds to the concept of sequential experiment. Some results are reported, giving epsilon-complexity and minimax-Bayesian interpretations for factor analysis. Results from epsilon-complexity are used to establish the optimal sequential design is no better than optimal nonsequential design for that problem.

@inproceedings{Kadane1985a,
author = {Kadane, J. B. and Wasilkowski, G. W.},
booktitle = {Bayesian Statistics 2, Proceedings of the Second Valencia International Meeting},
number = {July},
pages = {361--374},
title = {{Average case epsilon-complexity in computer science: A Bayesian view}},
year = {1985},
}

10. Kadane, J. B. (1985). Parallel and Sequential Computation: A Statistician’s View. Journal of Complexity, 1, 256–263.

I borrow themes from statistics - especially the Bayesian ideas underlying average case analysis and ideas of sequential design of experiments - to discuss when parallel computation is likely to be an attractive technique.

@article{Kadane1985b,
journal = {Journal of Complexity},
pages = {256--263},
title = {{Parallel and Sequential Computation: A Statistician's View}},
volume = {1},
year = {1985},
}

11. Diaconis, P. (1988). Bayesian numerical analysis. Statistical Decision Theory and Related Topics IV, 1, 163–175.

This text is arguably the first to very explicitly discuss the notion of "Bayesian Numerics".

@article{diaconis1988bayesian,
title = {Bayesian numerical analysis},
author = {Diaconis, Persi},
journal = {Statistical decision theory and related topics IV},
volume = {1},
pages = {163--175},
year = {1988},
publisher = {Springer-Verlag, New York},
file = {../assets/pdf/Diaconis_1988.pdf},
notes = {This text is arguably the first to very explicitly discuss the notion of "Bayesian Numerics".}
}

12. O’Hagan, A. (1992). Some Bayesian numerical analysis. Bayesian Statistics, 4(345–363), 4–2.
@article{o1992some,
title = {Some Bayesian numerical analysis},
author = {O’Hagan, Anthony},
journal = {Bayesian Statistics},
volume = {4},
number = {345--363},
pages = {4--2},
file = {../assets/pdf/OHagan1991.pdf},
year = {1992}
}

13. Ritter, K. (2000). Average-case analysis of numerical problems. Springer.
@book{ritter2000average,
title = {Average-case analysis of numerical problems},
author = {Ritter, Klaus},
number = {1733},
year = {2000},
publisher = {Springer},
series = {Lecture Notes in Mathematics}
}

14. Hennig, P., Osborne, M. A., & Girolami, M. (2015). Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 471(2179).

We deliver a call to arms for probabilistic numerical methods: algorithms for numerical tasks, including linear algebra, integration, optimization and solving differential equations, that return uncertainties in their calculations. Such uncertainties, arising from the loss of precision induced by numerical calculation with limited time or hardware, are important for much contemporary science and industry. Within applications such as climate science and astrophysics, the need to make decisions on the basis of computations with large and complex data has led to a renewed focus on the management of numerical uncertainty. We describe how several seminal classic numerical methods can be interpreted naturally as probabilistic inference. We then show that the probabilistic view suggests new algorithms that can flexibly be adapted to suit application specifics, while delivering improved empirical performance. We provide concrete illustrations of the benefits of probabilistic numeric algorithms on real scientific problems from astrometry and astronomical imaging, while highlighting open problems with these new algorithms. Finally, we describe how probabilistic numerical methods provide a coherent framework for identifying the uncertainty in calculations performed with a combination of numerical algorithms (e.g. both numerical optimisers and differential equation solvers), potentially allowing the diagnosis (and control) of error sources in computations.

@article{HenOsbGirRSPA2015,
author = {Hennig, Philipp and Osborne, Michael A. and Girolami, Mark},
title = {Probabilistic numerics and uncertainty in computations},
volume = {471},
number = {2179},
year = {2015},
publisher = {The Royal Society},
journal = {Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences},
file = {http://rspa.royalsocietypublishing.org/content/royprsa/471/2179/20150142.full.pdf}
}

15. Owhadi, H., & Scovel, C. (2016). Toward Machine Wald. In Springer Handbook of Uncertainty Quantification (pp. 1–35). Springer.

The past century has seen a steady increase in the need of estimating and predicting complex systems and making (possibly critical) decisions with limited information. Although computers have made possible the numerical evaluation of sophisticated statistical models, these models are still designed by humans because there is currently no known recipe or algorithm for dividing the design of a statistical model into a sequence of arithmetic operations. Indeed enabling computers to think as humans, especially when faced with uncertainty, is challenging in several major ways: (1) Finding optimal statistical models remains to be formulated as a well-posed problem when information on the system of interest is incomplete and comes in the form of a complex combination of sample data, partial knowledge of constitutive relations and a limited description of the distribution of input random variables. (2) The space of admissible scenarios along with the space of relevant information, assumptions, and/or beliefs, tends to be infinite dimensional, whereas calculus on a computer is necessarily discrete and finite. With this purpose, this paper explores the foundations of a rigorous framework for the scientific computation of optimal statistical estimators/models and reviews their connections with decision theory, machine learning, Bayesian inference, stochastic optimization, robust optimization, optimal uncertainty quantification, and information-based complexity.

@incollection{Owhadi-Scovel-TowardsMachineWald,
author = {Owhadi, Houman and Scovel, Clint},
title = {Toward Machine {W}ald},
booktitle = {Springer Handbook of Uncertainty Quantification},
year = {2016},
pages = {1--35},
publisher = {Springer},
file = {http://arxiv.org/pdf/1508.02449v2.pdf}
}

16. Cockayne, J., Oates, C., Sullivan, T., & Girolami, M. (2017). Bayesian Probabilistic Numerical Methods. ArXiv e-Prints, stat.ME 1702.03673.

The emergent field of probabilistic numerics has thus far lacked rigorous statistical principals. This paper establishes Bayesian probabilistic numerical methods as those which can be cast as solutions to certain Bayesian inverse problems, albeit problems that are non-standard. This allows us to establish general conditions under which Bayesian probabilistic numerical methods are well-defined, encompassing both non-linear and non-Gaussian models. For general computation, a numerical approximation scheme is developed and its asymptotic convergence is established. The theoretical development is then extended to pipelines of computation, wherein probabilistic numerical methods are composed to solve more challenging numerical tasks. The contribution highlights an important research frontier at the interface of numerical analysis and uncertainty quantification, with some illustrative applications presented.

@article{2017arXiv170203673C,
author = {{Cockayne}, J. and {Oates}, C. and {Sullivan}, T. and {Girolami}, M.},
title = {{{B}ayesian Probabilistic Numerical Methods}},
journal = {ArXiv e-prints},
volume = {stat.ME 1702.03673},
year = {2017},
month = feb,
file = {https://arxiv.org/pdf/1702.03673.pdf}
}


1. O’Hagan, A. (1987). Monte Carlo is Fundamentally Unsound. Journal of the Royal Statistical Society. Series D (The Statistician), 36(2/3), pp. 247–249.

We present some fundamental objections to the Monte Carlo method of numerical integration.

@article{1987,
jstor_articletype = {research-article},
title = {Monte Carlo is Fundamentally Unsound},
author = {O'Hagan, A.},
journal = {Journal of the Royal Statistical Society. Series D (The Statistician)},
jstor_issuetitle = {Special Issue: Practical Bayesian Statistics},
volume = {36},
number = {2/3},
jstor_formatteddate = {1987},
pages = {pp. 247-249},
issn = {00390526},
language = {English},
year = {1987},
publisher = {Wiley for the Royal Statistical Society},
}

2. Kennedy, M. C., & O’Hagan, A. (1996). Iterative rescaling for Bayesian quadrature. Bayesian Statistics, 5, 639–645.
@article{kennedy1996iterative,
title = {Iterative rescaling for Bayesian quadrature},
author = {Kennedy, MC and O’Hagan, A},
journal = {Bayesian Statistics},
volume = {5},
pages = {639--645},
year = {1996},
publisher = {Oxford University Press Oxford}
}

3. Kennedy, M. (1998). Bayesian quadrature with non-normal approximating functions. Statistics and Computing, 8(4), 365–375.
@article{kennedy1998bayesian,
title = {Bayesian quadrature with non-normal approximating functions},
author = {Kennedy, Marc},
journal = {Statistics and Computing},
volume = {8},
number = {4},
pages = {365--375},
year = {1998},
publisher = {Springer}
}

4. O’Hagan, A. (1991). Bayes–Hermite quadrature. Journal of Statistical Planning and Inference, 29(3), 245–260.

Bayesian quadrature treats the problem of numerical integration as one of statistical inference. A prior Gaussian process distribution is assumed for the integrand, observations arise from evaluating the integrand at selected points, and a posterior distribution is derived for the integrand and the integral. Methods are developed for quadrature in p. A particular application is integrating the posterior density arising from some other Bayesian analysis. Simulation results are presented, to show that the resulting Bayes–Hermite quadrature rules may perform better than the conventional Gauss–Hermite rules for this application. A key result is derived for product designs, which makes Bayesian quadrature practically useful for integrating in several dimensions. Although the method does not at present provide a solution to the more difficult problem of quadrature in high dimensions, it does seem to offer real improvements over existing methods in relatively low dimensions.

@article{o1991bayes,
author = {O'Hagan, A.},
journal = {Journal of statistical planning and inference},
volume = {29},
number = {3},
pages = {245--260},
year = {1991}
}

5. Minka, T. P. (2000). Deriving quadrature rules from Gaussian processes. Statistics Department, Carnegie Mellon University.

Quadrature rules are often designed to achieve zero error on a small set of functions, e.g. polynomials of specified degree. A more robust method is to minimize average error over a large class or distribution of functions. If functions are distributed according to a Gaussian process, then designing an average-case quadrature rule reduces to solving a system of 2n equations, where n is the number of nodes in the rule (O’Hagan, 1991). It is shown how this very general technique can be used to design customized quadrature rules, in the style of Yarvin & Rokhlin (1998), without the need for singular value decomposition and in any number of dimensions. It is also shown how classical Gaussian quadrature rules, trigonometric lattice rules, and spline rules can be extended to the average-case and to multiple dimensions by deriving them from Gaussian processes. In addition to being more robust, multidimensional quadrature rules designed for the average-case are found to be much less ambiguous than those designed for a given polynomial degree.

@techreport{minka2000deriving,
author = {Minka, T.P.},
institution = {Statistics Department, Carnegie Mellon University},
title = {{Deriving quadrature rules from {G}aussian processes}},
year = {2000},
}

6. Ghahramani, Z., & Rasmussen, C. E. (2002). Bayesian Monte Carlo. In Advances in neural information processing systems (pp. 489–496).

We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the in- corporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more chal- lenging multidimensional integrals involved in computing marginal like- lihoods of statistical models (a.k.a. partition functions and model evi- dences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

@inproceedings{ghahramani2002bayesian,
title = {Bayesian {Monte Carlo}},
author = {Ghahramani, Zoubin and Rasmussen, Carl E},
booktitle = {Advances in neural information processing systems},
pages = {489--496},
file = {http://machinelearning.wustl.edu/mlpapers/paper_files/AA01.pdf},
year = {2002}
}

7. Kong, A., McCullagh, P., Meng, X.-L., Nicolae, D. L., & Tan, Z. (2003). A theory of statistical models for Monte Carlo integration. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 65(3), 585–618.

The task of estimating an integral by Monte Carlo methods is formulated as a statistical model using simulated observations as data. The difficulty in this exercise is that we ordinarily have at our disposal all of the information required to compute integrals exactly by calculus or numerical integration, but we choose to ignore some of the information for simplicity or computational feasibility. Our proposal is to use a semiparametric statistical model that makes explicit what information is ignored and what information is retained. The parameter space in this model is a set of measures on the sample space, which is ordinarily an infinite dimensional object. None-the-less, from simulated data the base-line measure can be estimated by maximum likelihood, and the required integrals computed by a simple formula previously derived by Vardi and by Lindsay in a closely related model for biased sampling. The same formula was also suggested by Geyer and by Meng and Wong using entirely different arguments. By contrast with Geyer’s retrospective likelihood, a correct estimate of simulation error is available directly from the Fisher information. The principal advantage of the semiparametric model is that variance reduction techniques are associated with submodels in which the maximum likelihood estima-tor in the submodel may have substantially smaller variance than the traditional estimator. The method is applicable to Markov chain and more general Monte Carlo sampling schemes with multiple samplers.

@article{Kong2003,
author = {Kong, Augustine and McCullagh, Peter and Meng, Xiao-Li and Nicolae, Dan L. and Tan, Zhiquiang},
journal = {Journal of the Royal Statistical Society, Series B (Statistical Methodology)},
number = {3},
pages = {585--618},
title = {{A theory of statistical models for Monte Carlo integration}},
volume = {65},
year = {2003},
}

8. Tan, Z. (2004). On a Likelihood Approach for Monte Carlo Integration. Journal of the American Statistical Association, 99(468), 1027–1036.

The use of estimating equations has been a common approach for constructing Monte Carlo estimators. Recently, Kong et al. proposed a formulation of Monte Carlo integration as a statistical model, making explicit what information is ignored and what is retained about the baseline measure. From simulated data, the baseline measure is estimated by maximum likelihood, and then integrals of interest are estimated by substituting the estimated measure. For two different situations in which independent observations are simulated from multiple distributions, we show that this likelihood approach achieves the lowest asymptotic variance possible by using estimating equations. In the first situation, the normalizing constants of the design distributions are estimated, and Meng andWong’s bridge sampling estimating equation is considered. In the second situation, the values of the normalizing constants are known, thereby imposing linear constraints on the baseline measure. Estimating equations including Hesterberg’s stratified importance sampling estimator, Veach and Guibas’s multiple importance sampling estimator, and Owen and Zhou’s method of control variates are considered.

@article{Tan2004,
author = {Tan, Zhiqiang},
journal = {Journal of the American Statistical Association},
number = {468},
pages = {1027--1036},
title = {{On a Likelihood Approach for Monte Carlo Integration}},
volume = {99},
year = {2004},
}

9. Kong, A., McCullagh, P., Meng, X.-L., & Nicolae, D. L. (2007). Further Explorations of Likelihood Theory for Monte Carlo Integration. Advances in Statistical Modelling and Inference, (March), 563–592.

Monte-Carlo estimation of an integral is usually based on the method of moments or an estimating equation. Recently, Kong, McCullagh, Meng, Nicolae and Tan (2003) proposed a likelihood based theory, which puts Monte-Carlo estimation of integrals on a firmer, less ad hoc, basis by formulating the problem as a likelihood inference problems for the baseline measure with simulated observations as data. In this paper, we provide further exploration and development of this theory. After an overview of the likelihood formulation, we first demonstrate the power of the likelihood-based method by presenting a universally improved importance sampling estimator. We then prove that the formal, infinite-dimensional Fisher-information based variance calculation given in Kong et al. (2003) is asymptotically the same as the sampling based "sandwich" variance estimator. Next, we explore the gain in Monte Carlo efficiency when the baseline measure can be parameterized. Furthermore, we show how the Monte Carlo integration problem can also be dealt with by the method of empirical likelihood, and how the baseline measure parameter can be properly profiled out to form a profile likelihood for the integrals only. As a byproduct, we obtain four equivalent conditions for the existence of unique maximum likelihood estimate for mixture models with known components. We also discuss an apparent paradox for Bayesian inference with Monte Carlo integration.

@article{Kong2007,
author = {Kong, Augustine and McCullagh, Peter and Meng, Xiao-Li and Nicolae, Dan L.},
journal = {Advances in Statistical Modelling and Inference},
number = {March},
pages = {563--592},
title = {{Further Explorations of Likelihood Theory for Monte Carlo Integration}},
year = {2007},
}

10. Osborne, M. A., Garnett, R., Roberts, S. J., Hart, C., Aigrain, S., & Gibson, N. (2012). Bayesian quadrature for ratios. In International Conference on Artificial Intelligence and Statistics (pp. 832–840).

We describe a novel approach to quadrature for ratios of probabilistic integrals, such as are used to compute posterior probabilities. This approach offers performance superior to Monte Carlo methods by exploiting a Bayesian quadrature framework. We improve upon previous Bayesian quadrature techniques by explicitly modelling the non-negativity of our integrands, and the correlations that exist between them. It offers most where the integrand is multi-modal and expensive to evaluate. We demonstrate the efficacy of our method on data from the Kepler space telescope.

@inproceedings{osborne2012bayesian,
author = {Osborne, M.A. and Garnett, R. and Roberts, S.J. and Hart, C. and Aigrain, S. and Gibson, N.},
booktitle = {{International Conference on Artificial Intelligence and
Statistics}},
pages = {832--840},
title = {{Bayesian quadrature for ratios}},
year = {2012},
file = {../assets/pdf/Osborne2012Bayesian.pdf}
}

11. Osborne, M. A., Duvenaud, D. K., Garnett, R., Rasmussen, C. E., Roberts, S. J., & Ghahramani, Z. (2012). Active Learning of Model Evidence Using Bayesian Quadrature. In Advances in Neural Information Processing Systems (NIPS) (pp. 46–54).

Numerical integration is a key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model’s hyperparameters in closed form, and introduces an ac- tive learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy.

@inproceedings{osborne2012active,
author = {Osborne, M.A. and Duvenaud, D.K. and Garnett, R. and Rasmussen, C.E. and Roberts, S.J. and Ghahramani, Z.},
booktitle = {{Advances in Neural Information Processing Systems (NIPS)}},
pages = {46--54},
title = {{Active Learning of Model Evidence Using Bayesian
year = {2012},
file = {../assets/pdf/Osborne2012active.pdf
}
}

12. Särkkä, S., Hartikainen, J., Svensson, L., & Sandblom, F. (2014). Gaussian Process Quadratures in Nonlinear Sigma-Point Filtering and Smoothing. In FUSION.

This paper is concerned with the use of Gaussian process regression based quadrature rules in the context of sigma- point-based nonlinear Kalman filtering and smoothing. We show how Gaussian process (i.e., Bayesian or Bayes–Hermite) quadratures can be used for numerical solving of the Gaussian integrals arising in the filters and smoothers. An interesting additional result is that with suitable selections of Hermite polynomial covariance functions the Gaussian process quadratures can be reduced to unscented transforms, spherical cubature rules, and to Gauss- Hermite rules previously proposed for approximate nonlinear Kalman filter and smoothing. Finally, the performance of the Gaussian process quadratures in this context is evaluated with numerical simulations.

@inproceedings{sarkkagaussian,
title = {Gaussian Process Quadratures in Nonlinear Sigma-Point
Filtering and Smoothing},
author = {S{\"a}rkk{\"a}, Simo and Hartikainen, Jouni and Svensson, Lennart and Sandblom, Fredrik},
booktitle = {FUSION},
year = {2014},
}

13. Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., & Roberts, S. (2014). Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature. In C. Cortes & N. Lawrence (Eds.), Advances in Neural Information Processing Systems (NIPS).

We propose a novel sampling framework for inference in probabilistic models: an active learning approach that converges more quickly (in wall-clock time) than Markov chain Monte Carlo (MCMC) benchmarks. The central challenge in probabilistic inference is numerical integration, to average over ensembles of models or unknown (hyper-)parameters (for example to compute marginal likelihood or a partition function). MCMC has provided approaches to numerical integration that deliver state-of-the-art inference, but can suffer from sample inefficiency and poor convergence diagnostics. Bayesian quadrature techniques offer a model-based solution to such problems, but their uptake has been hindered by prohibitive computation costs. We introduce a warped model for probabilistic integrands (likelihoods) that are known to be non-negative, permitting a cheap active learning scheme to optimally select sample locations. Our algorithm is demonstrated to offer faster convergence (in seconds) relative to simple Monte Carlo and annealed importance sampling on both synthetic and real-world examples.

@inproceedings{gunter14-fast-bayesian-quadrature,
author = {Gunter, Tom and Osborne, Michael A. and Garnett, Roman and Hennig, Philipp and Roberts, Stephen},
title = {Sampling for Inference in Probabilistic Models with Fast
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
year = {2014},
editor = {Cortes, C. and Lawrence, N.},
code = {https://github.com/OxfordML/wsabi}
}

14. Oates, C. J., Girolami, M., & Chopin, N. (2014). Control functionals for Monte Carlo integration. ArXiv Preprint 1410.2392.

This paper introduces a novel class of estimators for Monte Carlo integration, that leverage gradient information in order to provide the improved estimator performance demanded by contemporary statistical applications. The proposed estimators, called "control functionals", achieve sub-root-n convergence and often require orders of magnitude fewer simulations, compared with existing approaches, in order to achieve a fixed level of precision. We focus on a particular sub-class of estimators that permit an elegant analytic form and study their properties, both theoretically and empirically. Results are presented on Bayes-Hermite quadrature, hierarchical Gaussian process models and non-linear ordinary differential equation models, where in each case our estimators are shown to offer state of the art performance.

@article{oates14-contr-monte-carlo,
author = {Oates, Chris J. and Girolami, Mark and Chopin, Nicolas},
title = {Control functionals for Monte Carlo integration},
journal = {arXiv preprint 1410.2392},
year = {2014},
file = {http://arxiv.org/pdf/1410.2392v1}
}

15. Särkkä, S., Hartikainen, J., Svensson, L., & Sandblom, F. (2015). On the relation between Gaussian process quadratures and sigma-point methods. ArXiv Preprint Stat.ME 1504.05994.

This article is concerned with Gaussian process quadratures, which are numerical integration methods based on Gaussian process regression methods, and sigma-point methods, which are used in advanced non-linear Kalman filtering and smoothing algorithms. We show that many sigma-point methods can be interpreted as Gaussian quadrature based methods with suitably selected covariance functions. We show that this interpretation also extends to more general multivariate Gauss–Hermite integration methods and related spherical cubature rules. Additionally, we discuss different criteria for selecting the sigma-point locations: exactness for multivariate polynomials up to a given order, minimum average error, and quasi-random point sets. The performance of the different methods is tested in numerical experiments.

@article{2015arXiv150405994S,
author = {{S{\"a}rkk{\"a}}, S. and {Hartikainen}, J. and {Svensson}, L. and {Sandblom}, F.},
title = {{On the relation between Gaussian process quadratures and sigma-point methods}},
journal = {arXiv preprint stat.ME 1504.05994},
year = {2015},
month = apr,
file = {http://arxiv.org/pdf/1504.05994v1.pdf}
}

16. Eslami, S. M. A., Tarlow, D., Kohli, P., & Winn, J. (2014). Just-In-Time Learning for Fast and Flexible Inference. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (NIPS) 27 (pp. 154–162).

Much of research in machine learning has centered around the search for inference algorithms that are both general-purpose and efficient. The problem is extremely challenging and general inference remains computationally expensive. We seek to address this problem by observing that in most specific applications of a model, we typically only need to perform a small subset of all possible inference computations. Motivated by this, we introduce just-in-time learning, a framework for fast and flexible inference that learns to speed up inference at run-time. Through a series of experiments, we show how this framework can allow us to combine the flexibility of sampling with the efficiency of deterministic message-passing.

@inproceedings{NIPS2014_5595,
title = {Just-In-Time Learning for Fast and Flexible Inference},
author = {Eslami, S. M. Ali and Tarlow, Daniel and Kohli, Pushmeet and Winn, John},
booktitle = {Advances in Neural Information Processing Systems (NIPS) 27},
editor = {Ghahramani, Z. and Welling, M. and Cortes, C. and Lawrence, N.D. and Weinberger, K.Q.},
pages = {154--162},
year = {2014},
file = {http://papers.nips.cc/paper/5595-just-in-time-learning-for-fast-and-flexible-inference.pdf}
}

17. Jitkrittum, W., Gretton, A., Heess, N., Eslami, S. M. A., Lakshminarayanan, B., Sejdinovic, D., & Szabó, Z. (2015). Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages. In Uncertainty in Artificial Intelligence (UAI) 31.

We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernel-based regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel two-layer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.

@inproceedings{Jitkrittum1503,
author = {{Jitkrittum}, W. and {Gretton}, A. and {Heess}, N. and {Eslami}, S.~M.~A. and {Lakshminarayanan}, B. and {Sejdinovic}, D. and {Szab{\'o}}, Z.},
title = {{Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages}},
booktitle = {Uncertainty in Artificial Intelligence (UAI) 31},
year = {2015},
file = {http://auai.org/uai2015/proceedings/papers/235.pdf}
}

18. Briol, F.-X., Oates, C. J., Girolami, M., & Osborne, M. A. (2015). Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees. In Advances in Neural Information Processing Systems (NIPS). Retrieved from http://arxiv.org/abs/1506.02681

There is renewed interest in formulating integration as an inference problem, motivated by obtaining a full distribution over numerical error that can be propagated through subsequent computation. Current methods, such as Bayesian Quadrature, demonstrate impressive empirical performance but lack theoretical analysis. An important challenge is to reconcile these probabilistic integrators with rigorous convergence guarantees. In this paper, we present the first probabilistic integrator that admits such theoretical treatment, called Frank-Wolfe Bayesian Quadrature (FWBQ). Under FWBQ, convergence to the true value of the integral is shown to be exponential and posterior contraction rates are proven to be superexponential. In simulations, FWBQ is competitive with state-of-the-art methods and out-performs alternatives based on Frank-Wolfe optimisation. Our approach is applied to successfully quantify numerical error in the solution to a challenging model choice problem in cellular biology.

@inproceedings{briol_frank-wolfe_2015,
title = {Frank-{Wolfe} {Bayesian} {Quadrature}: {Probabilistic} {Integration} with {Theoretical} {Guarantees}},
url = {http://arxiv.org/abs/1506.02681},
booktitle = {Advances in Neural Information Processing Systems (NIPS)},
author = {Briol, François-Xavier and Oates, Chris J. and Girolami, Mark and Osborne, Michael A.},
year = {2015},
keywords = {Statistics - Machine Learning},
file = {http://arxiv.org/pdf/1506.02681v1.pdf}
}

19. Bach, F. (2015). On the Equivalence between Quadrature Rules and Random Features. ArXiv Preprint ArXiv:1502.06800.

We show that kernel-based quadrature rules for computing in tegrals can be seen as a special case of random feature expansions for positive definite kernels, for a particular decomposition that always exists for such kernels. We provide a theoretical analysis of the number of required samples for a given approximation error, leading to both upper and lower bounds that are based solely on the eigenvalues of the associated integral operator and match up to logarithmic terms. In particular, we show that the upper bound may be obtained from independent an d identically distributed samples from a specific non-uniform distribution, while the lower bo und if valid for any set of points. Applying our results to kernel-based quadrature, while our results are fairly general, we recover known upper and lower bounds for the special cases of Sobolev spaces. Moreover, our results extend to the more general problem of full function approxim ations (beyond simply computing an integral), with results in L2- and L∞-norm that match known results for special cases. Applying our results to random features, we show an improvement of the number of random features needed to preserve the generalization guarantees for learning with Lipshitz-continuous losses.

@article{bach2015equivalence,
title = {On the Equivalence between Quadrature Rules and Random Features},
author = {Bach, Francis},
journal = {arXiv preprint arXiv:1502.06800},
file = {http://arxiv.org/pdf/1502.06800v2.pdf},
year = {2015}
}

20. Briol, F.-X., Oates, C. J., Girolami, M., Osborne, M. A., & Sejdinovic, D. (2015). Probabilistic Integration: A Role for Statisticians in Numerical Analysis? ArXiv:1512.00933 [Cs, Math, Stat]. Retrieved from http://arxiv.org/abs/1512.00933

A research frontier has emerged in scientific computation, founded on the principle that numerical error entails epistemic uncertainty that ought to be subjected to statistical analysis. This viewpoint raises several interesting challenges, including the design of statistical methods that enable the coherent propagation of probabilities through a (possibly deterministic) computational pipeline. This paper examines thoroughly the case for probabilistic numerical methods in statistical computation and a specific case study is presented for Markov chain and Quasi Monte Carlo methods. A probabilistic integrator is equipped with a full distribution over its output, providing a measure of epistemic uncertainty that is shown to be statistically valid at finite computational levels, as well as in asymptotic regimes. The approach is motivated by expensive integration problems, where, as in krigging, one is willing to expend, at worst, cubic computational effort in order to gain uncertainty quantification. There, probabilistic integrators enjoy the "best of both worlds", leveraging the sampling efficiency of Monte Carlo methods whilst providing a principled route to assessment of the impact of numerical error on scientific conclusions. Several substantial applications are provided for illustration and critical evaluation, including examples from statistical modelling, computer graphics and uncertainty quantification in oil reservoir modelling.

@article{briol_probabilistic_2015,
title = {Probabilistic {Integration}: A Role for Statisticians in Numerical Analysis?},
url = {http://arxiv.org/abs/1512.00933},
urldate = {2015-07-22},
journal = {arXiv:1512.00933 [cs, math, stat]},
author = {Briol, François-Xavier and Oates, Chris J. and Girolami, Mark and Osborne, Michael A. and Sejdinovic, Dino},
year = {2015},
note = {arXiv: 1512.00933},
file = {http://arxiv.org/pdf/1512.00933.pdf}
}

21. Prüher, J., & Šimandl, M. (2016). Bayesian Quadrature Variance in Sigma-Point Filtering. In J. Filipe, K. Madani, O. Gusikhin, & J. Sasiadek (Eds.), International Conference on Informatics in Control, Automation and Robotics (ICINCO) Revised Selected Papers (Vol. 12, pp. 355–370). Colmar, France: Springer International Publishing.

Sigma-point filters are algorithms for recursive state estimation of the stochastic dynamic systems from noisy measurements, which rely on moment integral approximations by means of various numerical quadrature rules. In practice, however, it is hardly guaranteed that the system dynamics or measurement functions will meet the restrictive requirements of the classical quadratures, which inevitably results in approximation errors that are not accounted for in the current state-of-the-art sigma-point filters. We propose a method for incorporating information about the integral approximation error into the filtering algorithm by exploiting features of a Bayesian quadrature—an alternative to classical numerical integration. This is enabled by the fact that the Bayesian quadrature treats numerical integration as a statistical estimation problem, where the posterior distribution over the values of the integral serves as a model of numerical error. We demonstrate superior performance of the proposed filters on a simple univariate benchmarking example.

@incollection{Pruher2016,
author = {Pr{\"u}her, Jakub and {\v{S}}imandl, Miroslav},
title = {{Bayesian Quadrature Variance in Sigma-Point Filtering}},
editor = {Filipe, Joaquim and Madani, Kurosh and Gusikhin, Oleg and Sasiadek, Jurek},
booktitle = {International Conference on Informatics in Control, Automation and Robotics (ICINCO) Revised Selected Papers},
volume = {12},
pages = {355--370},
publisher = {Springer International Publishing},
year = {2016},
code = {https://github.com/jacobnzw/icinco-code}
}

22. Oates, C. J., Briol, F.-X., & Girolami, M. (2016). Probabilistic Integration and Intractable Distributions. ArXiv e-Prints, stat.ME 1606.06841.

This paper considers numerical approximation for integrals of the form $∫f(x) p(\mathrmdx) in the case where f(x) is an expensive black-box function and p(\mathrmdx) is an intractable distribution (meaning that it is accessible only through a finite collection of samples). Our proposal extends previous work that treats numerical integration as a problem of statistical inference, in that we model both f as an a priori unknown random function and p$ as an a priori unknown random distribution. The result is a posterior distribution over the value of the integral that accounts for these dual sources of approximation error. This construction is designed to enable the principled quantification and propagation of epistemic uncertainty due to numerical error through a computational pipeline. The work is motivated by such problems that occur in the Bayesian calibration of computer models.

@article{2016arXiv160606841O,
author = {{Oates}, C.~J. and {Briol}, F.-X. and {Girolami}, M.},
title = {{Probabilistic Integration and Intractable Distributions}},
journal = {ArXiv e-prints},
volume = {stat.ME 1606.06841},
year = {2016},
month = jun,
file = {http://arxiv.org/pdf/1606.06841v2},
}

23. Kanagawa, M., Sriperumbudur, B. K., & Fukumizu, K. (2016). Convergence guarantees for kernel-based quadrature rules in misspecified settings. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 3288–3296). Curran Associates, Inc.

Kernel-based quadrature rules are becoming important in machine learning and statistics, as they achieve super-\sqrtn convergence rates in numerical integration, and thus provide alternatives to Monte Carlo integration in challenging settings where integrands are expensive to evaluate or where integrands are high dimensional. These rules are based on the assumption that the integrand has a certain degree of smoothness, which is expressed as that the integrand belongs to a certain reproducing kernel Hilbert space (RKHS). However, this assumption can be violated in practice (e.g., when the integrand is a black box function), and no general theory has been established for the convergence of kernel quadratures in such misspecified settings. Our contribution is in proving that kernel quadratures can be consistent even when the integrand does not belong to the assumed RKHS, i.e., when the integrand is less smooth than assumed. Specifically, we derive convergence rates that depend on the (unknown) lesser smoothness of the integrand, where the degree of smoothness is expressed via powers of RKHSs or via Sobolev spaces.

@incollection{NIPS2016-6174,
title = {Convergence guarantees for kernel-based quadrature rules in misspecified settings},
author = {Kanagawa, Motonobu and Sriperumbudur, Bharath K. and Fukumizu, Kenji},
booktitle = {Advances in Neural Information Processing Systems 29},
editor = {Lee, D. D. and Sugiyama, M. and Luxburg, U. V. and Guyon, I. and Garnett, R.},
pages = {3288--3296},
year = {2016},
publisher = {Curran Associates, Inc.},
}


## Linear Algebra

1. Hennig, P. (2015). Probabilistic Interpretation of Linear Solvers. SIAM J on Optimization, 25(1).

This paper proposes a probabilistic framework for algorithms that iteratively solve unconstrained linear problems Bx = b with positive definite B for x. The goal is to replace the point estimates returned by existing methods with a Gaussian posterior belief over the elements of the inverse of B, which can be used to estimate errors. Recent probabilistic interpretations of the secant family of quasi-Newton optimization algorithms are extended. Combined with properties of the conjugate gradient algorithm, this leads to uncertainty-calibrated methods with very limited cost overhead over conjugate gradients, a self-contained novel interpretation of the quasi-Newton and conjugate gradient algorithms, and a foundation for new nonlinear optimization methods.

@article{2014arXiv14022058H,
author = {{Hennig}, P.},
journal = {SIAM J on Optimization},
month = jan,
title = {{Probabilistic Interpretation of Linear Solvers}},
year = {2015},
volume = {25},
issue = {1},
file = {http://probabilistic-numerics.org/assets/pdf/HennigLinear2015.pdf}
}

2. Bartels, S., & Hennig, P. (2016). Probabilistic Approximate Least-Squares. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2016) (Vol. 51, pp. 676–684).

Least-squares and kernel-ridge / Gaussian process regression are among the foundational algorithms of statistics and machine learning. Famously, the worst-case cost of exact nonparametric regression grows cubically with the data-set size; but a growing number of approximations have been developed that estimate good solutions at lower cost. These algorithms typically return point estimators, without measures of uncertainty. Leveraging recent results casting elementary linear algebra operations as probabilistic inference, we propose a new approximate method for nonparametric least-squares that affords a probabilistic uncertainty estimate over the error between the approximate and exact least-squares solution (this is not the same as the posterior variance of the associated Gaussian process regressor). This allows estimating the error of the least-squares solution on a subset of the data relative to the full-data solution. The uncertainty can be used to control the computational effort invested in the approximation. Our algorithm has linear cost in the data-set size, and a simple formal form, so that it can be implemented with a few lines of code in programming languages with linear algebra functionality.

@conference{BarHen16,
title = {Probabilistic Approximate Least-Squares},
author = {Bartels, S. and Hennig, P.},
booktitle = {Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2016)},
volume = {51},
pages = {676--684},
series = {JMLR Workshop and Conference Proceedings},
editors = {Gretton, A. and Robert, C. C. },
year = {2016},
file = {http://jmlr.org/proceedings/papers/v51/bartels16.pdf}
}


## Optimization

1. Hennig, P., & Kiefel, M. (2012). Quasi-Newton methods – a new direction. In International Conference on Machine Learning (ICML).

Four decades after their invention, quasi- Newton methods are still state of the art in unconstrained numerical optimization. Al- though not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of clas- sical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors.

@inproceedings{HennigKiefel,
author = {Hennig, P. and Kiefel, M.},
booktitle = {{International Conference on Machine Learning (ICML)}},
title = {{Quasi-{N}ewton methods -- a new direction}},
year = {2012},
video = {http://techtalks.tv/talks/quasi-newton-methods-a-new-direction/57289/},
file = {../assets/pdf/hennig13quasiNewton.pdf}
}

2. Hennig, P., & Kiefel, M. (2013). Quasi-Newton Methods – a new direction. Journal of Machine Learning Research, 14, 834–865.

Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors.

@article{hennig13_quasi_newton_method,
author = {Hennig, P. and Kiefel, M.},
journal = {Journal of Machine Learning Research},
month = mar,
pages = {834--865},
title = {Quasi-{N}ewton Methods -- a new direction},
volume = {14},
year = {2013},
file = {http://jmlr.org/papers/volume14/hennig13a/hennig13a.pdf}
}

3. Hennig, P. (2013). Fast Probabilistic Optimization from Noisy Gradients. In International Conference on Machine Learning (ICML).

Stochastic gradient descent remains popular in large-scale machine learning, on account of its very low computational cost and robust- ness to noise. However, gradient descent is only linearly efficient and not transformation invariant. Scaling by a local measure can substantially improve its performance. One natural choice of such a scale is the Hessian of the objective function: Were it available, it would turn linearly efficient gradient descent into the quadratically efficient Newton-Raphson optimization. Existing covariant methods, though, are either super-linearly expensive or do not address noise. Generalising recent results, this paper constructs a nonparametric Bayesian quasi-Newton algorithm that learns gradient and Hessian from noisy evaluations of the gradient. Importantly, the resulting algorithm, like stochastic gradient descent, has cost linear in the number of input dimensions

@inproceedings{StochasticNewton,
author = {Hennig, P.},
booktitle = {{International Conference on Machine Learning (ICML)}},
title = {{Fast Probabilistic Optimization from Noisy Gradients}},
year = {2013},
file = {../assets/pdf/hennig13noisy.pdf}
}

4. Mahsereci, M., & Hennig, P. (2015). Probabilistic Line Searches for Stochastic Optimization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 28 (pp. 181–189). Curran Associates, Inc.

In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.

@incollection{NIPS2015_5753,
title = {Probabilistic Line Searches for Stochastic Optimization},
author = {Mahsereci, Maren and Hennig, Philipp},
booktitle = {Advances in Neural Information Processing Systems 28},
editor = {Cortes, C. and Lawrence, N. D. and Lee, D. D. and Sugiyama, M. and Garnett, R.},
pages = {181--189},
year = {2015},
publisher = {Curran Associates, Inc.},
file = {http://papers.nips.cc/paper/5753-probabilistic-line-searches-for-stochastic-optimization.pdf},
}


## Ordinary Differential Equations

To avoid a frequent initial confusion for new readers, it may be helpful to point out that there are two common ways in which probabilistic methods are used in combination with ordinary differential equations: The “classic” problem of numerics is to infer the solution to an initial value problem given access to the differential equation. Below, we term this problem “solving ODEs”. The reverse problem, in some sense, has also found interest in machine learning: inferring a differential equation from (noisy) observations of trajectories that are assumed to be governed by this ODE. Below, this is listed under “inferring ODEs”.

### Work regarding "solving ODEs"

1. Hull, T. E., & Swenson, J. R. (1966). Test of Probabilistic Models for the Propagation of Roundoff Errors. Communications of the ACM, 9(2), 108–113.

In any prolonged computation it is generally assumed that the accumulated effect of roundoff errors is in some sense statistical. The purpose of this paper is to give precise descriptions of certain probabilistic models for roundoff error, and then to described a series of experiments for testing the validity of these models. It is concluded that the models are in general very good. Discrepancies are both rare and mild. The test techniques can also be used to experiment with various types of special arithmetic.

@article{Hull1966,
author = {Hull, T. E. and Swenson, J. R.},
journal = {Communications of the ACM},
number = {2},
pages = {108--113},
title = {{Test of Probabilistic Models for the Propagation of Roundoff Errors}},
volume = {9},
year = {1966},
}

2. Kuki, H., & Cody, W. J. (1973). A Statistical Study Of The Accuracy Of Floating Point Number Systems. Communications of the ACM, 16(1), 223–230.

This paper presents the statistical results of tests of the accuracy of certain arithmetic systems in evaluating sums, products and inner products, and analytic error estimates for some of the computations. The arithmetic systems studied are 6-digit hexadecimal and 22-digit binary floating point number representations combined with the usual chop and round modes of arithmetic with various numbers of guard digits, and with a modified round mode with guard digits. In a certain sense, arithmetic systems differing only in their use of binary or hexadecimal number representations are shown to be approximately statistically equivalent in accuracy. Further, the usual round mode with guard digits is shown to be statistically superior in accuracy to the usual chop mode in all cases save one. The modified round mode is found to be superior to the chop mode in all cases.

@article{Kuki1973,
author = {Kuki, H. and Cody, W. J.},
journal = {Communications of the ACM},
number = {1},
pages = {223--230},
title = {{A Statistical Study Of The Accuracy Of Floating Point Number Systems}},
volume = {16},
year = {1973},
}

3. Skilling, J. (1991). Bayesian solution of ordinary differential equations. Maximum Entropy and Bayesian Methods, Seattle.

In the numerical solution of ordinary differential equations, a function y(x) is to be reconstructed from knowledge of the functional form of its derivative: dy/dx=f(x,y), together with an appropriate boundary condition. The derivative f is evaluated at a sequence of suitably chosen points (x_k,y_k), from which the form of y(.) is estimated. This is an inference problem, which can and perhaps should be treated by Bayesian techniques. As always, the inference appears as a probability distribution prob(y(.)), from which random samples show the probabilistic reliability of the results. Examples are given.

@article{skilling1991bayesian,
author = {Skilling, J.},
journal = {Maximum Entropy and Bayesian Methods, Seattle},
title = {{Bayesian solution of ordinary differential equations}},
year = {1991}
}

4. Mosbach, S., & Turner, A. G. (2009). A quantitative probabilistic investigation into the accumulation of rounding errors in numerical ODE solution. Computers &Amp; Mathematics with Applications, 57(7), 1157–1167.

We examine numerical rounding errors of some deterministic solvers for systems of ordinary differential equations (ODEs). We show that the accumulation of rounding errors results in a solution that is inherently random and we obtain the theoretical distribution of the trajectory as a function of time, the step size and the numerical precision of the computer. We consider, in particular, systems which amplify the effect of the rounding errors so that over long time periods the solutions exhibit divergent behaviour. By performing multiple repetitions with different values of the time step size, we observe numerically the random distributions predicted theoretically. We mainly focus on the explicit Euler and RK4 methods but also briefly consider more complex algorithms such as the implicit solvers VODE and RADAU5.

@article{Mosbach2009,
author = {Mosbach, Sebastian and Turner, Amanda G.},
journal = {Computers {\&} Mathematics with Applications},
number = {7},
pages = {1157--1167},
title = {{A quantitative probabilistic investigation into the accumulation of rounding errors in numerical ODE solution}},
volume = {57},
year = {2009},
}

5. Hennig, P., & Hauberg, S. (2014). Probabilistic Solutions to Differential Equations and their Application to Riemannian Statistics. In Proc. of the 17th int. Conf. on Artificial Intelligence and Statistics (AISTATS) (Vol. 33). JMLR, W&CP.

We study a probabilistic numerical method for the solution of both boundary and initial value problems that returns a joint Gaussian process posterior over the solution. Such methods have concrete value in the statistics on Riemannian manifolds, where non-analytic ordinary differential equations are involved in virtually all computations. The probabilistic formulation permits marginalising the uncertainty of the numerical solution such that statistics are less sensitive to inaccuracies. This leads to new Riemannian algorithms for mean value computations and principal geodesic analysis. Marginalisation also means results can be less precise than point estimates, enabling a noticeable speed-up over the state of the art. Our approach is an argument for a wider point that uncertainty caused by numerical calculations should be tracked throughout the pipeline of machine learning algorithms.

@inproceedings{HennigAISTATS2014,
author = {Hennig, Philipp and Hauberg, S{\o}ren},
booktitle = {{Proc. of the 17th int. Conf. on Artificial Intelligence and
Statistics ({AISTATS})}},
publisher = {JMLR, W\&CP},
title = {{Probabilistic Solutions to Differential Equations and their
Application to Riemannian Statistics}},
volume = {33},
year = {2014},
file = {http://jmlr.org/proceedings/papers/v33/hennig14.pdf},
code = {http://www.probabilistic-numerics.org/GP_ODE_Solver.zip},
supplements = {http://www.probabilistic-numerics.org/22-supp.zip}
}

6. Chkrebtii, O., Campbell, D. A., Girolami, M. A., & Calderhead, B. (2013). Bayesian Uncertainty Quantification for Differential Equations. Bayesin Analysis (Discussion Paper), in press.

This paper advocates expansion of the role of Bayesian statistical inference when formally quantifying uncertainty in computer models defined by systems of ordinary or partial differential equations. We adopt the perspective that implicitly defined infinite dimensional functions representing model states are objects to be inferred probabilistically. We develop a general methodology for the probabilistic integration of differential equations via model based updating of a joint prior measure on the space of functions and their temporal and spatial derivatives. This results in a posterior measure over functions reflecting how well they satisfy the system of differential equations and corresponding initial and boundary values. We show how this posterior measure can be naturally incorporated within the Kennedy and O’Hagan framework for uncertainty quantification and provides a fully Bayesian approach to model calibration. By taking this probabilistic viewpoint, the full force of Bayesian inference can be exploited when seeking to coherently quantify and propagate epistemic uncertainty in computer models of complex natural and physical systems. A broad variety of examples are provided to illustrate the potential of this framework for characterising discretization uncertainty, including initial value, delay, and boundary value differential equations, as well as partial differential equations. We also demonstrate our methodology on a large scale system, by modeling discretization uncertainty in the solution of the Navier-Stokes equations of fluid flow, reduced to over 16,000 coupled and stiff ordinary differential equations. Finally, we discuss the wide range of open research themes that follow from the work presented.

@article{13_bayes_uncer_quant_differ_equat,
author = {Chkrebtii, O. and Campbell, D.A. and Girolami, M.A. and Calderhead, B.},
journal = {Bayesin Analysis (discussion paper)},
title = {{B}ayesian Uncertainty Quantification for Differential
Equations},
year = {2013},
pages = {in press},
}

7. Schober, M., Kasenburg, N., Feragen, A., Hennig, P., & Hauberg, S. (2014). Probabilistic shortest path tractography in DTI using Gaussian Process ODE solvers. In P. Golland, N. Hata, C. Barillot, J. Hornegger, & R. Howe (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2014 (Vol. 8675, pp. 265–272). Springer.

Tractography in diffusion tensor imaging estimates connectivity in the brain through observations of local diffusivity. These observations are noisy and of low resolution and, as a consequence, connections cannot be found with high precision. We use probabilistic numerics to estimate connectivity between regions of interest and contribute a Gaussian Process tractography algorithm which allows for both quantification and visualization of its posterior uncertainty. We use the uncertainty both in visualization of individual tracts as well as in heat maps of tract locations. Finally, we provide a quantitative evaluation of different metrics and algorithms showing that the adjoint metric combined with our algorithm produces paths which agree most often with experts.

@inproceedings{LNCS86750265,
author = {Schober, Michael and Kasenburg, Niklas and Feragen, Aasa and Hennig, Philipp and Hauberg, S{\o}ren},
editor = {Golland, Polina and Hata, Nobuhiko and Barillot, Christian and Hornegger, Joachim and Howe, Robert},
booktitle = {Medical Image Computing and Computer-Assisted Intervention --
MICCAI 2014},
publisher = {Springer},
location = {Heidelberg},
series = {Lecture Notes in Computer Science},
volume = {8675},
year = {2014},
pages = {265--272},
title = {{Probabilistic shortest path tractography in {DTI} using
{G}aussian {P}rocess {ODE} solvers}},
file = {../assets/pdf/Schober2014MICCAI.pdf},
supplements = {http://www.probabilistic-numerics.org/MICCAI2014-supp.zip}
}

8. Schober, M., Duvenaud, D. K., & Hennig, P. (2014). Probabilistic ODE Solvers with Runge-Kutta Means. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27 (pp. 739–747). Curran Associates, Inc.

Runge-Kutta methods are the classic family of solvers for ordinary differential equations (ODEs), and the basis for the state of the art. Like most numerical methods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribution over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus inheriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions.

@incollection{schober2014nips,
title = {Probabilistic {ODE} Solvers with {R}unge-{K}utta Means},
author = {Schober, Michael and Duvenaud, David K and Hennig, Philipp},
booktitle = {Advances in Neural Information Processing Systems 27},
editor = {Ghahramani, Z. and Welling, M. and Cortes, C. and Lawrence, N.D. and Weinberger, K.Q.},
pages = {739--747},
year = {2014},
publisher = {Curran Associates, Inc.},
file = {http://papers.nips.cc/paper/5451-probabilistic-ode-solvers-with-runge-kutta-means.pdf},
supplements = {http://papers.nips.cc/paper/5451-probabilistic-ode-solvers-with-runge-kutta-means-supplemental.zip}
}

9. Barber, D. (2014). On solving Ordinary Differential Equations using Gaussian Processes. ArXiv Pre-Print 1408.3807.

We describe a set of Gaussian Process based approaches that can be used to solve non-linear Ordinary Differential Equations. We suggest an explicit probabilistic solver and two implicit methods, one analogous to Picard iteration and the other to gradient matching. All methods have greater accuracy than previously suggested Gaussian Process approaches. We also suggest a general approach that can yield error estimates from any standard ODE solver.

@article{2014arXiv14083807B,
author = {{Barber}, D.},
title = {{On solving Ordinary Differential Equations using Gaussian
Processes}},
journal = {ArXiv pre-print 1408.3807},
year = {2014},
month = aug,
}

10. Hauberg, S., Schober, M., Liptrot, M., Hennig, P., & Feragen, A. (2015). A Random Riemannian Metric for Probabilistic Shortest-Path Tractography. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) (Vol. 18). Munich, Germany.

Shortest-path tractography (SPT) algorithms solve global optimization problems defined from local distance functions. As diffusion MRI data is inherently noisy, so are the voxelwise tensors from which local distances are derived. We extend Riemannian SPT by modeling the stochasticity of the diffusion tensor as a “random Riemannian metric”, where a geodesic is a distribution over tracts. We approximate this distribution with a Gaussian process and present a probabilistic numerics algorithm for computing the geodesic distribution. We demonstrate SPT improvements on data from the Human Connectome Project.

@inproceedings{Hauberg_MICCAI_2015,
title = {A Random Riemannian Metric for Probabilistic Shortest-Path Tractography},
author = {Hauberg, S{\o}ren and Schober, Michael and Liptrot, Matthew and Hennig, Philipp and Feragen, Aasa},
booktitle = {Medical Image Computing and Computer-Assisted Intervention (MICCAI)},
volume = {18},
month = sep,
year = {2015},
file = {http://www2.compute.dtu.dk/~sohau/papers/miccai2015/MICCAI2015.pdf},
}

11. Conrad, P. R., Girolami, M., Särkkä, S., Stuart, A., & Zygalakis, K. (2015). Probability Measures for Numerical Solutions of Differential Equations. ArXiv:1506.04592 [Stat].

In this paper, we present a formal quantification of epistemic uncertainty induced by numerical solutions of ordinary and partial differential equation models. Numerical solutions of differential equations contain inherent uncertainties due to the finite dimensional approximation of an unknown and implicitly defined function. When statistically analysing models based on differential equations describing physical, or other naturally occurring, phenomena, it is therefore important to explicitly account for the uncertainty introduced by the numerical method. This enables objective determination of its importance relative to other uncertainties, such as those caused by data contaminated with noise or model error induced by missing physical or inadequate descriptors. To this end we show that a wide variety of existing solvers can be randomised, inducing a probability measure over the solutions of such differential equations. These measures exhibit contraction to a Dirac measure around the true unknown solution, where the rates of convergence are consistent with the underlying deterministic numerical method. Ordinary differential equations and elliptic partial differential equations are used to illustrate the approach to quantifying uncertainty in both the statistical analysis of the forward and inverse problems.

@article{conrad_probability_2015,
title = {Probability {Measures} for {Numerical} {Solutions} of {Differential} {Equations}},
file = {http://arxiv.org/pdf/1506.04592v1.pdf},
urldate = {2015-06-16},
journal = {arXiv:1506.04592 [stat]},
author = {Conrad, Patrick R. and Girolami, Mark and Särkkä, Simo and Stuart, Andrew and Zygalakis, Konstantinos},
month = jun,
year = {2015},
note = {arXiv: 1506.04592},
keywords = {Statistics - Methodology}
}

12. Kersting, H. P., & Hennig, P. (2016). Active Uncertainty Calibration in Bayesian ODE Solvers. In Janzing & Ihlers (Eds.), Uncertainty in Artificial Intelligence (UAI) (Vol. 32).

There is resurging interest, in statistics and machine learning, in solvers for ordinary differential equations (ODEs) that return probability measures instead of point estimates. Recently, Conrad et al. introduced a sampling-based class of methods that are ’well-calibrated’ in a specific sense. But the computational cost of these methods is significantly above that of classic methods. On the other hand, Schober et al. pointed out a precise connection between classic Runge-Kutta ODE solvers and Gaussian filters, which gives only a rough probabilistic calibration, but at negligible cost overhead. By formulating the solution of ODEs as approximate inference in linear Gaussian SDEs, we investigate a range of probabilistic ODE solvers, that bridge the trade-off between computational cost and probabilistic calibration, and identify the inaccurate gradient measurement as the crucial source of uncertainty. We propose the novel filtering-based method Bayesian Quadrature filtering (BQF) which uses Bayesian quadrature to actively learn the imprecision in the gradient measurement by collecting multiple gradient evaluations.

@inproceedings{KerstingHennigUAI2016,
author = {Kersting, Hans P. and Hennig, Philipp},
title = {Active Uncertainty Calibration in {B}ayesian {ODE} Solvers},
editor = {Janzing and Ihlers},
booktitle = {Uncertainty in Artificial Intelligence (UAI)},
volume = {32},
year = {2016},
file = {http://arxiv.org/pdf/1605.03364v2.pdf}
}

13. Schober, M., Särkkä, S., & Hennig, P. (2016). A probabilistic model for the numerical solution of initial value problems. ArXiv e-Prints.

Like many numerical methods, solvers for initial value problems (IVPs) on ordinary differential equations estimate an analytically intractable quantity, using the results of tractable computations as inputs. This structure is closely connected to the notion of inference on latent variables in statistics. We describe a class of algorithms that formulate the solution to an IVP as inference on a latent path that is a draw from a Gaussian process probability measure (or equivalently, the solution of a linear stochastic differential equation). We then show that certain members of this class are connected precisely to generalized linear methods for ODEs, a number of Runge–Kutta methods, and Nordsieck methods. This probabilistic formulation of classic methods is valuable in two ways: analytically, it highlights implicit prior assumptions favoring certain approximate solutions to the IVP over others, and gives a precise meaning to the old observation that these methods act like filters. Practically, it endows the classic solvers with ‘docking points’ for notions of uncertainty and prior information about the initial value, the value of the ODE itself, and the solution of the problem.

@article{2016arXiv161005261S,
author = {{Schober}, M. and {S{\"a}rkk{\"a}}, S. and {Hennig}, P.},
title = {A probabilistic model for the numerical solution of initial value problems},
journal = {ArXiv e-prints},
eprint = {1610.05261},
year = {2016},
month = oct,
file = {https://arxiv.org/pdf/1610.05261.pdf}
}


### Work regarding "inferring ODEs"

1. Graepel, T. (2003). Solving noisy linear operator equations by Gaussian processes: Application to ordinary and partial differential equations. In ICML (pp. 234–241).
@inproceedings{graepel2003solving,
title = {Solving noisy linear operator equations by Gaussian
processes: Application to ordinary and partial differential
equations},
author = {Graepel, Thore},
booktitle = {ICML},
pages = {234--241},
year = {2003},
file = {http://www.aaai.org/Papers/ICML/2003/ICML03-033.pdf}
}

2. Calderhead, B., Girolami, M., & Lawrence, N. D. (2009). Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems 21 (pp. 217–224). Curran Associates, Inc.
@incollection{NIPS2008_3497,
title = {Accelerating Bayesian Inference over Nonlinear Differential
Equations with Gaussian Processes},
author = {Calderhead, Ben and Girolami, Mark and Lawrence, Neil D.},
booktitle = {Advances in Neural Information Processing Systems 21},
editor = {Koller, D. and Schuurmans, D. and Bengio, Y. and Bottou, L.},
pages = {217--224},
year = {2009},
publisher = {Curran Associates, Inc.},
file = {http://papers.nips.cc/paper/3497-accelerating-bayesian-inference-over-nonlinear-differential-equations-with-gaussian-processes.pdf}
}

3. Korostil, I. A., Peters, G. W., Cornebise, J., & Regan, D. G. (2013). Adaptive Markov chain Monte Carlo forward projection for statistical analysis in epidemic modelling of human papillomavirus. Statistics in Medicine, 32(11), 1917–1953.

A Bayesian statistical model and estimation methodology based on forward projection adaptive Markov chain Monte Carlo is developed in order to perform the calibration of a high-dimensional nonlinear system of ordinary differential equations representing an epidemic model for human papillomavirus types 6 and 11 (HPV-6, HPV-11). The model is compartmental and involves stratification by age, gender and sexual-activity group. Developing this model and a means to calibrate it efficiently is relevant because HPV is a very multi-typed and common sexually transmitted infection with more than 100 types currently known. The two types studied in this paper, types 6 and 11, are causing about 90% of anogenital warts. We extend the development of a sexual mixing matrix on the basis of a formulation first suggested by Garnett and Anderson, frequently used to model sexually transmitted infections. In particular, we consider a stochastic mixing matrix framework that allows us to jointly estimate unknown attributes and parameters of the mixing matrix along with the parameters involved in the calibration of the HPV epidemic model. This matrix describes the sexual interactions between members of the population under study and relies on several quantities that are a priori unknown. The Bayesian model developed allows one to estimate jointly the HPV-6 and HPV-11 epidemic model parameters as well as unknown sexual mixing matrix parameters related to assortativity. Finally, we explore the ability of an extension to the class of adaptive Markov chain Monte Carlo algorithms to incorporate a forward projection strategy for the ordinary differential equation state trajectories. Efficient exploration of the Bayesian posterior distribution developed for the ordinary differential equation parameters provides a challenge for any Markov chain sampling methodology, hence the interest in adaptive Markov chain methods. We conclude with simulation studies on synthetic and recent actual data.

@article{korostil2013adaptive,
title = {Adaptive Markov chain Monte Carlo forward projection for
statistical analysis in epidemic modelling of human
papillomavirus},
author = {Korostil, Igor A and Peters, Gareth W and Cornebise, Julien and Regan, David G},
journal = {Statistics in medicine},
volume = {32},
number = {11},
pages = {1917--1953},
year = {2013},
file = {http://arxiv.org/pdf/1108.3137v1.pdf}
}

4. Wang, Y., & Barber, D. (2014). Gaussian Processes for Bayesian Estimation in Ordinary Differential Equations. In International Conference on Machine Learning – ICML.

Bayesian parameter estimation in coupled ordinary differential equations (ODEs) is challenging due to the high computational cost of numerical integration. In gradient matching a separate data model is introduced with the property that its gradient may be calculated easily. Parameter estimation is then achieved by requiring consistency between the gradients computed from the data model and those specified by the ODE. We propose a Gaussian process model that directly links state derivative information with system observations, simplifying previous approaches and improving estimation accuracy.

@inproceedings{wang-barber-ICML-14,
author = {Wang, Yali and Barber, David},
title = {{G}aussian Processes for {B}ayesian Estimation in Ordinary
Differential Equations},
booktitle = {International Conference on Machine Learning -- ICML},
year = {2014},
file = {http://web4.cs.ucl.ac.uk/staff/d.barber/publications/gpodeICML2014.pdf}
}


## Partial Differential Equations

1. Bui-Thanh, T., & Girolami, M. (2014). Solving Large-Scale PDE-constrained Bayesian Inverse Problems with Riemann Manifold Hamiltonian Monte Carlo. ArXiv Pre-Print 1407.1517.

We consider the Riemann manifold Hamiltonian Monte Carlo (RMHMC) method for solving statistical inverse problems governed by partial differential equations (PDEs). The power of the RMHMC method is that it exploits the geometric structure induced by the PDE constraints of the underlying inverse problem. Consequently, each RMHMC posterior sample is almost independent from the others providing statistically efficient Markov chain simulation. We reduce the cost of forming the Fisher information matrix by using a low rank approximation via a randomized singular value decomposition technique. This is efficient since a small number of Hessian-vector products are required. The Hessian-vector product in turn requires only two extra PDE solves using the adjoint technique. The results suggest RMHMC as a highly efficient simulation scheme for sampling from PDE induced posterior measures.

@article{2014arXiv14071517B,
author = {{Bui-Thanh}, T. and {Girolami}, M.},
title = {{Solving Large-Scale PDE-constrained Bayesian Inverse
Problems with Riemann Manifold Hamiltonian Monte Carlo}},
journal = {arXiv pre-print 1407.1517},
keywords = {Mathematics - Statistics Theory},
year = {2014},
month = jul,
}

2. Owhadi, H. (2015). Bayesian Numerical Homogenization. Multiscale Modeling &Amp; Simulation, 13(3), 812–828.

Numerical homogenization, i.e. the finite-dimensional approximation of solution spaces of PDEs with arbitrary rough coefficients, requires the identification of accurate basis elements. These basis elements are oftentimes found after a laborious process of scientific investigation and plain guesswork. Can this identification problem be facilitated? Is there a general recipe/decision framework for guiding the design of basis elements? We suggest that the answer to the above questions could be positive based on the reformulation of numerical homogenization as a Bayesian Inference problem in which a given PDE with rough coefficients (or multi-scale operator) is excited with noise (random right hand side/source term) and one tries to estimate the value of the solution at a given point based on a finite number of observations. We apply this reformulation to the identification of bases for the numerical homogenization of arbitrary integro-differential equations and show that these bases have optimal recovery properties. In particular we show how Rough Polyharmonic Splines can be re-discovered as the optimal solution of a Gaussian filtering problem.

@article{owhadi2015bayesian,
title = {Bayesian Numerical Homogenization},
journal = {Multiscale Modeling \& Simulation},
volume = {13},
number = {3},
pages = {812--828},
year = {2015},
publisher = {SIAM},
file = {http://arxiv.org/pdf/1406.6668v2.pdf}
}

3. Conrad, P. R., Girolami, M., Särkkä, S., Stuart, A., & Zygalakis, K. (2015). Probability Measures for Numerical Solutions of Differential Equations. ArXiv:1506.04592 [Stat].

In this paper, we present a formal quantification of epistemic uncertainty induced by numerical solutions of ordinary and partial differential equation models. Numerical solutions of differential equations contain inherent uncertainties due to the finite dimensional approximation of an unknown and implicitly defined function. When statistically analysing models based on differential equations describing physical, or other naturally occurring, phenomena, it is therefore important to explicitly account for the uncertainty introduced by the numerical method. This enables objective determination of its importance relative to other uncertainties, such as those caused by data contaminated with noise or model error induced by missing physical or inadequate descriptors. To this end we show that a wide variety of existing solvers can be randomised, inducing a probability measure over the solutions of such differential equations. These measures exhibit contraction to a Dirac measure around the true unknown solution, where the rates of convergence are consistent with the underlying deterministic numerical method. Ordinary differential equations and elliptic partial differential equations are used to illustrate the approach to quantifying uncertainty in both the statistical analysis of the forward and inverse problems.

@article{conrad_probability_2015,
title = {Probability {Measures} for {Numerical} {Solutions} of {Differential} {Equations}},
file = {http://arxiv.org/pdf/1506.04592v1.pdf},
urldate = {2015-06-16},
journal = {arXiv:1506.04592 [stat]},
author = {Conrad, Patrick R. and Girolami, Mark and Särkkä, Simo and Stuart, Andrew and Zygalakis, Konstantinos},
month = jun,
year = {2015},
note = {arXiv: 1506.04592},
keywords = {Statistics - Methodology}
}

4. Owhadi, H. (2015). Multigrid with rough coefficients and Multiresolution operator decomposition from Hierarchical Information Games. ArXiv, math.NA(1503.03467).

We introduce a near-linear complexity (geometric and meshless/algebraic) multigrid/multiresolution method for PDEs with rough (L^∞) coefficients with rigorous a-priori accuracy and performance estimates. The method is discovered through a decision/game theory formulation of the problems of (1) identifying restriction and interpolation operators (2) recovering a signal from incomplete measurements based on norm constraints on its image under a linear operator (3) gambling on the value of the solution of the PDE based on a hierarchy of nested measurements of its solution or source term. The resulting elementary gambles form a hierarchy of (deterministic) basis functions of H^1_0(\Omega) (gamblets) that (1) are orthogonal across subscales/subbands with respect to the scalar product induced by the energy norm of the PDE (2) enable sparse compression of the solution space in H^1_0(\Omega) (3) induce an orthogonal multiresolution operator decomposition. The operating diagram of the multigrid method is that of an inverted pyramid in which gamblets are computed locally (by virtue of their exponential decay), hierarchically (from fine to coarse scales) and the PDE is decomposed into a hierarchy of independent linear systems with uniformly bounded condition numbers. The resulting algorithm is parallelizable both in space (via localization) and in bandwith/subscale (subscales can be computed independently from each other). Although the method is deterministic it has a natural Bayesian interpretation under the measure of probability emerging (as a mixed strategy) from the information game formulation and multiresolution approximations form a martingale with respect to the filtration induced by the hierarchy of nested measurements.

@article{2015arXiv150303467O,
title = {Multigrid with rough coefficients and Multiresolution operator decomposition from Hierarchical Information Games},
journal = {ArXiv},
volume = {math.NA},
issue = {1503.03467},
year = {2015},
month = mar,
file = {http://arxiv.org/pdf/1503.03467v4.pdf}
}

5. Cockayne, J., Oates, C., Sullivan, T., & Girolami, M. (2016). Probabilistic Meshless Methods for Partial Differential Equations and Bayesian Inverse Problems. ArXiv, (1605.07811).

This paper develops a class of meshless methods that are well-suited to statistical inverse problems involving partial differential equations (PDEs). The methods discussed in this paper view the forcing term in the PDE as a random field that induces a probability distribution over the residual error of a symmetric collocation method. This construction enables the solution of challenging inverse problems while accounting, in a rigorous way, for the impact of the discretisation of the forward problem. In particular, this confers robustness to failure of meshless methods, with statistical inferences driven to be more conservative in the presence of significant solver error. In addition, (i) a principled learning-theoretic approach to minimise the impact of solver error is developed, and (ii) the challenging setting of inverse problems with a non-linear forward model is considered. The method is applied to parameter inference problems in which non-negligible solver error must be accounted for in order to draw valid statistical conclusions.

@article{2016arXiv160507811C,
author = {{Cockayne}, J. and {Oates}, C. and {Sullivan}, T. and {Girolami}, M.},
title = {Probabilistic Meshless Methods for Partial Differential Equations and {B}ayesian Inverse Problems},
journal = {ArXiv},
issue = {1605.07811},
year = {2016},
month = may,
file = {http://arxiv.org/pdf/1605.07811v1.pdf}
}

6. Owhadi, H., & Zhang, L. (2016). Gamblets for opening the complexity-bottleneck of implicit schemes for hyperbolic and parabolic ODEs/PDEs with rough coefficients. ArXiv e-Prints, (math.NA 1606.07686).

Implicit schemes are popular methods for the integration of time dependent PDEs such as hyperbolic and parabolic PDEs. However the necessity to solve corresponding linear systems at each time step constitutes a complexity bottleneck in their application to PDEs with rough coefficients. We present a generalization of gamblets introduced in arXiv:1503.03467 enabling the resolution of these implicit systems in near-linear complexity and provide rigorous a-priori error bounds on the resulting numerical approximations of hyperbolic and parabolic PDEs. These generalized gamblets induce a multiresolution decomposition of the solution space that is adapted to both the underlying (hyperbolic and parabolic) PDE (and the system of ODEs resulting from space discretization) and to the time-steps of the numerical scheme.

@article{2016arXiv160607686O,
author = {{Owhadi}, H. and {Zhang}, L.},
title = {{Gamblets for opening the complexity-bottleneck of implicit schemes for hyperbolic and parabolic ODEs/PDEs with rough coefficients}},
journal = {ArXiv e-prints},
issue = {math.NA 1606.07686},
year = {2016},
month = jun,
file = {http://arxiv.org/pdf/1606.07686v1.pdf}
}

7. Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2017). Numerical Gaussian Processes for Time-dependent and Non-linear Partial Differential Equations. ArXiv e-Prints, (stat.ML 1703.10230).

We introduce the concept of numerical Gaussian processes, which we define as Gaussian processes with covariance functions resulting from temporal discretization of time-dependent partial differential equations. Numerical Gaussian processes, by construction, are designed to deal with cases where: (1) all we observe are noisy data on black-box initial conditions, and (2) we are interested in quantifying the uncertainty associated with such noisy data in our solutions to time-dependent partial differential equations. Our method circumvents the need for spatial discretization of the differential operators by proper placement of Gaussian process priors. This is an attempt to construct structured and data-efficient learning machines, which are explicitly informed by the underlying physics that possibly generated the observed data. The effectiveness of the proposed approach is demonstrated through several benchmark problems involving linear and nonlinear time-dependent operators. In all examples, we are able to recover accurate approximations of the latent solutions, and consistently propagate uncertainty, even in cases involving very long time integration.

@article{2017arXiv170310230,
author = {Raissi, Maziar and Perdikaris, Paris and Karniadakis, George Em},
title = {Numerical {G}aussian Processes for Time-dependent and Non-linear Partial Differential Equations},
journal = {ArXiv e-prints},
issue = {stat.ML 1703.10230},
year = {2017},
month = mar,
file = {https://arxiv.org/pdf/1703.10230.pdf}
}


## Sampling (Monte Carlo)

coming soon

1. Liberty, E., Woolfe, F., Martinsson, P.-G., Rokhlin, V., & Tygert, M. (2007). Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51), 20167–20172.
@article{liberty2007randomized,
title = {Randomized algorithms for the low-rank approximation of
matrices},
author = {Liberty, Edo and Woolfe, Franco and Martinsson, Per-Gunnar and Rokhlin, Vladimir and Tygert, Mark},
journal = {Proceedings of the National Academy of Sciences},
volume = {104},
number = {51},
pages = {20167--20172},
year = {2007}
}

2. Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217–288.
@article{halko2011finding,
title = {Finding structure with randomness: Probabilistic algorithms
for constructing approximate matrix decompositions},
author = {Halko, Nathan and Martinsson, Per-Gunnar and Tropp, Joel A},
journal = {SIAM review},
volume = {53},
number = {2},
pages = {217--288},
year = {2011},
publisher = {SIAM}
}

3. O’Hagan, A. (2013). Polynomial Chaos: A Tutorial and Critique from a Statistician’s Perspective. University of Sheffield, UK.
@techreport{ohagan13-polyn-chaos,
author = {O'Hagan, Anthony},
title = {Polynomial Chaos: A Tutorial and Critique from a
Statistician's Perspective},
institution = {University of Sheffield, UK},
year = {2013},
month = may
}

4. Garnett, R., Osborne, M., & Hennig, P. (2014). Active Learning of Linear Embeddings for Gaussian Processes. In N. L. Zhang & J. Tian (Eds.), Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (pp. 230–239). AUAI Press. Retrieved from http://auai.org/uai2014/proceedings/individuals/152.pdf

We propose an active learning method for discovering low-dimensional structure in high-dimensional Gaussian process (GP) tasks. Such problems are increasingly frequent and important, but have hitherto presented severe practical difficulties. We further introduce a novel technique for approximately marginalizing GP hyperparameters, yielding marginal predictions robust to hyperparameter misspecification. Our method offers an efficient means of performing GP regression, quadrature, or Bayesian optimization in high-dimensional spaces.

@inproceedings{GarnettOH2013,
title = {Active Learning of Linear Embeddings for Gaussian Processes},
author = {Garnett, R. and Osborne, M. and Hennig, P.},
booktitle = {Proceedings of the 30th Conference on Uncertainty in
Artificial Intelligence},
editor = {Zhang, NL and Tian, J},
publisher = {AUAI Press},
pages = {230-239},
year = {2014},
url = {http://auai.org/uai2014/proceedings/individuals/152.pdf},
url2 = {https://github.com/rmgarnett/mgp},
department = {Department Sch{\"o}lkopf},
file = {http://auai.org/uai2014/proceedings/individuals/152.pdf},
code = {https://github.com/rmgarnett/mgp}
}

5. Hennig, P., & Schuler, C. J. (2012). Entropy Search for Information-Efficient Global Optimization. Journal of Machine Learning Research, 13, 1809–1837.

Contemporary global optimization algorithms are based on local measures of utility, rather than a probability measure over location and value of the optimum. They thus attempt to collect low function values, not to learn about the optimum. The reason for the absence of probabilistic global optimizers is that the corresponding inference problem is intractable in several ways. This paper develops desiderata for probabilistic optimization algorithms, then presents a concrete algorithm which addresses each of the computational intractabilities with a sequence of approximations and explicitly adresses the decision problem of maximizing information gain from each evaluation.

@article{HennigS2012,
title = {Entropy Search for Information-Efficient Global Optimization},
author = {Hennig, P. and Schuler, CJ.},
month = jun,
volume = {13},
pages = {1809-1837},
journal = {Journal of Machine Learning Research},
year = {2012},
file = {http://jmlr.csail.mit.edu/papers/volume13/hennig12a/hennig12a.pdf},
code = {http://probabilistic-optimization.org/Global.html}
}

6. Garrabrant, S., Benson-Tilsen, T., Critch, A., Soares, N., & Taylor, J. (2016). Logical Induction. ArXiv Preprint 1609.03543v3.

This monograph gives a strong theoretical optimality notion for using bounded computational resources to assign accurate probabilities to the outcomes of computations.

We present a computable algorithm that assigns probabilities to every logical statement in a given formal language, and refines those probabilities over time. For instance, if the language is Peano arithmetic, it assigns probabilities to all arithmetical statements, including claims about the twin prime conjecture, the outputs of long-running computations, and its own probabilities. We show that our algorithm, an instance of what we call a logical inductor, satisfies a number of intuitive desiderata, including: (1) it learns to predict patterns of truth and falsehood in logical statements, often long before having the resources to evaluate the statements, so long as the patterns can be written down in polynomial time; (2) it learns to use appropriate statistical summaries to predict sequences of statements whose truth values appear pseudorandom; and (3) it learns to have accurate beliefs about its own current beliefs, in a manner that avoids the standard paradoxes of self-reference. For example, if a given computer program only ever produces outputs in a certain range, a logical inductor learns this fact in a timely manner; and if late digits in the decimal expansion of π are difficult to predict, then a logical inductor learns to assign ≈10% probability to "the nth digit of π is a 7" for large n. Logical inductors also learn to trust their future beliefs more than their current beliefs, and their beliefs are coherent in the limit (whenever ϕ⟹ψ, ℙ∞(ϕ)≤ℙ∞(ψ), and so on); and logical inductors strictly dominate the universal semimeasure in the limit. These properties and many others all follow from a single logical induction criterion, which is motivated by a series of stock trading analogies. Roughly speaking, each logical sentence ϕ is associated with a stock that is worth \$1 per share if f φ is true and nothing otherwise, and we interpret the belief-state of a logically uncertain reasoner as a set of market prices, where Pn(φ) = 50% means that on day n, shares of φ may be bought or sold from the reasoner for 50¢. The logical induction criterion says (very roughly) that there should not be any polynomial-time computable trading strategy with finite risk tolerance that earns unbounded profits in that market over time. This criterion bears strong resemblance to the “no Dutch book” criteria that support both expected utility theory (von Neumann and Morgenstern 1944) and Bayesian probability theory (Ramsey 1931; de Finetti 1937).

@article{garrabrant,
author = {Garrabrant, Scott and Benson-Tilsen, Tsvi and Critch, Andrew and Soares, Nate and Taylor, Jessica},
title = {Logical Induction},
journal = {arXiv preprint 1609.03543v3},
year = {2016},