Probabilistic Numerical Methods

Probabilistic numerical methods solve numerical problems from linear algebra, optimization, quadrature and differential equations using probabilistic inference. This approach captures uncertainty arising from finite computational resources and from stochastic input.

Even though PN methods return random variables which in their distribution represent this aforementioned uncertainty, they do not necessarily make use of random numbers. In order to illustrate what PN methods are and how they work consider the following deliberately simple numerical problem.

[1]:
# Make inline plots vector graphics instead of raster graphics
%matplotlib inline
from IPython.display import set_matplotlib_formats

set_matplotlib_formats("pdf", "svg")

# Plotting
import matplotlib.pyplot as plt

plt.style.use("../probnum.mplstyle")

A Simple Numerical Problem

Consider the following one-dimensional optimization problem

\[\underset{x \in \mathbb{R}}{\operatorname{min}} f(x) = \underset{x \in \mathbb{R}}{\operatorname{min}} \frac{1}{2} ax^2 + bx + c,\]

where \(f\) is a quadratic function and we assume \(a > 0\). Since \(f\) is a strictly convex function there exists a unique minimum given by

\[f'(x_*) = 0 \iff x_*=-\frac{b}{a}.\]
[2]:
import numpy as np
import probnum as pn

# Quadratic objective function
a = 2
b = -1.0
c = 3.0
x_opt = -b / a


def f(x):
    return 0.5 * a * x ** 2 + b * x + c

Now suppose we are not given access to the coefficients, but only the ability to evaluate \(f\) at arbitrary points. We can then evaluate the objective three times and solve the resulting \(3 \times 3\) linear system to find the parameters \(a\), \(b\) and \(c\).

Introducing Noise

However, what happens if we go one step further and assume the parameters \((a,b,c)\) and therefore function evaluations are corrupted by noise? This is often the case in practice where \(f\) might describe a complex physical system or depend on data. Can we still design a computationally efficient algorithm which finds the minimum?

Suppose we only have access to noisy evaluations

\[\hat{y} = \hat{f}(x) = \frac{1}{2}(a + \varepsilon_a)x^2 + (b+\varepsilon_b)x + (c + \varepsilon_c)\]

of the objective function \(f\) at \(x\), where \(\varepsilon = (\varepsilon_a, \varepsilon_b, \varepsilon_c) \sim \mathcal{N}(0, \Lambda)\) and \(\Lambda \in \mathbb{R}^{3 \times 3}\) symmetric positive definite.

Remark: The \(n\)-dimensional analogue of this case arises for example in supervised (deep) learning. In large-scale empirical risk minimization the available training data often does not fit into memory. Therefore during training only batches of data are considered which induces noise on the objective function and its gradient. Here \(\hat{f}\) is analogous to the empirical risk on a gvien batch.

[3]:
from probnum import random_variables as rvs

# Noisy objective function
Lambda = np.diag([0.1, 0.05, 0.01])
eps = rvs.Normal(np.zeros(3), Lambda)


def f_hat(x, noise=None):
    if noise is None:
        noise = eps.sample()
    return 0.5 * (a + noise[0]) * x ** 2 + (b + noise[1]) * x + c + noise[2]
[4]:
"""Plot objective and noisy function evaluations."""
from mpl_toolkits.mplot3d import Axes3D

# Random seed
np.random.seed(42)

# Plot setup
n_samples = 20
xx = np.linspace(x_opt - 1, x_opt + 1, 1000)

fig = plt.figure(figsize=[10, 6])
gs = fig.add_gridspec(ncols=2, nrows=2, width_ratios=[5, 1], height_ratios=[2.5, 1])

ax00 = fig.add_subplot(gs[0, 0])
ax10 = fig.add_subplot(gs[1, 0], sharex=ax00)
ax01 = fig.add_subplot(gs[0, 1], sharey=ax00)
ax11 = fig.add_subplot(gs[1, 1], projection="3d")
ylim00 = [2.25, 3.75]
ylim10 = [-0.25, 1.1]

# Function
ax00.plot(xx, f(xx), label="$f$", color="C0")

# Parameters
ax11.scatter(a, b, c, color='C0', label="$(a, b, c)$")

# Function range
ax01.scatter(ylim10[1] + ylim10[0], f(x_opt), label="$\min f(x)$")
ax01.hlines(y=f(x_opt), xmin=ylim10[0], xmax=ylim10[1] + ylim10[0], zorder=-1, color="C0")

# Domain
ax10.scatter(x_opt, 0, label="$\operatorname{argmin} f(x)$")
ax10.vlines(x_opt, ymin=0, ymax=ylim10[1], zorder=-1, color="C0")

# Samples
for i in range(n_samples):
    noise = eps.sample()
    x_opt_noise = -(b + noise[1]) / (a + noise[0])
    if i == 0:
        sample_label = "$\\hat{f}_i$"
        sample_min_label = "$\min \\hat{f}_i$"
        sample_argmin_label = "$\operatorname{argmin} \hat{f}_i(x)$"
    else:
        sample_label = None
        sample_argmin_label = None
        sample_min_label = None
    ax00.plot(xx, f_hat(xx, noise), color="C0", alpha=0.2, label=sample_label)
    ax10.scatter(x_opt_noise, 0, color="C0", alpha=0.2, label=sample_argmin_label)
    ax01.scatter(
        ylim10[1] + ylim10[0],
        f_hat(x_opt_noise, noise),
        color="C0",
        alpha=0.2,
        label=sample_min_label,
    )
    ax11.scatter(a + noise[0], b + noise[1], c + noise[2], color='C0', alpha=0.2)

# Adjust axis visibility, labels and legends
ax00.get_xaxis().set_visible(False)
ax10.get_yaxis().set_visible(False)
ax01.get_yaxis().set_visible(False)
ax01.get_xaxis().set_visible(False)
ax11.get_xaxis().set_ticklabels([])
ax11.get_yaxis().set_ticklabels([])
ax11.get_zaxis().set_ticklabels([])

ax00.set_ylim(ylim00)
ax00.set_xlim([x_opt - 1, x_opt + 1])
ax10.set_ylim(ylim10)
ax01.set_xlim(ylim10)
ax11.set_xlim([a - 1, a + 1])
ax11.set_ylim([b - 1, b + 1])
ax11.set_zlim([c - 1, c + 1])
ax00.set_ylabel("$y$")
ax10.set_xlabel("$x$")
ax11.set_xlabel("$a$", labelpad=-12)
ax11.set_ylabel("$b$", labelpad=-12)
ax11.set_zlabel("$c$", labelpad=-12)
ax00.legend(loc="upper center")
ax01.legend(loc="upper center", fontsize=10, labelspacing=0)
ax10.legend(loc="upper right", fontsize=10, labelspacing=0)

plt.tight_layout(h_pad=0, w_pad=0)
plt.show()
../_images/tutorials_pn_methods_7_0.svg

This makes the optimization problem considerably more difficult depending on the degree of noise. Can we still come up with a computationally efficient and accurate algorithm that makes use of the information we have about the problem?

In fact, we can. We will design a custom probabilistic numerical method.

Components of PN Methods

We will now build up a probabilistic numerical method from its individual components to solve the given optimization problem. In a nutshell a PN method consists of prior knowledge about the problem (e.g. differentiability), a policy, which returns an action that collects information about the problem resulting in observations. Our prior belief is updated with the observations about the problem using statistical inference techniques leading to a posterior belief over the numerical quantity in question. Finally, a stopping criterion determines, when to terminate.

Prior Knowledge

A naive strategy to solve the problem given only the ability to evaluate the noisy function \(\hat{f}\) could be to simply choose random points on the real axis and return the point of lowest observed function value. However, clearly this is suboptimal since we might get unlucky and observe a low function value for an \(x\) far away from the minimum simply by chance. Also crucially it completely ignores information we have and collect about the problem. For example, we know the latent function is quadratic.

Further, we might have some idea of where the minimum of the function \(f\) lies. This could come from experience, a physical application from which the problem arises or related optimization problems we have solved in the past.

We can combine these two forms of prior information, inherent properties of the problem and a belief over its solution, into a prior distribution over the coefficients of the quadratic function. This induces a belief over the optimal function value \(f(x_*)\). Here, we assume that we are reasonably certain about the value of \(a\) but less so about \(b\) and \(c\).

[5]:
# Prior on parameters: a, b, c
mu = np.array([1.8, 0.0, 3.5])
Sigma = np.array([[0.1, 0.0, 0.0],
                  [0.0, 0.4, 0.0],
                  [0.0, 0.0, 0.2]])
fun_params0 = rvs.Normal(mean=mu, cov=Sigma)
x0 = -fun_params0.mean[1] / fun_params0.mean[0]

# Induced belief over optimal function value
def fun_opt0(fun_params0, x):
    x = np.asarray(x).reshape(-1, 1)
    return np.hstack((0.5 * x ** 2, x, np.ones_like(x))) @ fun_params0
[6]:
"""Plot objective, noisy function evaluations and prior knowledge."""

# Random seed
np.random.seed(42)

# Evaluate some quantities to plot
yy = np.linspace(ylim00[0], ylim00[1], 1000)
fun_opt0_mean = fun_opt0(fun_params0, xx).mean
fun_opt0_std = np.sqrt(fun_opt0(fun_params0, xx).var)
fun0_pdf = fun_opt0(fun_params0, x0).pdf(yy)
fun0_pdf_max = np.max(fun0_pdf)

# Plot setup
fig = plt.figure(figsize=[10, 6])
gs = fig.add_gridspec(ncols=2, nrows=2, width_ratios=[5, 1], height_ratios=[2.5, 1])

ax00 = fig.add_subplot(gs[0, 0])
ax10 = fig.add_subplot(gs[1, 0], sharex=ax00)
ax01 = fig.add_subplot(gs[0, 1], sharey=ax00)
ax11 = fig.add_subplot(gs[1, 1], projection="3d")

# Function
ax00.plot(xx, f(xx), label="$f$", color="C0")
ax00.plot(xx, fun_opt0_mean, color="C1", label="$p(f)$")
ax00.fill_between(
    x=xx,
    y1=fun_opt0_mean - 2 * fun_opt0_std,
    y2=fun_opt0_mean + 2 * fun_opt0_std,
    color="C1",
    alpha=0.2,
)

# Parameters
def gaussian_credible_region_3d(mean, cov, alpha=0.95):
    """Compute the x,y,z coordinates of a 3D Gaussian credible region."""
    from scipy.stats import chi2

    # Determine sphere radius (https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Interval)
    dim = 3
    radius = np.sqrt(chi2.ppf(alpha, df=dim))

    # Cholesky decomposition
    L = np.linalg.cholesky(cov)

    # Spheroid using polar coordinates
    u = np.linspace(0.0, 2.0 * np.pi, 100)
    v = np.linspace(0.0, np.pi, 100)
    x = radius * np.outer(np.cos(u), np.sin(v))
    y = radius * np.outer(np.sin(u), np.sin(v))
    z = radius * np.outer(np.ones_like(u), np.cos(v))

    # Affine transformation
    for i in range(len(x)):
        for j in range(len(x)):
            [x[i,j],y[i,j],z[i,j]] = np.dot(L, [x[i,j],y[i,j],z[i,j]]) + mean

    return x, y, z

x, y, z = gaussian_credible_region_3d(mean=fun_params0.mean, cov=fun_params0.cov)
ax11.plot_surface(x, y, z, color='C1', alpha=0.3, linewidth=1)
ax11.scatter(a, b, c, color='C0')

# Function range
ax01.scatter(ylim10[1] + ylim10[0], f(x_opt), label="$\min f(x)$")
ax01.hlines(y=f(x_opt), xmin=ylim10[0], xmax=ylim10[1] + ylim10[0], zorder=-1, color="C0")
ax01.plot(ylim10[1] + ylim10[0] - fun0_pdf / fun0_pdf_max, yy, color="C1", label="$p(f_*)$")
ax01.fill_betweenx(
    y=yy,
    x1=ylim10[1] + ylim10[0] - fun0_pdf / fun0_pdf_max,
    x2=ylim10[1] + ylim10[0],
    color="C1",
    alpha=0.1,
)

# Domain
ax10.scatter(x_opt, 0, label="$\operatorname{argmin} f(x)$")
ax10.vlines(x_opt, ymin=0, ymax=ylim10[1], zorder=-1, color="C0")
ax10.scatter(x0, 0, color="C1", label="$- \\frac{\\mathbb{E}[b]}{\\mathbb{E}[a]}$")
ax10.vlines(x0, ymin=0, ymax=ylim10[1], zorder=-1, color="C1")

# Samples
for i in range(n_samples):
    noise = eps.sample()
    x_opt_noise = -(b + noise[1]) / (a + noise[0])
    if i == 0:
        sample_label = "$\\hat{f}_i$"
        sample_min_label = "$\min \\hat{f}_i$"
        sample_argmin_label = "$\operatorname{argmin} \hat{f}_i(x)$"
    else:
        sample_label = None
        sample_argmin_label = None
        sample_min_label = None
    ax00.plot(xx, f_hat(xx, noise), color="C0", alpha=0.2, label=sample_label)
    ax10.scatter(x_opt_noise, 0, color="C0", alpha=0.2, label=sample_argmin_label)
    ax01.scatter(
        ylim10[1] + ylim10[0],
        f_hat(x_opt_noise, noise),
        color="C0",
        alpha=0.2,
        label=sample_min_label,
    )
    ax11.scatter(a + noise[0], b + noise[1], c + noise[2], color='C0', alpha=0.2)

# Adjust axis visibility, labels and legends
ax00.get_xaxis().set_visible(False)
ax10.get_yaxis().set_visible(False)
ax01.get_yaxis().set_visible(False)
ax01.get_xaxis().set_visible(False)
ax11.get_xaxis().set_ticklabels([])
ax11.get_yaxis().set_ticklabels([])
ax11.get_zaxis().set_ticklabels([])

ax00.set_ylim(ylim00)
ax00.set_xlim([x_opt - 1, x_opt + 1])
ax10.set_ylim(ylim10)
ax01.set_xlim(ylim10)
ax11.set_xlim([a - 1, a + 1])
ax11.set_ylim([b - 1, b + 1])
ax11.set_zlim([c - 1, c + 1])
ax00.set_ylabel("$y$")
ax10.set_xlabel("$x$")
ax11.set_xlabel("$a$", labelpad=-12)
ax11.set_ylabel("$b$", labelpad=-12)
ax11.set_zlabel("$c$", labelpad=-12)
ax00.legend(loc="upper center")
ax01.legend(loc="upper center", fontsize=10, labelspacing=0)
ax10.legend(loc="upper right", fontsize=10, labelspacing=0)

plt.tight_layout(h_pad=0, w_pad=0)
plt.show()
../_images/tutorials_pn_methods_13_0.svg

We would like to use this prior information to solve the optimization problem in question. In order to do so we need to collect information about the problem, which we do by evaluating \(\hat{f}\) at chosen points \(x_i\).

Policy

In order to collect information about the problem we need to decide where to evaluate the (noisy) objective function \(\hat{f}\). This is done via a policy \(\pi\). In principle this can be any function which maps to the domain of \(\hat{f}\), however it is sensible to use the model of the problem our algorithm is iteratively constructing. This is called a model-based policy.

Here, we will collect observations by selecting actions randomly around where we currently belief the minimum to be. This is motivated by an exploration-exploitation trade-off. While we are unsure where the minimum is we should evaluate \(\hat{f}\) with different actions to explore and find smaller function values. If we are certain about the parameter values \((a,b,c)\), we should keep evaluating close to the minimum to find its exact location.

[7]:
# Stochastic model-based policy
def explore_exploit_policy(fun_params0):
    """Policy exploring around the estimate of the minimum based on the certainty about the parameters."""
    a0, b0, c0 = fun_params0
    return -b0.mean / a0.mean + rvs.Normal(0, np.trace(fun_params0.cov)).sample()

Action and Observation

We now use our policy to collect observations about the problem by evaluating the objective function at the suggested evaluation point(s).

[8]:
def observe(fun, x):
    """Evaluate the objective at a given location."""
    return fun(x)

Inference Step

Given the noise model, we can compute the posterior distribution over the parameters \((a, b, c)\) for a given action \(x\) using Bayes’ theorem. To that end, let

\[\begin{split}\Phi = \begin{pmatrix}\frac{1}{2}x^2 \\ x\\1\end{pmatrix} \in \mathbb{R}^{3}\end{split}\]

be the feature vector for a single action. We can rewrite the evaluation of the objective function \(f(x)\) as a linear function in the parameters \((a,b,c)\) like so

\[\begin{split}f(x) = \frac{1}{2}ax^2 + bx + c = \Phi^\top \begin{pmatrix}a \\ b \\ c \end{pmatrix}\end{split}\]

Prior

We assume a Gaussian prior distribution

\[(a, b, c) \sim \mathcal{N}(\mu, \Sigma),\]

which expresses our belief over the parameters of the latent 2nd degree polynomial.

Likelihood

We assumed a Gaussian noise model of the form \(\varepsilon = (\varepsilon_a, \varepsilon_b, \varepsilon_c) \sim \mathcal{N}(0, \Lambda)\). Therefore the noise corrupted objective evaluated at an action \(x\) is distributed as

\[\begin{split}\hat{y} = \hat{f}(x) = \Phi^\top \begin{pmatrix}a + \varepsilon_a\\ b + \varepsilon_b\\ c + \varepsilon_c\end{pmatrix} = f(x) + \Phi^\top \varepsilon\end{split}\]

This implies the following Gaussian likelihood defining the observation model.

\[p(y \mid a, b, c, x) = \mathcal{N}\big(y; f(x), \Phi^\top \Lambda \Phi\big)\]

Posterior

Using the rules of Gaussian inference, we arrive at the posterior distribution

\[p(a, b, c \mid x, y) = \mathcal{N}((a,b,c); m, S),\]

where

\[\begin{split}\begin{aligned} m &= \mu + \Sigma \Phi \big(\Phi^\top (\Sigma + \Lambda) \Phi \big)^{-1}(y - \Phi^\top \mu)\\ S &= \Sigma - \Sigma \Phi \big(\Phi^\top (\Sigma + \Lambda) \Phi \big)^{-1} \Phi^\top \Sigma. \end{aligned}\end{split}\]

In the posterior mean for the parameters of \(f\) the prior mean \(\mu\) is updated with the difference between the observed function value and the expected function value based on the prior \(y - \Phi^\top \mu\). This update to the current mean of the belief is scaled with how uncertain we are about the parameters and what level of noise we expect \(\big(\Phi^\top (\Sigma + \Lambda) \Phi \big)^{-1}\). If the combined uncertainty is small the update term of the mean gets weighted more highly than if the combined uncertainty is large. Similarly for the covariance our prior uncertainty about the parameters \(\Sigma\) is reduced a certain amount in each iteration depending on how certain we are about the parameters and the degree of noise. This Bayesian update therefore combines what is known and what is observed about the problem.

[9]:
def belief_update(fun_params, x, y, Lambda=Lambda):
    """Update the belief over the parameters with an observation."""
    # Feature vector
    x = np.asarray(x).reshape(1, -1)
    Phi = np.vstack((0.5 * x ** 2, x, np.ones_like(x)))

    # Mean and covariance
    mu = fun_params.mean
    Sigma = fun_params.cov

    # Gram matrix
    gram = Phi.T @ (Sigma + Lambda) @ Phi

    # Posterior Mean
    m = mu + Sigma @ Phi @ np.linalg.solve(gram, y - Phi.T @ mu)

    # Posterior Covariance
    S = Sigma - Sigma @ Phi @ np.linalg.solve(gram, Phi.T @ Sigma)

    return rvs.Normal(m, S)

Note, for simplicity we assumed the noise level of the data is known. Usually in practice this is not the case. However, one can actually infer the noise level from the data using the probabilistic framework. See the section on advantages of PN methods for more details.

Stopping Criteria

Finally, we need to decide when to stop. From a probabilistic perspective we should terminate when our uncertainty about the quantity of interest, in this case the minimum, is sufficiently small either in absolute or relative terms. Here, we use the belief over the parameters as a proxy instead since the uncertainty over the minimum is computationally intractable. Our combined uncertainty over the parameters is given as the trace \(\operatorname{tr}(S)\) of the covariance matrix. Additionally, we also make sure a certain computational budget is not exceeded.

[10]:
def stopping_criterion(fun_params0, current_iter, maxiter=10 ** 4, tol=10 ** -3):
    """Convergence criteria."""
    # Uncertainty over parameters given by the trace of the covariance.
    trace_cov = np.trace(fun_params0.cov)
    if trace_cov < tol:
        return True, "uncertainty_abstol"
    elif trace_cov < np.linalg.norm(fun_params0.mean, ord=2) ** 2 * tol:
        return True, "uncertainty_reltol"
    # Fixed maximum budget
    elif current_iter >= maxiter:
        return True, "maxiter"
    else:
        return False, None

The Algorithm

Putting all the components together we arrive at the following algorithm.

[11]:
def probsolve_qp(
    fun,
    fun_params0,
    policy=explore_exploit_policy,
    stopping_criterion=stopping_criterion,
    belief_update=belief_update,
    callback=None,
):
    """
    Probabilistic 1D Quadratic Optimization.

    PN method solving unconstrained one-dimensional (noisy) quadratic
    optimization problems only needing access to function evaluations.

    Parameters
    ----------
    fun : callable
        Objective function to optimize.
    fun_params0 : pn.RandomVariable
        Prior on the parameters of the objective function.
    policy : callable
        Policy returning actions.
    stopping_criterion : callable
        Stopping criterion determining convergence.
    belief_update : callable
        Inference routine updating the belief over the parameters.
    callback : callable
        Callback function used to return intermediate quantities of the iteration. Used for plotting.

    Returns
    -------
    x_opt : float
        Estimated minimum of ``f``.
    fun_opt : pn.RandomVariable
        Belief over the optimal value of the objective function.
    fun_params : pn.RandomVariable
        Belief over the parameters of the objective function.
    info : dict
        Additional information about the solve, e.g. convergence.
    """
    # Initialization
    iteration = 0

    # Iterate until convergence
    while True:

        # Stopping criterion
        has_converged, conv_crit = stopping_criterion(fun_params0, iteration)
        if has_converged:
            break

        # Policy
        x = policy(fun_params0)

        # Observation
        noise = eps.sample()  # for plotting purposes only, this is not available in practice
        y = observe(lambda x: fun(x, noise), x)

        # Belief update
        fun_params0 = belief_update(fun_params0, x, y)

        # Callback function
        if callback is not None:
            callback(x, y, fun_params0, noise)

        iteration += 1

    # Belief over optimal function value and optimum
    x_opt0 = -fun_params0.mean[1] / fun_params0.mean[0]
    fun_opt0 = np.array([0.5 * x_opt0 ** 2, x_opt0, 1]).T @ fun_params0

    # Information on convergence
    info = {"iter": iteration, "conv_crit": conv_crit}

    return x_opt0, fun_opt0, fun_params0, info

We now solve the problem using our newly implemented probabilistic numerical method. We also use a callback function to extract quantities at each iteration for analysis and plotting purposes.

[12]:
np.random.seed(42)

# Callback function to return iterates
x_iterates = [None]
y_iterates = [None]
fun_params0_iterates = [fun_params0]
noise_samples = [None]


def callback(x, y, fun_params, noise):
    x_iterates.append(x)
    y_iterates.append(y)
    fun_params0_iterates.append(fun_params)
    noise_samples.append(noise)


# Solve noisy quadratic optimization problem
x_opt_post, fun_opt_post, fun_params_post, info = probsolve_qp(f_hat, fun_params0, callback=callback)

# Print results
print(f"Estimated optimum: {x_opt_post:.4f} vs. true optimum: {x_opt:.4f}")
print(
    f"Estimated optimal value: {fun_opt_post.mean:.4f} "
    + "\u00B1"
    + f" {2 * fun_opt_post.std:.4f} vs. true optimal value: {f(x_opt):.4f}"
)
print(info)
Estimated optimum: 0.5154 vs. true optimum: 0.5000
Estimated optimal value: 2.7500 ± 0.0042 vs. true optimal value: 2.7500
{'iter': 7285, 'conv_crit': 'uncertainty_reltol'}

The PN method iteratively updates the prior belief over the coefficients of the quadratic function based on observations of function values at inputs given by its internal policy. This results in a posterior belief over the parameters (bottom right) and the optimal function value (top right). We also obtain an estimate for the minimum given as \(x_* = - \frac{\mathbb{E}[b]}{\mathbb{E}[a]}\) from the posterior distribution over the parameters.

[13]:
"""Animation of 1D stochastic quadratic optimization using a PN method."""

# Animation parameters
total_frames = np.minimum(20, len(x_iterates))

# Set up figure and axes
fig = plt.figure(figsize=[10, 6])
gs = fig.add_gridspec(ncols=2, nrows=2, width_ratios=[5, 1], height_ratios=[2.5, 1])

ax00 = fig.add_subplot(gs[0, 0])
ax10 = fig.add_subplot(gs[1, 0], sharex=ax00)
ax01 = fig.add_subplot(gs[0, 1], sharey=ax00)
ax11 = fig.add_subplot(gs[1, 1], projection="3d")

fig.patch.set_alpha(0.) # set figure background opacity to 0
plt.close()

# Plot setup
yy = np.linspace(ylim00[0], ylim00[1], 1000)

def animate(i):
    # Clear axes
    ax00.cla()
    ax10.cla()
    ax01.cla()
    ax11.cla()

    # Select data
    _max_first_iters = 10
    if len(x_iterates) > _max_first_iters:
        alg_iter_list = np.concatenate(
            [np.arange(0, _max_first_iters),
             np.round(np.exp(np.linspace(np.log(_max_first_iters), np.log(len(x_iterates) - 1),
                                  total_frames - _max_first_iters))).astype(int)])
    else:
        alg_iter_list = np.arange(0, len(x_iterates))

    alg_iter = alg_iter_list[i]
    x = x_iterates[alg_iter]
    y = y_iterates[alg_iter]
    fun_params0 = fun_params0_iterates[alg_iter]
    noise = noise_samples[alg_iter]

    # Plot objective and noisy evaluations
    fun_opt0_mean = fun_opt0(fun_params0, xx).mean
    fun_opt0_std = np.sqrt(fun_opt0(fun_params0, xx).var)
    x_opt0 = -fun_params0.mean[1] / fun_params0.mean[0]
    fun0_pdf = fun_opt0(fun_params0, x_opt0).pdf(yy)
    fun0_pdf_max = np.max(fun0_pdf)

    # Function
    ax00.plot(xx, f(xx), label="$f$", color="C0")
    if i > 0:
        ax00.plot(xx, f_hat(xx, noise=noise), label=f"$\hat{{f}}_{{{alg_iter}}}$", alpha=0.2, color="C0")
    ax00.plot(xx, fun_opt0_mean, color="C1", label="$p(f)$")
    ax00.fill_between(
        x=xx,
        y1=fun_opt0_mean - 2 * fun_opt0_std,
        y2=fun_opt0_mean + 2 * fun_opt0_std,
        color="C1",
        alpha=0.2,
    )

    # Parameters
    x_credregion, y_credregion, z_credregion = gaussian_credible_region_3d(mean=fun_params0.mean,
                                                                           cov=fun_params0.cov)
    ax11.plot_surface(x_credregion, y_credregion, z_credregion, color='C1', alpha=0.3, linewidth=1)
    ax11.scatter(a, b, c, color='C0')
    if i > 0:
        ax11.scatter(a + noise[0], b + noise[1], c + noise[2], color="C0", alpha=0.2)

    # Function range
    ax01.scatter(ylim10[1] + ylim10[0], f(x_opt), label="$\min f(x)$")
    ax01.hlines(y=f(x_opt), xmin=ylim10[0], xmax=ylim10[1] + ylim10[0], zorder=-1, color="C0")
    ax01.plot(ylim10[1] + ylim10[0] - fun0_pdf / fun0_pdf_max, yy, color="C1", label="$p(f_*)$")
    ax01.fill_betweenx(
        y=yy,
        x1=ylim10[1] + ylim10[0] - fun0_pdf / fun0_pdf_max,
        x2=ylim10[1] + ylim10[0],
        color="C1",
        alpha=0.1,
    )
    ax01.scatter(ylim10[1] + ylim10[0], y, color="C3", marker="x", label=f"obs. $\hat{{y}}_{{{alg_iter}}}$")
    ax01.hlines(y=y, xmin=ylim10[0], xmax=ylim10[1] + ylim10[0], zorder=-1, color="C0", alpha=0.2)

    # Domain
    ax10.scatter(x_opt, 0, label="$\operatorname{argmin} f(x)$")
    ax10.vlines(x_opt, ymin=0, ymax=ylim10[1], zorder=-1, color="C0")
    ax10.scatter(x_opt0, 0, color="C1", label="$- \\frac{\\mathbb{E}[b]}{\\mathbb{E}[a]}$")
    ax10.vlines(x_opt0, ymin=0, ymax=ylim10[1], zorder=-1, color="C1")
    ax10.scatter(x, 0, color="C3", marker="x", label=f"action $x_{{{alg_iter}}}$")
    ax10.vlines(x, ymin=0, ymax=ylim10[1], zorder=-1, color="C0", alpha=0.2)

    # Adjust axis visibility, labels and legends
    ax00.get_xaxis().set_visible(False)
    ax10.get_yaxis().set_visible(False)
    ax01.get_yaxis().set_visible(False)
    ax01.get_xaxis().set_visible(False)
    ax11.get_xaxis().set_ticklabels([])
    ax11.get_yaxis().set_ticklabels([])
    ax11.get_zaxis().set_ticklabels([])

    ax00.set_ylim(ylim00)
    ax00.set_xlim([x_opt - 1, x_opt + 1])
    ax10.set_ylim(ylim10)
    ax01.set_xlim(ylim10)
    ax11.set_xlim([a - 1, a + 1])
    ax11.set_ylim([b - 1, b + 1])
    ax11.set_zlim([c - 1, c + 1])
    ax00.set_ylabel("$y$")
    ax10.set_xlabel("$x$")
    ax11.set_xlabel("$a$", labelpad=-12)
    ax11.set_ylabel("$b$", labelpad=-12)
    ax11.set_zlabel("$c$", labelpad=-12)
    ax00.legend(loc="upper center")
    ax01.legend(loc="upper center", fontsize=10, labelspacing=0)
    ax10.legend(loc="upper right", fontsize=10, labelspacing=0)

    fig.tight_layout(h_pad=0, w_pad=0)
[14]:
from matplotlib import animation
from IPython.display import HTML

# Animation setup
anim = animation.FuncAnimation(
    fig, func=animate, frames=total_frames, interval=1250, repeat_delay=4000, blit=False
)
HTML(anim.to_jshtml())
[14]: