Bayesian Optimization from scratch¶

Introduction¶

In this notebook we're going to build Bayesian Optimization from scratch.

Sources:

Table of Contents

Problem Statement¶

In bayesian optimization we're trying to optimize a certain objective function by using a bayesian optimization loop. In this loop we're estimating our true (unknown) objective function with model (almost always Gaussian Process Regression) fitted on our known data points. We use an acquisition function on this model to determine the next best query points. We query those points using our expensive real objective function and update the model. We repeat the process untill done.

Objective function¶

We are interested in solving

$$x_* = arg_{x\in X}\max f(x)$$

under the contraints that

$f$ is a black box for which no closed form is known (nor its gradients);
$f$ is expensive to evaluate;
and evaluations of $y = f(x)$ may be noisy.

Bayesian Optimization Loop¶

For $t=1: T$

Given observations $(x_i, y_i = f(x_i))$ for $i=1:t$, build a probabilistic model for the objective $f$. Integrate out all possible true functions, using Gaussian process regression.
Optimize a cheap acquisition/utility function $u$ based on the posterior distribution for sampling the next point. $x_{t+1} = arg\min_xu(x)$. Exploit uncertainty to balance exploration against exploitation.
Sample the next observation $y_{t+1}$ at $x_{t+1}$

Acquisition functions¶

Acquisition functions $u(x)$ specify which sample $x$: should be tried next:

Expected improvement (default): $-EI(x) = -\mathbb{E}[f(x)-f(x_t^+)]$
Lower confidence bound: $LCB(x) = \mu_{GP}(x) + k\sigma_{GP}(x)$
Probability of improvement: $-PI(x) = -P(f(x) \geq f(x_t^+)+k$

where $x_t^+$ is the best point observed so far.

In most cases, acquisition functions provide knobs (e.g., $k$) for controlling the exploration-exploitation trade-off. - Search in regions where $\mu_{GP}(x)$ is high (exploitation) - Probe regions where uncertainty $\sigma_{GP}(x)$ is high (exploration)

Starting off¶

To start we need to have a fake objective function. For this we're going to use the branin function, which has a two dimensional input space. Since the 1d case will be to easy to implement.

$$f(x) = a(x_2-bx_1^2+cx_1-r)^2+s(1-t)\cos(x_1)+s$$

Where $a=1, b=5.1/(4\pi^2), c=5/\pi, r=6, s=10, t=1/(8\pi)$. It is known that the global minimum lies at $f(x^*)=0.397887$, at $x^*=(-\pi, 12.275), (\pi, 2.275)$ and $(9.42478, 2.475)$. This function is usually evaluated on the square $x_1 \in [-5, 10]$, $x_2 \in [0, 15]$.

In [1]:

%load_ext autoreload
%autoreload 2

In [2]:

import math
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
import seaborn as sns; sns.set()

In [3]:

def f(x:list):
    """
    Branin function
    """
    x1, x2 = x
    a = 1
    b = 5.1/(4*math.pi**2)
    c = 5/math.pi
    r = 6
    s = 10
    t = 1/(8*math.pi)
    return a*(x2-b*x1**2+c*x1-r)**2 + s*(1-t)*math.cos(x1)

In [4]:

f([1,2])

Out[4]:

11.627635392062379

In [5]:

def plot_2d_function(func, x1_min, x1_max, x2_min, x2_max):
    """
    func: 2d function which takes input as a list [x1, x2]
    """
    x1 = np.linspace(x1_min, x1_max)
    x2 = np.linspace(x2_min, x2_max)

    X, Y = np.meshgrid(x1, x2)

    A = np.concatenate((X.reshape(1, -1), Y.reshape(1, -1)))
    n_grid_points = A.shape[1]

    z = [func(A[:,i]) for i in range(n_grid_points)]
    z = np.array(z).reshape(int(math.sqrt(n_grid_points)), int(math.sqrt(n_grid_points)))

    fig = plt.figure(figsize=(10,6))
    plt.contourf(X, Y, z, 50, cmap = 'RdGy');
    plt.colorbar();

In [6]:

plot_2d_function(f, -5, 10, 0, 15)

No description has been provided for this image

Now that we have our nice objective function we have to simulate our noise observations!

In [7]:

def noisy_f(x, noise_level=0.5**2):
    """
    Retuns an objective function f with some noise.
    """
    return f(x) + np.random.randn() * noise_level

In [8]:

noisy_f([0,0]), noisy_f([0,0])

Out[8]:

(45.55655343981535, 45.79452601600886)

Great now we have everything to start the bayesian optimization! To start of we need some random initialized data points and a gaussian process prior. Once we have that we can compute the gaussian process posterior. With the gaussian process posterior we can compute the acquisition function and find the maximum to get the next query point!

For this example we start of with 5 randomly initiated data points. Normally we can asign these data points in a bit smarter matter, but for now it doesn't really matter.

t=0 The Initialization (Noiseless)¶

So we need to initialize some data points, our gaussion process prior, gaussian process posterior and an acquisition function. We start of by using the noiseless function, that will make thing a bit easier in the beginning.

Initial data points¶

In [9]:

start_x1 = np.random.uniform(low=-5, high=10, size=5)
start_x2 = np.random.uniform(low=0, high=15, size=5)

X = [[x1, x2] for x1, x2 in list(zip(start_x1, start_x2))]
y = [f(x) for x in X]

In [10]:

plot_2d_function(f, -5, 10, 0, 15)
plt.scatter(start_x1, start_x2, c=y, cmap = 'RdGy', edgecolors="black", vmin=-5, vmax=290);

Great so we have a couple of initial data points!

Gaussian Process prior¶

To create a 2d gaussian process we of course need the main component of a gaussian process, the kernel! The kernel, or covariance function will determine how different points within our input domain are correlated. Every point on our domain is of course by definition gaussian distributed. Let's define our kernel first. We can use a lot of different kernels, but for this example lets use the exponential kernel given by

$$K(\textbf{x}_i, \textbf{x}_j) = \sigma^2\exp\left(\frac{-1}{2L^2}\cdot(\textbf{x}_i - \textbf{x}_j)^T(\textbf{x}_i - \textbf{x}_j)\right)$$

In [11]:

# set the hyperparameters
sigma = 1
l = 20

In [12]:

def Kernel(x, x_):
    """
    Exponential Kernel, returns the covariance between two n-dimensional vectors.
    """
    return sigma**2*np.exp(-1/(2*l**2)*(x-x_).T@(x-x_))

In [13]:

Kernel(np.array([2,1]), np.array([1,2]))

Out[13]:

0.9975031223974601

Nice, so let's do a couple of draws of our gaussian process with a 2 dimensional input space! All these draws could potentially be our model, luckily we have an infinite amount of options. This means we have to define our Gaussian Process. To do this we have to take a little step back by looking at something we all understand, linear regression.

A step back¶

For the definition of a gaussian process prior and posterior it is insightfull to first review the Bayesian analysis of the standard linear regression model, and build from there. Let's say we have a linear model $f$.

$$f(\textbf{x}) = \textbf{x}^T\textbf{w}, \hspace{2cm} y=f(x)+\epsilon$$

Where $\textbf{x}$ is the input vector, $\textbf{w}$ is a vector of weights (parameters) of the linear model, $f$ is the function value and $y$ is the observed target value. We assume the observed values $y$ differ from the function values $f(x)$ by additive noise, and we will further asume that this noise follows a i.i.d. guassian distribution with zero mean and variance $\sigma_n^2$. $$\epsilon\sim N(0, \sigma_n^2)$$ In a regression problem we are interested in learning the weights (parameters) of our model. That is the weights that minimize our error $\epsilon$. In the one dimensional case this is easily visualized for n points.

In [55]:

n = 3
x = np.linspace(0, 1, num=50)
true_w = np.array([2, 0.8])
mu_0 = 0
sigma_n = 0.1
epsilon = np.random.normal(mu_0, sigma_n, n)
x = np.random.uniform(0, 1, n).reshape(1, -1)
X = np.append(np.ones(n).reshape(1, -1), x, axis=0)  # We add a one column vector for the bias
y = X.T@true_w + epsilon

plt.figure(figsize=(6, 4))
plt.scatter(x, y)
x0, x1 = 0, 1
def f(x): return true_w[0] + true_w[1]*x
plt.plot([x0, x1], [f(x0), f(x1)], label=f"$y = {true_w[0]} + {true_w[1]}x$");
plt.title("Some randomly generated data with their true w");
plt.legend();

This noise assumption together with the model directly gives rise to the likelihood, the probability density of the observations given the parameters, which is factored over cases in the training set. In other words, how "likely" is it to see these data points given our parameters? In other words what is the probabilty of seeing those $y$ values given our input data and our weights. Mathematically that is: $$p(\textbf{y}|X, w) = \displaystyle\prod^n_{i=1}p(y_i|\textbf{x}_i, \textbf{w})$$

Given the mean, $\mu$ and variance, $\sigma$ of a normal gaussian the probability density funtion for a value $x$ is given by

$$p(x|\mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2\right).$$

This means

$$ \begin{align} p(\textbf{y}|X, w) &= \displaystyle\prod^n_{i=1}p(y_i|\textbf{x}_i, \textbf{w}) = \displaystyle\prod^n_{i=1}\frac{1}{\sigma_n\sqrt{2\pi}}\exp\left(-\frac{1}{2}\left(\frac{y_i-\textbf{x}_i^T\textbf{w}}{\sigma_n}\right)^2\right) \\ &= \frac{1}{\left(2\pi\sigma_n^2\right)^{n/2}}\exp\left(-\frac{1}{2\sigma^2_n}\left|\textbf{y}-X^T\textbf{w}\right|^2\right) = N\left(X^T\textbf{w}, \sigma_n\textbf{I}\right) \end{align} $$

Here $|\textbf{z}|$ denotes the euclidian distance of the vector $\textbf{z}$.

Since we now have a neat formula for the likelihood we still need a prior and marginal likelihood for our posterior. Remember Bayes rule?

$$\text{posterior} = \frac{\text{likelihood}\times\text{prior}}{\text{marginal likelihood}} \hspace{3cm} p(\textbf{w}|y, X) = \frac{p(\textbf{y}|X, \textbf{w})p(\textbf{w})}{p(\textbf{y}|X)}$$

We put a zero mean gaussian prior with covariance matrix $\Sigma_p$ on the weights

$$\textbf{w}\sim N(\textbf{0}, \Sigma_p).$$

Since $p(y|X)$ is independent of $\textbf{w}$ it is mostly factored out as a constant. This marginal likelihood, independent of the weights is given by

$$p(\textbf{y}|X) = \int p(\textbf{y}|X, \textbf{w})p(\textbf{w})d\textbf{w}$$

The posterior combines the likelihood and the prior, combining everything we know about the parameters!

$$ \begin{align} p(\textbf{w}|X, y) & \propto \exp \left( -\frac{1}{2\sigma_n^2}(\textbf{y}-X^T\textbf{w})^T(\textbf{y}-X^T\textbf{w})\right)\exp\left(-\frac{1}{2}\textbf{w}^T\Sigma_p^{-1}\textbf{w}\right) & \\ & \propto \exp\left(-\frac{1}{2}(\textbf{w}-\overline{\textbf{w}})^T(\frac{1}{\sigma^2_n}XX^T+\Sigma_p^{-1})(\textbf{w}-\overline{\textbf{w}})\right) & (1) \end{align} $$

where

$$\overline{\textbf{w}} = \sigma_n^{-2}(\sigma_n^{-2}XX^T + \Sigma_p^{-1})^{-1}X\textbf{y}.$$

We recognize the form of the posterior right!? It is a Gaussian with mean $\overline{\textbf{w}}$ and covariance matrix $\textbf{A}^{-1}$

$$p(\textbf{w}|X, \textbf{y})\sim N\left(\overline{\textbf{w}}=\frac{1}{\sigma_n^{2}}A^{-1}X\textbf{y}, A^{-1}\right),$$

where $A=\sigma_n^{-2}XX^T + \Sigma_p^{-1}$. So we're able to compute the posterior analytically!

Interested in the full math behind? (1)

Note, I left out the bold notation, but $w$ is still a 2d vector with the intercept and the slope. $y$ is still an $n$ dimensional vector with the response value of our data points. $X$ is still a $2 \times n$ matrix with the first row consisting of just ones for the bias and the second row for the $x$ coordinate of our data point.

$$ \begin{align} p(w|X, y) & \propto\exp \left( -\frac{1}{2\sigma_n^2}(y-X^Tw)^T(y-X^Tw)\right)\exp\left(-\frac{1}{2}w^T\Sigma_p^{-1}w\right) \\ & = \exp\left\{-\frac{1}{2}\left(\sigma_n^{-2}(y-X^Tw)^T(y-X^Tw) + w^T\Sigma_p^{-1}w\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left(\sigma_n^{-2}(y^T-w^TX)(y-X^Tw) + w^T\Sigma_p^{-1}w\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left(\sigma_n^{-2}(y^Ty-y^TX^Tw-w^TXy+w^TXX^Tw) + w^T\Sigma_p^{-1}w\right)\right\} \\ & \propto \exp\left\{-\frac{1}{2}\left(\sigma_n^{-2}(-(w^TXy)^T-w^TXy)+ w^T(\sigma_n^{-2}XX^T+\Sigma_p^{-1})w\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left(-\sigma_n^{-2}(w^TXy)^T-\sigma_n^{-2}w^TXy+ w^TAw\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left(-\sigma_n^{-2}(w^TAA^{-1}Xy)^T-\sigma_n^{-2}w^TAA^{-1}Xy+ w^TAw\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left(-(w^TA\sigma_n^{-2}A^{-1}Xy)^T-w^TA\sigma_n^{-2}A^{-1}Xy+ w^TAw\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left(-(w^TA\overline{w})^T-w^TA\overline{w}+ w^TAw\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left(-\overline{w}^TAw-w^TA\overline{w}+ w^TAw\right)\right\} \\ & = \exp\left\{-\frac{1}{2}\left((w-\overline{w})^TA(w-\overline{w}) - \overline{w}^TA\overline{w} \right)\right\} \\ & \propto \exp\left\{-\frac{1}{2}(w-\overline{w})^TA(w-\overline{w})\right\} \\ \end{align} $$

where

$$ A=\sigma_n^{-2}XX^T+\Sigma_p^{-1} \\ \overline{w} = \sigma_n^{-2}A^{-1}Xy $$

Note: At the $\propto$ steps we drop terms which do not depend on $w$. Also note we can introduce $AA^{-1}=1$, and bring the constant $\sigma_n^{-2}$ in.

Let's show an analytically computed posterior for the 1 dimensional example above!

Example for the 1d case¶

Enough math, let's show how all this works with an example in the 1d case.

In [70]:

# Prior on w
mu_0 = np.zeros(2)
sigma_p = np.array([[1, 0], [0, 1]])

# Let's sample from the prior m times
m = 1000
z = np.random.multivariate_normal(mu_0, sigma_p, size=m)

In [72]:

# Let's plot the prior
plt.figure(figsize=(8, 8))
plt.scatter(X[1], y)
for i in range(m):
    plt.plot([0, 1], [z[i, 0], z[i, 0] + z[i, 1]], color='grey', linewidth=1, alpha=0.1)
plt.title(f"Samples: {m} draws from the prior on f");

We see that our prior knowledge is pretty garbage. But that's fine. We're more interested in the posterior anyway! Let's also show the distribution on the parameters $w$.

In [73]:

fig = plt.figure(figsize=(8,8))
sns.jointplot(x=z[:,0], y=z[:,1], kind='kde', space=0, color='purple')
plt.title("Joint distribution between the two parameters slope and intercept in w");

<Figure size 576x576 with 0 Axes>

In [74]:

# Compute A.
A = (sigma_n)**(-2)*np.dot(X, X.T) + np.linalg.inv(sigma_p)
# Compute its inverse. 
A_inv = np.linalg.inv(A)

# Comput w_bar
w_bar = sigma_n**(-2)*np.linalg.inv(A)@X@y

# sample from the posterior m times
z = np.random.multivariate_normal(w_bar, A_inv, m)

In [75]:

x,y

Out[75]:

(array([[0.97071986, 0.4380278 , 0.51477552]]),
 array([3.00063856, 2.37245988, 2.53990343]))

In [76]:

# Let's plot the posterior samples
plt.figure(figsize=(8, 8))

for i in range(m):
    plt.plot([0, 1], [z[i, 0], z[i, 0] + z[i, 1]], color='grey', linewidth=1, alpha=0.01)
plt.scatter(x, y);
plt.title(f"Samples: {m} draws from the posterior on f");

In [77]:

fig = plt.figure(figsize=(8,8));
sns.jointplot(x=z[:,0], y=z[:,1], kind='kde', space=0, color='purple');
plt.title("Joint distribution between the two parameters slope and intercept in w");

<Figure size 576x576 with 0 Axes>

We can see our posterior distribution on $w$ is way more accurate! To use our posterior as a model to predict, we can use the mean or mode of the posterior. Which is oddly enough called the maximum a posteriori (MAP) estimate of $w$. To do this we average over all possible parameter values weighted by their posterior probability.

Okay great, but what the hell happened?

Remember all the math from before? Well linear regression is a very easy example of that multidimensional case, $\textbf{w} = (w_0, w_1)$, when $w_0$ is the intercept and $w_1$ the slope.

$$f(x) = w_0 + w_1 \cdot x$$

We've picked a 1d input space, so the weight space is two dimensional. We set normal prior on both weights, which means the weights could possibly be anything falling into the normal bell curve.

$$\textbf{w} \sim N(\boldsymbol\mu, \Sigma) = N(0, I)$$

Then we observed three datapoints which gave us more information about the possible weights. Using the fact that $w_0$ and $w_1$ are still random variables following some normal gaussian, we could use Bayes Rule to compute the posterior distribution of $\textbf{w}$. This gives us a better (posterior) distribution for the parameters of our normal gaussian ($\mu$, $\sigma$). This posterior we were able to compute using just analytics and fancy linear algebra (all the amazing math). This gave us new approximations for our $\boldsymbol\mu$ and $\Sigma$.

In [ ]: