Bayesian recursive update

The adopted setup corresponds to standard GP regression, which is thoroughly described in [11] for the batch setting. Here we consider the online setting in which new observations are incorporated sequentially. Therefore, a new posterior $ p({\boldsymbol{\mathbf{f}}}_{t}\vert{\cal D}_{t})$ including the most recent observation $ t$ must be computed at each time instant. We will compute the joint posterior only at the locations at which data is observed, instead of the full GP posterior. This is because $ p({\boldsymbol{\mathbf{f}}}_{t}\vert{\cal D}_{t})$ implicitly defines the posterior full GP distribution $ p(f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_{t})$

$\displaystyle p(f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_{t}) = \int p(f({\bol...
...) p({\boldsymbol{\mathbf{f}}}_t\vert{\cal D}_{t}) \d{\boldsymbol{\mathbf{f}}}_t$ (2)

and therefore, both posterior distributions hold the same information.

Instead of recomputing $ p({\boldsymbol{\mathbf{f}}}_{t}\vert{\cal D}_{t})$ from scratch at every time instant, we can obtain a simple recursive update as follows:

\begin{subequations}\begin{align}p({\boldsymbol{\mathbf{f}}}_{t+1}\vert{\cal D}_...
...mes p({\boldsymbol{\mathbf{f}}}_t\vert{\cal D}_t). \end{align}\end{subequations}

Eq. (3b) follows from (3a) by direct application of Bayes rule. Eq. (3c) includes an additional expansion due to $ f_{t+1}$ being conditionally independent of $ {\cal D}_t$ given $ {\boldsymbol{\mathbf{f}}}_t$.

If the posterior at time $ t$ is a known Gaussian $ p({\boldsymbol{\mathbf{f}}}_t\vert{\cal D}_t) = \mathcal{N}({\boldsymbol{\mathbf{f}}}_t\vert {\boldsymbol{\mathbf{\mu}}}_t, {\boldsymbol{\mathbf{\Sigma}}}_t)$, then we can use (3) to update it to include a new observation $ ({\boldsymbol{\mathbf{x}}}_{t+1}, y_{t+1})$. All the expression in (3) can be easily inferred from the stated assumptions as follows.

First, introducing the new quantities $ {\boldsymbol{\mathbf{Q}}}_t = {\boldsymbol{\mathbf{K}}}_t ^{-1}$, $ {\boldsymbol{\mathbf{q}}}_{t+1} = {\boldsymbol{\mathbf{Q}}}_t {\boldsymbol{\mathbf{k}}}_{t+1}$ and $ \gamma_{t+1}^2 = k_{t+1}- {\boldsymbol{\mathbf{k}}}_{t+1}^\top {\boldsymbol{\mathbf{Q}}}_t {\boldsymbol{\mathbf{k}}}_{t+1}$, where $ [{\boldsymbol{\mathbf{k}}}_{t+1}]_i = k(x_i,x_{t+1})$ and $ k_{t+1}= k(x_{t+1},x_{t+1})$, we can express the density of the latent function at the new input given its value at previous inputs as

$\displaystyle p(f_{t+1}\vert{\boldsymbol{\mathbf{f}}}_t) = \mathcal{N}(f_{t+1} ...
...oldsymbol{\mathbf{q}}}_{t+1}^\top {\boldsymbol{\mathbf{f}}}_t, \gamma_{t+1}^2).$ (4)

This result follows directly from the known prior probability $ p(f_{t+1}, {\boldsymbol{\mathbf{f}}}_t)$ and conditioning on $ {\boldsymbol{\mathbf{f}}}_t$. The inverse of the kernel matrix, $ {\boldsymbol{\mathbf{Q}}}_t$, has been defined because we will be computing and storing it instead of $ {\boldsymbol{\mathbf{K}}}_t$, which will never be directly used.

Then, the denominator of Eq. (3), which corresponds to the marginal likelihood (also known as evidence), provides the predictive distribution of a new observation $ y_{t+1}$ given past data, and can be computed by integration of the numerator

$\displaystyle p(y_{t+1}\vert {\cal D}_t)$ $\displaystyle = \int p(y_{t+1}\vert f_{t+1}) p(f_{t+1}\vert{\boldsymbol{\mathbf...
...ldsymbol{\mathbf{f}}}_t\vert{\cal D}_t) \d{\boldsymbol{\mathbf{f}}}_t\d f_{t+1}$    
  $\displaystyle = \mathcal{N}(y_{t+1}\vert\hat{y}_{t+1} , \hat{\sigma}_{yt+1}^2 )$ (5)

with the mean and the variance of this Gaussian being

$\displaystyle \hat{y}_{t+1} = {\boldsymbol{\mathbf{q}}}_{t+1}^\top{\boldsymbol{\mathbf{\mu}}}_t~~~~$and$\displaystyle ~~~~\hat{\sigma}_{yt+1}^2= \sigma_n^2 + \hat{\sigma}_{ft+1}^2,
$

respectively. We have also introduced the predictive variance of the latent function at the new input

$\displaystyle \hat{\sigma}_{ft+1}^2
= k_{t+1}+ {\boldsymbol{\mathbf{k}}}_{t+1}^...
..._{t+1}^2 + {\boldsymbol{\mathbf{q}}}_{t+1}^\top{\boldsymbol{\mathbf{h}}}_{t+1}
$

with $ {\boldsymbol{\mathbf{h}}}_{t+1} = {\boldsymbol{\mathbf{\Sigma}}}_t {\boldsymbol{\mathbf{q}}}_{t+1}$.

Finally, the likelihood follows from the model definition (1):

$\displaystyle p(y_{t+1}\vert f_{t+1}) = \mathcal{N}(y_{t+1}\vert f_{t+1}, \sigma_n^2).$ (6)

Thus, all involved distributions (4) and (5), (6) are univariate normal with simple, known expressions. Using those, (3) can be evaluated and the posterior updates emerge

\begin{subequations}\begin{align}p({\boldsymbol{\mathbf{f}}}_{t+1}\vert{\cal D}_...
...{t+1} \\ \hat{\sigma}_{ft+1}^2 \end{bmatrix}^\top. \end{align}\end{subequations}

The inverse of the kernel matrix including the new input can also be easily computed using the corresponding low-rank update

$\displaystyle {\boldsymbol{\mathbf{K}}}_{t+1}^{-1} = {\boldsymbol{\mathbf{Q}}}_...
...matrix}\begin{bmatrix}{\boldsymbol{\mathbf{q}}}_{t+1} \\ -1 \end{bmatrix}^\top.$ (8)

This illustrates both how probabilistic predictions for new observations can be made (using Eq. (5), which does not require knowledge of $ y_{t+1}$), and how these new observations can be included in the posterior once they are available. All computations for a given update can be made in $ \mathcal{O}(t^2)$ time, as is obvious from the update formulas. Only $ {\boldsymbol{\mathbf{\mu}}}_{t+1}$, $ {\boldsymbol{\mathbf{\Sigma}}}_{t+1}$ and $ {\boldsymbol{\mathbf{Q}}}_{t+1}$ will be reused in the next iteration, and the remaining quantities will be computed on demand.

The recursion updates can be initialized by setting

$\displaystyle {\boldsymbol{\mathbf{\mu}}}_1$ $\displaystyle = \frac{y_{1} k({\boldsymbol{\mathbf{x}}}_1, {\boldsymbol{\mathbf...
...}_1)}{\sigma_n^2 + k({\boldsymbol{\mathbf{x}}}_1, {\boldsymbol{\mathbf{x}}}_1)}$ (9)
$\displaystyle {\boldsymbol{\mathbf{\Sigma}}}_1$ $\displaystyle = k({\boldsymbol{\mathbf{x}}}_1, {\boldsymbol{\mathbf{x}}}_1) - \...
...1)^2}{\sigma_n^2 + k({\boldsymbol{\mathbf{x}}}_1, {\boldsymbol{\mathbf{x}}}_1)}$ (10)
$\displaystyle {\boldsymbol{\mathbf{Q}}}_1$ $\displaystyle = \frac{1}{k({\boldsymbol{\mathbf{x}}}_1, {\boldsymbol{\mathbf{x}}}_1)},$ (11)

which corresponds to inference according to the proposed model for a single data point.

Since the model is exactly that of GP regression and all provided formulas are exact, probabilistic predictions made at time $ t$ for observation $ t+1$ are exactly the same as those obtained using a standard GP in the batch setting. Using the batch formulation from [11], we can equivalently express the predictive mean and variance from (5) as

\begin{subequations}\begin{align}\hat{y}_{t+1} &= {\boldsymbol{\mathbf{k}}}_{t+1...
...\mathbf{I}}})^{-1}{\boldsymbol{\mathbf{k}}}_{t+1}. \end{align}\end{subequations}

Direct application of (12a) and (12b) involves a higher, $ \mathcal{O}(t^3)$ cost, so the recursive procedure of the previous section is preferred. However, these equations are useful to illustrate the form of the predictive distribution after several iterations, which is somewhat obscured in the recursive formulation.

In the standard KRLS setting, the predictive mean is often expressed as $ \hat{y}_{t+1} = {\boldsymbol{\mathbf{k}}}_{t+1}^\top {\boldsymbol{\mathbf{\alpha}}}_t$, where $ {\boldsymbol{\mathbf{\alpha}}}_t$ weights each kernel. When the batch formulation is used, these weights can be obtained as $ {\boldsymbol{\mathbf{\alpha}}}_t = ({\boldsymbol{\mathbf{K}}}_t + \sigma_n^2{\boldsymbol{\mathbf{I}}})^{-1}{\boldsymbol{\mathbf{y}}}_t$, whereas in our recursive formulation the same result can be obtained at each step $ t$ by computing $ {\boldsymbol{\mathbf{\alpha}}}_t = {\boldsymbol{\mathbf{K}}}_t^{-1}{\boldsymbol{\mathbf{\mu}}}_t = {\boldsymbol{\mathbf{Q}}}_t{\boldsymbol{\mathbf{\mu}}}_t$. Observe the resemblance between the batch and recursive formulations: In the batch formulation we are using noisy observations $ {\boldsymbol{\mathbf{y}}}_t$, so the kernel matrix includes a regularization term $ \sigma_n^2{\boldsymbol{\mathbf{I}}}$. In the recursive formulation we use $ {\boldsymbol{\mathbf{\mu}}}_t$, which are the values of the noiseless function evaluated at the inputs, so no noise term is added. Obviously, the same value for $ {\boldsymbol{\mathbf{\alpha}}}_t$ is obtained with both formulations.

Pdf version (275 KB)
Steven Van Vaerenbergh
Last modified: 2011-09-20