Back-to-the-prior forgetting

First, we select the form of the GP $ n({\boldsymbol{\mathbf{x}}})$ which acts as noise. This GP holds no information about the data and it is independent of $ f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t$. Assume for a moment that we want to forget all past data. In this case we must set $ \alpha=0$ to completely remove the informative GP. Then, our posterior GP would be $ \beta n({\boldsymbol{\mathbf{x}}})$. The distribution of the posterior when no data has been observed, should, by definition, be equal to the prior. Therefore $ n({\boldsymbol{\mathbf{x}}})$ must be a scaled version of the GP prior. Without lack of generality, we can choose this scale to be 1, so that the noise GP becomes $ n({\boldsymbol{\mathbf{x}}}) \sim \mathcal{GP}(0,k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}'))
$. Obviously, with this choice, setting $ \alpha=0$ should imply $ \beta = 1$, which as we will see later, is the case. Observe that $ n({\boldsymbol{\mathbf{x}}})$ corresponds to colored noise, using the same coloring as the prior.

Once $ n({\boldsymbol{\mathbf{x}}})$ has been defined, the distribution of $ \breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t$ can be obtained from its definition (15). Since both GPs are independent, their linear combination is distributed as

$\displaystyle \breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t \sim \mathcal{GP}($ $\displaystyle \alpha{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top...
...}_t,~(\alpha^2+\beta^2)k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}')+$ (17)
  $\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold...
...dsymbol{\mathbf{Q}}}_t{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}}')).$    

Comparing (16) and (17) and identifying terms, we obtain

$\displaystyle \breve {\boldsymbol{\mathbf{\mu}}}_t = \alpha{\boldsymbol{\mathbf...
...\Sigma}}}_t+(1-\alpha^2){\boldsymbol{\mathbf{K}}}_t;~~~~\alpha^2 + \beta^2 = 1
$

which provides the relationship between the posterior distribution before and after forgetting occurs. Forgetting depends on a single positive parameter, $ \alpha$, and one can find the corresponding $ \beta = \sqrt{1-\alpha^2}$. This latter equation implies that $ \alpha$ cannot be bigger than 1. Its values are therefore in the range from 0 (all past data is forgotten and we arrive back at the prior) to 1 (no forgetting occurs and we are left with the original, unmodified posterior). Reparameterizing $ \alpha^2=\lambda$ for convenience, the forgetting updates are finally:
\begin{subequations}\begin{align}{\boldsymbol{\mathbf{\Sigma}}}_t &\leftarrow \l...
...tarrow \sqrt{\lambda}{\boldsymbol{\mathbf{\mu}}}_t \end{align}\end{subequations}

where we denote $ \lambda\in (0,1]$ as the forgetting factor. The smaller the value of $ \lambda$, the faster the algorithm can track changes (and the less it is able to learn, since information is quickly discarded). Usually, only values in the $ [0.95,1]$ range are sensible. We call this technique ``back-to-the-prior'' forgetting.

Pdf version (275 KB)
Steven Van Vaerenbergh
Last modified: 2011-09-20