Back-to-the-prior forgetting

First, we select the form of the GP $n({\boldsymbol{\mathbf{x}}})$ which acts as noise. This GP holds no information about the data and it is independent of $f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t$ . Assume for a moment that we want to forget all past data. In this case we must set $\alpha=0$ to completely remove the informative GP. Then, our posterior GP would be $\beta n({\boldsymbol{\mathbf{x}}})$ . The distribution of the posterior when no data has been observed, should, by definition, be equal to the prior. Therefore $n({\boldsymbol{\mathbf{x}}})$ must be a scaled version of the GP prior. Without lack of generality, we can choose this scale to be 1, so that the noise GP becomes $n({\boldsymbol{\mathbf{x}}}) \sim \mathcal{GP}(0,k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}'))$ . Obviously, with this choice, setting $\alpha=0$ should imply $\beta = 1$ , which as we will see later, is the case. Observe that $n({\boldsymbol{\mathbf{x}}})$ corresponds to colored noise, using the same coloring as the prior.

Once $n({\boldsymbol{\mathbf{x}}})$ has been defined, the distribution of $\breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t$ can be obtained from its definition (15). Since both GPs are independent, their linear combination is distributed as

$\displaystyle \breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t \sim \mathcal{GP}($	$\displaystyle \alpha{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top... ...}_t,~(\alpha^2+\beta^2)k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}')+$	(17)
	$\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold... ...dsymbol{\mathbf{Q}}}_t{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}}')).$

Comparing (16) and (17) and identifying terms, we obtain

$\displaystyle \breve {\boldsymbol{\mathbf{\mu}}}_t = \alpha{\boldsymbol{\mathbf... ...\Sigma}}}_t+(1-\alpha^2){\boldsymbol{\mathbf{K}}}_t;~~~~\alpha^2 + \beta^2 = 1$

which provides the relationship between the posterior distribution before and after forgetting occurs. Forgetting depends on a single positive parameter, $\alpha$ , and one can find the corresponding $\beta = \sqrt{1-\alpha^2}$ . This latter equation implies that $\alpha$ cannot be bigger than 1. Its values are therefore in the range from 0 (all past data is forgotten and we arrive back at the prior) to 1 (no forgetting occurs and we are left with the original, unmodified posterior). Reparameterizing $\alpha^2=\lambda$ for convenience, the forgetting updates are finally:

$\begin{subequations}\begin{align}{\boldsymbol{\mathbf{\Sigma}}}_t &\leftarrow \l... ...tarrow \sqrt{\lambda}{\boldsymbol{\mathbf{\mu}}}_t \end{align}\end{subequations}$

where we denote $\lambda\in (0,1]$ as the forgetting factor. The smaller the value of $\lambda$ , the faster the algorithm can track changes (and the less it is able to learn, since information is quickly discarded). Usually, only values in the

range are sensible. We call this technique ``back-to-the-prior'' forgetting.

Pdf version (275 KB)
Steven Van Vaerenbergh
Last modified: 2011-09-20