A general forgetting setup

After several inclusion-deletion steps, all information available up to time

has (approximately) been stored in the posterior density over the dictionary bases ${\boldsymbol{\mathbf{f}}}_t\vert{\cal D}_t\sim\mathcal{N}({\boldsymbol{\mathbf{f}}}_t\vert{\boldsymbol{\mathbf{\mu}}}_t,{\boldsymbol{\mathbf{\Sigma}}}_t)$ . Inserting this $p({\boldsymbol{\mathbf{f}}}_t\vert{\cal D}_t)$ in Eq. (2) and solving the integral, we can obtain the implied posterior GP over the whole input space,

$\displaystyle f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t \sim \mathcal{GP}($	$\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold... ...mbol{\mathbf{\mu}}}_t,~k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}')+$	(14)
	$\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold... ...dsymbol{\mathbf{Q}}}_t{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}}')),$

In order to make KRLS able to adapt to non-stationary environments, we should make it able to ``forget'' past samples, i.e., to intentionally force the posterior $p(f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t)$ to lose some information. A very general approach to this is to linearly combine $f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t$ with another independent GP $n({\boldsymbol{\mathbf{x}}})$ that holds no information about data. Since this new posterior after forgetting will be a linear combination of two GPs, it will also be a GP, and we will denote it as

$\displaystyle \breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t = \alpha f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t + \beta n({\boldsymbol{\mathbf{x}}}),$

(15)

The posterior GP after forgetting, $p(\breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t)$ , should be expressible in terms of a distribution over the latent points in the dictionary (to avoid a budget increase). We will refer to this distribution as $\mathcal{N}(\breve {\boldsymbol{\mathbf{\mu}}}_t, \breve {\boldsymbol{\mathbf{\Sigma}}}_t)$ . Using Eq. (2) again, the posterior after forgetting in terms of $\breve {\boldsymbol{\mathbf{\mu}}}_t$ and $\breve {\boldsymbol{\mathbf{\Sigma}}}_t$ is

$\displaystyle \breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t \sim \mathcal{GP}($	$\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold... ...mbol{\mathbf{\mu}}}_t,~k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}')+$	(16)
	$\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold... ...dsymbol{\mathbf{Q}}}_t{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}}')).$

Different definitions for $\alpha$ , $\beta$ and $n({\boldsymbol{\mathbf{x}}})$ will result in different types of forgetting. One reasonable approach is discussed next.