A general forgetting setup

After several inclusion-deletion steps, all information available up to time $ t$ has (approximately) been stored in the posterior density over the dictionary bases $ {\boldsymbol{\mathbf{f}}}_t\vert{\cal D}_t\sim\mathcal{N}({\boldsymbol{\mathbf{f}}}_t\vert{\boldsymbol{\mathbf{\mu}}}_t,{\boldsymbol{\mathbf{\Sigma}}}_t)$. Inserting this $ p({\boldsymbol{\mathbf{f}}}_t\vert{\cal D}_t)$ in Eq. (2) and solving the integral, we can obtain the implied posterior GP over the whole input space,

$\displaystyle f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t \sim \mathcal{GP}($ $\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold...
...mbol{\mathbf{\mu}}}_t,~k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}')+$ (14)
  $\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold...
...dsymbol{\mathbf{Q}}}_t{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}}')),$    

where $ {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})$ is the vector of covariances between $ {\boldsymbol{\mathbf{x}}}$ and all the bases in the dictionary at time $ t$. Observe that (14) has the same form as the prediction equation (5), but extended to the whole input space.

In order to make KRLS able to adapt to non-stationary environments, we should make it able to ``forget'' past samples, i.e., to intentionally force the posterior $ p(f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t)$ to lose some information. A very general approach to this is to linearly combine $ f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t$ with another independent GP $ n({\boldsymbol{\mathbf{x}}})$ that holds no information about data. Since this new posterior after forgetting will be a linear combination of two GPs, it will also be a GP, and we will denote it as

$\displaystyle \breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t = \alpha f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t + \beta n({\boldsymbol{\mathbf{x}}}),$ (15)

where $ \alpha,\beta>0$ are used to control the trade-off between the informative GP $ f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t$ and the uninformative ``forgetting noise'' $ n({\boldsymbol{\mathbf{x}}})$.

The posterior GP after forgetting, $ p(\breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t)$, should be expressible in terms of a distribution over the latent points in the dictionary (to avoid a budget increase). We will refer to this distribution as $ \mathcal{N}(\breve {\boldsymbol{\mathbf{\mu}}}_t, \breve {\boldsymbol{\mathbf{\Sigma}}}_t)$. Using Eq. (2) again, the posterior after forgetting in terms of $ \breve {\boldsymbol{\mathbf{\mu}}}_t$ and $ \breve {\boldsymbol{\mathbf{\Sigma}}}_t$ is

$\displaystyle \breve f({\boldsymbol{\mathbf{x}}})\vert{\cal D}_t \sim \mathcal{GP}($ $\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold...
...mbol{\mathbf{\mu}}}_t,~k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}')+$ (16)
  $\displaystyle {\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}})^\top{\bold...
...dsymbol{\mathbf{Q}}}_t{\boldsymbol{\mathbf{k}}}_t({\boldsymbol{\mathbf{x}}}')).$    

Different definitions for $ \alpha$, $ \beta$ and $ n({\boldsymbol{\mathbf{x}}})$ will result in different types of forgetting. One reasonable approach is discussed next.

Pdf version (275 KB)
Steven Van Vaerenbergh
Last modified: 2011-09-20