How to optimally remove a basis

After the inclusion of several observations, we are left with a posterior of the form $ p({\boldsymbol{\mathbf{f}}}_t, f_{t+1}\vert{\cal D}_{t+1}) = \mathcal{N}({\bol...
...}\vert {\boldsymbol{\mathbf{\mu}}}_{t+1}, {\boldsymbol{\mathbf{\Sigma}}}_{t+1})$. Without lack of generality we will assume that we want to remove the basis corresponding to $ f_{t+1}$. To this end we can approximate $ p({\boldsymbol{\mathbf{f}}}_t, f_{t+1}\vert{\cal D}_{t+1})$ with the product of $ p(f_{t+1}\vert{\boldsymbol{\mathbf{f}}}_t)$ (independent of data) times some distribution $ q({\boldsymbol{\mathbf{f}}}_t)$ which does not depend on the removed basis. The optimal form of $ q({\boldsymbol{\mathbf{f}}}_t)$ is then derived by minimizing the Kullback-Leibler (KL) divergence between the exact and approximate posteriors KL$ (p({\boldsymbol{\mathbf{f}}}_{t+1}\vert{\cal D}_{t+1}) \vert\vert p(f_{t+1}\vert{\boldsymbol{\mathbf{f}}}_t) q({\boldsymbol{\mathbf{f}}}_t) )$, which yields $ q({\boldsymbol{\mathbf{f}}}_t) = p({\boldsymbol{\mathbf{f}}}_{t}\vert{\cal D}_{t+1})$. Unsurprisingly, the optimal way to remove a basis from the posterior within this Bayesian framework is simply to marginalize it out.

Marginalizing out a variable in a joint Gaussian distribution implies removing the corresponding row and column from its mean vector and covariance matrix, so the removal equations become $ {\boldsymbol{\mathbf{\mu}}}_{t+1} \leftarrow [{\boldsymbol{\mathbf{\mu}}}_{t+1}]_{-i}$ and $ {\boldsymbol{\mathbf{\Sigma}}}_{t+1} \leftarrow[{\boldsymbol{\mathbf{\Sigma}}}_{t+1}]_{-i, -i}$, where the notation $ [\cdot]_{-i}$ refers to a vector in which the $ i$-th row has been removed, and $ [\cdot]_{-i, -i}$ to matrix in which the $ i$-th row and column have been removed. Following this notation, we will use $ [\cdot]_{-i, i}$ to refer to the $ i$-th column of a matrix, excluding the element in the $ i$-th row.

The $ i$-th basis can be removed from $ {\boldsymbol{\mathbf{Q}}}_{t+1}$ using

$\displaystyle {\boldsymbol{\mathbf{Q}}}_{t+1} \leftarrow [{\boldsymbol{\mathbf{...
...mbol{\mathbf{Q}}}_{t+1}]_{-i,i}^\top}{[{\boldsymbol{\mathbf{Q}}}_{t+1}]_{i,i}}.$ (13)

Additionally, it can be proved that whenever $ \gamma^2_{t+1}$ is (numerically) zero, the above KL divergence is also zero, i.e., discarding the last basis produces no information loss. In such cases, after updating the posterior with Eq. (7), we can immediately prune the last row and column without incurring any information loss. This is sometimes known in the literature as reduced update and is specially useful since update (8), which would be ill-defined, is avoided.

Pdf version (275 KB)
Steven Van Vaerenbergh
Last modified: 2011-09-20