Online regression on stationary data

In the first experiment we apply the algorithms to perform online regression of the KIN40K data set3. This set contains $ 40,000$ examples, each consisting of an $ 8$-dimensional input vector and a scalar output, representing the forward kinematics of an $ 8$-link all-revolute robot arm. We randomly selected $ 10,000$ data points for training and used the remaining $ 30,000$ points for testing the regression.

Figure 1: NMSE performance of KRLS algorithms on the KIN40K regression problem, each using a dictionary of $ M=500$ bases.
\includegraphics[width=\linewidth]{fig/kin40k_comparison_500}

An anisotropic Gaussian kernel was used, in which the hyperparameters were determined off-line by standard GP regression. In particular, the noise-to-signal ratio (regularization) was $ \sigma_n^2/\sigma_0^2=0.0021$. The first algorithm was SW-KRLS with a window of $ 500$ data points. The second algorithm was ALD-KRLS with sensitivity $ \nu=0.45$. For this parameter value, the final dictionary contained exactly $ 500$ bases. We also ran ALD-KRLS with $ \nu=0$ and stopped its dictionary expansion once it contained $ 500$ bases. While after this point the dictionary is left unchanged, ALD-KRLS continues adaptation by performing reduced updates of its other parameters. The last algorithm is the proposed KRLS-T algorithm, with a dictionary size of $ 500$ bases and $ \lambda=1$. To prune the dictionary, we used the slower criterion that minimizes KL-divergence (``fullKL'') in one test and the simpler MSE-based criterion in another test.

Each algorithm performed a single run over the data. The performance was measured by calculating the normalized mean-square error (NMSE) on the test set, at different points throughout the training run. The results are displayed in Fig. 1. During the first $ 500$ iterations, four of the algorithms show identical performance, since they accept every observed data point into their dictionaries. The fifth algorithm, ALD-KRLS with $ \nu=0.45$, has a slower dictionary growth and an initially larger error. From iteration $ 501$ onwards, SW-KRLS maintains its performance, since it performs regression only on the $ 500$ most recent samples. ALD-KRLS selects the most relevant bases (as a growing set), and therefore it converges to a better NMSE. While its sensitivity level $ \nu=0.45$ results in a slower convergence, it achieves similar results as $ \nu=0$. The KRLS-T algorithm outperforms all others, since it is able to properly weight all samples and to trade weaker bases in the dictionary for more relevant ones during the entire training process. Interestingly, it obtains similar results with the simple MSE-based pruning criterion and the computationally more expensive KL-based criterion.

Pdf version (275 KB)
Steven Van Vaerenbergh
Last modified: 2011-09-20