Introduction

During the past decade there has been a growing interest in kernel methods, motivated by their successful applications in many fields such as image processing, biomedical engineering and communications. Similar to neural networks, kernel methods are universal nonlinear approximators, but they show the advantage of yielding convex optimization problems. Popular kernel-based algorithms include support vector machines (SVM) and kernel principal component analysis (KPCA) [1].

An online scenario assumes that data is received as a stream of input-output patterns $ \{({\mathbf x}_1,y_1),({\mathbf x}_2,y_2),\cdots\}$, in which $ {\mathbf x}_i$ is a vector and $ y_i$ is a scalar, for $ i=1,2,\dots$ Notice that we assume a supervised setting in which input and output data are provided. The goal of online algorithms is to update their solution in every iteration $ n$, based on the new available data $ ({\mathbf x}_n,y_n)$, while maintaining a low computational complexity. In their classic formulation, kernel methods are not suitable for online applications, as their functional representation grows linearly with the number of processed patterns. This poses both computational as well as memory issues. Therefore, substantial effort has gone into designing ``sparsification'' techniques that curb the growth of the networks constructed by these methods.

Recently a number of kernel adaptive filtering algorithms have been presented that introduce sparsification procedures similar to the resource-allocating networks (RAN) proposed by Platt [2]. In [3], Engel et al. introduce an approximate linear dependency (ALD) criterion that constructs a dictionary of significant patterns based on the dependency of feature vectors. In [4], Liu et al. propose a criterion that measures the information a point can contribute to the learning system, based on Gaussian process theory. Unlike ALD, this ``surprise''-based criterion also takes into account the output data $ y_i$.

As a counterpart to criteria that optimally construct a sparse network, a number of pruning criteria have been developed that discard redundant patterns from existing large networks. Pruning techniques have been well studied in the context of neural network design, for instance in optimal brain damage [5] and optimal brain surgeon [6], which are based on an analysis of the Hessian of the error surface. In [7,8] a number of different, easier to evaluate criteria were presented to prune least-squares support vector machines (LS-SVM).

A few methods have been proposed that combine growing and pruning procedures, for instance generalized growing and pruning (GGAP) [9], the forgetron [10], and sliding-window algorithms [11,12,13], which limit the memory to the $ M$ newest patterns. These approaches are interesting with practical applications in mind, such as implementations on a microchip, since they allow to put an exact upper bound on the memory size and the number of computations needed. Moreover, they are capable of forgetting past data, which makes them suitable for operating in time-varying environments.

The contributions of this paper are threefold. First, we show how the solution of the kernel recursive least-squares (KRLS) algorithm can be updated when one new point is added to the support and one chosen point is discarded. Second, we show how to choose the point to discard by using a suitable, easy to evaluate pruning criterion. And third, we introduce a label update procedure that equips the proposed algorithm with tracking capability.

The rest of this paper is organized as follows. In section 2 we briefly overview the basics of KRLS, followed by a description of the proposed method in section 3. The results of two numerical experiments are reported in section 4 and, finally, the main conclusions of this work are listed in section 5.

Pdf version (236 KB)
Steven Van Vaerenbergh
Last modified: 2010-08-07