First, we select the form of the GP
which acts as noise. This GP holds no information about the data and it is independent of
. Assume for a moment that we want to forget all past data. In this case we must set
to completely remove the informative GP. Then, our posterior GP would be
. The distribution of the posterior when no data has been observed, should, by definition, be equal to the prior. Therefore
must be a scaled version of the GP prior. Without lack of generality, we can choose this scale to be 1, so that the noise GP becomes
.
Obviously, with this choice, setting
should imply
, which as we will see later, is the case. Observe that
corresponds to colored noise, using the same coloring as the prior.
Once
has been defined, the distribution of
can be obtained from its definition (15). Since both GPs are independent, their linear combination is distributed as
Comparing (16) and (17) and identifying terms, we obtain