Intro to AbstractGPs: one-dimensional regression

You are seeing the HTML output generated by Documenter.jl and Literate.jl from the Julia source file. The corresponding notebook can be viewed in nbviewer.

Setup

Loading the necessary packages.

using AbstractGPs



    
        
    
    JuliaGP
    
        
            AbstractGPs
        
        
            KernelFunctions
        
        
            GPLikelihoods
        
        
            ApproximateGPs
        
        
            Co-developed with Turing.jl
        
        
        
    
    
        
    
    
    
    ☰



using Distributions
using FillArrays
using StatsFuns

using Plots
default(; legend=:outertopright, size=(700, 400))

using Random
Random.seed!(42)  # setting the seed for reproducibility of this notebook

Load toy regression dataset taken from GPflow examples.

x = [
    0.8658165855998895,
    0.6661700880180962,
    0.8049218148148531,
    0.7714303440386239,
    0.14790478354654835,
    0.8666105548197428,
    0.007044577166530286,
    0.026331737288148638,
    0.17188596617099916,
    0.8897812990554013,
    0.24323574561119998,
    0.028590102134105955,
]
y = [
    1.5255314337144372,
    3.6434202968230003,
    3.010885733911661,
    3.774442382979625,
    3.3687639483798324,
    1.5506452040608503,
    3.790447985799683,
    3.8689707574953,
    3.4933565751758713,
    1.4284538820635841,
    3.8715350915692364,
    3.7045949061144983,
]
scatter(x, y; xlabel="x", ylabel="y", legend=false)

We split the observations into train and test data.

x_train = x[1:8]
y_train = y[1:8]
x_test = x[9:end]
y_test = y[9:end]

We instantiate a Gaussian process with a Matern kernel. The kernel has fixed variance and length scale parameters of default value 1.

f = GP(Matern52Kernel())

We create a finite dimensional projection at the inputs of the training dataset observed under Gaussian noise with variance $noise\_var = 0.1$, and compute the log-likelihood of the outputs of the training dataset.

noise_var = 0.1
fx = f(x_train, noise_var)
logpdf(fx, y_train)

-25.53057444906228

We compute the posterior Gaussian process given the training data, and calculate the log-likelihood of the test dataset.

p_fx = posterior(fx, y_train)
logpdf(p_fx(x_test, noise_var), y_test)

-2.878533694097381

We plot the posterior Gaussian process (its mean and a ribbon of 2 standard deviations around it) on a grid along with the observations.

scatter(
    x_train,
    y_train;
    xlim=(0, 1),
    xlabel="x",
    ylabel="y",
    title="posterior (default parameters)",
    label="Train Data",
)
scatter!(x_test, y_test; label="Test Data")
plot!(0:0.001:1, p_fx; label=false, ribbon_scale=2)

Markov Chain Monte Carlo

Previously we computed the log likelihood of the untuned kernel parameters of the GP. We now also perform approximate inference over said kernel parameters using different Markov chain Monte Carlo (MCMC) methods. I.e., we approximate the posterior distribution of the kernel parameters with samples from a Markov chain.

We define a function which returns the log-likelihood of the data for different variance and inverse lengthscale parameters of the Matern kernel. We ensure that these parameters are positive with the softplus function

\[f(x) = \log (1 + \exp x).\]

function gp_loglikelihood(x, y)
    function loglikelihood(params)
        kernel =
            softplus(params[1]) * (Matern52Kernel() ∘ ScaleTransform(softplus(params[2])))
        f = GP(kernel)
        fx = f(x, noise_var)
        return logpdf(fx, y)
    end
    return loglikelihood
end

const loglik_train = gp_loglikelihood(x_train, y_train)

We define a Gaussian prior for the joint distribution of the two transformed kernel parameters. We assume that both parameters are independent with mean 0 and variance 1.

logprior(params) = logpdf(MvNormal(Eye(2)), params)

Hamiltonian Monte Carlo

We start with a Hamiltonian Monte Carlo (HMC) sampler. More precisely, we use the No-U-Turn sampler (NUTS), which is provided by the Julia packages AdvancedHMC.jl and DynamicHMC.jl.

AdvancedHMC

We start with performing inference with AdvancedHMC.

using AdvancedHMC
using ForwardDiff

Set the number of samples to draw and warmup iterations.

n_samples = 2_000
n_adapts = 1_000

AdvancedHMC and DynamicHMC below require us to implement the LogDensityProblems interface for the log joint probability.

using LogDensityProblems

struct LogJointTrain end

# Log joint density
LogDensityProblems.logdensity(::LogJointTrain, θ) = loglik_train(θ) + logprior(θ)

# The parameter space is two-dimensional
LogDensityProblems.dimension(::LogJointTrain) = 2

# `LogJointTrain` does not allow to evaluate derivatives of the log density function
function LogDensityProblems.capabilities(::Type{LogJointTrain})
    return LogDensityProblems.LogDensityOrder{0}()
end

We use ForwardDiff.jl to compute the derivatives of the log joint density with automatic differentiation.

using LogDensityProblemsAD

const logjoint_train = ADgradient(Val(:ForwardDiff), LogJointTrain())

We define an Hamiltonian system of the log joint probability.

metric = DiagEuclideanMetric(2)
hamiltonian = Hamiltonian(metric, logjoint_train)

Define a leapfrog solver, with initial step size chosen heuristically.

initial_params = rand(2)
initial_ϵ = find_good_stepsize(hamiltonian, initial_params)
integrator = Leapfrog(initial_ϵ)

Define an HMC sampler, with the following components:

multinomial sampling scheme,
generalised No-U-Turn criteria, and
windowed adaption for step-size and diagonal mass matrix

proposal = HMCKernel(Trajectory{MultinomialTS}(integrator, GeneralisedNoUTurn()))
adaptor = StanHMCAdaptor(MassMatrixAdaptor(metric), StepSizeAdaptor(0.8, integrator))

We draw samples from the posterior distribution of kernel parameters. These samples are in the unconstrained space $\mathbb{R}^2$.

samples, _ = sample(
    hamiltonian, proposal, initial_params, n_samples, adaptor, n_adapts; progress=false
)

┌ Info: Finished 1000 adapation steps
│   adaptor =
│    StanHMCAdaptor(
│        pc=WelfordVar,
│        ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.8, state.ϵ=0.8765703439909124),
│        init_buffer=75, term_buffer=50, window_size=25,
│        state=window(76, 950), window_splits(100, 150, 250, 450, 950)
│    )
│   κ.τ.integrator = Leapfrog(ϵ=0.877)
└   h.metric = DiagEuclideanMetric([0.37816624906919755, 0.520 ...])
┌ Info: Finished 2000 sampling steps for 1 chains in 0.590829272 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.37816624906919755, 0.520 ...]), kinetic=AdvancedHMC.GaussianKinetic())
│   κ = AdvancedHMC.HMCKernel{AdvancedHMC.FullMomentumRefreshment, AdvancedHMC.Trajectory{AdvancedHMC.MultinomialTS, AdvancedHMC.Leapfrog{Float64}, AdvancedHMC.GeneralisedNoUTurn{Float64}}}(AdvancedHMC.FullMomentumRefreshment(), Trajectory{AdvancedHMC.MultinomialTS}(integrator=Leapfrog(ϵ=0.877), tc=AdvancedHMC.GeneralisedNoUTurn{Float64}(10, 1000.0)))
│   EBFMI_est = 1.1272887976650698
└   average_acceptance_rate = 0.8503065689665961

We transform the samples back to the constrained space and compute the mean of both parameters:

samples_constrained = [map(softplus, p) for p in samples]
mean_samples = mean(samples_constrained)

2-element Vector{Float64}:
 2.31725033692087
 2.261365939249381

We plot a histogram of the samples for the two parameters. The vertical line in each graph indicates the mean of the samples.

histogram(
    reduce(hcat, samples_constrained)';
    xlabel="sample",
    ylabel="counts",
    layout=2,
    title=["variance" "inverse length scale"],
    legend=false,
)
vline!(mean_samples'; linewidth=2)

We approximate the log-likelihood of the test data using the posterior Gaussian processes for kernels with the sampled kernel parameters. We can observe that there is a significant improvement over the log-likelihood of the test data with respect to the posterior Gaussian process with default kernel parameters of value 1.

function gp_posterior(x, y, p)
    kernel = softplus(p[1]) * (Matern52Kernel() ∘ ScaleTransform(softplus(p[2])))
    f = GP(kernel)
    return posterior(f(x, noise_var), y)
end

mean(logpdf(gp_posterior(x_train, y_train, p)(x_test, noise_var), y_test) for p in samples)

-1.0013167293696807

We sample 5 functions from each posterior GP given by the final 100 samples of kernel parameters.

plt = plot(; xlim=(0, 1), xlabel="x", ylabel="y", title="posterior (AdvancedHMC)")
for (i, p) in enumerate(samples[(end - 100):end])
    sampleplot!(
        plt,
        0:0.02:1,
        gp_posterior(x_train, y_train, p);
        samples=5,
        seriescolor="red",
        label=(i == 1 ? "samples" : nothing),
    )
end
scatter!(plt, x_train, y_train; label="Train Data", markercolor=1)
scatter!(plt, x_test, y_test; label="Test Data", markercolor=2)
plt

DynamicHMC

We repeat the inference with DynamicHMC.

using DynamicHMC

samples =
    mcmc_with_warmup(
        Random.GLOBAL_RNG, logjoint_train, n_samples; reporter=NoProgressReport()
    ).posterior_matrix

We transform the samples back to the constrained space and compute the mean of both parameters:

samples_constrained = map(softplus, samples)
mean_samples = vec(mean(samples_constrained; dims=2))

2-element Vector{Float64}:
 2.3107754616634013
 2.2857933721109345

We plot a histogram of the samples for the two parameters. The vertical line in each graph indicates the mean of the samples.

histogram(
    samples_constrained';
    xlabel="sample",
    ylabel="counts",
    layout=2,
    title=["variance" "inverse length scale"],
    legend=false,
)
vline!(mean_samples'; linewidth=2)

Again we can observe that there is a significant improvement over the log-likelihood of the test data with respect to the posterior Gaussian process with default kernel parameters.

mean(logpdf(gp_posterior(x_train, y_train, p)(x_test), y_test) for p in eachcol(samples))

-43.564053361846696

We sample a function from the posterior GP for the final 100 samples of kernel parameters.

plt = plot(; xlim=(0, 1), xlabel="x", ylabel="y", title="posterior (DynamicHMC)")
scatter!(plt, x_train, y_train; label="Train Data")
scatter!(plt, x_test, y_test; label="Test Data")
for i in (n_samples - 100):n_samples
    p = @view samples[:, i]
    sampleplot!(plt, 0:0.02:1, gp_posterior(x_train, y_train, p); seriescolor="red")
end
plt

Elliptical slice sampling

Instead of HMC, we use elliptical slice sampling which is provided by the Julia package EllipticalSliceSampling.jl.

using EllipticalSliceSampling

We draw 2000 samples from the posterior distribution of kernel parameters.

samples = sample(ESSModel(
    MvNormal(Eye(2)), # Gaussian prior
    loglik_train,
), ESS(), n_samples; progress=false)

We transform the samples back to the constrained space and compute the mean of both parameters:

samples_constrained = [map(softplus, p) for p in samples]
mean_samples = mean(samples_constrained)

2-element Vector{Float64}:
 2.2377661226681465
 2.2419035752463805

We plot a histogram of the samples for the two parameters. The vertical line in each graph indicates the mean of the samples.

histogram(
    reduce(hcat, samples_constrained)';
    xlabel="sample",
    ylabel="counts",
    layout=2,
    title=["variance" "inverse length scale"],
)
vline!(mean_samples'; layout=2, labels="mean")

Again we can observe that there is a significant improvement over the log-likelihood of the test data with respect to the posterior Gaussian process with default kernel parameters.

mean(logpdf(gp_posterior(x_train, y_train, p)(x_test), y_test) for p in samples)

-11.588584500819495

We sample a function from the posterior GP for the final 100 samples of kernel parameters.

plt = plot(;
    xlim=(0, 1), xlabel="x", ylabel="y", title="posterior (EllipticalSliceSampling)"
)
scatter!(plt, x_train, y_train; label="Train Data")
scatter!(plt, x_test, y_test; label="Test Data")
for p in samples[(end - 100):end]
    sampleplot!(plt, 0:0.02:1, gp_posterior(x_train, y_train, p); seriescolor="red")
end
plt

Variational Inference

Sanity check for the Evidence Lower BOund (ELBO) implemented according to M. K. Titsias's Variational learning of inducing variables in sparse Gaussian processes.

elbo(VFE(f(rand(5))), fx, y_train)

-25.722444893456252

We use the LBFGS algorithm to maximize the given ELBO. It is provided by the Julia package Optim.jl.

using Optim

We define a function which returns the negative ELBO for different variance and inverse lengthscale parameters of the Matern kernel and different pseudo-points. We ensure that the kernel parameters are positive with the softplus function

\[f(x) = \log (1 + \exp x),\]

and that the pseudo-points are in the unit interval $[0,1]$ with the logistic function

\[f(x) = \frac{1}{1 + \exp{(-x)}}.\]

jitter = 1e-6  # "observing" the latent process with some (small) amount of jitter improves numerical stability

function objective_function(x, y)
    function negative_elbo(params)
        kernel =
            softplus(params[1]) * (Matern52Kernel() ∘ ScaleTransform(softplus(params[2])))
        f = GP(kernel)
        fx = f(x, noise_var)
        z = logistic.(params[3:end])
        approx = VFE(f(z, jitter))
        return -elbo(approx, fx, y)
    end
    return negative_elbo
end

We randomly initialize the kernel parameters and 5 pseudo points, and minimize the negative ELBO with the LBFGS algorithm and obtain the following optimal parameters:

x0 = rand(7)
opt = optimize(objective_function(x_train, y_train), x0, LBFGS())

 * Status: success

 * Candidate solution
    Final objective value:     1.086925e+01

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 2.97e-09 ≰ 0.0e+00
    |x - x'|/|x'|          = 3.55e-10 ≰ 0.0e+00
    |f(x) - f(x')|         = 0.00e+00 ≤ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 0.00e+00 ≤ 0.0e+00
    |g(x)|                 = 7.62e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    41
    f(x) calls:    115
    ∇f(x) calls:   115

opt.minimizer

7-element Vector{Float64}:
  8.379380104656423
  3.9327375632143817
  1.8479571183525028
 -4.133679982747246
 -1.7583267680191783
  1.2763097360572246
  0.6917102806552108

The optimized value of the variance is

softplus(opt.minimizer[1])

8.379609630513048

and of the inverse lengthscale is

softplus(opt.minimizer[2])

3.9521381197869268

We compute the log-likelihood of the test data for the resulting approximate posterior. We can observe that there is a significant improvement over the log-likelihood with the default kernel parameters of value 1.

opt_kernel =
    softplus(opt.minimizer[1]) *
    (Matern52Kernel() ∘ ScaleTransform(softplus(opt.minimizer[2])))
opt_f = GP(opt_kernel)
opt_fx = opt_f(x_train, noise_var)
ap = posterior(VFE(opt_f(logistic.(opt.minimizer[3:end]), jitter)), opt_fx, y_train)
logpdf(ap(x_test, noise_var), y_test)

-2.0685997732665813

We visualize the approximate posterior with optimized parameters.

scatter(
    x_train,
    y_train;
    xlim=(0, 1),
    xlabel="x",
    ylabel="y",
    title="posterior (VI with sparse grid)",
    label="Train Data",
)
scatter!(x_test, y_test; label="Test Data")
plot!(0:0.001:1, ap; label=false, ribbon_scale=2)
vline!(logistic.(opt.minimizer[3:end]); label="Pseudo-points")

Exact Gaussian Process Inference

Here we use Type-II MLE to train the hyperparameters of the Gaussian process. This means that our loss function is the negative log marginal likelihood.

We re-calculate the log-likelihood of the test dataset with the default kernel parameters of value 1 for the sake of comparison.

logpdf(p_fx(x_test), y_test)

-232.51565975779937

We define a function which returns the negative log marginal likelihood for different variance and inverse lengthscale parameters of the Matern kernel and different pseudo-points. We ensure that the kernel parameters are positive with the softplus function $f(x) = \log (1 + \exp x)$.

function loss_function(x, y)
    function negativelogmarginallikelihood(params)
        kernel =
            softplus(params[1]) * (Matern52Kernel() ∘ ScaleTransform(softplus(params[2])))
        f = GP(kernel)
        fx = f(x, noise_var)
        return -logpdf(fx, y)
    end
    return negativelogmarginallikelihood
end

We randomly initialize the kernel parameters, and minimize the negative log marginal likelihood with the LBFGS algorithm and obtain the following optimal parameters:

θ0 = randn(2)
opt = Optim.optimize(loss_function(x_train, y_train), θ0, LBFGS())

 * Status: success

 * Candidate solution
    Final objective value:     1.085252e+01

 * Found with
    Algorithm:     L-BFGS

 * Convergence measures
    |x - x'|               = 2.48e-08 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.95e-09 ≰ 0.0e+00
    |f(x) - f(x')|         = 2.17e-13 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 2.00e-14 ≰ 0.0e+00
    |g(x)|                 = 2.70e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    9
    f(x) calls:    31
    ∇f(x) calls:   31

opt.minimizer

2-element Vector{Float64}:
 8.38552055926984
 3.9687942110234706

The optimized value of the variance is

softplus(opt.minimizer[1])

8.385748680211925

and of the inverse lengthscale is

softplus(opt.minimizer[2])

3.9875140996623806

We compute the log-likelihood of the test data for the resulting optimized posterior. We can observe that there is a significant improvement over the log-likelihood with the default kernel parameters of value 1.

opt_kernel =
    softplus(opt.minimizer[1]) *
    (Matern52Kernel() ∘ ScaleTransform(softplus(opt.minimizer[2])))

opt_f = GP(opt_kernel)
opt_fx = opt_f(x_train, noise_var)
opt_p_fx = posterior(opt_fx, y_train)
logpdf(opt_p_fx(x_test, noise_var), y_test)

-2.092647854433654

We visualize the posterior with optimized parameters.

scatter(
    x_train,
    y_train;
    xlim=(0, 1),
    xlabel="x",
    ylabel="y",
    title="posterior (optimized parameters)",
    label="Train Data",
)
scatter!(x_test, y_test; label="Test Data")
plot!(0:0.001:1, opt_p_fx; label=false, ribbon_scale=2)

Package and system information

Package information (click to expand)

Status `~/work/AbstractGPs.jl/AbstractGPs.jl/examples/0-intro-1d/Project.toml`
  [99985d1d] AbstractGPs v0.5.24 `/home/runner/work/AbstractGPs.jl/AbstractGPs.jl#master`
  [0bf59076] AdvancedHMC v0.7.1
  [31c24e10] Distributions v0.25.120
  [bbc10e6e] DynamicHMC v3.5.1
  [cad2338a] EllipticalSliceSampling v2.0.0
  [1a297f60] FillArrays v1.13.0
  [f6369f11] ForwardDiff v1.0.1
  [98b081ad] Literate v2.20.1
  [6fdf6af0] LogDensityProblems v2.1.2
⌃ [996a588d] LogDensityProblemsAD v1.9.2
  [429524aa] Optim v1.12.0
  [91a5bcdd] Plots v1.40.13
  [4c63d2b9] StatsFuns v1.5.0
  [9a3f8284] Random
Info Packages marked with ⌃ have new versions available and may be upgradable.

To reproduce this notebook's package environment, you can download the full Manifest.toml.

System information (click to expand)

Julia Version 1.10.9
Commit 5595d20a287 (2025-03-10 12:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_DEBUG = Documenter
  JULIA_LOAD_PATH = :/home/runner/.julia/packages/JuliaGPsDocs/7M86H/src

This page was generated using Literate.jl.