Better Priors via Convex Optimization

In Bayesian statistical inference, we start with a probability distribution that fits our prior information about a parameter $X$ (called the prior probability distribution or prior). We update this distribution as we collect data to yield the posterior probability distribution, which we may use for decision making.

Prior information can come in many forms. For example, we may have information about the moments of a random variable $X$ , such as the variance. Or we may know that $P(X > a) \leq 0.25$ . The prior should adhere to these known constraints.

In this post, we focus on a common (and computationally tractable) method of constructing priors: finding the distribution with maximum entropy (i.e., the least informative one) that fits a set of constraints.

In some cases, this prior is known in closed form. Famously, the Gaussian distribution is the maximum entropy continuous distribution given a known mean and variance. However, for general constraints on the random variable $X$ , this distribution must be found computationally.

Discrete distributions

The max entropy distribution problem can be readily formulated as a convex optimization problem:

\begin{array}{ll} \text{minimize} & \sum_{i=1}^n p_i\log p_i \\ \text{subject to} & Ap = b \\ &\mathbf{1}^Tp = 1, \end{array}

where $p \in \mathbf{R}^n$ is the probability mass function of random variable $X$ which takes values $\{x_1, \dots x_n\}$ , and each row of $Ap=b$ encodes some expectation constraint $\mathbf{E}[f_i(X)] = \sum_k f(x_k) p_k = \sum_k A_{ik}p_k = b_i$ . It turns out that this problem not only is convex but also can be solved very efficiently by looking at the dual problem.

The Lagrange dual function

The Lagrangian for this problem is

L(p, \lambda, \nu) = f_0(p) + \lambda^T(Ap - b) + \nu(\mathbf{1}^Tp - 1)

Minimizing over $p$ , we get the dual function

\begin{aligned} g(\lambda, \nu) &= \inf_p(f_0(p) + (\lambda^TA + \nu\mathbf{1}^T)p) -\lambda^Tb - \nu \\ &= -\sup_p((-\lambda^TA - \nu\mathbf{1}^T)p - f_0(p)) -\lambda^Tb - \nu \\ &= -f_0^*(-A^T\lambda - \mathbf{1}\nu) - \lambda^Tb - \nu. \end{aligned}

The conjugate function for negative entropy is [BV04, Example 3.21]

f_0^*(q) = \sum_{i=1}^n e^{q_i - 1}.

Plugging this in, we have that

g(\lambda, \nu) = -\sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1} - \lambda^Tb - \nu,

where $a_i$ is the $i^{th}$ column of $A$ . To find the optimal dual variables $\lambda^\star$ and $\nu^\star$ , we solve

\begin{array}{ll} \text{maximize} & g(\lambda, \nu). \end{array}

The dual function is concave, so it can be maximized using a standard technique (e.g., Newton's method or LBFGS). For reference, the gradients are

\begin{aligned} \nabla_\lambda g(\lambda, \nu) &= \sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1}a_i - b\\ \nabla_\nu g(\lambda, \nu) &= \sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1} - 1. \end{aligned}

Solving the original problem via the dual

Assuming Slater's condition holds, we have strong duality and the primal and dual problems have the same optimal value. We can substitute $\lambda^\star$ and $\nu^\star$ into the Lagrangian to get the unique (due to strict convexity) solution

p_i^\star = 1/\exp\left(a_i^T\lambda^\star + \nu^\star + 1\right).

If this $p^\star$ is primal feasible, then it is optimal.

Several well known distributions come from solving this problem with specific choices of $\mathbf{E}[f_i(X)]=b_i$ (corresponding to $Ap = b$ ). The Gaussian distribution solves this problem when $X$ is defined over the reals and is constrained such that $\mathbf{E}[X] = \mu$ and $\mathbf{E}[(X - \mathbf{E}[X])^2] = \sigma^2$ . This wikipedia article has several more examples.

This dual formulation also hints at how we can solve this problem for a continuous probability density function $p(x)$ by solving a convex optimization problem with only $m + 1$ variables. However, we will end up with an integral instead of a sum in the definition of $g(\lambda, \nu)$ .

Example: geometric distribution

The geometric distribution with parameter $p$ is the maximum entropy distribution on $\{1, 2, \dots\}$ such that $\mathbf{E}[X] = 1/p$ . We will verify this by formulating it as a convex optimization problem, which we can solve with LBFGS from Optim.jl and Julia's powerful automatic differentiation capabilities.

using Optim
using LinearAlgebra
using Plots

# Geometric distribution with parameter p
p = .75
k = 1:8
pk = (1-p) .^ (k .- 1) * p

# Setup problem data
A = k'
b = [1/p]
n, m = length(k), length(b)
 
# function to evaluate g(λ, ν)
# note that we negate everything since the convention for Optim is to minimize
function f1(y, A, b)
    n = size(A, 2)
    λ, ν = @view(y[1:end-1]), y[end]
    return sum([exp(-dot(@view(A[:,i]), λ)-ν-1) for i in 1:n]) + dot(λ, b) + ν
end

# Solve via LBFGS
y0 = randn(m+1)
res = Optim.optimize(y->f1(y, A, b), y0, LBFGS(); autodiff = :forward)
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]

# Construct distribution
pstar = [1 / exp(dot(λstar, A[:,i]) + νstar + 1) for i in 1:n]
rel_err = norm(pstar - pk) / min(norm(pstar), norm(pk))

# Plot computed & true distribution
plt = plot(
    k,
    pstar,
    lw=2,
    title="Geometric Distribution & Computed Distribution",
    markershape=:circle,
    seriestype=:sticks,
    label="Computed",
    dpi=300
)
plot!(plt, k, pk, label="Geometric", lw=2)

We get a relative error of 9.36e-5 and a clear visual match, confirming that our computational result matches the theory.

Note that we could have also solved the original problem directly using a general purpose conic solver and a modeling tool like Convex.jl; however, when $m \ll n$ (which is often the case), the dual problem will be much faster to solve.

Continuous distributions

In the above derivation, we can essentially replace the sums with integrals, which gives us a way to handle continuous distributions with only a finite number of variables in the associated optimization problem. Our expectation constraints $\mathbf{E}[f_i(X)] = \sum_k A_{ik}p_k = b_i$ become $\mathbf{E}[f(X)] = \int a_i(x)p(x)dx = b_i$ (where we switched notation s.t. $a_i(x) = f_i(x)$ ). After committing some mathematical sins and securing my place in analysis hell, we get

g(\lambda, \nu) = -\int_l^u \exp\left(-a(x)^T\lambda - \nu - 1 \right)dx - \lambda^Tb - \nu

where $a(x) = \begin{bmatrix} a_1(x) & \cdots a_m(x) \end{bmatrix}^T$ and the $Ax=b$ constraint is equivalent to

\int_l^u a_i(x)p(x)dx = b_i, \quad\text{ for } i = 1, \dots, m.

We can similarly solve this problem with LBFGS using our favorite numerical integration technique. The optimal distribution is given by

p^\star(x) = 1/\exp\left(a(x)^T\lambda^\star + \nu^\star + 1\right).

Example: Gaussian distribution

It's well known that the Gaussian distribution maximizes entropy given a fixed mean $\mu$ and variance $\sigma^2$ . We can verify this empirically. To encode the moment constraints, we take $a_1(x) = x$ and $a_2(x) = x^2$ , so $b = \begin{bmatrix}\mu & \mu^2 + \sigma^2 \end{bmatrix}^T$ . We only define $g(\lambda, \nu)$ and let Julia's automatic differentiation capabilities compute the gradient for us.

using QuadGK

# Gaussian distribution with mean 0 and variance σ^2
μ, σ2 = 0.0, 1.0
f(t) = 1/sqrt(2π) * exp(-t^2/2)

# Setup problem data
b = [μ, σ2]
m = length(b)

function f2(y, b, l, u)
    λ, ν = @view(y[1:end-1]), y[end]
    I, E = quadgk(t->exp(-dot([t, t^2], λ) - ν - 1), l, u)
    return I + dot(λ, b) + ν
end

# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f2(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]

# construct solution & compare distribution
t = -5:0.1:5
pstar2(t) = 1 / exp(dot(λstar, [t; t^2]) + νstar + 1)
rel_err = norm(pstar2.(t) - f.(t)) / min(norm(pstar2.(t)), norm(f.(t)))

# Plot computed & true distribution
plt = plot(
    t, 
    pstar2.(t),
    lw=2,
    title="Gaussian Distribution & Computed Distribution",
    markershape=:circle,
    label="Computed",
    dpi=300
)
plot!(plt, t, f.(t), label="Gaussian", lw=2)

Again, we get a low relative error of 6.13e-6 and a clear visual match.

Something more complicated

We can repeat the example above but add that $P(X \geq 0) = 0.25$ by taking $f(X) = \mathbb{I}_{\geq 0}(X)$ . The associated code is below.

# Setup problem data
b = [0, σ2, 0.25]
m = length(b)

function f3(y, b, l, u)
    λ, ν = @view(y[1:end-1]), y[end]
    I, E = quadgk(t->exp(-dot([t, t^2, t > 0], λ) - ν - 1), l, u)
    return I + dot(λ, b) + ν
end

# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f3(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]

# Construct distribution
pstar3(t) = 1 / exp(dot(λstar, [t; t^2; t >= 0]) + νstar + 1)

# Plot computed & true distribution
tplot = -5:0.01:5
plt = plot(
    tplot, 
    pstar3.(tplot),
    lw=2,
    title="Computed Distribution",
    legend=false,
    dpi=300
)

It's easy to quickly verify that the constraints are met:

# Verify results
μ_comp = quadgk(t -> t * pstar3(t), tplot[1], tplot[end])[1]
σ2_comp = quadgk(t -> t^2 * pstar3(t), tplot[1], tplot[end])[1] - μ_comp^2
prob_noneg = quadgk(t -> (t >= 0) * pstar3(t), tplot[1], tplot[end])[1]
println("μ̂, σ̂2, P(X ≥ 0) = $(round.((μ_comp, σ2_comp, prob_noneg), digits=4))")

μ̂, σ̂2, P(X ≥ 0) = (0.0, 1.0, 0.25)

This distribution is definitely somewhat unintuitive. It looks like a combo of an exponential and a Gaussian, which are solutions to the max entropy problem under different constraints. However, I certainly don't know how to read the tea leaves here.

Concluding thoughts

While these ideas aren't new at all, I don't see them mentioned super frequently. If any of this is interesting to you, or you think these ideas may be useful for your own work, please reach out! (tdiamand@mit.edu)

Thank you to Chris Rackauckas, Parth Nobel, and Logan Engstrom for helpful comments on drafts of this post!

References

[1] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.