Theo Diamandis

Better Priors via Convex Optimization

In Bayesian statistical inference, we start with a probability distribution that fits our prior information about a parameter XX (called the prior probability distribution or prior). We update this distribution as we collect data to yield the posterior probability distribution, which we may use for decision making.

Prior information can come in many forms. For example, we may have information about the moments of a random variable XX, such as the variance. Or we may know that P(X>a)0.25P(X > a) \leq 0.25. The prior should adhere to these known constraints.

In this post, we focus on a common (and computationally tractable) method of constructing priors: finding the distribution with maximum entropy (i.e., the least informative one) that fits a set of constraints.

In some cases, this prior is known in closed form. Famously, the Gaussian distribution is the maximum entropy continuous distribution given a known mean and variance. However, for general constraints on the random variable XX, this distribution must be found computationally.

Discrete distributions

The max entropy distribution problem can be readily formulated as a convex optimization problem:

minimizei=1npilogpisubject toAp=b1Tp=1, \begin{array}{ll} \text{minimize} & \sum_{i=1}^n p_i\log p_i \\ \text{subject to} & Ap = b \\ &\mathbf{1}^Tp = 1, \end{array}

where pRnp \in \mathbf{R}^n is the probability mass function of random variable XX which takes values {x1,xn}\{x_1, \dots x_n\}, and each row of Ap=bAp=b encodes some expectation constraint E[fi(X)]=kf(xk)pk=kAikpk=bi\mathbf{E}[f_i(X)] = \sum_k f(x_k) p_k = \sum_k A_{ik}p_k = b_i. It turns out that this problem not only is convex but also can be solved very efficiently by looking at the dual problem.

The Lagrange dual function

The Lagrangian for this problem is

L(p,λ,ν)=f0(p)+λT(Apb)+ν(1Tp1) L(p, \lambda, \nu) = f_0(p) + \lambda^T(Ap - b) + \nu(\mathbf{1}^Tp - 1)

Minimizing over pp, we get the dual function

g(λ,ν)=infp(f0(p)+(λTA+ν1T)p)λTbν=supp((λTAν1T)pf0(p))λTbν=f0(ATλ1ν)λTbν.\begin{aligned} g(\lambda, \nu) &= \inf_p(f_0(p) + (\lambda^TA + \nu\mathbf{1}^T)p) -\lambda^Tb - \nu \\ &= -\sup_p((-\lambda^TA - \nu\mathbf{1}^T)p - f_0(p)) -\lambda^Tb - \nu \\ &= -f_0^*(-A^T\lambda - \mathbf{1}\nu) - \lambda^Tb - \nu. \end{aligned}

The conjugate function for negative entropy is [BV04, Example 3.21]

f0(q)=i=1neqi1. f_0^*(q) = \sum_{i=1}^n e^{q_i - 1}.

Plugging this in, we have that

g(λ,ν)=i=1neaiTλν1λTbν, g(\lambda, \nu) = -\sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1} - \lambda^Tb - \nu,

where aia_i is the ithi^{th} column of AA. To find the optimal dual variables λ\lambda^\star and ν\nu^\star, we solve

maximizeg(λ,ν). \begin{array}{ll} \text{maximize} & g(\lambda, \nu). \end{array}

The dual function is concave, so it can be maximized using a standard technique (e.g., Newton's method or LBFGS). For reference, the gradients are

λg(λ,ν)=i=1neaiTλν1aibνg(λ,ν)=i=1neaiTλν11.\begin{aligned} \nabla_\lambda g(\lambda, \nu) &= \sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1}a_i - b\\ \nabla_\nu g(\lambda, \nu) &= \sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1} - 1. \end{aligned}

Solving the original problem via the dual

Assuming Slater's condition holds, we have strong duality and the primal and dual problems have the same optimal value. We can substitute λ\lambda^\star and ν\nu^\star into the Lagrangian to get the unique (due to strict convexity) solution

pi=1/exp(aiTλ+ν+1). p_i^\star = 1/\exp\left(a_i^T\lambda^\star + \nu^\star + 1\right).

If this pp^\star is primal feasible, then it is optimal.

Several well known distributions come from solving this problem with specific choices of E[fi(X)]=bi\mathbf{E}[f_i(X)]=b_i (corresponding to Ap=bAp = b). The Gaussian distribution solves this problem when XX is defined over the reals and is constrained such that E[X]=μ\mathbf{E}[X] = \mu and E[(XE[X])2]=σ2\mathbf{E}[(X - \mathbf{E}[X])^2] = \sigma^2. This wikipedia article has several more examples.

This dual formulation also hints at how we can solve this problem for a continuous probability density function p(x)p(x) by solving a convex optimization problem with only m+1m + 1 variables. However, we will end up with an integral instead of a sum in the definition of g(λ,ν)g(\lambda, \nu).

Example: geometric distribution

The geometric distribution with parameter pp is the maximum entropy distribution on {1,2, }\{1, 2, \dots\} such that E[X]=1/p\mathbf{E}[X] = 1/p. We will verify this by formulating it as a convex optimization problem, which we can solve with LBFGS from Optim.jl and Julia's powerful automatic differentiation capabilities.

using Optim
using LinearAlgebra
using Plots

# Geometric distribution with parameter p
p = .75
k = 1:8
pk = (1-p) .^ (k .- 1) * p

# Setup problem data
A = k'
b = [1/p]
n, m = length(k), length(b)
# function to evaluate g(λ, ν)
# note that we negate everything since the convention for Optim is to minimize
function f1(y, A, b)
    n = size(A, 2)
    λ, ν = @view(y[1:end-1]), y[end]
    return sum([exp(-dot(@view(A[:,i]), λ)-ν-1) for i in 1:n]) + dot(λ, b) + ν

# Solve via LBFGS
y0 = randn(m+1)
res = Optim.optimize(y->f1(y, A, b), y0, LBFGS(); autodiff = :forward)
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]

# Construct distribution
pstar = [1 / exp(dot(λstar, A[:,i]) + νstar + 1) for i in 1:n]
rel_err = norm(pstar - pk) / min(norm(pstar), norm(pk))

# Plot computed & true distribution
plt = plot(
    title="Geometric Distribution & Computed Distribution",
plot!(plt, k, pk, label="Geometric", lw=2)

We get a relative error of 9.36e-5 and a clear visual match, confirming that our computational result matches the theory.

Note that we could have also solved the original problem directly using a general purpose conic solver and a modeling tool like Convex.jl; however, when mnm \ll n (which is often the case), the dual problem will be much faster to solve.

Continuous distributions

In the above derivation, we can essentially replace the sums with integrals, which gives us a way to handle continuous distributions with only a finite number of variables in the associated optimization problem. Our expectation constraints E[fi(X)]=kAikpk=bi\mathbf{E}[f_i(X)] = \sum_k A_{ik}p_k = b_i become E[f(X)]=ai(x)p(x)dx=bi\mathbf{E}[f(X)] = \int a_i(x)p(x)dx = b_i (where we switched notation s.t. ai(x)=fi(x)a_i(x) = f_i(x)). After committing some mathematical sins and securing my place in analysis hell, we get

g(λ,ν)=luexp(a(x)Tλν1)dxλTbν g(\lambda, \nu) = -\int_l^u \exp\left(-a(x)^T\lambda - \nu - 1 \right)dx - \lambda^Tb - \nu

where a(x)=[a1(x)am(x)]Ta(x) = \begin{bmatrix} a_1(x) & \cdots a_m(x) \end{bmatrix}^T and the Ax=bAx=b constraint is equivalent to

luai(x)p(x)dx=bi, for i=1,,m. \int_l^u a_i(x)p(x)dx = b_i, \quad\text{ for } i = 1, \dots, m.

We can similarly solve this problem with LBFGS using our favorite numerical integration technique. The optimal distribution is given by

p(x)=1/exp(a(x)Tλ+ν+1). p^\star(x) = 1/\exp\left(a(x)^T\lambda^\star + \nu^\star + 1\right).

Example: Gaussian distribution

It's well known that the Gaussian distribution maximizes entropy given a fixed mean μ\mu and variance σ2\sigma^2. We can verify this empirically. To encode the moment constraints, we take a1(x)=xa_1(x) = x and a2(x)=x2a_2(x) = x^2, so b=[μμ2+σ2]Tb = \begin{bmatrix}\mu & \mu^2 + \sigma^2 \end{bmatrix}^T. We only define g(λ,ν)g(\lambda, \nu) and let Julia's automatic differentiation capabilities compute the gradient for us.

using QuadGK

# Gaussian distribution with mean 0 and variance σ^2
μ, σ2 = 0.0, 1.0
f(t) = 1/sqrt(2π) * exp(-t^2/2)

# Setup problem data
b = [μ, σ2]
m = length(b)

function f2(y, b, l, u)
    λ, ν = @view(y[1:end-1]), y[end]
    I, E = quadgk(t->exp(-dot([t, t^2], λ) - ν - 1), l, u)
    return I + dot(λ, b) + ν

# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f2(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]

# construct solution & compare distribution
t = -5:0.1:5
pstar2(t) = 1 / exp(dot(λstar, [t; t^2]) + νstar + 1)
rel_err = norm(pstar2.(t) - f.(t)) / min(norm(pstar2.(t)), norm(f.(t)))

# Plot computed & true distribution
plt = plot(
    title="Gaussian Distribution & Computed Distribution",
plot!(plt, t, f.(t), label="Gaussian", lw=2)

Again, we get a low relative error of 6.13e-6 and a clear visual match.

Something more complicated

We can repeat the example above but add that P(X0)=0.25P(X \geq 0) = 0.25 by taking f(X)=I0(X)f(X) = \mathbb{I}_{\geq 0}(X). The associated code is below.

# Setup problem data
b = [0, σ2, 0.25]
m = length(b)

function f3(y, b, l, u)
    λ, ν = @view(y[1:end-1]), y[end]
    I, E = quadgk(t->exp(-dot([t, t^2, t > 0], λ) - ν - 1), l, u)
    return I + dot(λ, b) + ν

# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f3(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]

# Construct distribution
pstar3(t) = 1 / exp(dot(λstar, [t; t^2; t >= 0]) + νstar + 1)

# Plot computed & true distribution
tplot = -5:0.01:5
plt = plot(
    title="Computed Distribution",

It's easy to quickly verify that the constraints are met:

# Verify results
μ_comp = quadgk(t -> t * pstar3(t), tplot[1], tplot[end])[1]
σ2_comp = quadgk(t -> t^2 * pstar3(t), tplot[1], tplot[end])[1] - μ_comp^2
prob_noneg = quadgk(t -> (t >= 0) * pstar3(t), tplot[1], tplot[end])[1]
println("μ̂, σ̂2, P(X ≥ 0) = $(round.((μ_comp, σ2_comp, prob_noneg), digits=4))")

μ̂, σ̂2, P(X ≥ 0) = (0.0, 1.0, 0.25)

This distribution is definitely somewhat unintuitive. It looks like a combo of an exponential and a Gaussian, which are solutions to the max entropy problem under different constraints. However, I certainly don't know how to read the tea leaves here.

Concluding thoughts

While these ideas aren't new at all, I don't see them mentioned super frequently. If any of this is interesting to you, or you think these ideas may be useful for your own work, please reach out! (

Thank you to Chris Rackauckas, Parth Nobel, and Logan Engstrom for helpful comments on drafts of this post!


[1] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.