In Bayesian statistical inference, we start with a probability distribution that fits our prior information about a parameter (called the prior probability distribution or prior). We update this distribution as we collect data to yield the posterior probability distribution, which we may use for decision making.
Prior information can come in many forms. For example, we may have information about the moments of a random variable , such as the variance. Or we may know that . The prior should adhere to these known constraints.
In this post, we focus on a common (and computationally tractable) method of constructing priors: finding the distribution with maximum entropy (i.e., the least informative one) that fits a set of constraints.
In some cases, this prior is known in closed form. Famously, the Gaussian distribution is the maximum entropy continuous distribution given a known mean and variance. However, for general constraints on the random variable , this distribution must be found computationally.
The max entropy distribution problem can be readily formulated as a convex optimization problem:
where is the probability mass function of random variable which takes values , and each row of encodes some expectation constraint . It turns out that this problem not only is convex but also can be solved very efficiently by looking at the dual problem.
The Lagrangian for this problem is
Minimizing over , we get the dual function
The conjugate function for negative entropy is [BV04, Example 3.21]
Plugging this in, we have that
where is the column of . To find the optimal dual variables and , we solve
The dual function is concave, so it can be maximized using a standard technique (e.g., Newton's method or LBFGS). For reference, the gradients are
Assuming Slater's condition holds, we have strong duality and the primal and dual problems have the same optimal value. We can substitute and into the Lagrangian to get the unique (due to strict convexity) solution
If this is primal feasible, then it is optimal.
Several well known distributions come from solving this problem with specific choices of (corresponding to ). The Gaussian distribution solves this problem when is defined over the reals and is constrained such that and . This wikipedia article has several more examples.
This dual formulation also hints at how we can solve this problem for a continuous probability density function by solving a convex optimization problem with only variables. However, we will end up with an integral instead of a sum in the definition of .
The geometric distribution with parameter is the maximum entropy distribution on such that . We will verify this by formulating it as a convex optimization problem, which we can solve with LBFGS from Optim.jl
and Julia's powerful automatic differentiation capabilities.
using Optim
using LinearAlgebra
using Plots
# Geometric distribution with parameter p
p = .75
k = 1:8
pk = (1-p) .^ (k .- 1) * p
# Setup problem data
A = k'
b = [1/p]
n, m = length(k), length(b)
# function to evaluate g(λ, ν)
# note that we negate everything since the convention for Optim is to minimize
function f1(y, A, b)
n = size(A, 2)
λ, ν = @view(y[1:end-1]), y[end]
return sum([exp(-dot(@view(A[:,i]), λ)-ν-1) for i in 1:n]) + dot(λ, b) + ν
end
# Solve via LBFGS
y0 = randn(m+1)
res = Optim.optimize(y->f1(y, A, b), y0, LBFGS(); autodiff = :forward)
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]
# Construct distribution
pstar = [1 / exp(dot(λstar, A[:,i]) + νstar + 1) for i in 1:n]
rel_err = norm(pstar - pk) / min(norm(pstar), norm(pk))
# Plot computed & true distribution
plt = plot(
k,
pstar,
lw=2,
title="Geometric Distribution & Computed Distribution",
markershape=:circle,
seriestype=:sticks,
label="Computed",
dpi=300
)
plot!(plt, k, pk, label="Geometric", lw=2)
We get a relative error of 9.36e-5
and a clear visual match, confirming that our computational result matches the theory.
Note that we could have also solved the original problem directly using a general purpose conic solver and a modeling tool like Convex.jl
; however, when (which is often the case), the dual problem will be much faster to solve.
In the above derivation, we can essentially replace the sums with integrals, which gives us a way to handle continuous distributions with only a finite number of variables in the associated optimization problem. Our expectation constraints become (where we switched notation s.t. ). After committing some mathematical sins and securing my place in analysis hell, we get
where and the constraint is equivalent to
We can similarly solve this problem with LBFGS using our favorite numerical integration technique. The optimal distribution is given by
It's well known that the Gaussian distribution maximizes entropy given a fixed mean and variance . We can verify this empirically. To encode the moment constraints, we take and , so . We only define and let Julia's automatic differentiation capabilities compute the gradient for us.
using QuadGK
# Gaussian distribution with mean 0 and variance σ^2
μ, σ2 = 0.0, 1.0
f(t) = 1/sqrt(2π) * exp(-t^2/2)
# Setup problem data
b = [μ, σ2]
m = length(b)
function f2(y, b, l, u)
λ, ν = @view(y[1:end-1]), y[end]
I, E = quadgk(t->exp(-dot([t, t^2], λ) - ν - 1), l, u)
return I + dot(λ, b) + ν
end
# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f2(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]
# construct solution & compare distribution
t = -5:0.1:5
pstar2(t) = 1 / exp(dot(λstar, [t; t^2]) + νstar + 1)
rel_err = norm(pstar2.(t) - f.(t)) / min(norm(pstar2.(t)), norm(f.(t)))
# Plot computed & true distribution
plt = plot(
t,
pstar2.(t),
lw=2,
title="Gaussian Distribution & Computed Distribution",
markershape=:circle,
label="Computed",
dpi=300
)
plot!(plt, t, f.(t), label="Gaussian", lw=2)
Again, we get a low relative error of 6.13e-6
and a clear visual match.
We can repeat the example above but add that by taking . The associated code is below.
# Setup problem data
b = [0, σ2, 0.25]
m = length(b)
function f3(y, b, l, u)
λ, ν = @view(y[1:end-1]), y[end]
I, E = quadgk(t->exp(-dot([t, t^2, t > 0], λ) - ν - 1), l, u)
return I + dot(λ, b) + ν
end
# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f3(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]
# Construct distribution
pstar3(t) = 1 / exp(dot(λstar, [t; t^2; t >= 0]) + νstar + 1)
# Plot computed & true distribution
tplot = -5:0.01:5
plt = plot(
tplot,
pstar3.(tplot),
lw=2,
title="Computed Distribution",
legend=false,
dpi=300
)
It's easy to quickly verify that the constraints are met:
# Verify results
μ_comp = quadgk(t -> t * pstar3(t), tplot[1], tplot[end])[1]
σ2_comp = quadgk(t -> t^2 * pstar3(t), tplot[1], tplot[end])[1] - μ_comp^2
prob_noneg = quadgk(t -> (t >= 0) * pstar3(t), tplot[1], tplot[end])[1]
println("μ̂, σ̂2, P(X ≥ 0) = $(round.((μ_comp, σ2_comp, prob_noneg), digits=4))")
μ̂, σ̂2, P(X ≥ 0) = (0.0, 1.0, 0.25)
This distribution is definitely somewhat unintuitive. It looks like a combo of an exponential and a Gaussian, which are solutions to the max entropy problem under different constraints. However, I certainly don't know how to read the tea leaves here.
While these ideas aren't new at all, I don't see them mentioned super frequently. If any of this is interesting to you, or you think these ideas may be useful for your own work, please reach out! (tdiamand@mit.edu)
Thank you to Chris Rackauckas, Parth Nobel, and Logan Engstrom for helpful comments on drafts of this post!
[1] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.