In Bayesian statistical inference, we start with a probability distribution that fits our prior information about a parameter $X$ (called the *prior probability distribution* or *prior*). We update this distribution as we collect data to yield the *posterior probability distribution*, which we may use for decision making.

Prior information can come in many forms. For example, we may have information about the moments of a random variable $X$, such as the variance. Or we may know that $P(X > a) \leq 0.25$. The prior should adhere to these known constraints.

In this post, we focus on a common (and computationally tractable) method of constructing priors: finding the distribution with *maximum entropy* (i.e., the least informative one) that fits a set of constraints.

In some cases, this prior is known in closed form. Famously, the Gaussian distribution is the maximum entropy continuous distribution given a known mean and variance. However, for general constraints on the random variable $X$, this distribution must be found computationally.

The max entropy distribution problem can be readily formulated as a convex optimization problem:

$\begin{array}{ll} \text{minimize} & \sum_{i=1}^n p_i\log p_i \\ \text{subject to} & Ap = b \\ &\mathbf{1}^Tp = 1, \end{array}$where $p \in \mathbf{R}^n$ is the probability mass function of random variable $X$ which takes values $\{x_1, \dots x_n\}$, and each row of $Ap=b$ encodes some expectation constraint $\mathbf{E}[f_i(X)] = \sum_k f(x_k) p_k = \sum_k A_{ik}p_k = b_i$. It turns out that this problem not only is convex but also can be solved very efficiently by looking at the dual problem.

The Lagrangian for this problem is

$L(p, \lambda, \nu) = f_0(p) + \lambda^T(Ap - b) + \nu(\mathbf{1}^Tp - 1)$Minimizing over $p$, we get the dual function

$\begin{aligned} g(\lambda, \nu) &= \inf_p(f_0(p) + (\lambda^TA + \nu\mathbf{1}^T)p) -\lambda^Tb - \nu \\ &= -\sup_p((-\lambda^TA - \nu\mathbf{1}^T)p - f_0(p)) -\lambda^Tb - \nu \\ &= -f_0^*(-A^T\lambda - \mathbf{1}\nu) - \lambda^Tb - \nu. \end{aligned}$The conjugate function for negative entropy is [BV04, Example 3.21]

$f_0^*(q) = \sum_{i=1}^n e^{q_i - 1}.$Plugging this in, we have that

$g(\lambda, \nu) = -\sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1} - \lambda^Tb - \nu,$where $a_i$ is the $i^{th}$ column of $A$. To find the optimal dual variables $\lambda^\star$ and $\nu^\star$, we solve

$\begin{array}{ll} \text{maximize} & g(\lambda, \nu). \end{array}$The dual function is concave, so it can be maximized using a standard technique (e.g., Newton's method or LBFGS). For reference, the gradients are

$\begin{aligned} \nabla_\lambda g(\lambda, \nu) &= \sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1}a_i - b\\ \nabla_\nu g(\lambda, \nu) &= \sum_{i=1}^ne^{-a_i^T\lambda - \nu - 1} - 1. \end{aligned}$Assuming Slater's condition holds, we have strong duality and the primal and dual problems have the same optimal value. We can substitute $\lambda^\star$ and $\nu^\star$ into the Lagrangian to get the unique (due to strict convexity) solution

$p_i^\star = 1/\exp\left(a_i^T\lambda^\star + \nu^\star + 1\right).$If this $p^\star$ is primal feasible, then it is optimal.

Several well known distributions come from solving this problem with specific choices of $\mathbf{E}[f_i(X)]=b_i$ (corresponding to $Ap = b$). The Gaussian distribution solves this problem when $X$ is defined over the reals and is constrained such that $\mathbf{E}[X] = \mu$ and $\mathbf{E}[(X - \mathbf{E}[X])^2] = \sigma^2$. This wikipedia article has several more examples.

This dual formulation also hints at how we can solve this problem for a continuous probability density function $p(x)$ by solving a convex optimization problem with only $m + 1$ variables. However, we will end up with an integral instead of a sum in the definition of $g(\lambda, \nu)$.

The geometric distribution with parameter $p$ is the maximum entropy distribution on $\{1, 2, \dots\}$ such that $\mathbf{E}[X] = 1/p$. We will verify this by formulating it as a convex optimization problem, which we can solve with LBFGS from `Optim.jl`

and Julia's powerful automatic differentiation capabilities.

```
using Optim
using LinearAlgebra
using Plots
# Geometric distribution with parameter p
p = .75
k = 1:8
pk = (1-p) .^ (k .- 1) * p
# Setup problem data
A = k'
b = [1/p]
n, m = length(k), length(b)
# function to evaluate g(λ, ν)
# note that we negate everything since the convention for Optim is to minimize
function f1(y, A, b)
n = size(A, 2)
λ, ν = @view(y[1:end-1]), y[end]
return sum([exp(-dot(@view(A[:,i]), λ)-ν-1) for i in 1:n]) + dot(λ, b) + ν
end
# Solve via LBFGS
y0 = randn(m+1)
res = Optim.optimize(y->f1(y, A, b), y0, LBFGS(); autodiff = :forward)
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]
# Construct distribution
pstar = [1 / exp(dot(λstar, A[:,i]) + νstar + 1) for i in 1:n]
rel_err = norm(pstar - pk) / min(norm(pstar), norm(pk))
# Plot computed & true distribution
plt = plot(
k,
pstar,
lw=2,
title="Geometric Distribution & Computed Distribution",
markershape=:circle,
seriestype=:sticks,
label="Computed",
dpi=300
)
plot!(plt, k, pk, label="Geometric", lw=2)
```

We get a relative error of `9.36e-5`

and a clear visual match, confirming that our computational result matches the theory.

Note that we could have also solved the original problem directly using a general purpose conic solver and a modeling tool like `Convex.jl`

; however, when $m \ll n$ (which is often the case), the dual problem will be much faster to solve.

In the above derivation, we can essentially replace the sums with integrals, which gives us a way to handle continuous distributions with only a finite number of variables in the associated optimization problem. Our expectation constraints $\mathbf{E}[f_i(X)] = \sum_k A_{ik}p_k = b_i$ become $\mathbf{E}[f(X)] = \int a_i(x)p(x)dx = b_i$ (where we switched notation s.t. $a_i(x) = f_i(x)$). After committing some mathematical sins and securing my place in analysis hell, we get

$g(\lambda, \nu) = -\int_l^u \exp\left(-a(x)^T\lambda - \nu - 1 \right)dx - \lambda^Tb - \nu$where $a(x) = \begin{bmatrix} a_1(x) & \cdots a_m(x) \end{bmatrix}^T$ and the $Ax=b$ constraint is equivalent to

$\int_l^u a_i(x)p(x)dx = b_i, \quad\text{ for } i = 1, \dots, m.$We can similarly solve this problem with LBFGS using our favorite numerical integration technique. The optimal distribution is given by

$p^\star(x) = 1/\exp\left(a(x)^T\lambda^\star + \nu^\star + 1\right).$It's well known that the Gaussian distribution maximizes entropy given a fixed mean $\mu$ and variance $\sigma^2$. We can verify this empirically. To encode the moment constraints, we take $a_1(x) = x$ and $a_2(x) = x^2$, so $b = \begin{bmatrix}\mu & \mu^2 + \sigma^2 \end{bmatrix}^T$. We only define $g(\lambda, \nu)$ and let Julia's automatic differentiation capabilities compute the gradient for us.

```
using QuadGK
# Gaussian distribution with mean 0 and variance σ^2
μ, σ2 = 0.0, 1.0
f(t) = 1/sqrt(2π) * exp(-t^2/2)
# Setup problem data
b = [μ, σ2]
m = length(b)
function f2(y, b, l, u)
λ, ν = @view(y[1:end-1]), y[end]
I, E = quadgk(t->exp(-dot([t, t^2], λ) - ν - 1), l, u)
return I + dot(λ, b) + ν
end
# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f2(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]
# construct solution & compare distribution
t = -5:0.1:5
pstar2(t) = 1 / exp(dot(λstar, [t; t^2]) + νstar + 1)
rel_err = norm(pstar2.(t) - f.(t)) / min(norm(pstar2.(t)), norm(f.(t)))
# Plot computed & true distribution
plt = plot(
t,
pstar2.(t),
lw=2,
title="Gaussian Distribution & Computed Distribution",
markershape=:circle,
label="Computed",
dpi=300
)
plot!(plt, t, f.(t), label="Gaussian", lw=2)
```

Again, we get a low relative error of `6.13e-6`

and a clear visual match.

We can repeat the example above but add that $P(X \geq 0) = 0.25$ by taking $f(X) = \mathbb{I}_{\geq 0}(X)$. The associated code is below.

```
# Setup problem data
b = [0, σ2, 0.25]
m = length(b)
function f3(y, b, l, u)
λ, ν = @view(y[1:end-1]), y[end]
I, E = quadgk(t->exp(-dot([t, t^2, t > 0], λ) - ν - 1), l, u)
return I + dot(λ, b) + ν
end
# Solve via LBFGS
y0 = rand(m+1)
res = Optim.optimize(y->f3(y, b, -5, 5), y0, LBFGS())
y = Optim.minimizer(res)
λstar, νstar = y[1:m], y[end]
# Construct distribution
pstar3(t) = 1 / exp(dot(λstar, [t; t^2; t >= 0]) + νstar + 1)
# Plot computed & true distribution
tplot = -5:0.01:5
plt = plot(
tplot,
pstar3.(tplot),
lw=2,
title="Computed Distribution",
legend=false,
dpi=300
)
```

It's easy to quickly verify that the constraints are met:

```
# Verify results
μ_comp = quadgk(t -> t * pstar3(t), tplot[1], tplot[end])[1]
σ2_comp = quadgk(t -> t^2 * pstar3(t), tplot[1], tplot[end])[1] - μ_comp^2
prob_noneg = quadgk(t -> (t >= 0) * pstar3(t), tplot[1], tplot[end])[1]
println("μ̂, σ̂2, P(X ≥ 0) = $(round.((μ_comp, σ2_comp, prob_noneg), digits=4))")
```

```
μ̂, σ̂2, P(X ≥ 0) = (0.0, 1.0, 0.25)
```

This distribution is definitely somewhat unintuitive. It looks like a combo of an exponential and a Gaussian, which are solutions to the max entropy problem under different constraints. However, I certainly don't know how to read the tea leaves here.

While these ideas aren't new at all, I don't see them mentioned super frequently. If any of this is interesting to you, or you think these ideas may be useful for your own work, please reach out! (tdiamand@mit.edu)

Thank you to Chris Rackauckas, Parth Nobel, and Logan Engstrom for helpful comments on drafts of this post!

[1] Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.

© Theo Diamandis. Last modified: March 16, 2023. Website built with Franklin.jl and the Julia programming language.