Theo Diamandis

Concentration Bounds via Optimization

When dealing with random variables, we're often concerned with bounding the probability of large deviations. For example, if the random variable XX is the number of parking tickets I get next year, I may be interested in bounding P(Xa)P(X \geq a).

Markov's inequality for a nonnegative random variable XX bounds the probability of large deviation when we know that E[X]=μ\mathbf{E}[X] = \mu:

P(Xa)μa. P(X \geq a) \leq \frac{\mu}{a}.

A typical proof follows from the fact that

μ=0xf(x)dxaxf(x)dxaaf(x)dx=aP(Xa), \mu = \int_0^\infty x f(x) dx \ge \int_a^\infty x f(x) dx \ge a \int_a^\infty f(x) dx = a P(X \ge a),

where ff is the probability mass function of the distribution.

But what if we know more than just E[X]\mathbf{E}[X]? How do we incorporate the fact that we have bounds on the third moment? Or that we know P(X>b)0.25P(X > b) \leq 0.25? In this post, we'll define probability bounds as an optimization problem in which we can incorporate additional prior information. We'll also see that Markov's inequality can be derived from the dual of this optimization problem.

Note that this post is similar in flavor and techniques to one of my previous posts. When in doubt, take the dual...

The Optimization Problem

When we only have mean information, we can formulate finding a bound on P(Xa)P(X \ge a) as a simple optimization problem:

maximizeP(Xa)subject toE[X]=μ, \begin{aligned} &\text{maximize} && P(X \geq a) \\ &\text{subject to} && \mathbf{E}[X] = \mu, \end{aligned}

where the optimization variable is the probability distribution PP. Note that the objective and constraint are both linear in PP.

For simplicity, we'll make XX take nonnegative discrete values x1>x2>xka>xk+1>xn0x_1 > x_2 > \dots x_k \ge a > x_{k+1} > \dots x_n \ge 0, so the distribution pp is a nonnegative vector that sums to 11. The optimization problem can be formulated as

maximizep1++pksubject toxTp=μ1Tp=1p0. \begin{aligned} &\text{maximize} && p_1 + \dots + p_k \\ &\text{subject to} && x^Tp = \mu \\ &&& \mathbf{1}^Tp = 1 \\ &&& p \ge 0. \end{aligned}

To derive a bound on the objective, we will construct a dual problem, since any feasible point in the dual provides a valid upper bound on the primal (by weak duality). In fact, this problem is a linear program (a type of convex optimization problem), so a minimizer of the dual problem will give a tight bound.

The Dual Problem

We take a particular dual where we keep the nonnegativity constraint intact. This approach results in the (partial) Lagrangian

L(p,y,z)=1kTp+y(μxTp)+z(11Tp)I(p0), \mathcal{L}(p, y, z) = \mathbf{1}_k^Tp + y(\mu - x^Tp) + z(1 - \mathbf{1}^Tp) - \mathbf{I}(p \ge 0),

where the indicator function I(p0)\mathbf{I}(p \ge 0) equals 00 if p0p \geq 0 and \infty otherwise, and 1k\mathbf{1}_k is 11 in the first kk entries and 00 otherwise. We find the dual function by maximizing over the primal variable pp:

g(y,z)=suppL(p,y,z)=supp{(1kyxz1)TpI(p0)}+yμ+z={yμ+z1kyxz10otherwise. \begin{aligned} g(y, z) &= \sup_p \mathcal{L}(p, y, z) \\ &= \sup_p \left\{(\mathbf{1}_k - yx - z \mathbf{1})^Tp - \mathbf{I}(p \ge 0) \right\}+ y\mu + z \\ &= \begin{cases} y\mu + z & \mathbf{1}_k -yx - z \mathbf{1} \le 0 \\ \infty & \text{otherwise}. \end{cases} \end{aligned}

Thus, the dual problem is

minimizeyμ+zsubject to1yxi+zi=1,,k0yxi+zi=k+1,,n. \begin{aligned} &\text{minimize} && y\mu + z \\ &\text{subject to} && 1 \le yx_i + z \qquad i = 1, \dots, k \\ &&& 0 \le yx_i + z \qquad i = k+1, \dots, n. \\ \end{aligned}

Recall that, by weak duality, any feasible point for this problem gives a valid upper bound on the primal objective P(Xa)P(X \ge a). One such point is y=1/ay = 1/a, z=0z = 0. These values satisfy the constraints since xiax_i \geq a for i=1,ki = 1, \dots k. This point has objective value μ/a\mu / a, which recovers Markov's inequality.

We can also show that this inequality is tight (for example, one can find a solution to the primal problem that matches the bound).However, Markov's inequality is usually a pretty terrible bound for practical purposes. Just knowing the mean of a random variable isn't much information.

A More General Framework

To get a better bound, we can incorporate additional primal information with this optimization-based framework. In fact, we can compute any bound of the form

maximizeP(Xa)subject toE[fi(X)]=biE[gi(X)]ci \begin{aligned} &\text{maximize} && P(X \geq a) \\ &\text{subject to} && \mathbf{E}[f_i(X)] = b_i \\ &&& \mathbf{E}[g_i(X)] \le c_i \end{aligned}

for arbitrary functions fif_i, gig_i by simply solving a linear program. Similarly to in a previous post, we can extend these ideas to the continuous case with a little more care.

While a computational approach is not always appropriate, in many situations we can do better than the bounds given by the standard set of inequalities. Furthermore, computational bounds can scale to higher dimensions and more easily incorporate non-i.i.d. information.

Thank you to Kshitij Kulkarni for reviewing a draft of this post!