Concentration Bounds via Optimization

When dealing with random variables, we're often concerned with bounding the probability of large deviations. For example, if the random variable $X$ is the number of parking tickets I get next year, I may be interested in bounding $P(X \geq a)$ .

Markov's inequality for a nonnegative random variable $X$ bounds the probability of large deviation when we know that $\mathbf{E}[X] = \mu$ :

P(X \geq a) \leq \frac{\mu}{a}.

A typical proof follows from the fact that

\mu = \int_0^\infty x f(x) dx \ge \int_a^\infty x f(x) dx \ge a \int_a^\infty f(x) dx = a P(X \ge a),

where $f$ is the probability mass function of the distribution.

But what if we know more than just $\mathbf{E}[X]$ ? How do we incorporate the fact that we have bounds on the third moment? Or that we know $P(X > b) \leq 0.25$ ? In this post, we'll define probability bounds as an optimization problem in which we can incorporate additional prior information. We'll also see that Markov's inequality can be derived from the dual of this optimization problem.

Note that this post is similar in flavor and techniques to one of my previous posts. When in doubt, take the dual...

The Optimization Problem

When we only have mean information, we can formulate finding a bound on $P(X \ge a)$ as a simple optimization problem:

\begin{aligned} &\text{maximize} && P(X \geq a) \\ &\text{subject to} && \mathbf{E}[X] = \mu, \end{aligned}

where the optimization variable is the probability distribution $P$ . Note that the objective and constraint are both linear in $P$ .

For simplicity, we'll make $X$ take nonnegative discrete values $x_1 > x_2 > \dots x_k \ge a > x_{k+1} > \dots x_n \ge 0$ , so the distribution $p$ is a nonnegative vector that sums to $1$ . The optimization problem can be formulated as

\begin{aligned} &\text{maximize} && p_1 + \dots + p_k \\ &\text{subject to} && x^Tp = \mu \\ &&& \mathbf{1}^Tp = 1 \\ &&& p \ge 0. \end{aligned}

To derive a bound on the objective, we will construct a dual problem, since any feasible point in the dual provides a valid upper bound on the primal (by weak duality). In fact, this problem is a linear program (a type of convex optimization problem), so a minimizer of the dual problem will give a tight bound.

The Dual Problem

We take a particular dual where we keep the nonnegativity constraint intact. This approach results in the (partial) Lagrangian

\mathcal{L}(p, y, z) = \mathbf{1}_k^Tp + y(\mu - x^Tp) + z(1 - \mathbf{1}^Tp) - \mathbf{I}(p \ge 0),

where the indicator function $\mathbf{I}(p \ge 0)$ equals $0$ if $p \geq 0$ and $\infty$ otherwise, and $\mathbf{1}_k$ is $1$ in the first $k$ entries and $0$ otherwise. We find the dual function by maximizing over the primal variable $p$ :

\begin{aligned} g(y, z) &= \sup_p \mathcal{L}(p, y, z) \\ &= \sup_p \left\{(\mathbf{1}_k - yx - z \mathbf{1})^Tp - \mathbf{I}(p \ge 0) \right\}+ y\mu + z \\ &= \begin{cases} y\mu + z & \mathbf{1}_k -yx - z \mathbf{1} \le 0 \\ \infty & \text{otherwise}. \end{cases} \end{aligned}

Thus, the dual problem is

\begin{aligned} &\text{minimize} && y\mu + z \\ &\text{subject to} && 1 \le yx_i + z \qquad i = 1, \dots, k \\ &&& 0 \le yx_i + z \qquad i = k+1, \dots, n. \\ \end{aligned}

Recall that, by weak duality, any feasible point for this problem gives a valid upper bound on the primal objective $P(X \ge a)$ . One such point is $y = 1/a$ , $z = 0$ . These values satisfy the constraints since $x_i \geq a$ for $i = 1, \dots k$ . This point has objective value $\mu / a$ , which recovers Markov's inequality.

We can also show that this inequality is tight (for example, one can find a solution to the primal problem that matches the bound).However, Markov's inequality is usually a pretty terrible bound for practical purposes. Just knowing the mean of a random variable isn't much information.

A More General Framework

To get a better bound, we can incorporate additional primal information with this optimization-based framework. In fact, we can compute any bound of the form

\begin{aligned} &\text{maximize} && P(X \geq a) \\ &\text{subject to} && \mathbf{E}[f_i(X)] = b_i \\ &&& \mathbf{E}[g_i(X)] \le c_i \end{aligned}

for arbitrary functions $f_i$ , $g_i$ by simply solving a linear program. Similarly to in a previous post, we can extend these ideas to the continuous case with a little more care.

While a computational approach is not always appropriate, in many situations we can do better than the bounds given by the standard set of inequalities. Furthermore, computational bounds can scale to higher dimensions and more easily incorporate non-i.i.d. information.

Thank you to Kshitij Kulkarni for reviewing a draft of this post!