Foundations of Data Science

Chapter notes for textbook by Blum et al 2020

Author

Dan Killian

Published

February 1, 2026

2 High Dimensional Space

2.1 Markov’s Inequality

2.1.1 Application

Markov’s Inequality states the following:

\(P(X \geq a) \leq \frac{E(X)}{a}\)

The probability of some data point in the distribution X exceeding some threshold value \(a\) cannot exceed the average of the distribution divided by \(a\).

When all we know about a distribution is it’s mean, Markov’s Inequality sets an upper bound on the probability of observing a particular value of any data point in the distribution.

How might this look in practice?

Let \(X\) be a discrete random variable with values \(x_1, x_2, \ldots, x_n\) and probabilities \(p_1, p_2, \ldots, p_n\) where \(X \geq 0\).

Code
x <- c(1, 2, 4, 8, 16)
p <- c(0.4, 0.2, 0.2, 0.1, 0.1)

dis <- tibble(x=x,
              p=p)

dis %>%
    flextable()

x

p

1

0.4

2

0.2

4

0.2

8

0.1

16

0.1

We get the expected value, or average, of the variable by summing over each value in the distribution times its weight.

\[E(X) = \sum_i x_i \cdot P(X = x_i)\] \(E(X)\) = \((1)(.4) + (2)(.2) + (4)(.2) + (8)(.1) + (16)(.1)\)

So \(E(X)\) = 4

Code
EX <- sum(x * p)

Now let’s establish some arbitrary threshold \(a\geq4\)

In this example, what is \(P(X\geq a)\leq\frac{E(X)}{a}\)?

\(P(X\geq4)\leq\frac{4}{4}\leq 1\)

An obvious case, because the data point is also the mean!

What if \(a\geq8\)?

\(P(X\geq8)\leq\frac{4}{8}\leq.5\)

For a distribution with mean four, only half of all values of the distribution may exceed eight. If more than half of values in the distribution exceeded eight, it would shift the mean!

\(a\geq16\)?

\(P(X\ge16)\leq\frac{4}{16}\leq.25\)

For a distribution of mean four, at most 25 percent of all data values may exceed 16. As soon as more than 25 percent of data values exceed four, it will shift the mean.

2.1.2 Derivation

Let’s work out how we can derive Markov’s Inequality, and offer additional illustrations

Note how we computed \(E(X)\) as the sum of all values times their weight.

\[E(X) = \sum_i x_i \cdot P(X = x_i)\]

Let’s do that again, but this time grouping terms around our threshold \(a=4\).1

1 This is the Law of Total Expectation: \(E(X) = E(X|A)P(A) + E(X|A^c)P(A^c)\)

\[E[X] = \sum\limits_{x_i < a} x_i \cdot P(X = x_i) + \sum\limits_{x_i \geq a} x_i \cdot P(X = x_i)\] In the example we’re working with, that would be

\(E(X) = \Big[ (1)(.4) + (2)(.2) \Big] + \Big[ (4)(.2) + (8)(.1) + (16)(.1) \Big]\)

We’re only interested in the second part of the decomposition. What happens if we remove the first part where \(a\lt 4\)?

Since all values of \(X\) are non-negative, that means we are removing positive values from the sum. Because we are removing positive values from the sum, the result of the operation will be less than \(E(X)\). Therefore, \(E(X)\) will be the same as or higher than the remaining sum.

This allows us to replace the equality with greater than or equal to.

\[E(X) \geq \sum\limits_{x_i \geq a} x_i \cdot P(X = x_i)\]

Returning to our example, that would be

\(E(X) \geq (4)(.2) + (8)(.1) + (16)(.1)\)

Recall that \(E(X)=4\) and \(E(X\geq4)=.8 + .8 + 1.6 = 3.2\)

\(4\geq3.2\) and our inequality holds

Let’s go back to the generalized notation:

\[E(X) \geq \sum\limits_{x_i \geq a} x_i \cdot P(X = x_i)\]

Note that our remaining \(x_i\) values are greater than or equal to \(a\). So our inequality still holds if we replace all the \(X\geq a\) with just \(a\).

\[E(X) \geq \sum\limits_{x_i \geq a} a \cdot P(X = x_i)\] We’re still summing over the same values where \(x_i\geq a\), but now we multiply by \(a\) instead of \(x_i\).

This is where the loosening of the bound occurs! Back to our simple example for \(a=4\), what is the expected value?

\(x_i\) = 4: 4 * 0.2 = 0.8
\(x_i\) = 8: 8 * 0.1 = 0.8
\(x_i\) = 16: 16 * 0.1 = 1.6

Add them up: 4 * 0.2 + 8 * 0.1 + 16 * 0.1 = 3.2

Now we repeat that step, but use only \(a\) - a lesser value:

\(x_i\) = 4: 4 * 0.2 = 0.8
\(x_i\) = 8: 4 * 0.1 = 0.4
\(x_i\) = 16: 4 * 0.1 = 0.4

Add them up: 4 * 0.2 + 4 * 0.1 + 4 * 0.1 = 1.6

We loosened the bound, in order to derive the proof.

Now let’s factor out the constant, based on the linearity of the summation operator.

\[E(X) \geq a\cdot \sum\limits_{x_i \geq a} P(X = x_i)\]

Now note that the summation of all \(x_i\geq a\) is just the cumulative probability of \(x\geq a\). So let’s replace the summation operator with the cumulative probability.

\[E(X) \geq a \cdot P(X \geq a)\]

Solve for the probability to get Markov’s Inequality

\[P(X \geq a) \leq \frac{E(X)}{a}\]

Recap:

  • After establishing some arbitrary threshold \(a\), we used the Law of Total Expectation to group the values of the distribution below and above the threshold.

  • We dropped all terms below the threshold. In doing this, we ensure that the remaining sum is less than the expected value (because the expected value is the sum of all terms).

  • Knowing that the expected value is the same as or higher than the remaining terms, we replaced the equality operator in the total expectation with a greater than or equal to operator after dropping the terms less than \(a\).

  • Knowing that the remaining values of \(X\) are greater than or equal to \(a\), we replaced the \(X\) notation with \(a\). This is where the loosening of the bound occurs, but it enables us to derive its proof.

  • We factored the constant \(a\) out of the summation operator.

  • We recognized that the summation of terms above a threshold is the same thing as the cumulative probability above the threshold, so we replaced the summation operator with cumulative probability.

  • We solved for \(P(X\geq a)\) to be \(\frac{E(X)}{a}\).

And that is Markov’s Inequality

Let’s look at a few more data illustrations

Code
set.seed(432)

con <- tibble(x=round(rnorm(100,50, 10),0))
e_x <- round(mean(con$x),0)
sd <- sd(con$x)

head(con) %>%
    flextable()

x

43

64

56

56

55

54

Code
describe(con) %>% flextable()

vars

n

mean

sd

median

trimmed

mad

min

max

range

skew

kurtosis

se

1

100

48.8

9.59

48.5

48.5

11.1

31

73

42

0.23

-0.84

0.959

Code
set.seed(32)
a <- sample(con$x, 1)

Let’s say a = 52

\[P(x\geq52)\leq\frac{49}{52}\leq 0.942\] For context, what is the actual probability of observing 52 or more?

Code
mean(con$x>=a)
[1] 0.41

Under the Markov bound, as many as 94 percent of data values may exceed 51. In the actual data simulated from a normal distribution, 41 percent of the data values exceed 51.

Now let’s set values of \(a\) at higher and higher quantiles2

2 Values of \(a\) at low quantiles are uninteresting because the probability is unity

Code
a <- quantile(con$x,
              probs=c(.75,.9,.95,.99)) %>%
    round(0)
Code
out <- tibble(
    quantile = names(a),
    a=a,
    ex_a = e_x / a,
    actual_prob = c(
        mean(con$x >= a[1]),  # proportion >= 75th percentile
        mean(con$x >= a[2]),  # proportion >= 90th percentile
        mean(con$x >= a[3]),  # proportion >= 95th percentile
        mean(con$x >= a[4])   # proportion >= 99th percentile
    ))
out %>%
    flextable()

quantile

a

ex_a

actual_prob

75%

56

0.875

0.26

90%

61

0.803

0.15

95%

64

0.766

0.08

99%

69

0.710

0.02

The Markov bounds are much higher than the actual probabilities found in the data. There is such a gap that Markov’s Inequality isn’t helpful in terms of hypothesis testing. It is still used in theoretical work to set bounds and support proofs. It would probably be best to think of it this way: what is the maximum proportion of data points that may exceed a given threshold, before the mean must shift toward the threshold?

Let’s conclude by looking at Markov’s Inequality across across several distributions:

2.2 Chebushev’s Inequality