\( \def\e{\operatorname{E}} \def\p{\operatorname{P}} \def\var{\operatorname{Var}} \def\sd{\operatorname{SD}} \def\bin{\operatorname{Bin}} \def\n{\operatorname{N}} \)

10 Intro to Probability

In this section, we’ll introduce you to some foundational probability theory that will be necessary for later models. We will do this in a way that is only semi-rigorous, with an emphasis on teaching the materials in an intuitive fashion.

10.1 Random variables (RVs)

Suppose we have an experiment that produces an outcome each time it’s observed. This is modeled by a random variable, often denoted with a capital letter, e.g. \(X\) or \(Y\). The set of all possible outcomes is called the sample space and often denoted \(\Omega\).

Sets of possible outcomes are called events. Each event has some probability associated with it. The probability of some event is often denoted \(\p(\text{event})\).

A distribution is any specification of both the outcomes and the associated probabilities of a random variable.

Let \(X\) be the result of rolling a standard 6-sided die that is fair, i.e. the outcomes 1, 2, 3, 4, 5, and 6 all have equal probability. Here are few examples of events and their corresponding probabilities:

  • Probability of getting a 1: \(\p(X=1)=\frac16\)
  • Probability of getting more than 4: \(\p(X>4)=\frac13\)
  • Probability of getting an even number: \(\p(X=2,4,\text{or }6)=\frac12\)
  • Probability of getting a 7: \(\p(X=7)=0\)

10.2 Axioms of probability

In math, axioms are basic rules which formally define an object and which are assumed to be true without proof. These form the basis on which everything else rests. These are the axioms of probability:

  1. The probability of an event is always non-negative.
    • Mathematically, \(\p(E)\ge0\) for any event \(E\) of some random variable.
  2. The probability of the entire sample space is always 1.
    • Mathematically, \(\p(\Omega)=1\) for any random variable. Note 1 is equivalent to 100%.
  3. The probability of the union of mutually exclusive events is equal to the sum of the probabilities of each event.
    • Mathematically, if \(A\cap B\) is empty, then \(\p(A\cup B)=\p(A)+\p(B)\)

Let \(A\) and \(B\) be two events for some random variable.

The union of \(A\) and \(B\), denoted \(A\cup B\), is the event of observing \(A\) OR \(B\).

The intersection of \(A\) and \(B\), denoted \(A\cap B\), is the event of observing \(A\) AND \(B\).

\(A\), \(B\) are called mutually exclusive if they don’t intersect, i.e. they have no outcomes in common.

Let’s see these in an example. Let \(X\) again be the result of rolling a fair, 6-sided die with outcomes \(1,2,\ldots,6\).

Let \(A\) be the event of observing \(X\) to be more than 4, and let \(B\) be the event of observing \(X\) to be an even number. Then, \(A\cap B=\{6\}\) and \(A\cup B=\{2,4,5,6\}\). Note that \(1,3\) are in neither \(A\) nor \(B\).

Since \(A\cap B=\{6\}\) which is NOT empty, \(A\) and \(B\) are NOT mutually exclusive. Suppose we define a third event \(C\) as observing \(X\) to be either \(1\) or \(3\). Then, \(C\) is mutually exclusive with both \(A\) and \(B\), since both \(A\cap C\) and \(B\cap C\) are empty.

10.2.1 Corollaries

From these axioms, we have a few important corollaries (i.e. derived statements) that are also true:

  1. Probabilities are always between 0 and 1.
    • Mathematically, for any event \(E\), we have \(0\le \p(E)\le1\)
  2. To get the probability of the “opposite” event, subtract from 1.
    • Mathematically, for any event \(E\), we have \(\p(\text{not }E)=1-\p(E)\).
  3. In general, for any \(A,B\), the probability of \(A\) or \(B\) is the probability of \(A\) plus \(B\) minus the intersection of \(A\) and \(B\). This is the generalized form of the 3rd axiom.
    • Mathematically, \(\p(A\cup B)=\p(A)+\p(B)-\p(A\cap B)\).

These are not difficult to derive from the axioms, but we omit the proofs for brevity.34

Here’s an example of how to use these axioms. Suppose in Nice Town, on an average day, there’s a 70% chance it’s sunny, and a 40% chance of a light breeze. Suppose there’s a 20% chance of being neither sunny nor breezy. What’s the probability it’s both sunny and breezy?

Let \(S\) represent sunny, and \(B\) represent breezy. Then, from the information given, we know \(\p(S)=0.7\), \(\p(B)=0.4\), and \(\p(\text{neither \(S\) nor \(B\)})=0.2\).

By corollary 2, \(\p(\text{neither \(S\) nor \(B\)})=1-\p(S\cup B)\), so we have \(\p(S\cup B)=0.8\).

By corollary 3, \(\p(S\cup B)=\p(S)+\p(B)-\p(S\cap B)\). Rearranging the terms, we get \(\p(S\cap B)=\p(S)+\p(B)-\p(S\cup B)=0.7+0.4-0.8=0.3\). Thus, there’s a 30% chance of it being both sunny and breezy.

10.3 Discrete vs Continuous RVs

Generally, random variables are either discrete or continuous.

A discrete RV is one whose outcomes can be listed out one-by-one. The list is allowed to be infinite.35

A continuous RV is the opposite, where outcomes are in a continuous range and not listable.

In practice, we usually use discrete RVs to model integer valued outcomes, e.g. counts of something; and continuous RVs to model real number valued outcomes, e.g. lengths/weights/durations. These require slightly different mathematical notations/treatments.

10.3.1 Discrete (PMFs)

For discrete RVs, the distribution of outcomes and probabilities is specified with a probability mass function (PMF). Note the PMF must satisfy all rules of probability. In particular, all probabilities must be in \([0,1]\), and \(\p(\Omega)=\sum_\text{k}\p(k)=1\) where \(k\) represents each possible outcome.

Let \(X\) be a discrete RV. The probability mass function (PMF) of \(X\) is a function \(P\) which, for each possible outcome \(k\) in the sample space, specifies the probability that \(X\) is observed to be \(k\), denoted \(\p(X\! =\!k)\), or sometimes \(\p(k)\) for short.

To be a valid PMF, \(P\) must satisfy the probability axioms, namely it must always be non-negative and sum to 1 across all possible outcomes in the sample space.

PMFs can be specified using either a table, function, or plot.

Let \(X\) be the number of dollars you win from a new casino game where there’s only 4 possible outcomes: you either win nothing with 40% chance, or you win $1 with 30% chance, or you win $2 with 20% chance, or you win $3 with 10% chance.

First, it’s easy to see the axioms are satisfied, since all probabilities are in \([0,1]\) and \(0.1+0.2+0.3+0.4=1\). Thus, this is a valid PMF. We can specify the PMF in any of the following equivalent ways:

Table:

 k | \p(k)
---+-----
 0 | 0.4 
 1 | 0.3 
 2 | 0.2 
 3 | 0.1 

Function:

\[\p(X\! =\!k)=\begin{cases}\frac1{10}\big(4-k\big) & k=0,1,2,3 \\ 0 & \text{otherwise}\end{cases}\]

Plot:

# remember to import tidyverse (and optionally, update theme options)
tibble(k = 0:3, p = (4:1)/10) %>%
  ggplot(aes(x = k, y = p)) + geom_col() +
  labs(title = "Distribution of X (winnings from made-up casino game)")

For another example, let \(X\) be the sum of rolling 2 ordinary, fair 6-sided dice (independently)36. What is the PMF of \(X\)?

First, note the possible outcomes \(k\) in the sample space are the integers \(k=2,3,\ldots,12\). Next, since the dice are fair, we can find the probability of each outcome \(k\) by counting the number of combinations that add to \(k\). For example, for \(k=5\) the outcomes are 14, 23, 32, and 41. Each outcome has probability 1/36 so summing them we get \(\p(X=5)=4\cdot\frac1{36}=\frac19\)

From this, you can show the probability for each \(k=2,3,\ldots,12\) is \(\p(X=k)=(6-|k-7|)/36\). Thus we can write the PMF as:

\[\p(X\! =\!k)=\begin{cases}\frac1{36}\big(6-|k-7|\big)&k=2,3,\ldots,12\\0&\text{otherwise}\end{cases}\]

You can easily check this PMF satisfies the probability axioms. Here’s a plot of this PMF:

tibble(k = 2:12, p = (6-abs(k-7))/36) %>%
  ggplot(aes(x = k, y = p)) + geom_col() +
  labs(title = "Distribution of X (sum of rolling two fair 6-sided dice)") +
  scale_x_continuous(breaks = 2:12) +
  scale_y_continuous(breaks = seq(0,.2,.02))

10.3.2 Continuous (PDFs)

For continuous RVs, distributions are specified with a probability density function (PDF). They are similar to PMFs but with a key distinction: PDFs do NOT output probability of an outcome, rather it denotes “density” which can be thought of as the rate of change of probability.

Let \(X\) be a continuous RV. The probability density function (PDF) of \(X\) is a function \(P\) which, for each outcome \(x\) in the sample space, specifies the density of probability around \(x\).

To be a valid PDF, \(P\) must also satisfy the probability axioms, i.e. \(P\) must always be non-negative and integrate to 1 across the sample space.

This is important enough to warrant repeating: a continuous PDF does NOT output a probability! For continuous PDFs, probabilities ALWAYS correspond to areas under the PDF.

Also note it’s customary to use \(k\) to represent possible outcomes of discrete PMFs, and \(x\) to represent possible outcomes of continuous PDFs.

PDFs are the continuous analog of PMFs, so whenever you might use PMFs in a summation \(\sum\) expression, you would switch to a definite integral \(\int\) for a PDF. In STAT 240, we will NOT require you to evaluate these integrals but we may occasionally show them to familiarize you with the notation. Computations with simple PMFs may be asked however.

PDFs can feel strange at first, so here’s an easy example to start. According to the 2011-12 National Health and Nutrition Examination Survey (NHANES) by the CDC, US adult human height has an approximately normal distribution with mean 65.8 inches and standard deviation 3.98 inches. Below is a plot of the distribution:

# use geom_function to plot dnorm (the normal PDF function)
# also plot the mean ± up to 3 standard deviations
ggplot(tibble(SDs = 65.8 + (-3:3)*3.98)) +
  geom_function(fun = \(x) dnorm(x, 65.8, 3.98), xlim = c(52, 80), n = 1e3) +
  geom_segment(aes(x = SDs, xend = SDs, y = 0, yend = dnorm(SDs, 65.8, 3.98)),
               color = "red", linetype = "dashed", linewidth = 0.7) +
  scale_x_continuous(breaks = seq(52, 80, 2), expand = 0) +
  scale_y_continuous(breaks = seq(0, 0.10, 0.01), expand = c(0, 0.001)) +
  labs(title = "US adult human height distribution",
       subtitle = '(approx normal w/ mean 65.8", SD 3.98"; SDs shown in dashed)',
       x = "height (inches)", y = "probability density")

You can easily see the density is non-negative, and with some careful math the area can be shown to be 1. Thus it satisfies the probability axioms.

Remember the density at \(x\) is NOT the probability of \(x\). Again, probabilities of events correspond to areas under the curve. For example, one can show that for a normal distribution:

  • Approx. 68% of outcomes are between ±1 standard deviation,
  • Approx. 95% of outcomes are between ±2 standard deviations,
  • Approx. 99.7% of outcomes are between ±3 standard deviations.

This rule is called the empirical rule. For our distribution of heights, this means about 68% of people are between 62 and 70 inches, about 95% are between 58 and 74 inches, etc.

For another example, consider the uniform distribution on \((0,1)\). This distribution generalizes rolling a fair die to drawing a number from an interval, where every value inside the interval is equally likely to be selected. Below is a plot of the distribution:

# use geom_function to plot dunif (the uniform PDF function)
ggplot() + geom_function(fun = dunif, xlim = c(-.5, 1.5), n = 10001) +
  labs(title = "Uniform distribution between 0 and 1",
       x = "x", y = "probability density")

One can also show percentiles are always uniformly distributed, so this is actually a very useful distribution (more on this in STAT 340).

10.4 Expectation & Variance

10.4.1 Expected value

Expected value refers to the average value of some random variable (or function thereof). For a discrete random variable \(X\), the expected value of \(X\)—also called \(\e(X)\), \(\mu\), or simply the mean of \(X\)—is defined:

\[\mu=\e(X)=\sum_k\,k\cdot \p(X\! =\!k)\]

The summation is performed over all possible outcomes \(k\) in the sample space. In plain words, the mean is the sum of the product of each outcome with its probability (i.e. a weighted sum using the probabilities as weights). The formula for a continuous variable is similar but with an integral instead of a summation.

The expected value of a function of a random variable \(f(X)\) can also be defined, which represents the average value of \(f(X)\) :

\[\e\big(f(X)\big)=\sum_k\,f(k)\cdot \p(X\! =\!k)\]

Again, as a simple example, consider \(X\) being rolling a single fair, 6-sided die. Find \(\e(X)\).

If the die is fair, then \(\p(X\! =\!k)=1/6\) for all \(k=1,2,\ldots,6\). Then, we get

\[\e(X)=\sum_{k=1}^6\,k\cdot(1/6)=\frac16(1+2+\cdots+6)=\frac{21}6=3.5\]

This means the average value of a die roll is 3.5. Note that the average does NOT need to be a possible observation! Even though it’s impossible to roll a 3.5, the average of each roll over many rolls is in fact 3.5.

Now consider \(f(x)=x^2\). What is \(\e\big(f(X)\big)\), i.e. the average value of \(X^2\)?

\[\e(X^2)=\sum_{k=1}^6\,k^2\cdot(1/6)=\frac16(1+4+\cdots+36)=\frac{91}6\approx15.17\]

Thus, the average of the square of a 6-sided die is about 15.17.

Let’s reconsider the casino game example above, where \(X\) is the average winnings per game. Find the expected value of \(X\).

Recall the outcomes \(0,1,2,3\) have corresponding probabilities \(0.4,0.3,0.2,0.1\). Using the formula, we get

\[\e(X)=0\!\cdot\!(0.4)+1\!\cdot\!(0.3)+2\!\cdot\!(0.2)+3\!\cdot\!(0.1)=0.3+0.4+0.3=1\]

Thus, each play on average wins $1. This means if a casino wants to not lose money, they need to charge at least $1 per play.

10.4.2 Variance

The variance of a random variable—also called \(\var(X)\) or \(\sigma^2\)—can be thought of as the average squared distance from the mean and is defined as:

\[\sigma^2=\var(X)=\e\big((X-\mu)^2\big)=\sum_k\,(k\!-\!\mu)^2\cdot \p(X\! =\!k)\]

where \(\mu\) represents \(\e(X)\). The variance is used as a measure of spread. The higher the variance, the more “spread out” a RV is.

The standard deviation of \(X\), often denoted \(\sigma\), is then defined as the square root of the variance. This is often more convenient to use in calculations instead of the variance since it has the same units as the data. It can be thought of as the average distance of an observation from the mean.

Consider again the previous single, fair 6-sided die example. Find the variance and standard deviation of \(X\).

We found earlier that \(\mu=\e(X)=3.5\). Applying the variance formula gives:

\[\var(X)=\sum_{k=1}^6(k\!-\!3.5)^2\cdot(1/6)=\frac16\big((1\!-\!3.5)^2+\cdots+(6\!-\!3.5)^2\big)=\frac{35}{12}\approx2.92\]

So, we found the variance \(\sigma^2\approx2.92\), and thus the standard deviation \(\sigma\approx1.71\).

Let’s consider the casino game example a final time. Find the variance and standard deviation of \(X\).

We found earlier \(\mu=\e(X)=1\). Applying the variance formula gives:

\[\var(X)=(0\!-\!1)^2\!\cdot\!(0.4)+(1\!-\!1)^2\!\cdot\!(0.3)+(2\!-\!1)^2\!\cdot\!(0.2)+(3\!-\!1)^2\!\cdot\!(0.1)=1\]

Thus, the variance is also \(\sigma^2=1\), and so is the standard deviation \(\sigma=1\).

10.5 Binomial distribution

The binomial distribution is perhaps the most important example of a discrete random variable. It generalizes the idea of the sum of a sequence of independent, repeated, binary trials, e.g. the number of heads in a series of coin flips, or the number of patients who respond to an experimental treatment.

10.5.1 Definition

Suppose you have \(n\) independent trials, each of which has \(p\) probability of producing \(1\) (called a “success”), and \(1-p\) probability of producing \(0\) (called a “failure”). Then, the sum of these trials \(X\) follows a binomial distribution with parameters \(n,p\) or in other words \(X\sim\bin(n,p)\).

Suppose we flip a fair coin 10 times and let \(X=\) total number of heads. Then we can see \(X\sim\bin(10,0.5)\). Below shows a plot of this distribution:

# use dbinom (the binomial PMF function) to plot distribution
tibble(k = 0:10, p = dbinom(k, 10, 0.5)) %>% 
ggplot(aes(x = k, y = p)) + geom_col() +
  scale_x_continuous(breaks = seq(0,10,1), expand = 0) +
  scale_y_continuous(breaks = seq(0, 0.25, 0.05), limits = c(0, 0.25),
                     minor_breaks = seq(0, 0.25, 0.01), expand = 0) +
  labs(title = "Binomial(10, 0.5) distribution (number of heads in 10 flips of a fair coin)",
       x = "k", y = "probability")

The following assumptions in the acronym BINS must hold for a binomial distribution to apply:

  • Binary outcomes: each individual trial must produce a 1 or 0
  • Independence: each individual trial must be independent from others (i.e. neither affects nor affected by other trials)
  • N fixed: the number of trials \(n\) must be a fixed, predetermined quantity, NOT a variable/condition
  • Same probability: each trial must have the same probability \(p\) of observing a 1

If any of these criteria are not met, the resultant variable will NOT follow a binomial distribution.

10.5.2 PMF

First, what is the formula for the PMF of \(X\sim\bin(n,p)\)? It can be shown to be the following:

\[\p(X\!=\!k)={n\choose k}p^k(1\!-\!p)^{n-k}\quad\text{for outcomes $k=0,1,\ldots,n$}\]

\[\small\text{where}~~{n\choose k}=\dfrac{n!}{k!(n\!-\!k)!}~~~\text{and}~~n!=n\!\times\!(n\!-\!1)\!\times\cdots\!\times\!1\]

  • \(n!\) is called “\(n\) factorial” and counts the number of permutations (i.e. ways of ordering) of \(n\) distinct objects
    • Note: \(0!=1\) by definition
  • \(n\choose k\) is called “\(n\) choose \(k\)” and counts the number of ways of choosing \(k\) objects from \(n\) distrinct objects (order of choice doesn’t matter)

For example, if you have 5 objects, there are 5×4×3×2×1=120 ways to order them. If you want to choose 2 from the 5, there are 5 choose 2 or 10 different combinations possible.

# using R, we can find 5!
factorial(5)
[1] 120
# as well as 5 choose 2
choose(5, 2)
[1] 10

This should help you understand the PMF shown above. The \(n\choose k\) counts the number of ways those \(k\) successes could have happened out of the \(n\) total trials, whereas the \(p^k\) and \((1-p)^{n-k}\) are respectively the probabilities of having exactly \(k\) successes and \(n-k\) failures.

Let \(X\sim\bin(10,0.5)\) again. What is \(\p(X=4)\), i.e. the probability of observing exactly 4 total heads?

\[P(X=4)={10\choose4}0.5^4(1\!-\!0.5)^{10-4}=\frac{10\cdot9\cdot8\cdot7}{4\cdot3\cdot2}\cdot\frac1{2^{10}}=\frac{210}{1024}\approx20.5\%\]

We can also use R to help us compute this:

# apply the formula
choose(10, 4) * 0.5^4 * (1-0.5)^6
[1] 0.2050781

This is consistent with the plot of the PMF we previously made. We could also ask what’s the probability of getting 3 or fewer heads total, i.e. \(\p(X\le3)\).

\[ \begin{aligned} P(X\le3)&=P(X\!=\!0)+P(X\!=\!1)+P(X\!=\!2)+P(X\!=\!3)\\ &={\small{10\choose0}0.5^0(1\!-\!0.5)^{10-0}+\cdots+{10\choose3}0.5^3(1\!-\!0.5)^{10-3}}\\ &=\frac{176}{1024}\\ &\approx17.2% \end{aligned} \]

# compute P(X≤3) with vectorization
sum(
  choose(10, 0:3) * 0.5^(0:3) * (1-0.5)^(10-0:3)
)
[1] 0.171875

10.5.3 E & Var

Recall the general expectation and variance definitions from before:

\[ \begin{aligned} \e(X)&=\sum_k\,k\cdot \p(X\! =\!k)\\ \var(X)&=\sum_k\,(k\!-\!\mu)^2\cdot \p(X\! =\!k) \end{aligned} \]

You can plug in the binomial PMF for \(\p(X\! =\!k)\) and derive the following identities for the binomial distribution:

\[ \text{for binomial,}\quad \begin{aligned} \e(X)&=np\\ \var(X)&=np(1\!-\!p)\\ \sd(X)&=\sqrt{\var(X)}=\sqrt{np(1\!-\!p)} \end{aligned} \]

Consider \(X\sim\bin(10,0.5)\) again. Compute the expectation & variance using thetheir definitions and check they agree with the identities given above.

# compute expectation using definition
sum(0:10 * dbinom(0:10, 10, 0.5))
[1] 5
# compare with identity E=np
10 * 0.5
[1] 5
# compute variance using definition
sum((0:10 - 5)^2 * dbinom(0:10, 10, 0.5))
[1] 2.5
# compare with identity Var=np(1-p)
10 * 0.5 * (1-0.5)
[1] 2.5

10.5.4 *binom() functions:

There are 4 main R functions for working with binomial distributions, dbinom(), pbinom(), qbinom(), and rbinom(). We will go over each below.

10.5.5 dbinom() (PMF)

dbinom() simply implements the PMF equation shown in section 10.5.2. In order to compute \(\p(X=k)\) for \(X\sim\bin(n,p)\), simply do dbinom(k, n, p).

# check the computation earlier for P(X=4) where X~Bin(10,0.5)
dbinom(4, 10, 0.5)
[1] 0.2050781
# show the probability of every possible outcome
# (this is how the PMF was plotted earlier)
dbinom(0:10, 10, 0.5) %>% round(3)
 [1] 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001

Here’s the plot of the PMF again for \(n=10\), \(p=0.5\).

# use dbinom (the binomial PMF function) to plot distribution
tibble(k = 0:10, p = dbinom(k, 10, 0.5)) %>% 
ggplot(aes(x = k, y = p)) + geom_col() +
  scale_x_continuous(breaks = seq(0, 10, 1), expand = 0) +
  scale_y_continuous(breaks = seq(0, 0.25, 0.05), limits = c(0, 0.25),
                     minor_breaks = seq(0, 0.25, 0.01), expand = 0) +
  labs(title = "Binomial(10, 0.5) PMF", x = "k", y = "probability")

10.5.6 pbinom() (CDF)

pbinom() is what’s called the binomial’s cumulative distribution function or CDF, which finds the probability of getting less than or equal to a given input. To find \(\p(X\le k)\) for \(X\sim\bin(n,p)\), you can do pbinom(k, n, p).

# check the computation earlier for P(X≤3) where X~Bin(10,0.5)
pbinom(3, 10, 0.5)
[1] 0.171875
# another way
sum(dbinom(0:3, 10, 0.5))
[1] 0.171875

Note that 1-pbinom(k, n, p) can also give you \(\p(X>k)\).

# example: find P(X>7) where X~Bin(10,0.5)
1 - pbinom(7, 10, 0.5)
[1] 0.0546875

Here’s a plot of the binomial CDF for \(n=10\), \(p=0.5\). Note the stepwise nature of the function, since this is a discrete distribution and cumulative probabilities increase in discrete “steps” at the possible values of \(k\).

Also note the CDF always goes from 0 on the left to 1 on the right, and can never decrease as you move from left to right; it can only increase or stay the same.37

ggplot() + geom_function(fun = \(x) pbinom(x, 10, 0.5), xlim = c(0,10), n = 1e4) +
  scale_x_continuous(breaks = seq(0, 10, 1), expand = 0) +
  scale_y_continuous(breaks = seq(0, 1, 0.1), expand = c(0, 0.005)) +
  labs(title = "Binomial(10, 0.5) CDF", x = "k", y = "probability")

10.5.7 qbinom() (inverse CDF)

qbinom() is the inverse function of the CDF, i.e. it does the opposite. For a given probability \(p\), it finds the smallest \(k\) such that \(\p(X\le k)\ge p\). It’s also called the quantile function since it is used to compute what observation \(k\) is at a given percentile \(p\).

# what observation is at the 17.1%-ile of X~Bin(10,0.5)?
qbinom(0.171, 10, 0.5)
[1] 3
# what if we "cross" the value of P(X≤3) and ask for the 17.2%-ile?
qbinom(0.172, 10, 0.5)
[1] 4

Here’s a plot of the binomial inverse CDF for \(n=10\), \(p=0.5\), note it’s the reflection of the CDF across the \(y=x\) diagonal line. For discrete distributions, like the binomial the inverse CDF is also a stepwise function.

ggplot() + geom_function(fun = \(x) qbinom(x, 10, 0.5), xlim = c(0,1), n = 1e4) +
  scale_x_continuous(breaks = seq(0, 1, 0.1), expand = c(0, 0.005)) +
  scale_y_continuous(breaks = seq(0, 10, 1), expand = 0) +
  labs(title = "Binomial(10, 0.5) inverse CDF", x = "probability", y = "k")

10.5.8 rbinom() (random generator)

rbinom() is the random generator function, which allows to simulate drawing random samples from a binomial distribution. This can be useful in simulation studies, empirical computations, etc.

# draw a sample of size 100 from Bin(10,0.5)
samp <- rbinom(100, 10, 0.5)
samp
  [1] 4 4 5 7 4 7 7 6 6 3 4 4 6 5 6 5 6 9 5 6 7 4 6 3 4 5 2 5 7 4 5 5 5 4 6 6 6 3 6 5
 [41] 6 6 6 5 5 6 2 5 6 6 5 7 5 4 3 3 4 5 6 5 7 4 5 4 6 4 5 6 3 7 4 7 4 4 5 7 7 5 6 8
 [81] 5 6 5 4 6 4 6 3 4 3 4 3 6 7 6 6 5 5 6 5
# what are the mean and SD in our sample?
# these should be close to the theoretical E & SD
mean(samp)
[1] 5.14
sd(samp)
[1] 1.325888
# a histogram of our sample
ggplot(tibble(samp), aes(x = samp)) +
  geom_histogram(binwidth = 1, center = 0, color = "white") +
  scale_x_continuous(breaks = seq(0, 10, 1), expand = 0) +
  scale_y_continuous(breaks = seq(0, 30, 5), minor_breaks = 0:30, expand = 0) +
  labs(title = "Histogram of 100 observations sampled from Bin(10,0.5)", x = "k")

10.6 Normal distribution

The normal distribution is definitely the most important of the continuous distributions, and perhaps the most important in all of statistics. It’s named “normal” due to it being ubiquitous throughout nature, showing up everywhere we look, hence it’s the “normal” distribution we encounter.

The specific circumstances which give rise to a normal distribution have to do with something called the Central Limit Theorem which is outside the scope of 240, but as an extremely oversimplified summary, normal distributions often appear when an observation can be viewed as the sum or average of many smaller, independent processes.

10.6.1 Definition & PDF

The normal distribution is parametrized (i.e. specified) by two parameters: the mean \(\mu\) and the standard deviation \(\sigma\), which are so named because they are the mean and SD of the population. If \(X\) follows such a distribution, we also write \(X\sim\n(\mu,\sigma)\).

Given these parameters, the PDF of the normal is given by the following equation:

\[f(x) = \frac1{\sigma\sqrt{2\pi}}\,e^{-\frac12\left(\frac{x-\mu}\sigma\right)^2}\]

This expression produces the familiarly shaped “bell curve” distribution we all know and love. For example, IQ scores are standardized to have mean 100 and SD 15 in the general population. Below is shown the corresponding normal curve for this distribution:

SDs <- seq(55, 145, 15)
ggplot() + geom_segment(aes(x = SDs, y = 0, yend = dnorm(SDs, 100, 15)),
                        color = rep(c("red", "blue", "red"), times = c(3, 1, 3))) + 
  geom_function(fun = \(x) dnorm(x, 100, 15), xlim = c(50, 150)) +
  scale_x_continuous(breaks = seq(55, 145, 15), expand = 0, minor_breaks = NULL) +
  scale_y_continuous(expand = c(0.004, 0)) +
  labs(title = "N(100, 15) distribution (standardized IQ scores)",
       subtitle = "(mean shown in blue, SDs shown in red)",
       x = "score", y = "probability density")

Recall from earlier that since this is a continuous random variable, the PDF does NOT plot the probability, but rather the probability density! Probabilities of events should instead always be interpreted as corresponding areas under the curve!

10.6.2 Empirical rule

For normal distributions, it can be shown that:

  • About 68% of the values will be between ±1 SD of the mean,
  • About 95% of the values will be between ±2 SDs of the mean,
  • About 99.7% of the values will be between ±3 SDs of the mean.

This is called the empirical rule of the normal and is useful for roughly approximating what you’d observe based on the parameters.

Using the \(X\sim\n(100,15)\) model of IQ scores, approximately what percent of people would you expect to score in the following ranges?

  1. 85 to 115
  2. 70 to 130
  3. 85 to 130
  4. Less than 70
  5. Greater than 145

Using the empirical rule, we know 68% would fall between 85 to 115 (±1 SD), and 95% would fall between 70 to 130 (±2 SDs).

Using symmetry, we know from 85-100 is 68%/2 or 34%, and 100-130 is 95%/2 or 47.5%. Thus, 85-130 covers about 81.5% of the distribution.

We know 70-130 covers about 95% from the empirical rule, which means <70 or >130 must be 5%. By symmetry, <70 must be half that or 2.5%.

Again using the empirical rule, we know 55 to 145 covers about 99.7%, which means <55 or >145 makes up about 0.3%. By symmetry again, we know that >145 must be around 0.15%.

10.6.3 E & Var

As you’d expect from definition, the expectation and variance are equal to \(\mu\) and \(\sigma^2\).

\[ \text{for normal,}\quad \begin{aligned} \e(X)&=\mu\\ \var(X)&=\sigma^2\\ \sd(X)&=\sqrt{\var(X)}=\sigma \end{aligned} \]

10.6.4 *norm() functions:

There are also 4 main R functions for working with normal distributions, dnorm(), pnorm(), qnorm(), and rnorm(). They are each the continuous analogs of the *binom() functions.

10.6.5 dnorm() (PDF)

dnorm() is the normal PDF (probability density function), which was plotted just above. Remember the PDF does NOT represent the probability of a certain outcome, but rather the probability density, which is akin to the rate of change of probability around that point. For us, the main use case of dnorm() is to show a plot of the PDF.

# what's the density at the mean of N(100,15)?
dnorm(100, 100, 15)
[1] 0.02659615
# replot the N(100,15) PDF
ggplot() + geom_function(fun = \(x) dnorm(x, 100, 15), xlim = c(50, 150)) +
    scale_x_continuous(breaks = seq(55, 145, 15), minor_breaks = NULL) +
  labs(title = "N(100, 15) PDF", x = "x", y = "probability density")

10.6.6 pnorm() (CDF)

pnorm() is the normal CDF (cumulate distribution function), which finds the probability less than or equal to a given input. This function is extremely useful for computing probabilities of ranges under the normal curve, e.g. to compute \(\p(X\le k)\) when \(X\sim\n(\mu,\sigma)\), simply do pnorm(k, µ, σ).

Note that to find the probability \(\p(a<X\le b)\), we can do pnorm(a, µ, σ) - pnorm(b, µ, σ). Together with remebering the area under the entire curve is 1 just like any other RV, we can check the probabilities of each range in the previous example:

c(  # find the probability an observation from X~N(100,15) falls in each range:
  pnorm(115, 100, 15) - pnorm(85, 100, 15),   # between 85-115
  pnorm(130, 100, 15) - pnorm(70, 100, 15),   # between 70-130
  pnorm(130, 100, 15) - pnorm(85, 100, 15),   # between 85-130
  pnorm(70, 100, 15),                         # less than 70
  1 - pnorm(145, 100, 15)                     # greater than 145
) %>% round(4)
[1] 0.6827 0.9545 0.8186 0.0228 0.0013

Due to the way probabilities work for continuous variables (they’re always areas of ranges under the curve), you can show the probability of observing a single number38 has probability 0. More specifically, for any number \(a\), \(\p(X=a)=0\).

This may seem strange, but if you try to imagine the area the single number \(a\) “bounds” under the curve, it makes a line, which has no area. This does not mean \(a\) is impossible to observe as an outcome39, we just can’t meaningfully assign a probability to it due to the way probabilities have to be defined for continuous RVs.

A consequence of this is that for continuous variables, \(<\) and \(\le\) are interchangeable, as well as \(>\) and \(\ge\). In other words, you don’t need to worry about counting or not counting “boundaries” of ranges when doing math with probabilities over different intervals.

Below is a plot of the normal CDF for \(\mu=100\), \(\sigma=15\). Again note as you move from left to right, the CDF goes from 0 to 1 and never decreases.

ggplot() + geom_function(fun = \(x) pnorm(x, 100, 15), xlim = c(55, 145), n = 1e4) + 
  scale_x_continuous(breaks = seq(55, 145, 15), minor_breaks = NULL, expand = 0) + 
  scale_y_continuous(breaks = seq(0, 1, 0.2), expand = c(0, 0.002)) + 
  labs(title = "N(100, 15) CDF", x = "x", y = "probability")

10.6.7 qnorm() (inverse CDF)

qnorm() is the inverse CDF or the quantile function for the normal. Since the normal distribution is continuous (unlike the binomial), the inverse CDF here works a little simpler. For a given probability \(p\), it finds the \(k\) such that \(\p(X\le k)=p\).

# what observation is at the 90%-ile of X~N(100,15)?
qnorm(0.9, 100, 15)
[1] 119.2233
# we can also check that P(X≤119.2233) = 90% (up to rounding)
pnorm(119.2233, 100, 15)
[1] 0.9000003

Below is a plot of the normal inverse CDF for \(\mu=100\), \(\sigma=15\), again note it’s the reflection of the CDF across the \(y=x\) diagonal line.

ggplot() + geom_function(fun = \(x) qnorm(x, 100, 15), xlim = c(0, 1), n = 1e4) + 
  scale_x_continuous(breaks = seq(0, 1, 0.2), expand = c(0, 0.002)) + 
  scale_y_continuous(breaks = seq(55, 145, 15), minor_breaks = NULL, expand = 0) + 
  labs(title = "N(100, 15) inverse CDF", x = "probability", y = "x")

10.6.8 rnorm() (random generator)

rnorm() is the normal random generator function, which simulates drawing random samples from a normal distribution.

# draw a sample of size 80 from N(100,15)
samp <- rnorm(80, 100, 15)
samp
 [1] 105.97159  90.81960 105.11680  83.05955 121.49536 129.70600  94.49168  84.33798
 [9] 108.54579  97.97418 136.02427  99.41140 110.34609 100.42003  88.85090 102.83188
[17]  72.92562 121.98332 102.29880 132.58918 107.13264  89.35080 109.16090  85.98854
[25]  81.19550 104.37169  93.35062 100.01658 101.11512  91.15719  91.46997  97.97232
[33] 117.67130  77.14650 108.90919 104.99426 115.94650  95.43724 105.55028 104.00648
[41]  91.86220 118.11802 117.40604 110.50320 123.80250 108.37730  80.85112  91.40102
[49]  81.63081  92.89899  90.69450 100.63174  86.33618 102.37043  90.18123 126.50931
[57] 110.75061 113.65261 105.76278 125.23264  90.46395  93.07533 121.48423  90.23955
[65]  96.88929  94.10788  95.20011  95.81330 107.41282  97.34004  92.41064 120.14558
[73]  96.78131  97.30665  98.49714 110.68999  98.89653  99.43549  89.77509  95.13595
# again, we can check our sample mean/sd are close to theory
mean(samp)
[1] 101.5152
sd(samp)
[1] 13.28363
# a histogram of our sample
ggplot(tibble(samp), aes(x = samp)) +
  geom_histogram(binwidth = 5, boundary = 0, color = "white") +
  scale_x_continuous(expand = 0) + scale_y_continuous(expand = 0) +
  labs(title = "Histogram of 80 observations sampled from N(100,15)", x = "x")

10.7 Standard Normal \(\n(0,1)\)

An important member of the normal family is the standard normal, which is usually denoted \(Z\) and has mean \(\mu=0\) and SD \(\sigma=1\). In other words, \(Z\sim\n(0,1)\).

The standard normal is important because it serves as a reference distribution, to which all other normal distributions can be compared. This is because it turns out every normal distribution is similar to each other, i.e. has the same shape, up to axes scaling. For example, in the plot below, the 3 distributions appear to have different shapes:

normals_plot <- ggplot() + geom_function(fun = dnorm, n = 1e3, color = "red3") + 
  geom_function(fun = \(x) dnorm(x, -3, 0.5), n = 1e3, color = "seagreen") + 
  geom_function(fun = \(x) dnorm(x, 5, 2), n = 1e3, color = "navy") + 
  scale_x_continuous(breaks = -5:10, limits = c(-5, 10), expand = 0) + 
  scale_y_continuous(expand = 0) + 
  labs(title = "Plot of 3 normals: N(0,1), N(-3,0.5), N(5,2)",
       y = "probability density")
normals_plot

However, if you take the exact same plot and simply zoom in to each curve, you can see in fact they are all the exact same shape, up to axes scaling.

p1 <- normals_plot + xlim(-4.5, -1.5) + ylim(0, dnorm(-3, -3, 0.5)) + 
  labs(title = "Same plot, zoomed to N(-3,0.5)", y = NULL, x = NULL)
p2 <- normals_plot + xlim(-3, 3) + ylim(0, dnorm(0, 0, 1)) + 
  labs(title = "Same plot, zoomed to N(0,1)", y = NULL, x = NULL)
p3 <- normals_plot + xlim(-1, 11) + ylim(0, dnorm(5, 5, 2)) + 
  labs(title = "Same plot, zoomed to N(5,2)", y = NULL, x = NULL)
gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

A consequence of this is that any problem involving an arbitrary normal distribution \(X\sim\n(\mu,\sigma)\) can be turned into an equivalent problem for the standard normal \(Z\sim\n(0,1)\) by simply applying a transformation that converts \(X\) into \(Z\). This is called standardization, or computing a Z-score.

10.7.1 Z-score

Suppose you draw an observation \(x\) from a normal population with mean \(\mu\) and SD \(\sigma\). The Z-score of \(x\) is defined as:

\[z=\frac{x-\mu}\sigma\]

Then, solving the same problem but using \(z\) and the standard normal is equivalent to solving the original problem.

Continuing with the IQ example, suppose you’re asked to find the percent of the population with IQ score less than 115. In other words, you want to find \(\p(X<115)\) where \(X\sim\n(100,15)\). We can do the following:

# use pnorm() to get area to the left of 115
pnorm(115, 100, 15)
[1] 0.8413447
# showing this area on a quick plot
ggplot() + geom_function(fun = \(x) dnorm(x, 100, 15), xlim = c(55, 145)) + 
  stat_function(fun = \(x) dnorm(x, 100, 15), xlim = c(55, 115),
                geom = "area", fill = "red4") + 
  labs(title = "Plot showing P(X<115) where X~N(100,15)")

Using the Z-score approach, we have

\[z=\frac{x-\mu}\sigma=\frac{115-100}{15}=1\]

Thus, this problem is equivalent to asking what’s \(\p(Z<1)\) where again \(Z\sim\n(0,1)\) is the standard normal:

# pnorm() defaults to mean 0 SD 1, i.e. standard normal, if not specified
pnorm(1)
[1] 0.8413447
# showing the same equivalent area on a plot
ggplot() + geom_function(fun = dnorm, xlim = c(-3, 3)) + 
  stat_function(fun = dnorm, xlim = c(-3, 1), geom = "area", fill = "red4") + 
  labs(title = "Plot showing P(Z<1) where Z~N(0,1)")

Note how these two areas exactly correspond, so solving one solves the other.

Since we have R, we won’t need to rely on Z-scores as much, however it’s an important concept to normal curves, and also shows up later when discussing inference, e.g. test statistics during hypothesis testing.