Peter Zhang

Stat 134 MT


I took these notes for Stat 134: Concepts of Probability taught by Professor Ibser. It is based on Probability by Jim Pitman.



The set of possible outcomes is denoted by \(\Omega\), while an event \(A\) is a subset of \(\Omega\). Letting \(\#(A)\) denote the number of outcomes of \(A\), the probability of \(A\) under equally likely outcomes is \(P(A) = \frac{\#(A)}{\#{\Omega}}\). Odds for \(A\) are the ratio of \(\#(A):#(\not A)\). Payoff odds express the ratio of your stake to a casino’s stake. In a fair bet, the payoff odds are equal to the chance odds. Assume payoff odds of \(r_{pay}\) to \(1\), the fair price would be \(P(A)(r_{pay} + 1)\).

A relative frequency is how often something occurs in a sequence of observation. According the empirical law of averages, the relative frequency tends to stabilize with more trials. In the real world, we should assume that objects are distinguishable. Another interpretation of probability is that of an opinion - it’s often reasonable to think in such terms for one-off events.

An event \(B\) is partitioned into \(n\) events if \(B = B_1 \cup B_2 \cup ... \cup B_n\) and the events are mutually exclusive. Probabilities of events are always non-negative. A distribution over \(\Omega\) is a probability function of subsets where \(P(\Omega) = 1\). The analogous set notation of events is below.


  • Complement rule: \(P(A) + P(A^c) = 1\)
  • Difference rule: If \(A \subseteq B\), then \(P(B not A) = P(B) - P(A)\).
  • Inclusion-exclusion: \(P(A \cup B) = P(A) + P(B) - P(AB)\).


A Bernoulli distribution has outcomes \(0\) and \(1\) with probabilities \(1-p\) and \(p\), respectively. A uniform distribution is defined by equally likely outcomes and can be made over a finite set, continuum, area of a plane, etc. An empirical distribution on \((-\infty, \infty)\) is defined by \(P_n(a, b) = \#{i : 1 \leq i \leq n, a < x_i < b}/n\) where \(P_n(a, b)\) is the proportion of numbers that lie on the interval \((a, b)\); it can be displayed via a histogram.

Conditional probability of \(A\) given \(B\) is \(P(A \| B) = \frac{P(AB)}{P(B)}\). The multiplication rule states that \(P(AB) = P(A \| B)P(B)\). For any events \(A\) and \(B\), \(P(A) = P(A \| B)P(B) + P(A \| B^c)P(B^c)\); it generalizes to any number of partitions. For independent events where \(P(A \| B) = P(A)\), the multiplication rule simplifies to \(P(AB) = P(A)P(B)\).

Baye’s Rule

Behold, Bayes’ Rule for a partition \(B_1, B_2, ..., B_n\) of outcomes: \(P(B_i | A) = \frac{P(A | B_i)P(B_i)}{P(A|B_1)P(B_1) + ... + P(A|B_n)P(B_n)}\) The multiplication rule for n events is: \(P(A_1, A_2, ...,A_n) = P(A_1)P(A_2 | A_1) P(A_3 | A_1A_2) ... P(A_n|A_1A_2...A_{n-1})\) Consider a sequence of trials where each trials is equally likely to result in one of \(N\) outcomes. The probability that the first event repeat appears by trial number \(n\) is: \(P(B_n) = 1 - (1 - \frac{1}{N}) (1- \frac{2}{N}) ... (1-\frac{(n-1)}{N}) \approx 1 - e^{-\frac{n^2}{2N}}\) A set of events \(A, B, C\) are independent if \(P(B|A) = P(B|A^c) = P(B)\) and \(P(C|AB) = P(C|A^cB) = P(C|AB^c) = P(C | A^cB^c) = P(C)\). The multiplication rule for independent events is \(P(ABC) = P(A)P(B)P(C)\). Pairwise independence is not as strong as independence. Consider a set of coin tosses with events \(H_1\), \(H_2\), and \(S\), where \(S\) is the event that both heads land the same way. Each pair is independent, but the three events are not independent.



The probability of \(k\) success in \(n\) trials is \({n \choose k} p^k q^{n-k}\) where \(n \choose k = \frac{n(n-1)...(n-k+1)}{k(k-1)...1} = \frac{n!}{k!(n-k)!}\). The odds of $k$ consecutive successes relative to $k-1$ successes are: \(R(k) = [\frac{n - k + 1}{k}] \frac{p}{q}\)

The mode is \(m = \text{floor} (np + p)\), except when \(p= 1/2\) and \(n\) odd, where \(m\) and \(m-1\) are equally likely. The mean of the binomial distribution is \(\mu = np\).


The normal distribution with mean \(\mu\) and standard deviation \(\sigma\) has a normal curve: \(y = \frac{1}{\sqrt{2\pi} \sigma} e^{-1/2 (x-\mu)^2/\sigma^2}\) We can late \(z = (x-\mu)/\sigma\), denoting standard units from the mean. We can fit a normal curve to the binomial distribution as follows: \(P(a \text{ to } b \text{ successes}) \approx \mathcal{N}(\frac{b + 0.5 - \mu}{\sigma}) - \mathcal{N}(\frac{a - 0.5 - \mu}{\sigma})\) Here, \(\mu = np\) is the mean and \(\sigma = \sqrt{npq}\) is the standard deviation.

The square root law states that the number of successes will lie in a small interval of numbers centered on \(np\) with width a multiple of \(\sqrt{n}\). The proportion of successes will lie in a small interval centered on \(p\) with width a multiple of \(1/sqrt{n}\). The law of large numbers states that the proportion of successes will converge to \(p\) as the number of trial goes to infinity.

As an approximation, the normal distribution is better for \(p\) closer to 1/2 and greater \(\sigma\). The worst error in the normal approximation is less than 0.01 for \(n \geq 10\). The skew-normal approximation includes the curve \(\mathcal{N}'''(z) = (3z - z^3)\) and can be modeled as \(y = \mathcal{N} - \frac{1}{6} \text{Skewness} (n,p) \mathcal{N}'''(z) \text{ where } \text{Skewness}(n, p) = (1-2p)/\sqrt{npq} = (1-2p)/\sigma\)

Poisson Approximation

Binomial distributions with \(n\) draws and \(1/n\) probability of success cannot be accurately estimated with normal curves. Instead, as \(n \go \infty\), the shape of the distribution approaches the Poisson distribution with parameter \(\mu\). The chance of zero successes in \(n\) trials with probability \(p\) can be approximated exponentially as \(P(0) = e^{-\mu} \text{ for } p\approx 0\) The probability of \(k\) successes in \(n\) trials is \(R(k) = \frac{P(k)}{P(k-1)} \approx \frac{\mu}{k}\) More generally, according to Poisson approximation: \(P(k\text{ successes}) = P_\mu(k) \approx e^{-\mu} \frac{u^k}{k!}\)

Random Sample

Suppose we want to estimate the proportion of “good” elements from a sample. Consider sampling \(n\) draws from a population of size \(N\) so that all \(N^n\) sequences are equally likely. There is a probability \(p = G/N\) of success in each trial, which is well approximated by the normal curve. For n sufficiently large, the number of good elements will be in \(np \pm 2\sqrt{npq}\)

When sampling without replacement, there are \({N \choose n}\) possible sequences. The chance of getting \(g\) good and \(b\) bad elements is: \({G \choose g} {B \choose b}/{N \choose n}\)

The hypergeometric distribution is the distribution of good elements in a sample of size \(n\) without replacement from a population of \(G\) good and \(N-G\) bad elements. For large enough \(N\), \(G\), \(B\), the hypergeometric distribution can be approximated by the binomial distribution.

Random Variables


A random variable, denoted with a capital letter, represents something that can take on a set of values. The range of a random variable \(X\) is the set of all possible values \(X\) can produce. Although random variables usually take on numbers, we could also have random pairs, random sequences, random permutations, or even random suits.

A statement about a random variable (e.g. \(X \geq 3\)) defines an event. If a statement is made about a RV \(X\), then the event (say, \(B\)) is determined by \(X\). The distribution of \(X\) is determined by the probabilities of the individual values, while the probability of event \(B\) is determined via the addition rule: \(P(X = x) \ \ x \in \text{range of } X\\ P(X \in B) = \sum_{x \in B} P(X = x)\)

The symbol used to represent a generic possible value of a distribution is called a dummy variable and need not necessarily be denoted by \(x\). Take for example: \(P(X = v) = P(Y = v)\)

A random variable can be expressed a s a function of another random variable. If we say \(X = g(W)\), then \(g\) represents a function defined on the range of \(W\) with values in the range of \(X\). It is a deterministic rule which means that \(X\) can never be more detailed than \(W\). This means \(P(x = x) = P(g(W) = x) = \sum_{w : g(w) = x} P(W = w)\)

We can create new random variables using common functions, such as \(2X\) and \(X^2\).

The combined or joint outcome of a pair of random variables describes the distribution of a set of values \((X, Y)\). The event that \(((X, Y) = (x, y))\) is the intersection of the events \((X = x)\) and \((Y = y)\). The distribution is determined by: \(P(x, y) = P(x = x, Y = y) \\ P(x, y) \geq 0 \text{ and } \sum_{(x, y)} P(x, y) = 1\)

The events \((x, y)\) form a partition of \((X = x)\), so \(P(X = x) = \sum_{y \in Y} P(X, y)\). While two random variables can have identical distributions, they may not be equal. Consider two draws without replacement, where \(P(X = Y) = 0\). Two variables are equal only if \(P(X = Y) = !\).

For a random variable \(Y\) and events \(A\) and \(B\), we express conditional probabilities as: \(P(Y \in B | A) = \frac{P[(Y \in B) \and A]}{P(A)}\)

For random variables \(X\) and \(Y\), $$P(Y = y X = x)\(defines a distribution over the range of\)Y\(given\)X = x\(. The joint distribution is fond using\)P(X = x, Y = y) = P(X = x) P(Y = y X = x)\(. They are independent if\)P(X = x, Y = y) = P(X = x)P(Y = x)\(for all\)x\(and\)y$$.

The joint distribution of several random variables \(X_1, X_2, ..., X_n\) is defined by the joint probabilities \(P(x_1, ..., x_n) = P(X_1 = x_1, ..., X_n = x_n)\)

for all values \(x_i\) of each \(X_i\). The group of RVs are independent if \(P(x_1, x_2, ..., x_n) = P(X_1 = x_1) P(X_2 = x_2)...P(X_n = x_n)\)

Letting \(N_1$4 denote the number of results in category\)i\(with probability\)p_i\(, for every\)m\(-tuple of non-negative integers with sum\)n\(:\) P(N_1 = n_1, N_2 = n_2, …, N_m = n_m) = \frac{n!}{n_1! n_2! … n_m!} p_1^{n_1} p_2^{n_2} … p_m^{n_m} $$

Symmetry arguments can simplify calculations. A distribution is symmetric about 0 if \(P(X = -x) = P(X = x)\). Similarly, a distribution is symmetric about \(b\) if \(P(Y = b + x) = P(Y = b - x) \text{ for all } x\).


The mean of a probability distribution $P(x)$ is the average of all the values \(x\) weighted by their probabilities, i.e. \(\mu = \sum_{x} xP(x)\). It can be thought of as the center of gravity. The expectation of a random variable \(X\) is the mean of its distribution. The binomial distribution has expectation \(E(X) = np\).

An indicator \(I_A\) for an event \(A\) takes on the value \(1\) if \(A\) occurs amd \(0\) otherwise, so \(E(I_A) = P(A)\). Over the long run, the average converges to the expectation. Use indicators to simplify expressions. The sum of indicators is an indicator if the corresponding events are mutually exclusive. In general, Boole’s inequality dictates that \(P(X \geq 1) \leq E(X)\). It can generalized to Markov’s inequality, which states that for \(X \geq 0\), \(P(X \geq a) \leq \frac{E(X)}{a}\).

According to the addition rule, the expectation of two RVs equal to the sum of the two RVs expectations, i.e. \(E(X + Y) = E(X) + E(Y)\). For a set of events \(A_1, A_2, ..., A_n\), the expected number of events to occur is \(E(X) = P(A_1) + P(A_2) + ... + P(A_n)\). A RV with values \(\{0, 1, ..., n\}\) can be represented as counting the number of events so that \(E(X) = \sum_{j=1}^n P(X \geq j)\).

For a function of \(X\), we compute the expectation as \(E[g(X)] = \sum_x g(x) P(X = x)\). Moments refer to averages of powers of a distribution \(X\). If distributions \(X\) and \(Y\) are independent, then \(E(XY) = [E(X)][E(Y)]\).

A loss function \(L(x, b)\) models the loss of predicting \(b\) when the value is \(x\). Common functions are absolute error (where the solution is the median) and mean squared error (where the solution is the expectation).

SD and Approximation

The variance of a random variable \(X\) is the mean squared deviation from its expected value: \(Var(X) = E[(X - E[X])^2]\). The standard deviation of \(X\) is the square root of the variance: \(SD(X) = \sqrt{Var(X)}\). An equivalent but easier calculation for the variance is: \(Var(X) = \sum_x x^2 P(X = x) - [\sum_x xP(X = x)]^2\)

The variance of an indicator is \(\sqrt{p(1-p)}\). Variances scale, so \(SD(aX + b) = \|a\| SD(X)\). We standardized random variables as: \(X^* = (X - \mu)/\sigma\). Variances add if they are independent: \(Var(X+Y) = Var(X) + Var(Y)\).

Even for distributions that aren’t normal, most elements will be within a few standard deviations of \(E[X]\). Chebychev’s inequality states that for a RV \(X\) and \(k > 0\), \(P(|[X - E(X)]| \geq kSD(X) \leq \frac{1}{k^2}\)

The Central Limit Theorem states that the sum of independent variables is approximately normal, regardless of the underlying distribution.

The skewness of \(X\) is the third moment \(\text{Skewness}(X) = E(X^3_*) \\ \text{Skewness}(S_n) = \text{Skewness}(X) /\sqrt{n}\)

Discrete Distributions

The expectation of a discrete random variable \(X\) is \(E(X) = \sum_x xP(X = x)\) if it is absolutely convergent. For example, the number of times \(T\) that you need to make a draw before a success is \(P(T = i) = q^{i-1}p\).ww

Poisson Distribution

The Poisson distribution approximates the number \(N\) occurrences of \(n\) events when events have small probability \(p\) and are nearly independent. \(P(N = k) \approx e^{-\mu} \mu^k / k!\)

If \(N\) has Poisson distribution, then \(E(N) = \mu\) and \(SD(N) = \sqrt{\mu}}\). For \(\mu\) large enough that the standard deviation is small, the distribution becomes normal in shape. It has skewness \(1/sqrt{u}\). The skew-normal approximation is: \(P(N_\mu \leq b) \approx \mathcal{N}(z) - \frac{1}{6\sqrt{\mu}} (z^2 - 1) \mathcal{N}(z) \text{ where } z = (b+\frac{1}{2} - \mu)/ \sqrt{\mu}\)

For independent Poisson variables, \(N_1 + ... + N_j\) is a Poisson random variable with parameter \(\mu_1 + ... + \mu_j\).

A random scatter involves counting the number of points in a continuous region. We assume no multiple hits (that distinct hits defined distinct points) and randomness of hits on subsquares (that subsquares are hit with the same probabilities). There is a constant \(\lambda\) such that the number \(N(B)\) of hits in a subset \(B\) is a Poisson random variable with \(\mu = \lambda \times \text{area}(B)\) and that for disjoint subsets, the number of hits are mutually independent.


Two random variables RVs \((X, Y)\) with joint distribution \(P(x, y) = P(X = x, Y= y)\) are symmetric if \(P(x, y) = P(y, x) \forall (x, y)\). Analogously, for three RVs, the joint distribution is symmetric if \(P(x, y, z) = P(x, z, y) = ...\). In general, a function \(f(x_1, ..., x_n)\) is symmetric if the value is unchanged for all permutation of the same variables. Symmetry applies to sampling without replacement, since random permutation are exchangeable.

The hypergeometric distribution measures the number of good elements \(S_n\) is a sample of size \(n\) for a population of size \(N\) containing \(G\) good elements: \(P(S_n = g) = {G \choose g} {N - G \choose n - g} / {N \choose n}\)

The mean and standard deviation are \(E(S_n) = np\) and \(SD(S_n) = \sqrt{\frac{N - n}{N - 1}} \sqrt{npq}\). Here, the factor \(\sqrt{\frac{N-n}{N-1}}\) is the finite population correction factor.

Built with Jekyll on the Swiss theme.