Peter Zhang

CS 188 MT


These notes are for CS 188: Introduction to Artificial Intelligence taught by Mesut Yang and Carl Qi in Summer 2021. The notes up to the midterm are here. These notes cover the topics up to the final.

Bayes Nets


In an environment with uncertainty, we use observed variables (evidence) to build models that relate known variables to unknown variables. A random variable (RV) is an aspect of the world; the variable is denoted with a capital letter and the value is denoted with a lower-case letter. A probability distribution outlines the probability that a RV takes on a given value. Unobserved RVs have distributions; we can notate \(P(X = x) = p\). Distributions are denoted \(\forall x P(X = x) \geq 0\) and \(\sum_x P(X = x) = 1\).

A joint distribution specifies a real number for each assignment with the same constraints. For \(n\) variables with domain sizes \(d\), the size of the distribution will be \(d^n\). A probabilistic model is a joint distribution over a set of RVs with domains, outcomes (assignment), and normalized probability (sums to 1). An event is a set \(E\) of outcomes with \(P(E) = \sum_{(x_1, ... , x_n) \in E} P(x_1, ..., x_n)\). Marginal distributions are sub-tables that eliminate variables by summing them out. A conditional probability is defined as \(P(a \| b) = P(a, b) / P(b)\). Conditional distributions are probability distributions over some variables given fixed values of other variables.

Probabilistic inference involves computing a probability from known probabilities. In the general case, you have evidence variables \(E_1, ..., E_k = e_1, ..., e_k\), a query variable \(Q\), and hidden variables \(H_1, ..., H_r\). We want to evaluate $$P(Q e_1, …, e_k)$$.
The product rule dictates that $$P(y) P(x y) = P(x, y)\(. More generally, the chain rule dictates that any joint distribution is the product of conditional distributions. **Bayes rule** lets us build conditional probabilities as\)P(x | y) = \frac{P(y, x) P(x)}{P(y)}$$. We can compute a posterior distribution by updating with Bayes rule.


Models are always simplifications, but some models are useful. Bayes nets are a type of model that let us reason about unknown variables given evidence. Two variables are independent if \(\forall x, y : P(x, y) = P(x) P(y)\), denoted \(X \indep Y\). Independence is a simplifying modeling assumption, since it makes computing joints easier. A RV \(X\) is conditionally independent of \(Y\) given \(Z\) if $$\forall x, y, z: P(x, z, y) = P(x z)\(, denoted\)X \indep Y Y$$.

Full joint distributions grow exponentially in size. Bayes nets (graphical models, “BN”) describe complex joint distributions with simple, local distributions. A set of nodes, each representing a variable \(X\), comprise a directed, acyclic graph. Each node is described by a distribution over each combination of its parent’s values. A BN encodes joint distributions as the product of local conditional distributions.

For a graph like \(X \to Y \to Z \to W\), we have $$P(x, y, z, w) = P(x)P(y x) P(z y) P(w z)\(. An unintuitive independence relationship is\)W \indep Y | X$$. We can prove independence using tedious algebra, but disprove independence with a single example.

A triple is a group of 3 nodes. D-separation is an algorithm for analyzing the independence of triples. In a causal chain, \(X \to Y \to Z\), \(X \indep Z \| Y\). In common cause, \(X \leftarrow Y \to Z$4,\)X \indep Z | Y\(. In **common effect**,\)X \to Z \leftarrow Y\(,\)X \indep Y$.

Any BN can be broken down into the three cases. Two variables are independent if there are no active paths between them. A path is active if every component triple is active.

Given a BN structure, the d-separation algorithm builds a complete list of conditional independences. The topology of a BN limits which joint distributions can be encoded.


Inference in BN is equivalent to 3-SAT and NP-comoplete. Some orderings make computation more efficient. A polytree is a directed graph with no undirected cycle.

General variable elimination refers to eliminating rows corresponding to hidden variables. You instatiate a CPT with evidence. You find all factors corresponding a given hidden variable \(H\) and join/normalize the probabilities. Ordering matters: if you eliminate a variable with many dependents, when the maximum factor will be much larger.

The intermediate probability tables are called factors. In inference by enumeration, we start with the CPTs as the initial factors. We select known factors, eliminate the hidden variables, and then normalize Marginalizing early creates savings. Always start with factors that select the evidence.

The first type of factor is a joint distribution. There’s one entry for each joint. If one variable has taken on a value, say a fixed X=x, then it’ll sum to P(x). The number of capitals is the dimensionality, multiplying the dimensions of the capital variables.

The second type of factor is a single or family of conditionals. It could be $$P(Y x)\(for a fix\)x\(, or a family of conditionals\)P(Y, X)$$.

The final type of factor is the specified family, We fix the \(Y = y\) and get the table \(P(y, X)\).

In general, for a factor \(P(Y_1 ... Y_N \| X_1 ... X_M)\), any assigned variable is a dimension missing from the array.


Variable elimination involves eliminating hidden variables and marginalizing hiddne variables. In the worst case, it’s still exponential in the size of the Bayes’ net.

Sampling is repeated simulation. Instead of calculating the model directly, we quickly ask the what if questions. In prior sampling, ** we sample the priors and ultimate generate the distribution. As we take more samples, it approaches the real probability; it is **consistent. For rejection sampling, we reject samples that are inconsistent with the evidence. The problem is that this is inefficient; instead we can use likelihood weighting, weighting the entire sample by the probability consistent with the evidence. But, we can’t fix upstream variables to make evidence more likely. Start with an arbitrary instantiation. Sample one variable at a time, but fix evidence.

Markov Models

The value of \(X\) at some time is the state. The parameters are the transition probabilities/dynamics, which specify the probability of transitions.

The first order Markov property is that the past and future are independent given the present. The chain is just a BN. A forward algorithm could compute: \(P(x_t) = \sum_{x_{t-1}} (P(x_t) | x_{t-1}) P(x_{t-1})\) For most chains, the initial distribution matters less over time. We end up in a stationary distribution \(P_\infty\) which satisfies: \(P_\infty(X) = P_{\infty + 1}(X)\) Hidden Markov models have the additional property of the current observation being independent of other observations. We update our beliefs by filtering, tracking the distribution of \(B_t(X) = P_T(X_T | e_1, ..., e_t)\). Usually \(B_!1\) is uniform, since we don’t have evidence. If we have evidence up to the state before, we update with \(B'(X_{t+1}) = \sum_{x_t} P(X' | x_t) B(x_t)\). If we also receive new evidence, we update with \(B(X_{t+1}) ∝ x_{t+1} P(e_{t+1} | X_{t+1}) B'(X_{t+1})\).

The forward algorithm lets us calculate \(B_t(X)\) given evidence at each time. We can compute: \(P(x_1 | e_{1:t}) \propto P(e_t | x_t) \sum_{x_{t-1}} P(x_t | x_{t-1}) P(x_{t-1}, e_{1:t-1})\) Particle filtering gives an approximate solution in scenarios where the solution space is too big. Instead, we can track samples called particles. We represent \(P(X)\) with a list of \(N\) particles with \(N << |X|\). Each particle is moved by sampling the transition model so that \(x' = \text{sample}(P(X'|x))\), similar to prior sampling. After making an observation \(e\), we weigh the samples according to the probability of that observation. To resample, we sample a new set of particles from the previous, weighted distribution of particles.

Decision Networks

A decision network lets us calculate expected utility for actions under uncertainty. To select an action, we instantiate all evidence, select all possible actions, calculate the posterior for all parents of a utility node, and pick the action that maximizes expected utility. A decision tree can be modeled as an outcome tree.

The value of perfect information is the gain in MEU from knowing information. \(VPI(E'|e) = (\sum_{e'} P(e'|e) MEU(e, e')) - MEU(e)\) VPI is nonnegative (information won’t hurt), nonadditive (information isn’t necessarily additive), and order-independent (the order of information doesn’t matter). There’s no “imperfect” information - if an observation is noisy, we can model it with another node.

POMDPs add a noisy measurement of the true state based on the observation \(O\), modeled by an observation function $$P(o s)$$. Solving POMDPs requires a VPI agent that repeatedly calculates the marginal value of information.

Machine Learning

Naïve Bayes

Examples of classification problems include the spam/ham problem, digit recognition, and a variety of commercial application. In Naïve Bayes, we assume all features are independent effects of the label, where the model only specifies how each feature depends on the class: \(P(Y|F_{1} ... F_{n}) \propto P(Y) \prod_{i} P(F_{i} | Y)\) To use NB, we need an inference method and local conditional probability tables, which are specified by parameters denoted by \(theta\).

The experimental cycle involves learning parameters on the training set, adjusting hyperparameters, and computing the accuracy of the test set. Accuracy is the fraction of instances predicted correctly. Overfitting refers to fitting a model to the training data without generalizing to future cases. To generalize, estimates need to be smoothed and regularized.

To estimate parameters, we use the empirical rate of the value to pick the estimate that maximizes the likelihood of the data. \(P_{ML}(x) = \frac{\text{count}(x)}{\text{total samples}}, L(x, \theta) = \prod_i P_\theta(x_i)\) In maximum likelihood estimation, the hypothesis space is comprised on binomial distributions. Taking the derivative of the likelihood yields that the empirical rate maximizes likelihood.

Smoothing gives us a way to deal with unseen events and avoid overfitting. In Laplace smoothing, we pretend that we’ve seen every outcome \(k\) extra times: \(P_{LAP, k}(x) = \frac{c(x) + k}{N + k|X|}\) The amount of smoothing is a hyperparameter that we tune after finding parameters. A baseline is a straw man procedure that measures how hard the task is and gives a comparison for other accuracy achievements. The confidence of a classifier represents how sure the classifier is in the classification.


In a linear classifier, the inputs are feature values, each feature has a weight, and the sum is the activation. \(\text{activation}_w(x) = \sum_i w_i \cdot f_i(x) = w \cdot f(x)\) In binary decision rule, the model outputs +1 when the activation is positive and -1 otherwise. The decision boundary is a hyperplane orthogonal to the weight vector. To learn with a binary perceptron, we can adjust the weights when a prediction is incorrect. \(w = w + y^* \cdot f\)

If we have multiple classes, we introduce a weight vector \(w_y\) for each class and pick the prediction with the highest score: \(y = \arg \max_y w_y \cdot f(x)\) We train by subtracting weights from the wrong answer and adding weights to the right answer.

Separability refers to the existence of parameters that get the training set correct; in this case, there is convergence. The mistake bound is the number of mistakes related to the margin \(\delta\) and number of feature \(k\): \(\text{mistakes} < \frac{k}{\delta^2}\) If the data aren’t separable, weights might thrash. Averaging weights can help. Sometimes it can be a poor generalization, barely separating the solutions, leading to overfitting. We can get probabilistic decision boundaries with scoring. If the score is very positive, we want the probability is very positive, we want probability to be close to 1, and vice versa. The Sigmoid function models this: \(z = w \cdot f(x), \phi (z) = \frac{1}{1+ e ^{-z}}\) The new learning method is maximum likelihood estimation: \(\max _w ll(w) = \max_w \sum_i \log P(y^{(i)} | x^{(i)} ; w), P(y^{(i)} = +1 | x^{(u)} ; w) = \frac{1}{1 + e^{-w \cdot f(x^{(i)})}}\) Multiclass logistic regression evaluates scores in the same way.

Neural Networks

The simplest way to optimize weights is hill climbing. Move to the best neighbor until there isn’t another one. For multiclass logistic regression, there is a continuous space with infinite neighbors. Instead of randomly picking neighbors, gradient ascent updates in the uphill direction. For \(g(w_1, w_2, ...)\), we update: \(w \leftarrow w + \alpha * \Delta_w g(w), \Delta_wg(w) = [\frac{\partial g}{\partial w_1}(w), \frac{\partial g}{\partial w_2}(w)...]\) We can use the logistic regression function as \(g\), where each update adds \(f\) to the correct class weight.

One alternative is stochastic gradient ascent, which involves picking random samples. Another is mini-batch were the gradient over a batch is computed in parallel.

Multi-class logistic regression is a special case of neural networks. The difference is that a deep neural network learns the features. The value of node \(i\) in layer \(k\) is a weighted average of the previous layers’ neurons: \(z_i^{k} = g(\sum_j W{_i, j}^{(k-1, k)}) z_j^{(k-1)}\) Common activation fundtions are sigmoid, hyperbolic (\(\frac{e^z - e^{-z}}{e^z + e^{-z}}\)) and ReLU (\(\max(0, z)\)). A two-layer neural network with sufficient neurons can closely approximate any continuous function.

Backpropogation involves computing partial derivatives backwards. It gives the exact answer, in contrast to finite difference approximation. Applications of nueral nets include computer vision, visual QA, and machine translation. Neural nets are limited by shortcuts, using related information and particular features. In deep RL, coordination tasks can overfit and be sensitive to noise in collaborators.

Built with Jekyll on the Swiss theme.