Variable Elimination: Efficient Probabilistic Inference in Bayesian Networks
Learn how variable elimination dramatically reduces computational complexity in Bayesian network inference through strategic factor manipulation and elimination ordering.
Variable Elimination: Efficient Probabilistic Inference in Bayesian Networks
Probabilistic inference in Bayesian networks is fundamental to intelligent systems, allowing us to answer queries such as “What is the probability of disease X given symptoms Y and Z?” While the theoretical approach—computing the full joint distribution and then marginalizing—is sound, it suffers from exponential complexity in the worst case. This makes it impractical for networks with more than a handful of variables.
Variable elimination provides a systematic, algorithmic method to compute exact probabilities while avoiding the redundancy of the naive approach. By carefully selecting the order in which we “eliminate” (sum out) hidden variables, we can often achieve dramatic computational savings, rendering inference tractable for many real-world networks.
This lecture explores the variable elimination algorithm in depth, demonstrates its efficiency gains over naive enumeration, and examines the critical role of elimination ordering in determining computational complexity.
The Inference Problem
Problem Statement
In the context of a Bayesian network, the general inference problem can be stated as follows:
Given:
- A Bayesian network structure and its conditional probability tables (CPTs).
- Evidence variables $E = e$ (observed values).
- Query variable(s) $Q$.
- Hidden variables $H$ (variables that are neither observed nor queried).
Compute: The posterior probability distribution of the query variable given the evidence:
Illustrative Example: The Traffic Domain
Consider a simple Bayesian network modeling the relationship between rain, traffic, and being late for class.
Figure 1: Traffic Bayesian Network Structure
[ R (Raining) ]
|
v
[ T (Traffic) ]
|
v
[ L (Late) ]
The network is defined by the following probability tables:
Table 1: Prior Probability for Rain $P(R)$
| R | P(R) |
|---|---|
| +r | 0.1 |
| -r | 0.9 |
Table 2: Conditional Probability for Traffic $P(T \mid R)$
| R | T | P(T \mid R) |
|---|---|---|
| +r | +t | 0.8 |
| +r | -t | 0.2 |
| -r | +t | 0.1 |
| -r | -t | 0.9 |
Table 3: Conditional Probability for Being Late $P(L \mid T)$
| T | L | P(L \mid T) |
|---|---|---|
| +t | +l | 0.3 |
| +t | -l | 0.7 |
| -t | +l | 0.1 |
| -t | -l | 0.9 |
Query: We wish to calculate $P(L \mid R = +r)$, or “What is the probability I will be late given that it is raining?”
The Naive Approach: Inference by Enumeration
To appreciate the efficiency of variable elimination, we first look at the naive approach, known as inference by enumeration.
Step 1: Joint Distribution Formulation
First, we write out the full joint probability distribution using the chain rule of Bayesian networks.
Step 2: Evidence Instantiation
Next, we fix the value of the observed evidence variable ($R = +r$).
Step 3: Marginalization
We then sum out the hidden variable $T$ to get the marginal distribution of $L$ and $+r$.
Step 4: Normalization
Finally, we normalize the result to obtain the conditional probability.
Computational Complexity Analysis
For a network with $n$ variables, each having domain size $d$:
- The Joint Distribution has a size of $O(d^n)$.
- The Time Complexity is $O(d^n)$ because we must sum over the exponential joint space.
- The Space Complexity is $O(d^n)$ if we store the full joint table.
Example: For a network with $n = 100$ binary variables, the joint table would contain $2^{100} \approx 10^{30}$ entries. This scale renders the naive approach completely intractable for non-trivial networks.
Variable Elimination: The Key Insight
The fundamental insight of variable elimination is that we do not need to compute the full joint distribution. Instead, we can perform the summation over hidden variables in a specific order, creating intermediate factors that are much smaller than the full joint table.
Key Idea: By pushing summations inward as far as possible, we can sum out a variable as soon as all its dependent factors are collected, preventing the creation of the massive joint table.
The Algorithm
Input:
- Bayesian network with CPTs.
- Query $Q$ and evidence $E = e$.
- Elimination ordering for hidden variables $H$.
Output: The posterior distribution $P(Q \mid E = e)$.
Procedure:
- Initialize Factors: Start with the set of all CPTs from the network.
- Instantiate Evidence: For every factor containing an evidence variable, fix it to its observed value.
- Eliminate Hidden Variables: For each hidden variable $H_i$ in the ordering:
- Join: Multiply all factors that mention $H_i$.
- Eliminate (Marginalize): Sum out $H_i$ from the resulting product factor.
- Store: Replace the old factors with this new factor (which no longer involves $H_i$).
- Join Remaining Factors: Multiply the remaining factors, which now only involve the query variable $Q$.
- Normalize: Divide by the sum over all values of $Q$ to obtain a valid probability distribution.
Worked Example: Traffic Domain
Let’s apply variable elimination to our query $P(L \mid R = +r)$.
Variables:
- Query: $L$
- Evidence: $R = +r$
- Hidden: $T$ (needs to be eliminated)
Elimination Order: We will eliminate $T$.
Step-by-Step Execution
1. Initial Factors We begin with the CPTs as our initial factors:
- $f_1(R) = P(R)$
- $f_2(T, R) = P(T \mid R)$
- $f_3(L, T) = P(L \mid T)$
2. Instantiate Evidence ($R = +r$) We restrict our factors to the row where $R = +r$:
- $f_1(+r) = P(+r) = 0.1$
- $f_2(T, +r) = P(T \mid +r)$
- $f_2(+t, +r) = 0.8$
- $f_2(-t, +r) = 0.2$
- $f_3(L, T)$ remains unchanged as it does not depend on $R$.
3. Eliminate T We find all factors involving $T$: $f_2(T, +r)$ and $f_3(L, T)$.
Join: We compute the product $f_{2,3}(L, T, +r) = f_2(T, +r) \cdot f_3(L, T)$.
| L | T | $f_{2,3}$ Calculation | Value |
|---|---|---|---|
| +l | +t | $0.8 \times 0.3$ | 0.24 |
| +l | -t | $0.2 \times 0.1$ | 0.02 |
| -l | +t | $0.8 \times 0.7$ | 0.56 |
| -l | -t | $0.2 \times 0.9$ | 0.18 |
Marginalize: We sum out $T$ to create a new factor $f_4(L, +r)$.
| L | $f_4(L, +r)$ Calculation | Value |
|---|---|---|
| +l | $0.24 + 0.02$ | 0.26 |
| -l | $0.56 + 0.18$ | 0.74 |
4. Join Remaining Factors The remaining factors are $f_1(+r)$ and $f_4(L, +r)$.
| L | $f_5(L, +r)$ Calculation | Value |
|---|---|---|
| +l | $0.1 \times 0.26$ | 0.026 |
| -l | $0.1 \times 0.74$ | 0.074 |
5. Normalize We compute the normalizing constant $Z$: Note that $Z$ equals $P(+r)$, the probability of the evidence.
| L | Calculation | Probability |
|---|---|---|
| +l | $0.026 / 0.1$ | 0.26 |
| -l | $0.074 / 0.1$ | 0.74 |
Answer: The probability of being late given it is raining is 0.26.
Complexity Analysis: Enumeration vs. Variable Elimination
Enumeration:
- Computed the full joint $P(R, T, L)$ with size $2^3 = 8$.
- Required 8 multiplications and 4 additions.
Variable Elimination:
- The largest intermediate factor was $f_{2,3}(L, T, +r)$ with size $2 \times 2 = 4$.
- Required 4 multiplications and 2 additions.
Savings: In this small example, the savings are modest. However, in large networks, avoiding the full joint distribution allows us to reduce exponential costs to polynomial ones (depending on the network structure).
General Complexity and Elimination Ordering
The computational cost of variable elimination is dominated by the size of the largest intermediate factor formed during execution.
Factor Sizes
The size of a factor $f(X_1, \ldots, X_k)$ is $d^k$, where $d$ is the domain size. The number of variables $k$ in the factor is the critical determinant of complexity.
Worst-Case Complexity and NP-Hardness
Theorem: Variable elimination is NP-hard in the worst case.
Proof Sketch: The size of intermediate factors is strictly determined by the elimination ordering. While a good ordering can keep factors small, a poor ordering can create a factor that includes almost every variable in the network, causing exponential blowup. Finding the optimal ordering is equivalent to finding the treewidth of the graph, which is an NP-hard problem.
Example: The Chain Network
Consider a chain network: $X_1 \to X_2 \to X_3 \to \cdots \to X_n$.
Good Ordering: Eliminate in sequence $X_2, X_3, \ldots, X_{n-1}$.
- At each step, we join $f(X_{i-1}, X_i)$ and $f(X_i, X_{i+1})$.
- The resulting factor involves only 3 variables.
- Complexity: $O(n \cdot d^3)$, which is linear in $n$.
Bad Ordering: Eliminate $X_1, X_3, X_5, \ldots$ first (skipping nodes).
- This would couple distant variables, creating larger and larger factors.
- Complexity: Can approach $O(d^n)$.
Heuristics for Elimination Ordering
Since finding the optimal ordering is NP-hard, we rely on greedy heuristics:
- Min-Fill: Choose the variable that adds the fewest new edges to the graph when eliminated.
- Min-Width: Choose the variable that results in the smallest factor size.
- Greedy: Eliminate the variable that currently has the fewest neighbors.
Practical Advice:
- Use domain knowledge to guide ordering.
- Eliminate leaf nodes and nodes with few connections first.
- Avoid eliminating “hub” nodes (nodes connected to many others) early, as this couples all their neighbors.
Advanced Example: The Alarm Network
Consider the classic “Burglary Alarm” network.
Figure 2: Alarm Network Structure
B (Burglary) E (Earthquake)
\ /
\ /
v v
A (Alarm)
/ \
/ \
v v
J (JohnCalls) M (MaryCalls)
Query: $P(B \mid J = \text{true}, M = \text{true})$ Evidence: $J = t, M = t$ Hidden: $E, A$
Scenario 1: Elimination Order [E, A]
- Eliminate E:
- Factors involving E: $f(B, E)$ (prior implies independence here, effectively $P(B)$ and $P(E)$) and $f(A \mid B, E)$.
- Join: $f(B, E) \times f(A \mid B, E) \to f’(B, A)$.
- Size: Factor involves $B, A$ (after summing $E$). Size $2^2 = 4$.
- Eliminate A:
- Factors involving A: $f’(B, A)$, $f(J \mid A)$, $f(M \mid A)$.
- Join: $f’(B, A) \times f(J \mid A) \times f(M \mid A) \to f’‘(B, J, M)$.
- Size: Before summing A, factor involves $A, B, J, M$. Since $J, M$ are evidence (constants), effective variables are $A, B$. Size 4.
Max Factor Size: 4.
Scenario 2: Elimination Order [A, E]
- Eliminate A:
- Factors involving A: $P(A \mid B, E)$, $P(J \mid A)$, $P(M \mid A)$.
- Join: All three. Result depends on $B, E, J, M$.
- Size: Depends on $A, B, E$. Size $2^3 = 8$.
- Note: By eliminating $A$ first, we couple $B$ and $E$ (parents) with $J$ and $M$ (children).
Conclusion: Order 1 is generally superior because it keeps factors smaller by dealing with the “causes” (E) before the central hub (A).
Conclusion
Variable elimination transforms probabilistic inference from an intractable exponential problem into one that is often efficiently solvable.
- Efficiency: By creating small intermediate factors, we avoid computing the massive full joint distribution.
- Ordering: The order of elimination is the single most critical factor for performance. A good ordering yields polynomial complexity; a bad one returns to exponential.
- Applicability: This algorithm is ideal for low-treewidth networks (sparse, tree-like structures) and when exact probabilities are required.
For networks where even the best variable elimination ordering fails (high treewidth), we must turn to approximate methods like Markov Chain Monte Carlo (MCMC) sampling, which we will cover in future lectures.
Further Reading
- Probabilistic Graphical Models by Koller and Friedman (Chapter 9) - The definitive comprehensive reference.
- Bayesian Reasoning and Machine Learning by Barber - Excellent textbook with a focus on algorithms.
- Artificial Intelligence: A Modern Approach by Russell and Norvig (Chapter 13.4) - Standard introductory text.
- pgmpy Documentation - Practical Python implementation of Variable Elimination.
This lecture covered variable elimination for efficient probabilistic inference. Next, we will transition from exact inference to machine learning, starting with Naive Bayes classification.