Variable Elimination: Efficient Probabilistic Inference in Bayesian Networks

November 21, 2025 13 min read

artificial-intelligence machine-learning

Learn how variable elimination dramatically reduces computational complexity in Bayesian network inference through strategic factor manipulation and elimination ordering.

Variable Elimination: Efficient Probabilistic Inference in Bayesian Networks

Probabilistic inference in Bayesian networks is fundamental to intelligent systems, allowing us to answer queries such as “What is the probability of disease X given symptoms Y and Z?” While the theoretical approach—computing the full joint distribution and then marginalizing—is sound, it suffers from exponential complexity in the worst case. This makes it impractical for networks with more than a handful of variables.

Variable elimination provides a systematic, algorithmic method to compute exact probabilities while avoiding the redundancy of the naive approach. By carefully selecting the order in which we “eliminate” (sum out) hidden variables, we can often achieve dramatic computational savings, rendering inference tractable for many real-world networks.

This lecture explores the variable elimination algorithm in depth, demonstrates its efficiency gains over naive enumeration, and examines the critical role of elimination ordering in determining computational complexity.

The Inference Problem

Problem Statement

In the context of a Bayesian network, the general inference problem can be stated as follows:

Given:

A Bayesian network structure and its conditional probability tables (CPTs).
Evidence variables $E = e$ (observed values).
Query variable(s) $Q$.
Hidden variables $H$ (variables that are neither observed nor queried).

Compute: The posterior probability distribution of the query variable given the evidence: $P(Q \mid E = e)$

Illustrative Example: The Traffic Domain

Consider a simple Bayesian network modeling the relationship between rain, traffic, and being late for class.

Figure 1: Traffic Bayesian Network Structure

[ R (Raining) ]
      |
      v
[ T (Traffic) ]
      |
      v
[ L (Late) ]

The network is defined by the following probability tables:

Table 1: Prior Probability for Rain $P(R)$

R	P(R)
+r	0.1
-r	0.9

Table 2: Conditional Probability for Traffic $P(T \mid R)$

R	T	P(T \mid R)
+r	+t	0.8
+r	-t	0.2
-r	+t	0.1
-r	-t	0.9

Table 3: Conditional Probability for Being Late $P(L \mid T)$

T	L	P(L \mid T)
+t	+l	0.3
+t	-l	0.7
-t	+l	0.1
-t	-l	0.9

Query: We wish to calculate $P(L \mid R = +r)$, or “What is the probability I will be late given that it is raining?”

The Naive Approach: Inference by Enumeration

To appreciate the efficiency of variable elimination, we first look at the naive approach, known as inference by enumeration.

Step 1: Joint Distribution Formulation

First, we write out the full joint probability distribution using the chain rule of Bayesian networks. $P(L, R, T) = P(R) \cdot P(T \mid R) \cdot P(L \mid T)$

Step 2: Evidence Instantiation

Next, we fix the value of the observed evidence variable ($R = +r$). $P(L, +r, T) = P(+r) \cdot P(T \mid +r) \cdot P(L \mid T)$

Step 3: Marginalization

We then sum out the hidden variable $T$ to get the marginal distribution of $L$ and $+r$. $P(L, +r) = \sum_T P(+r) \cdot P(T \mid +r) \cdot P(L \mid T)$

Step 4: Normalization

Finally, we normalize the result to obtain the conditional probability. $P(L \mid +r) = \frac{P(L, +r)}{\sum_L P(L, +r)}$

Computational Complexity Analysis

For a network with $n$ variables, each having domain size $d$:

The Joint Distribution has a size of $O(d^n)$.
The Time Complexity is $O(d^n)$ because we must sum over the exponential joint space.
The Space Complexity is $O(d^n)$ if we store the full joint table.

Example: For a network with $n = 100$ binary variables, the joint table would contain $2^{100} \approx 10^{30}$ entries. This scale renders the naive approach completely intractable for non-trivial networks.

Variable Elimination: The Key Insight

The fundamental insight of variable elimination is that we do not need to compute the full joint distribution. Instead, we can perform the summation over hidden variables in a specific order, creating intermediate factors that are much smaller than the full joint table.

Key Idea: By pushing summations inward as far as possible, we can sum out a variable as soon as all its dependent factors are collected, preventing the creation of the massive joint table.

The Algorithm

Input:

Bayesian network with CPTs.
Query $Q$ and evidence $E = e$.
Elimination ordering for hidden variables $H$.

Output: The posterior distribution $P(Q \mid E = e)$.

Procedure:

Initialize Factors: Start with the set of all CPTs from the network.
Instantiate Evidence: For every factor containing an evidence variable, fix it to its observed value.
Eliminate Hidden Variables: For each hidden variable $H_i$ in the ordering:
- Join: Multiply all factors that mention $H_i$.
- Eliminate (Marginalize): Sum out $H_i$ from the resulting product factor.
- Store: Replace the old factors with this new factor (which no longer involves $H_i$).
Join Remaining Factors: Multiply the remaining factors, which now only involve the query variable $Q$.
Normalize: Divide by the sum over all values of $Q$ to obtain a valid probability distribution.

Worked Example: Traffic Domain

Let’s apply variable elimination to our query $P(L \mid R = +r)$.

Variables:

Query: $L$
Evidence: $R = +r$
Hidden: $T$ (needs to be eliminated)

Elimination Order: We will eliminate $T$.

Step-by-Step Execution

1. Initial Factors We begin with the CPTs as our initial factors:

$f_1(R) = P(R)$
$f_2(T, R) = P(T \mid R)$
$f_3(L, T) = P(L \mid T)$

2. Instantiate Evidence ($R = +r$) We restrict our factors to the row where $R = +r$:

$f_1(+r) = P(+r) = 0.1$
$f_2(T, +r) = P(T \mid +r)$
- $f_2(+t, +r) = 0.8$
- $f_2(-t, +r) = 0.2$
$f_3(L, T)$ remains unchanged as it does not depend on $R$.

3. Eliminate T We find all factors involving $T$: $f_2(T, +r)$ and $f_3(L, T)$.

Join: We compute the product $f_{2,3}(L, T, +r) = f_2(T, +r) \cdot f_3(L, T)$.

L	T	$f_{2,3}$ Calculation	Value
+l	+t	$0.8 \times 0.3$	0.24
+l	-t	$0.2 \times 0.1$	0.02
-l	+t	$0.8 \times 0.7$	0.56
-l	-t	$0.2 \times 0.9$	0.18

Marginalize: We sum out $T$ to create a new factor $f_4(L, +r)$. $f_4(L, +r) = \sum_T f_{2,3}(L, T, +r)$

L	$f_4(L, +r)$ Calculation	Value
+l	$0.24 + 0.02$	0.26
-l	$0.56 + 0.18$	0.74

4. Join Remaining Factors The remaining factors are $f_1(+r)$ and $f_4(L, +r)$. $f_5(L, +r) = f_1(+r) \cdot f_4(L, +r)$

L	$f_5(L, +r)$ Calculation	Value
+l	$0.1 \times 0.26$	0.026
-l	$0.1 \times 0.74$	0.074

5. Normalize We compute the normalizing constant $Z$: $Z = 0.026 + 0.074 = 0.1$ Note that $Z$ equals $P(+r)$, the probability of the evidence.

P(L \mid +r) = \frac{f_5(L, +r)}{Z}

L	Calculation	Probability
+l	$0.026 / 0.1$	0.26
-l	$0.074 / 0.1$	0.74

Answer: The probability of being late given it is raining is 0.26.

Complexity Analysis: Enumeration vs. Variable Elimination

Enumeration:

Computed the full joint $P(R, T, L)$ with size $2^3 = 8$.
Required 8 multiplications and 4 additions.

Variable Elimination:

The largest intermediate factor was $f_{2,3}(L, T, +r)$ with size $2 \times 2 = 4$.
Required 4 multiplications and 2 additions.

Savings: In this small example, the savings are modest. However, in large networks, avoiding the full joint distribution allows us to reduce exponential costs to polynomial ones (depending on the network structure).

General Complexity and Elimination Ordering

The computational cost of variable elimination is dominated by the size of the largest intermediate factor formed during execution.

Factor Sizes

The size of a factor $f(X_1, \ldots, X_k)$ is $d^k$, where $d$ is the domain size. The number of variables $k$ in the factor is the critical determinant of complexity.

Worst-Case Complexity and NP-Hardness

Theorem: Variable elimination is NP-hard in the worst case.

Proof Sketch: The size of intermediate factors is strictly determined by the elimination ordering. While a good ordering can keep factors small, a poor ordering can create a factor that includes almost every variable in the network, causing exponential blowup. Finding the optimal ordering is equivalent to finding the treewidth of the graph, which is an NP-hard problem.

Example: The Chain Network

Consider a chain network: $X_1 \to X_2 \to X_3 \to \cdots \to X_n$.

Good Ordering: Eliminate in sequence $X_2, X_3, \ldots, X_{n-1}$.

At each step, we join $f(X_{i-1}, X_i)$ and $f(X_i, X_{i+1})$.
The resulting factor involves only 3 variables.
Complexity: $O(n \cdot d^3)$, which is linear in $n$.

Bad Ordering: Eliminate $X_1, X_3, X_5, \ldots$ first (skipping nodes).

This would couple distant variables, creating larger and larger factors.
Complexity: Can approach $O(d^n)$.

Heuristics for Elimination Ordering

Since finding the optimal ordering is NP-hard, we rely on greedy heuristics:

Min-Fill: Choose the variable that adds the fewest new edges to the graph when eliminated.
Min-Width: Choose the variable that results in the smallest factor size.
Greedy: Eliminate the variable that currently has the fewest neighbors.

Practical Advice:

Use domain knowledge to guide ordering.
Eliminate leaf nodes and nodes with few connections first.
Avoid eliminating “hub” nodes (nodes connected to many others) early, as this couples all their neighbors.

Advanced Example: The Alarm Network

Consider the classic “Burglary Alarm” network.

Figure 2: Alarm Network Structure

    B (Burglary)    E (Earthquake)
          \          /
           \        /
            v      v
            A (Alarm)
           /        \
          /          \
         v            v
    J (JohnCalls)   M (MaryCalls)

Query: $P(B \mid J = \text{true}, M = \text{true})$ Evidence: $J = t, M = t$ Hidden: $E, A$

Scenario 1: Elimination Order [E, A]

Eliminate E:
- Factors involving E: $f(B, E)$ (prior implies independence here, effectively $P(B)$ and $P(E)$) and $f(A \mid B, E)$.
- Join: $f(B, E) \times f(A \mid B, E) \to f’(B, A)$.
- Size: Factor involves $B, A$ (after summing $E$). Size $2^2 = 4$.
Eliminate A:
- Factors involving A: $f’(B, A)$, $f(J \mid A)$, $f(M \mid A)$.
- Join: $f’(B, A) \times f(J \mid A) \times f(M \mid A) \to f’‘(B, J, M)$.
- Size: Before summing A, factor involves $A, B, J, M$. Since $J, M$ are evidence (constants), effective variables are $A, B$. Size 4.

Max Factor Size: 4.

Scenario 2: Elimination Order [A, E]

Eliminate A:
- Factors involving A: $P(A \mid B, E)$, $P(J \mid A)$, $P(M \mid A)$.
- Join: All three. Result depends on $B, E, J, M$.
- Size: Depends on $A, B, E$. Size $2^3 = 8$.
- Note: By eliminating $A$ first, we couple $B$ and $E$ (parents) with $J$ and $M$ (children).

Conclusion: Order 1 is generally superior because it keeps factors smaller by dealing with the “causes” (E) before the central hub (A).

Conclusion

Variable elimination transforms probabilistic inference from an intractable exponential problem into one that is often efficiently solvable.

Efficiency: By creating small intermediate factors, we avoid computing the massive full joint distribution.
Ordering: The order of elimination is the single most critical factor for performance. A good ordering yields polynomial complexity; a bad one returns to exponential.
Applicability: This algorithm is ideal for low-treewidth networks (sparse, tree-like structures) and when exact probabilities are required.

For networks where even the best variable elimination ordering fails (high treewidth), we must turn to approximate methods like Markov Chain Monte Carlo (MCMC) sampling, which we will cover in future lectures.

Variable Elimination: Efficient Probabilistic Inference in Bayesian Networks

The Inference Problem

Problem Statement

Illustrative Example: The Traffic Domain

The Naive Approach: Inference by Enumeration

Step 1: Joint Distribution Formulation

Step 2: Evidence Instantiation

Step 3: Marginalization

Step 4: Normalization

Computational Complexity Analysis

Variable Elimination: The Key Insight

The Algorithm

Worked Example: Traffic Domain

Step-by-Step Execution

Complexity Analysis: Enumeration vs. Variable Elimination

General Complexity and Elimination Ordering

Factor Sizes

Worst-Case Complexity and NP-Hardness

Example: The Chain Network

Heuristics for Elimination Ordering

Advanced Example: The Alarm Network

Scenario 1: Elimination Order [E, A]

Scenario 2: Elimination Order [A, E]

Conclusion

Further Reading

Related Posts

Unit Testing Strategies: Writing Effective Test Cases

Symmetric Encryption Algorithms: Understanding Block and Stream Ciphers

Spring 2026 Semester Overview: Cryptography and Software Verification