Laplace Smoothing and Maximum Likelihood Estimation in Naive Bayes

In the previous lecture, we learned the Naive Bayes classification algorithm and saw its power for supervised learning. However, we encountered a critical problem: What happens when a feature value never appears in the training data for a particular class?

This zero-frequency problem can cause catastrophic failures—a single unseen feature can force the probability of an entire class to zero, regardless of all other evidence. This lecture explores Laplace smoothing, a principled solution that prevents this problem while maintaining the spirit of Naive Bayes.

We’ll also formalize our parameter learning through maximum likelihood estimation (MLE) and understand the trade-offs between overfitting and generalization.

The Zero-Frequency Problem

Example: Spam Classification Catastrophe

Training Data: 1000 spam emails, 1000 ham emails

Feature: Contains word “unsubscribe”

Spam: 500/1000 emails contain “unsubscribe”
Ham: 0/1000 emails contain “unsubscribe”

Learned Parameters:

P(F_{\text{unsubscribe}} = \text{yes} \mid \text{spam}) = 0.5

P(F_{\text{unsubscribe}} = \text{yes} \mid \text{ham}) = 0.0

New Email: Contains “unsubscribe” + other features suggesting ham

Classification:

P(\text{ham} \mid \text{features}) \propto P(\text{ham}) \prod_i P(F_i \mid \text{ham})

Since $P(F_{\text{unsubscribe}} = \text{yes} \mid \text{ham}) = 0$:

P(\text{ham} \mid \text{features}) \propto 0.5 \times \cdots \times 0 = 0

Result: Classified as spam regardless of all other features!

Problem: One zero probability destroys the entire inference, even if 99 other features strongly suggest ham.

Why This Happens

Maximum Likelihood Estimation (naive version):

P(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y)}{\text{count}(Y = y)}

If count is zero: $P(F = f \mid Y = y) = 0$

Consequences:

Overfitting: Model is too confident about unseen events
Fragility: Classification can fail catastrophically
Poor generalization: Training data never covers all possibilities

Laplace Smoothing: The Solution

Basic Laplace Smoothing (Add-One)

Idea: Pretend we’ve seen every feature value at least once in every class.

Modified Estimate:

P(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y) + 1}{\text{count}(Y = y) + |F|}

where $

$ is the number of possible values for feature $F$.

Interpretation:

Add 1 pseudo-count to every feature-value combination
Add $ F $ to denominator (to maintain normalization)

Effect:

No more zeros: Every probability is at least $\frac{1}{\text{count}(Y=y) + F } > 0$
Small change for frequent events: If count is large, $+1$ is negligible
Significant change for rare events: Prevents overfitting to small counts

Example: Binary Feature

Training Data:

Class A: Feature present 8 times, absent 2 times
Class B: Feature present 0 times, absent 10 times

Without Smoothing:

P(F = 1 \mid A) = \frac{8}{10} = 0.8

P(F = 1 \mid B) = \frac{0}{10} = 0.0 \quad \text{(PROBLEM!)}

With Laplace Smoothing ($

= 2$ for binary):

P(F = 1 \mid A) = \frac{8 + 1}{10 + 2} = \frac{9}{12} = 0.75

P(F = 1 \mid B) = \frac{0 + 1}{10 + 2} = \frac{1}{12} \approx 0.083

Observations:

Class A: Probability decreased slightly (0.8 → 0.75)
Class B: Probability no longer zero (0 → 0.083)
Both probabilities sum correctly: $0.75 + 0.25 = 1$ for A, $0.083 + 0.917 = 1$ for B

Generalized Laplace Smoothing

Adding Parameter $k$

Basic Laplace can be too strong for some applications. We generalize with parameter $k$:

P(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y) + k}{\text{count}(Y = y) + k \cdot |F|}

Interpretation:

$k = 0$: No smoothing (maximum likelihood)
$k = 1$: Standard Laplace smoothing
$k > 1$: Stronger smoothing (more conservative)
$0 < k < 1$: Weaker smoothing

Choosing $k$: The Smoothing Trade-off

Small $k$ (e.g., $k = 0.1$):

Pros: Stays closer to observed data
Cons: Less protection against zeros, more overfitting
Use when: Lots of training data

Large $k$ (e.g., $k = 10$):

Pros: Strong protection against zeros, reduces overfitting
Cons: May over-smooth, losing distinctions in data
Use when: Little training data or many features

Just right $k$ (e.g., $k = 1$):

Standard Laplace is often a good default
Can tune on validation set

Example: Smoothing Extremes

Training: 100 examples, feature appears 20 times in class A

$k = 0$ (No smoothing):

P(F \mid A) = \frac{20}{100} = 0.20

$k = 1$ (Standard Laplace):

P(F \mid A) = \frac{20 + 1}{100 + 2} = \frac{21}{102} \approx 0.206

$k = 100$ (Heavy smoothing):

P(F \mid A) = \frac{20 + 100}{100 + 200} = \frac{120}{300} = 0.40

Observation: As $k$ increases, probability shifts toward uniform (0.5 for binary)

Risk of Over-Smoothing: With $k = 100$, we’ve lost the signal from data (20/100 = 20% → 40%)

Maximum Likelihood Estimation (MLE)

Formal Framework

Goal: Find parameters $\theta$ that make observed data most likely.

Likelihood Function:

L(\theta) = P(D \mid \theta) = \prod_{i=1}^N P(x^{(i)}, y^{(i)} \mid \theta)

where $D = {(x^{(1)}, y^{(1)}), \ldots, (x^{(N)}, y^{(N)})}$

Maximum Likelihood Estimator:

\hat{\theta}_{\text{MLE}} = \arg\max_\theta L(\theta)

Log-Likelihood (easier to optimize):

\ell(\theta) = \log L(\theta) = \sum_{i=1}^N \log P(x^{(i)}, y^{(i)} \mid \theta)

Key Insight: Maximizing $L(\theta)$ is equivalent to maximizing $\ell(\theta)$ (logarithm is monotonic).

MLE for Naive Bayes

Parameters: $\theta = (P(Y), P(F_1 \mid Y), \ldots, P(F_n \mid Y))$

Log-Likelihood (using Naive Bayes factorization):

\ell(\theta) = \sum_{i=1}^N \left[ \log P(y^{(i)}) + \sum_{j=1}^n \log P(f_j^{(i)} \mid y^{(i)}) \right]

Optimization: Take derivatives and set to zero (with constraint that probabilities sum to 1).

Result (for categorical distributions):

\hat{P}(Y = y) = \frac{\text{count}(Y = y)}{N}

\hat{P}(F_j = f \mid Y = y) = \frac{\text{count}(F_j = f, Y = y)}{\text{count}(Y = y)}

This is exactly what we’ve been doing! Counting frequencies is MLE for Naive Bayes.

The Overfitting Problem in MLE

MLE Issue: Assigns probability zero to unseen events.

Why? MLE maximizes likelihood of observed data only. It doesn’t care about generalization.

Example:

Flip a coin 3 times, get HHH
MLE estimate: $P(H) = 1.0$, $P(T) = 0.0$
Clearly overfit! We don’t believe the coin has no tails.

Smoothing as Regularization: Laplace smoothing adds a prior that prevents extreme estimates.

Bayesian Interpretation of Smoothing

Maximum A Posteriori (MAP) Estimation

Bayesian View: Parameters are random variables with a prior distribution.

Prior: $P(\theta)$ - Our belief about parameters before seeing data

Posterior: $P(\theta \mid D) \propto P(D \mid \theta) \cdot P(\theta)$ (Bayes’ rule)

MAP Estimate:

\hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\theta \mid D) = \arg\max_\theta [P(D \mid \theta) \cdot P(\theta)]

Laplace Smoothing = MAP with Dirichlet Prior:

For parameter $P(F = f \mid Y = y)$, use Dirichlet prior with pseudo-counts $k$:

P(\theta) \propto \prod_f \theta_f^{k-1}

MAP Estimate:

\hat{P}(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y) + k}{\text{count}(Y = y) + k \cdot |F|}

Exactly Laplace smoothing!

Interpretation:

MLE: “Only trust the data”
MAP: “Combine data with prior belief”
Prior encodes: “I believe feature values should be somewhat uniform”

Log Probabilities: Numerical Stability

The Underflow Problem

Naive Bayes Classification:

P(Y \mid F_1, \ldots, F_n) \propto P(Y) \prod_{i=1}^n P(F_i \mid Y)

Problem: With many features, product of small probabilities underflows.

Example: 100 features, each with $P(F_i \mid Y) \approx 0.1$

0.1^{100} = 10^{-100}

Below machine precision! Computer stores as 0.0, losing all information.

Solution: Log-Space Computation

Log Probabilities:

\log P(Y \mid F_1, \ldots, F_n) = \log P(Y) + \sum_{i=1}^n \log P(F_i \mid Y) + \text{const}

Product becomes sum: Numerically stable!

Classification Algorithm (log-space):

def classify_logspace(features, P_Y, P_F_given_Y):

    best_y = None

    best_score = -infinity

    for y in classes:

        # Compute log probability (unnormalized)

        score = log(P(Y = y))

        for i, f_i in enumerate(features):

            score += log(P(F_i = f_i | Y = y))

        if score > best_score:

            best_score = score

            best_y = y

    return best_y

Note: For classification, we only need relative probabilities (argmax), so normalization constant doesn’t matter.

If you need actual probabilities:

Compute log probabilities: $\ell_1, \ell_2, \ldots, \ell_k$
Find max: $\ell_{\max} = \max_i \ell_i$
Compute: $P_i = \frac{e^{\ell_i - \ell_{\max}}}{\sum_j e^{\ell_j - \ell_{\max}}}$

Subtracting $\ell_{\max}$ prevents overflow in exponentiation.

Practical Example: Spam Detection with Smoothing

Dataset

Training: 10,000 emails (5,000 spam, 5,000 ham)

Features:

Word “lottery”: appears in 2000 spam, 10 ham
Word “meeting”: appears in 100 spam, 1500 ham
Word “xyzabc” (rare): appears in 0 spam, 0 ham

Without Smoothing (MLE)

P(\text{lottery} \mid \text{spam}) = \frac{2000}{5000} = 0.4

P(\text{lottery} \mid \text{ham}) = \frac{10}{5000} = 0.002

P(\text{xyzabc} \mid \text{spam}) = \frac{0}{5000} = 0.0 \quad \text{(PROBLEM)}

New email: Contains “xyzabc” and 50 other features suggesting spam

Classification fails! Zero probability for both classes.

With Laplace Smoothing ($k = 1$, $|F| = 2$)

P(\text{lottery} \mid \text{spam}) = \frac{2000 + 1}{5000 + 2} \approx 0.4

P(\text{lottery} \mid \text{ham}) = \frac{10 + 1}{5000 + 2} \approx 0.0022

P(\text{xyzabc} \mid \text{spam}) = \frac{0 + 1}{5000 + 2} \approx 0.0002

P(\text{xyzabc} \mid \text{ham}) = \frac{0 + 1}{5000 + 2} \approx 0.0002

Observations:

“lottery” probabilities barely changed (data-driven)
“xyzabc” gets small non-zero probability (safe default)
Classification can proceed normally

Log-Space Computation:

log_p_spam = log(0.5)  # prior

log_p_spam += log(0.4)  # lottery | spam

log_p_spam += log(0.0002)  # xyzabc | spam

log_p_spam += ... # other features

# No underflow!

Practice Problems

Problem 1: Smoothing Effect

Training data: 20 examples of class A, feature present 0 times

Compute $P(F \mid A)$ with:

$k = 0$ (no smoothing)
$k = 0.5$
$k = 1$
$k = 5$

How does the probability change? When would each be appropriate?

Problem 2: Prior Strength

You have 10 training examples. Compare:

Laplace smoothing with $k = 1$
No smoothing

Which has stronger influence: the prior or the data? What if you have 10,000 examples?

Problem 3: Log-Space Arithmetic

Compute $P(Y \mid F_1, F_2, F_3)$ given:

$P(Y) = 0.01$
$P(F_1 \mid Y) = 0.1$
$P(F_2 \mid Y) = 0.2$
$P(F_3 \mid Y) = 0.05$

Do this:

Directly (multiplying probabilities)
In log-space

Which is more numerically stable?

Conclusion

Laplace smoothing and maximum likelihood estimation complete our understanding of Naive Bayes:

Zero-Frequency Problem: MLE assigns zero probability to unseen events, causing catastrophic failures
Laplace Smoothing: Add pseudo-counts to prevent zeros while minimally affecting frequent events
Smoothing Parameter $k$: Controls trade-off between data fidelity and generalization
MLE Formalization: Counting frequencies maximizes likelihood of observed data
MAP Interpretation: Smoothing is Bayesian estimation with Dirichlet prior
Log-Space Computation: Prevents numerical underflow for many features

Key Takeaways:

Always use smoothing in Naive Bayes (even $k = 0.1$ is better than $k = 0$)
Default $k = 1$ works well; tune on validation set if needed
Use log probabilities for numerical stability
Smoothing is a form of regularization that improves generalization

Practical Impact: These techniques make Naive Bayes robust and reliable for real-world applications with limited training data.

Laplace Smoothing and Maximum Likelihood Estimation in Naive Bayes

The Zero-Frequency Problem

Example: Spam Classification Catastrophe

Why This Happens

Laplace Smoothing: The Solution

Basic Laplace Smoothing (Add-One)

Example: Binary Feature

Generalized Laplace Smoothing

Adding Parameter $k$

Choosing $k$: The Smoothing Trade-off

Example: Smoothing Extremes

Maximum Likelihood Estimation (MLE)

Formal Framework

MLE for Naive Bayes

The Overfitting Problem in MLE

Bayesian Interpretation of Smoothing

Maximum A Posteriori (MAP) Estimation

Log Probabilities: Numerical Stability

The Underflow Problem

Solution: Log-Space Computation

Practical Example: Spam Detection with Smoothing

Dataset

Without Smoothing (MLE)

With Laplace Smoothing ($k = 1$, $|F| = 2$)

Practice Problems

Problem 1: Smoothing Effect

Problem 2: Prior Strength

Problem 3: Log-Space Arithmetic

Conclusion

Further Reading