Laplace Smoothing and Maximum Likelihood Estimation in Naive Bayes
Learn how to handle zero-frequency problems and overfitting in Naive Bayes through Laplace smoothing and understand the theoretical foundations of maximum likelihood estimation.
Laplace Smoothing and Maximum Likelihood Estimation in Naive Bayes
In the previous lecture, we learned the Naive Bayes classification algorithm and saw its power for supervised learning. However, we encountered a critical problem: What happens when a feature value never appears in the training data for a particular class?
This zero-frequency problem can cause catastrophic failures—a single unseen feature can force the probability of an entire class to zero, regardless of all other evidence. This lecture explores Laplace smoothing, a principled solution that prevents this problem while maintaining the spirit of Naive Bayes.
We’ll also formalize our parameter learning through maximum likelihood estimation (MLE) and understand the trade-offs between overfitting and generalization.
The Zero-Frequency Problem
Example: Spam Classification Catastrophe
Training Data: 1000 spam emails, 1000 ham emails
Feature: Contains word “unsubscribe”
-
Spam: 500/1000 emails contain “unsubscribe”
-
Ham: 0/1000 emails contain “unsubscribe”
Learned Parameters:
New Email: Contains “unsubscribe” + other features suggesting ham
Classification:
Since $P(F_{\text{unsubscribe}} = \text{yes} \mid \text{ham}) = 0$:
Result: Classified as spam regardless of all other features!
Problem: One zero probability destroys the entire inference, even if 99 other features strongly suggest ham.
Why This Happens
Maximum Likelihood Estimation (naive version):
If count is zero: $P(F = f \mid Y = y) = 0$
Consequences:
-
Overfitting: Model is too confident about unseen events
-
Fragility: Classification can fail catastrophically
-
Poor generalization: Training data never covers all possibilities
Laplace Smoothing: The Solution
Basic Laplace Smoothing (Add-One)
Idea: Pretend we’ve seen every feature value at least once in every class.
Modified Estimate:
| where $ | F | $ is the number of possible values for feature $F$. |
Interpretation:
-
Add 1 pseudo-count to every feature-value combination
-
Add $ F $ to denominator (to maintain normalization)
Effect:
-
No more zeros: Every probability is at least $\frac{1}{\text{count}(Y=y) + F } > 0$ -
Small change for frequent events: If count is large, $+1$ is negligible
- Significant change for rare events: Prevents overfitting to small counts
Example: Binary Feature
Training Data:
-
Class A: Feature present 8 times, absent 2 times
-
Class B: Feature present 0 times, absent 10 times
Without Smoothing:
| With Laplace Smoothing ($ | F | = 2$ for binary): |
Observations:
-
Class A: Probability decreased slightly (0.8 → 0.75)
-
Class B: Probability no longer zero (0 → 0.083)
-
Both probabilities sum correctly: $0.75 + 0.25 = 1$ for A, $0.083 + 0.917 = 1$ for B
Generalized Laplace Smoothing
Adding Parameter $k$
Basic Laplace can be too strong for some applications. We generalize with parameter $k$:
Interpretation:
-
$k = 0$: No smoothing (maximum likelihood)
-
$k = 1$: Standard Laplace smoothing
-
$k > 1$: Stronger smoothing (more conservative)
-
$0 < k < 1$: Weaker smoothing
Choosing $k$: The Smoothing Trade-off
Small $k$ (e.g., $k = 0.1$):
-
Pros: Stays closer to observed data
-
Cons: Less protection against zeros, more overfitting
-
Use when: Lots of training data
Large $k$ (e.g., $k = 10$):
-
Pros: Strong protection against zeros, reduces overfitting
-
Cons: May over-smooth, losing distinctions in data
-
Use when: Little training data or many features
Just right $k$ (e.g., $k = 1$):
-
Standard Laplace is often a good default
-
Can tune on validation set
Example: Smoothing Extremes
Training: 100 examples, feature appears 20 times in class A
$k = 0$ (No smoothing):
$k = 1$ (Standard Laplace):
$k = 100$ (Heavy smoothing):
Observation: As $k$ increases, probability shifts toward uniform (0.5 for binary)
Risk of Over-Smoothing: With $k = 100$, we’ve lost the signal from data (20/100 = 20% → 40%)
Maximum Likelihood Estimation (MLE)
Formal Framework
Goal: Find parameters $\theta$ that make observed data most likely.
Likelihood Function:
where $D = {(x^{(1)}, y^{(1)}), \ldots, (x^{(N)}, y^{(N)})}$
Maximum Likelihood Estimator:
Log-Likelihood (easier to optimize):
Key Insight: Maximizing $L(\theta)$ is equivalent to maximizing $\ell(\theta)$ (logarithm is monotonic).
MLE for Naive Bayes
Parameters: $\theta = (P(Y), P(F_1 \mid Y), \ldots, P(F_n \mid Y))$
Log-Likelihood (using Naive Bayes factorization):
Optimization: Take derivatives and set to zero (with constraint that probabilities sum to 1).
Result (for categorical distributions):
This is exactly what we’ve been doing! Counting frequencies is MLE for Naive Bayes.
The Overfitting Problem in MLE
MLE Issue: Assigns probability zero to unseen events.
Why? MLE maximizes likelihood of observed data only. It doesn’t care about generalization.
Example:
-
Flip a coin 3 times, get HHH
-
MLE estimate: $P(H) = 1.0$, $P(T) = 0.0$
-
Clearly overfit! We don’t believe the coin has no tails.
Smoothing as Regularization: Laplace smoothing adds a prior that prevents extreme estimates.
Bayesian Interpretation of Smoothing
Maximum A Posteriori (MAP) Estimation
Bayesian View: Parameters are random variables with a prior distribution.
Prior: $P(\theta)$ - Our belief about parameters before seeing data
Posterior: $P(\theta \mid D) \propto P(D \mid \theta) \cdot P(\theta)$ (Bayes’ rule)
MAP Estimate:
Laplace Smoothing = MAP with Dirichlet Prior:
For parameter $P(F = f \mid Y = y)$, use Dirichlet prior with pseudo-counts $k$:
MAP Estimate:
Exactly Laplace smoothing!
Interpretation:
-
MLE: “Only trust the data”
-
MAP: “Combine data with prior belief”
-
Prior encodes: “I believe feature values should be somewhat uniform”
Log Probabilities: Numerical Stability
The Underflow Problem
Naive Bayes Classification:
Problem: With many features, product of small probabilities underflows.
Example: 100 features, each with $P(F_i \mid Y) \approx 0.1$
Below machine precision! Computer stores as 0.0, losing all information.
Solution: Log-Space Computation
Log Probabilities:
Product becomes sum: Numerically stable!
Classification Algorithm (log-space):
def classify_logspace(features, P_Y, P_F_given_Y):
best_y = None
best_score = -infinity
for y in classes:
# Compute log probability (unnormalized)
score = log(P(Y = y))
for i, f_i in enumerate(features):
score += log(P(F_i = f_i | Y = y))
if score > best_score:
best_score = score
best_y = y
return best_y
Note: For classification, we only need relative probabilities (argmax), so normalization constant doesn’t matter.
If you need actual probabilities:
-
Compute log probabilities: $\ell_1, \ell_2, \ldots, \ell_k$
-
Find max: $\ell_{\max} = \max_i \ell_i$
-
Compute: $P_i = \frac{e^{\ell_i - \ell_{\max}}}{\sum_j e^{\ell_j - \ell_{\max}}}$
Subtracting $\ell_{\max}$ prevents overflow in exponentiation.
Practical Example: Spam Detection with Smoothing
Dataset
Training: 10,000 emails (5,000 spam, 5,000 ham)
Features:
-
Word “lottery”: appears in 2000 spam, 10 ham
-
Word “meeting”: appears in 100 spam, 1500 ham
-
Word “xyzabc” (rare): appears in 0 spam, 0 ham
Without Smoothing (MLE)
New email: Contains “xyzabc” and 50 other features suggesting spam
Classification fails! Zero probability for both classes.
With Laplace Smoothing ($k = 1$, $|F| = 2$)
Observations:
-
“lottery” probabilities barely changed (data-driven)
-
“xyzabc” gets small non-zero probability (safe default)
-
Classification can proceed normally
Log-Space Computation:
log_p_spam = log(0.5) # prior
log_p_spam += log(0.4) # lottery | spam
log_p_spam += log(0.0002) # xyzabc | spam
log_p_spam += ... # other features
# No underflow!
Practice Problems
Problem 1: Smoothing Effect
Training data: 20 examples of class A, feature present 0 times
Compute $P(F \mid A)$ with:
-
$k = 0$ (no smoothing)
-
$k = 0.5$
-
$k = 1$
-
$k = 5$
How does the probability change? When would each be appropriate?
Problem 2: Prior Strength
You have 10 training examples. Compare:
-
Laplace smoothing with $k = 1$
-
No smoothing
Which has stronger influence: the prior or the data? What if you have 10,000 examples?
Problem 3: Log-Space Arithmetic
Compute $P(Y \mid F_1, F_2, F_3)$ given:
-
$P(Y) = 0.01$
-
$P(F_1 \mid Y) = 0.1$
-
$P(F_2 \mid Y) = 0.2$
-
$P(F_3 \mid Y) = 0.05$
Do this:
-
Directly (multiplying probabilities)
-
In log-space
Which is more numerically stable?
Conclusion
Laplace smoothing and maximum likelihood estimation complete our understanding of Naive Bayes:
-
Zero-Frequency Problem: MLE assigns zero probability to unseen events, causing catastrophic failures
-
Laplace Smoothing: Add pseudo-counts to prevent zeros while minimally affecting frequent events
-
Smoothing Parameter $k$: Controls trade-off between data fidelity and generalization
-
MLE Formalization: Counting frequencies maximizes likelihood of observed data
-
MAP Interpretation: Smoothing is Bayesian estimation with Dirichlet prior
-
Log-Space Computation: Prevents numerical underflow for many features
Key Takeaways:
-
Always use smoothing in Naive Bayes (even $k = 0.1$ is better than $k = 0$)
-
Default $k = 1$ works well; tune on validation set if needed
-
Use log probabilities for numerical stability
-
Smoothing is a form of regularization that improves generalization
Practical Impact: These techniques make Naive Bayes robust and reliable for real-world applications with limited training data.
Further Reading
-
Machine Learning: A Probabilistic Perspective by Murphy (Chapter 3.4) - Bayesian parameter estimation
-
Pattern Recognition and Machine Learning by Bishop (Chapter 2.2) - Dirichlet distributions and priors
-
Additive Smoothing (Wikipedia) - Mathematical treatment
-
Speech and Language Processing by Jurafsky and Martin (Chapter 4) - Smoothing in NLP
-
sklearn.naive_bayes.MultinomialNB - Implementation with smoothing parameter
alpha
This lecture covered Laplace smoothing, maximum likelihood estimation, and numerical stability in Naive Bayes. Next, we transition to discriminative models with the Perceptron algorithm.