Laplace Smoothing and Maximum Likelihood Estimation in Naive Bayes

In the previous lecture, we learned the Naive Bayes classification algorithm and saw its power for supervised learning. However, we encountered a critical problem: What happens when a feature value never appears in the training data for a particular class?

This zero-frequency problem can cause catastrophic failures—a single unseen feature can force the probability of an entire class to zero, regardless of all other evidence. This lecture explores Laplace smoothing, a principled solution that prevents this problem while maintaining the spirit of Naive Bayes.

We’ll also formalize our parameter learning through maximum likelihood estimation (MLE) and understand the trade-offs between overfitting and generalization.

The Zero-Frequency Problem

Example: Spam Classification Catastrophe

Training Data: 1000 spam emails, 1000 ham emails

Feature: Contains word “unsubscribe”

  • Spam: 500/1000 emails contain “unsubscribe”

  • Ham: 0/1000 emails contain “unsubscribe”

Learned Parameters:

P(Funsubscribe=yesspam)=0.5P(F_{\text{unsubscribe}} = \text{yes} \mid \text{spam}) = 0.5 P(Funsubscribe=yesham)=0.0P(F_{\text{unsubscribe}} = \text{yes} \mid \text{ham}) = 0.0

New Email: Contains “unsubscribe” + other features suggesting ham

Classification:

P(hamfeatures)P(ham)iP(Fiham)P(\text{ham} \mid \text{features}) \propto P(\text{ham}) \prod_i P(F_i \mid \text{ham})

Since $P(F_{\text{unsubscribe}} = \text{yes} \mid \text{ham}) = 0$:

P(hamfeatures)0.5××0=0P(\text{ham} \mid \text{features}) \propto 0.5 \times \cdots \times 0 = 0

Result: Classified as spam regardless of all other features!

Problem: One zero probability destroys the entire inference, even if 99 other features strongly suggest ham.

Why This Happens

Maximum Likelihood Estimation (naive version):

P(F=fY=y)=count(F=f,Y=y)count(Y=y)P(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y)}{\text{count}(Y = y)}

If count is zero: $P(F = f \mid Y = y) = 0$

Consequences:

  1. Overfitting: Model is too confident about unseen events

  2. Fragility: Classification can fail catastrophically

  3. Poor generalization: Training data never covers all possibilities

Laplace Smoothing: The Solution

Basic Laplace Smoothing (Add-One)

Idea: Pretend we’ve seen every feature value at least once in every class.

Modified Estimate:

P(F=fY=y)=count(F=f,Y=y)+1count(Y=y)+FP(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y) + 1}{\text{count}(Y = y) + |F|}
where $ F $ is the number of possible values for feature $F$.

Interpretation:

  • Add 1 pseudo-count to every feature-value combination

  • Add $ F $ to denominator (to maintain normalization)

Effect:

  • No more zeros: Every probability is at least $\frac{1}{\text{count}(Y=y) + F } > 0$
  • Small change for frequent events: If count is large, $+1$ is negligible

  • Significant change for rare events: Prevents overfitting to small counts

Example: Binary Feature

Training Data:

  • Class A: Feature present 8 times, absent 2 times

  • Class B: Feature present 0 times, absent 10 times

Without Smoothing:

P(F=1A)=810=0.8P(F = 1 \mid A) = \frac{8}{10} = 0.8 P(F=1B)=010=0.0(PROBLEM!)P(F = 1 \mid B) = \frac{0}{10} = 0.0 \quad \text{(PROBLEM!)}
With Laplace Smoothing ($ F = 2$ for binary):
P(F=1A)=8+110+2=912=0.75P(F = 1 \mid A) = \frac{8 + 1}{10 + 2} = \frac{9}{12} = 0.75 P(F=1B)=0+110+2=1120.083P(F = 1 \mid B) = \frac{0 + 1}{10 + 2} = \frac{1}{12} \approx 0.083

Observations:

  • Class A: Probability decreased slightly (0.8 → 0.75)

  • Class B: Probability no longer zero (0 → 0.083)

  • Both probabilities sum correctly: $0.75 + 0.25 = 1$ for A, $0.083 + 0.917 = 1$ for B

Generalized Laplace Smoothing

Adding Parameter $k$

Basic Laplace can be too strong for some applications. We generalize with parameter $k$:

P(F=fY=y)=count(F=f,Y=y)+kcount(Y=y)+kFP(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y) + k}{\text{count}(Y = y) + k \cdot |F|}

Interpretation:

  • $k = 0$: No smoothing (maximum likelihood)

  • $k = 1$: Standard Laplace smoothing

  • $k > 1$: Stronger smoothing (more conservative)

  • $0 < k < 1$: Weaker smoothing

Choosing $k$: The Smoothing Trade-off

Small $k$ (e.g., $k = 0.1$):

  • Pros: Stays closer to observed data

  • Cons: Less protection against zeros, more overfitting

  • Use when: Lots of training data

Large $k$ (e.g., $k = 10$):

  • Pros: Strong protection against zeros, reduces overfitting

  • Cons: May over-smooth, losing distinctions in data

  • Use when: Little training data or many features

Just right $k$ (e.g., $k = 1$):

  • Standard Laplace is often a good default

  • Can tune on validation set

Example: Smoothing Extremes

Training: 100 examples, feature appears 20 times in class A

$k = 0$ (No smoothing):

P(FA)=20100=0.20P(F \mid A) = \frac{20}{100} = 0.20

$k = 1$ (Standard Laplace):

P(FA)=20+1100+2=211020.206P(F \mid A) = \frac{20 + 1}{100 + 2} = \frac{21}{102} \approx 0.206

$k = 100$ (Heavy smoothing):

P(FA)=20+100100+200=120300=0.40P(F \mid A) = \frac{20 + 100}{100 + 200} = \frac{120}{300} = 0.40

Observation: As $k$ increases, probability shifts toward uniform (0.5 for binary)

Risk of Over-Smoothing: With $k = 100$, we’ve lost the signal from data (20/100 = 20% → 40%)

Maximum Likelihood Estimation (MLE)

Formal Framework

Goal: Find parameters $\theta$ that make observed data most likely.

Likelihood Function:

L(θ)=P(Dθ)=i=1NP(x(i),y(i)θ)L(\theta) = P(D \mid \theta) = \prod_{i=1}^N P(x^{(i)}, y^{(i)} \mid \theta)

where $D = {(x^{(1)}, y^{(1)}), \ldots, (x^{(N)}, y^{(N)})}$

Maximum Likelihood Estimator:

θ^MLE=argmaxθL(θ)\hat{\theta}_{\text{MLE}} = \arg\max_\theta L(\theta)

Log-Likelihood (easier to optimize):

(θ)=logL(θ)=i=1NlogP(x(i),y(i)θ)\ell(\theta) = \log L(\theta) = \sum_{i=1}^N \log P(x^{(i)}, y^{(i)} \mid \theta)

Key Insight: Maximizing $L(\theta)$ is equivalent to maximizing $\ell(\theta)$ (logarithm is monotonic).

MLE for Naive Bayes

Parameters: $\theta = (P(Y), P(F_1 \mid Y), \ldots, P(F_n \mid Y))$

Log-Likelihood (using Naive Bayes factorization):

(θ)=i=1N[logP(y(i))+j=1nlogP(fj(i)y(i))]\ell(\theta) = \sum_{i=1}^N \left[ \log P(y^{(i)}) + \sum_{j=1}^n \log P(f_j^{(i)} \mid y^{(i)}) \right]

Optimization: Take derivatives and set to zero (with constraint that probabilities sum to 1).

Result (for categorical distributions):

P^(Y=y)=count(Y=y)N\hat{P}(Y = y) = \frac{\text{count}(Y = y)}{N} P^(Fj=fY=y)=count(Fj=f,Y=y)count(Y=y)\hat{P}(F_j = f \mid Y = y) = \frac{\text{count}(F_j = f, Y = y)}{\text{count}(Y = y)}

This is exactly what we’ve been doing! Counting frequencies is MLE for Naive Bayes.

The Overfitting Problem in MLE

MLE Issue: Assigns probability zero to unseen events.

Why? MLE maximizes likelihood of observed data only. It doesn’t care about generalization.

Example:

  • Flip a coin 3 times, get HHH

  • MLE estimate: $P(H) = 1.0$, $P(T) = 0.0$

  • Clearly overfit! We don’t believe the coin has no tails.

Smoothing as Regularization: Laplace smoothing adds a prior that prevents extreme estimates.

Bayesian Interpretation of Smoothing

Maximum A Posteriori (MAP) Estimation

Bayesian View: Parameters are random variables with a prior distribution.

Prior: $P(\theta)$ - Our belief about parameters before seeing data

Posterior: $P(\theta \mid D) \propto P(D \mid \theta) \cdot P(\theta)$ (Bayes’ rule)

MAP Estimate:

θ^MAP=argmaxθP(θD)=argmaxθ[P(Dθ)P(θ)]\hat{\theta}_{\text{MAP}} = \arg\max_\theta P(\theta \mid D) = \arg\max_\theta [P(D \mid \theta) \cdot P(\theta)]

Laplace Smoothing = MAP with Dirichlet Prior:

For parameter $P(F = f \mid Y = y)$, use Dirichlet prior with pseudo-counts $k$:

P(θ)fθfk1P(\theta) \propto \prod_f \theta_f^{k-1}

MAP Estimate:

P^(F=fY=y)=count(F=f,Y=y)+kcount(Y=y)+kF\hat{P}(F = f \mid Y = y) = \frac{\text{count}(F = f, Y = y) + k}{\text{count}(Y = y) + k \cdot |F|}

Exactly Laplace smoothing!

Interpretation:

  • MLE: “Only trust the data”

  • MAP: “Combine data with prior belief”

  • Prior encodes: “I believe feature values should be somewhat uniform”

Log Probabilities: Numerical Stability

The Underflow Problem

Naive Bayes Classification:

P(YF1,,Fn)P(Y)i=1nP(FiY)P(Y \mid F_1, \ldots, F_n) \propto P(Y) \prod_{i=1}^n P(F_i \mid Y)

Problem: With many features, product of small probabilities underflows.

Example: 100 features, each with $P(F_i \mid Y) \approx 0.1$

0.1100=101000.1^{100} = 10^{-100}

Below machine precision! Computer stores as 0.0, losing all information.

Solution: Log-Space Computation

Log Probabilities:

logP(YF1,,Fn)=logP(Y)+i=1nlogP(FiY)+const\log P(Y \mid F_1, \ldots, F_n) = \log P(Y) + \sum_{i=1}^n \log P(F_i \mid Y) + \text{const}

Product becomes sum: Numerically stable!

Classification Algorithm (log-space):



def classify_logspace(features, P_Y, P_F_given_Y):


    best_y = None


    best_score = -infinity





    for y in classes:


        # Compute log probability (unnormalized)


        score = log(P(Y = y))


        for i, f_i in enumerate(features):


            score += log(P(F_i = f_i | Y = y))





        if score > best_score:


            best_score = score


            best_y = y





    return best_y


Note: For classification, we only need relative probabilities (argmax), so normalization constant doesn’t matter.

If you need actual probabilities:

  1. Compute log probabilities: $\ell_1, \ell_2, \ldots, \ell_k$

  2. Find max: $\ell_{\max} = \max_i \ell_i$

  3. Compute: $P_i = \frac{e^{\ell_i - \ell_{\max}}}{\sum_j e^{\ell_j - \ell_{\max}}}$

Subtracting $\ell_{\max}$ prevents overflow in exponentiation.

Practical Example: Spam Detection with Smoothing

Dataset

Training: 10,000 emails (5,000 spam, 5,000 ham)

Features:

  • Word “lottery”: appears in 2000 spam, 10 ham

  • Word “meeting”: appears in 100 spam, 1500 ham

  • Word “xyzabc” (rare): appears in 0 spam, 0 ham

Without Smoothing (MLE)

P(lotteryspam)=20005000=0.4P(\text{lottery} \mid \text{spam}) = \frac{2000}{5000} = 0.4 P(lotteryham)=105000=0.002P(\text{lottery} \mid \text{ham}) = \frac{10}{5000} = 0.002 P(xyzabcspam)=05000=0.0(PROBLEM)P(\text{xyzabc} \mid \text{spam}) = \frac{0}{5000} = 0.0 \quad \text{(PROBLEM)}

New email: Contains “xyzabc” and 50 other features suggesting spam

Classification fails! Zero probability for both classes.

With Laplace Smoothing ($k = 1$, $|F| = 2$)

P(lotteryspam)=2000+15000+20.4P(\text{lottery} \mid \text{spam}) = \frac{2000 + 1}{5000 + 2} \approx 0.4 P(lotteryham)=10+15000+20.0022P(\text{lottery} \mid \text{ham}) = \frac{10 + 1}{5000 + 2} \approx 0.0022 P(xyzabcspam)=0+15000+20.0002P(\text{xyzabc} \mid \text{spam}) = \frac{0 + 1}{5000 + 2} \approx 0.0002 P(xyzabcham)=0+15000+20.0002P(\text{xyzabc} \mid \text{ham}) = \frac{0 + 1}{5000 + 2} \approx 0.0002

Observations:

  • “lottery” probabilities barely changed (data-driven)

  • “xyzabc” gets small non-zero probability (safe default)

  • Classification can proceed normally

Log-Space Computation:



log_p_spam = log(0.5)  # prior


log_p_spam += log(0.4)  # lottery | spam


log_p_spam += log(0.0002)  # xyzabc | spam


log_p_spam += ... # other features





# No underflow!


Practice Problems

Problem 1: Smoothing Effect

Training data: 20 examples of class A, feature present 0 times

Compute $P(F \mid A)$ with:

  1. $k = 0$ (no smoothing)

  2. $k = 0.5$

  3. $k = 1$

  4. $k = 5$

How does the probability change? When would each be appropriate?

Problem 2: Prior Strength

You have 10 training examples. Compare:

  • Laplace smoothing with $k = 1$

  • No smoothing

Which has stronger influence: the prior or the data? What if you have 10,000 examples?

Problem 3: Log-Space Arithmetic

Compute $P(Y \mid F_1, F_2, F_3)$ given:

  • $P(Y) = 0.01$

  • $P(F_1 \mid Y) = 0.1$

  • $P(F_2 \mid Y) = 0.2$

  • $P(F_3 \mid Y) = 0.05$

Do this:

  1. Directly (multiplying probabilities)

  2. In log-space

Which is more numerically stable?

Conclusion

Laplace smoothing and maximum likelihood estimation complete our understanding of Naive Bayes:

  1. Zero-Frequency Problem: MLE assigns zero probability to unseen events, causing catastrophic failures

  2. Laplace Smoothing: Add pseudo-counts to prevent zeros while minimally affecting frequent events

  3. Smoothing Parameter $k$: Controls trade-off between data fidelity and generalization

  4. MLE Formalization: Counting frequencies maximizes likelihood of observed data

  5. MAP Interpretation: Smoothing is Bayesian estimation with Dirichlet prior

  6. Log-Space Computation: Prevents numerical underflow for many features

Key Takeaways:

  • Always use smoothing in Naive Bayes (even $k = 0.1$ is better than $k = 0$)

  • Default $k = 1$ works well; tune on validation set if needed

  • Use log probabilities for numerical stability

  • Smoothing is a form of regularization that improves generalization

Practical Impact: These techniques make Naive Bayes robust and reliable for real-world applications with limited training data.

Further Reading

  • Machine Learning: A Probabilistic Perspective by Murphy (Chapter 3.4) - Bayesian parameter estimation

  • Pattern Recognition and Machine Learning by Bishop (Chapter 2.2) - Dirichlet distributions and priors

  • Additive Smoothing (Wikipedia) - Mathematical treatment

  • Speech and Language Processing by Jurafsky and Martin (Chapter 4) - Smoothing in NLP

  • sklearn.naive_bayes.MultinomialNB - Implementation with smoothing parameter alpha


This lecture covered Laplace smoothing, maximum likelihood estimation, and numerical stability in Naive Bayes. Next, we transition to discriminative models with the Perceptron algorithm.