Gaussian Naive Bayes: Part 3 — From Theory to Practice with Python
Series: Gaussian Naive Bayes Classifier
- Gaussian Naive Bayes: Part 1 — Introduction and Bayes Theorem Refresher
- Gaussian Naive Bayes: Part 2 — Mathematical Deep Dive and Optimization
- Gaussian Naive Bayes: Part 3 — From Theory to Practice with Python
In the previous part of the series, we took a deep dive into the mathematics behind Gaussian Naive Bayes, and what optimizations we could apply to prevent computation errors, and improve computational efficiency. In this part of the series, we will implement our insights, and build a Gaussian Naive Bayes classifier from scratch in Python, using Test-Driven Development (TDD) to ensure that each mathematical concept is correctly translated into the working code. We will train the model on the Iris dataset, and use it to classify.
A quick mathematical recapPermalink
Recall Bayes’ Theorem:
P(H|e)=P(e|H)P(H)P(e)Given the class variable y, and dependent feature vector →x=[xi…xn], the theorem defines the following relationship:
P(y|→x)=P(y)P(→x|y)P(→x)Assuming conditional independence for all xi, this is simplified to
P(xi∈→x|y,x1,…,xi−1,xi+1,…,xn)=P(xi∈→x|y)Which simplifies the relationship to
P(y|→x)=P(y)n∏i=1P(xi|y)P(→x)Since P(→x) is constant given the input, we can derive the following classification rule
P(y|→x)∝P(y)n∏i=1P(→x|y)Applying log optimization results in the following classification rule
P(y|→x)∝log(P(y))+n∑i=1log(P(xi|y))Where, for Gaussian Naive Bayes,
log(P(xi|y))=−12log(2πσ2ic)−(xic−μic)22σ2icVectorizationPermalink
We extract constants that can be precomputed during training by expanding the factor (xic−μic)2 , and enable extremely fast log-likelihood computation during inference time.
log(P(xi|y))=−12log(2πσ2ic)−μ2ic2σ2ic⏟log normalization constants+μicσ2ic⏟mean pullxic−12σ2ic⏟curvaturex2icWe can precompute log normalization constants, mean pull, and curvature, along with the prior probabilities for each class y, and each feature i, and compute log-likelihoods fast at inference time.
Epsilon Variance smoothingPermalink
To prevent division by zero or numerical instability, we apply ϵ variance smoothing:
σ2ic⟶σ2ic+ϵ, where ϵ=1e−9
The final classification rule is:
\boxed{P(y\vert \vec{x}) \varpropto \boldsymbol{\log\biggl(P(y)\biggl)} + \sum\limits_{i=1}^n -\frac{1}{2}\log(2\pi(\sigma_{ic}^2+\epsilon))- \frac{\mu_{ic}^2}{2(\sigma_{ic}^2+\epsilon)}+\frac{\mu_{ic}}{\sigma_{ic}^2+\epsilon}x_{ic}-\frac{1}{2(\sigma_{ic}^2 +\epsilon)}x_{ic}^2}Step-by-step implementationPermalink
Now that we have refreshed our mathematics, we can start implementing. Let’s walk through each step.
ImportsPermalink
# Libraries
import numpy as np
# Typings
from typing import Self, Any
from numpy.typing import NDArray
Define the classPermalink
We need the following methods for training, classification, and probability estimations:
Method | Description |
---|---|
.fit() |
Fits our Gaussian Naive Bayes model according to the given training data |
.predict() |
Perform classification on an array of test vectors X |
.predict_proba() |
Estimate probability outputs on an array of test vectors |
.predict_log_proba() |
Estimate log probability outputs on an array of test vectors |
Let’s create a class with a descriptive name, like GaussianNaiveBayes
(but you are free to pick any name you like), and add the methods.
class GaussianNaiveBayes:
"""
A Gaussian Naive Bayes Classifier
Implements Bayes' Theorem assuming feature
independence and Gaussian likelihoods.
"""
def __init__(self):
pass
def fit(self)-> Self:
"""
Fits the Gaussian Naive Bayes model
according to the given training data
"""
pass
def predict(self) -> Self:
"""
Perform classification on an array of test vectors X
"""
pass
def predict_log_proba(self) -> NDArray[np.int64]:
"""
Estimate log probability outputs on an array of test vectors
"""
pass
def predict_proba(self) -> NDArray[np.int64]:
"""
Estimate probability outputs on an array of test vectors
"""
pass
Fitting the modelPermalink
Next, we implement .fit()
to train our model. The fit phase has two pre-computation steps:
- Compute log prior probabilities P(y)
- Compute log-likelihood constants
curvature
,mean pull
&log likelihood constants
Notice that we apply log-optimization (part 2).
The .fit()
method accepts three parameters:
- Feature matrix X
- Target vector \vec{y}
- Epsilon variance smoothing value (default:
1e-9
)
We added class attributes to store pre-computations; it makes them available in inference time.
def __init__(self):
+ self.classes_: NDArray[Any] = None
+ self.class_count_: NDArray[np.int64] = None
+ self.class_prior_: NDArray[np.float64] = None
+ self.class_curvature_: NDArray[np.float64] = None
+ self.class_mean_pull_: NDArray[np.float64] = None
+ self.class_log_likelihood_consts_: NDArray[np.float64] = None
- pass
To keep the code legible, maintainable, and interoperable, we added two helper methods:
.__compute_prior_probabilities()
.__compute_log_likelihood_constants()
def __compute_log_prior_probabilities(self, y: NDArray) -> Self:
"""
Computes log prior probabilities for each class
Parameters
----------
y: nd.array of shape (n_samples,)
Target vector
Returns
-------
self: class-instance
"""
# Store unique classes and n_samples_per_class in class attributes
self.classes_, self.class_count_ = np.unique(y, return_counts=True)
# Compute log prior probabilities
self.class_prior_ = np.log(self.class_count_ / self.class_count_.sum())
return self
def __compute_log_likelihood_constants(self, X: NDArray, y: NDArray, epsilon: np.int64) -> Self:
"""
Compute log-likelihood constants for feature matrix X
Parameters
----------
X: nd.array of shape (n_sample, n_features)
Feature matrix
y: nd.array of shape (n_samples,)
Target vector
epsilon: np.int64
variance smoothing value to prevent division by zero, and provide numerical stability
Returns
-------
self: class-instance
"""
# First, we have to group the feature matrix by class
# We use a boolean mask, a matrix (shape: n_classes x n_features)
# mask[i,j] = True if sample j belongs to class i
mask = (y == self.classes_[:, None])
# Reshape class_count_ class attribute from shape: (n_classes, ) to shape: (n_classes, 1)
# Required for matrix multiplication below
counts = self.class_count_[:, None]
# Compute per-class sums of features (shape: n_classes x n_features)
sums = mask @ X
# Compute per-class means
thetas = sums / counts
# Compute sums of squares for variance computation
X_squared = X ** 2
# Compute per-class sums of squares
sums_of_squares = mask @ X_squared
# Compute variances = E[x^2] - (E[x])^2 + ϵ
variances = (sums_of_squares / counts) - (thetas ** 2) + epsilon
# With thetas and variances computed, we can pre-compute the log-likelihood constants
self.class_curvature_ = -0.5 / variances
self.class_mean_pull_ = thetas / variances
self.class_log_likelihood_consts_ = -0.5 * np.log(2 * np.pi * variances) - (thetas ** 2) / (2 * variances)
return self
With the helpers implemented, we can call them in .fit()
.
def fit(self, X: NDArray, y: NDArray, epsilon=1e-9)-> Self:
"""
Fits the Gaussian Naive Bayes model
according to the given training data
Parameters
----------
X: nd.array of shape (n_sample, n_features)
Feature matrix
y : nd.array of shape (n_samples,)
Target vector
epsilon: np.int64
variance smoothing value to prevent division by zero, and provide numerical stability
Returns
-------
self: class-instance
"""
# Always check if the provided data meets the most basic needs.
# In our case: Both X and y cannot be empty
if len(X) == 0:
raise ValueError("Feature Matrix cannot be empty")
if len(y) == 0:
raise ValueError("Target Vector cannot be empty")
# Compute prior probabilities
self.__compute_log_prior_probabilities(y)
# Compute log-likelihood constants for feature matrix X, and target vector y
self.__compute_log_likelihood_constants(X, y)
return self
ClassificationPermalink
Classification is done on an array of test vectors X that we provide as parameters to .predict()
. This method has two steps:
- Compute log-likelihood probabilities
- Perform classification by selecting the class with the highest posterior probability score with
argmax
.
To keep the code legible, maintainable, and interoperable, we added another helper method:
.__compute_joint_log_likelihood()
def __compute_joint_log_likelihood(self, X: NDArray) -> NDArray:
"""
Compute joint log likelihood
Parameters
----------
X: nd.array of shape (n_sample, n_features)
Feature matrix
Returns
-------
C: nd.array of shape (n_classes, )
Classification results vector
"""
# Compute log-likelihood terms
log_likelihood = (
(X ** 2) @ self.class_curvature_.T +
X @ self.class_mean_pull_.T +
np.sum(self.class_log_likelihood_consts_, axis=1)
)
# Add prior (log) probabilities to each row
joint_log_likelihood = log_likelihood + self.class_prior_
return joint_log_likelihood
def predict(self, X: NDArray) -> NDArray:
"""
Perform classification on an array of test vectors X
Parameters
----------
X: nd.array of shape (n_sample, n_features)
Feature matrix
Returns
-------
C: nd.array of shape (n_classes, )
Classification results vector
"""
joint_log_likelihood = self.__compute_joint_log_likelihood(X)
return np.argmax(joint_log_likelihood, axis=1)
Compute posterior (log) probabilitiesPermalink
The final step is to estimate (log) probability outputs on an array of test vectors. We have two methods .predict_log_proba()
, and .predict_proba()
, where the latter normalizes the log posterior probabilities computed in the first.
.predict_log_proba()
has two steps:
- Compute log-likelihood probabilities
- Normalize with
log-sum-exp
transform
def predict_log_proba(self, X: NDArray) -> NDArray:
"""
Estimate log probability outputs on an array of test vectors
Parameters
----------
X: nd.array of shape (n_sample, n_features)
Feature matrix
Returns
-------
C: nd.array of shape (n_samples, n_classes)
Returns the log probability of the samples for each class in
the model
"""
# Compute joint log likelihood
joint_log_likelihood = self.__compute_joint_log_likelihood(X)
# Apply log-sum-exp transform
max_joint_log_likelihood = np.max(joint_log_likelihood, axis=1, keepdims=True)
logsumexp = max_joint_log_likelihood + np.log(np.sum(np.exp(joint_log_likelihood - max_joint_log_likelihood), axis=1, keepdims=True))
return joint_log_likelihood - logsumexp
def predict_proba(self, X: NDArray) -> NDArray:
"""
Return probability estimates for the test vector X.
Parameters
----------
X: nd.array of shape (n_sample, n_features)
Feature matrix
Returns
-------
C : nd.array of shape (n_samples, n_classes)
Returns the probability of the samples for each class in
the model
"""
return np.exp(self.predict_log_proba(X))
ConclusionPermalink
In this tutorial, we translated theory into practice by implementing a Gaussian Naive Bayes classifier from scratch using nothing but NumPy. We explored every component — from fitting the model with class-wise means and variances to efficiently computing log-likelihoods for predictions.
Along the way, we took care to address numerical stability with techniques like the log-sum-exp trick, and built a vectorized implementation that scales well and is easy to read.