7 minute read

Series: Gaussian Naive Bayes Classifier

In the previous part of the series, we took a deep dive into the mathematics behind Gaussian Naive Bayes, and what optimizations we could apply to prevent computation errors, and improve computational efficiency. In this part of the series, we will implement our insights, and build a Gaussian Naive Bayes classifier from scratch in Python, using Test-Driven Development (TDD) to ensure that each mathematical concept is correctly translated into the working code. We will train the model on the Iris dataset, and use it to classify.

A quick mathematical recapPermalink

Recall Bayes’ Theorem:

P(H|e)=P(e|H)P(H)P(e)

Given the class variable y, and dependent feature vector x=[xixn], the theorem defines the following relationship:

P(y|x)=P(y)P(x|y)P(x)

Assuming conditional independence for all xi, this is simplified to

P(xix|y,x1,,xi1,xi+1,,xn)=P(xix|y)

Which simplifies the relationship to

P(y|x)=P(y)ni=1P(xi|y)P(x)

Since P(x) is constant given the input, we can derive the following classification rule

P(y|x)P(y)ni=1P(x|y)

Applying log optimization results in the following classification rule

P(y|x)log(P(y))+ni=1log(P(xi|y))

Where, for Gaussian Naive Bayes,

log(P(xi|y))=12log(2πσ2ic)(xicμic)22σ2ic

VectorizationPermalink

We extract constants that can be precomputed during training by expanding the factor (xicμic)2 , and enable extremely fast log-likelihood computation during inference time.

log(P(xi|y))=12log(2πσ2ic)μ2ic2σ2iclog normalization constants+μicσ2icmean pullxic12σ2iccurvaturex2ic

We can precompute log normalization constants, mean pull, and curvature, along with the prior probabilities for each class y, and each feature i, and compute log-likelihoods fast at inference time.

Epsilon Variance smoothingPermalink

To prevent division by zero or numerical instability, we apply ϵ variance smoothing:

σ2icσ2ic+ϵ, where ϵ=1e9

The final classification rule is:

\boxed{P(y\vert \vec{x}) \varpropto \boldsymbol{\log\biggl(P(y)\biggl)} + \sum\limits_{i=1}^n -\frac{1}{2}\log(2\pi(\sigma_{ic}^2+\epsilon))- \frac{\mu_{ic}^2}{2(\sigma_{ic}^2+\epsilon)}+\frac{\mu_{ic}}{\sigma_{ic}^2+\epsilon}x_{ic}-\frac{1}{2(\sigma_{ic}^2 +\epsilon)}x_{ic}^2}

Step-by-step implementationPermalink

Now that we have refreshed our mathematics, we can start implementing. Let’s walk through each step.

ImportsPermalink

# Libraries
import numpy as np

# Typings
from typing import Self, Any
from numpy.typing import NDArray

Define the classPermalink

We need the following methods for training, classification, and probability estimations:

Method Description
.fit() Fits our Gaussian Naive Bayes model according to the given training data
.predict() Perform classification on an array of test vectors X
.predict_proba() Estimate probability outputs on an array of test vectors
.predict_log_proba() Estimate log probability outputs on an array of test vectors

Let’s create a class with a descriptive name, like GaussianNaiveBayes (but you are free to pick any name you like), and add the methods.

class GaussianNaiveBayes:
    """
    A Gaussian Naive Bayes Classifier

    Implements Bayes' Theorem assuming feature
    independence and Gaussian likelihoods.
    """

    def __init__(self):
        pass


    def fit(self)-> Self:
    """
    Fits the Gaussian Naive Bayes model
    according to the given training data
    """
    pass


    def predict(self) -> Self:
    """
    Perform classification on an array of test vectors X
    """
    pass


    def predict_log_proba(self) -> NDArray[np.int64]:
    """
    Estimate log probability outputs on an array of test vectors
    """
    pass

    def predict_proba(self) -> NDArray[np.int64]:
    """
    Estimate probability outputs on an array of test vectors
    """
    pass

Fitting the modelPermalink

Next, we implement .fit() to train our model. The fit phase has two pre-computation steps:

  1. Compute log prior probabilities P(y)
  2. Compute log-likelihood constants curvature, mean pull & log likelihood constants

Notice that we apply log-optimization (part 2).

The .fit() method accepts three parameters:

  1. Feature matrix X
  2. Target vector \vec{y}
  3. Epsilon variance smoothing value (default:1e-9)

We added class attributes to store pre-computations; it makes them available in inference time.

def __init__(self):
+   self.classes_: NDArray[Any] = None
+   self.class_count_: NDArray[np.int64] = None
+   self.class_prior_: NDArray[np.float64] = None
+   self.class_curvature_: NDArray[np.float64] = None
+   self.class_mean_pull_: NDArray[np.float64] = None
+   self.class_log_likelihood_consts_: NDArray[np.float64] = None
-    pass

To keep the code legible, maintainable, and interoperable, we added two helper methods:

  1. .__compute_prior_probabilities()
  2. .__compute_log_likelihood_constants()
def __compute_log_prior_probabilities(self, y: NDArray) -> Self:
    """
    Computes log prior probabilities for each class

    Parameters
    ----------
    y: nd.array of shape (n_samples,)
        Target vector

    Returns
    -------
    self: class-instance
    """

    # Store unique classes and n_samples_per_class in class attributes
    self.classes_, self.class_count_ = np.unique(y, return_counts=True)

    # Compute log prior probabilities
    self.class_prior_ = np.log(self.class_count_ / self.class_count_.sum())

    return self


def __compute_log_likelihood_constants(self, X: NDArray, y: NDArray, epsilon: np.int64) -> Self:
    """
    Compute log-likelihood constants for feature matrix X

    Parameters
    ----------
    X: nd.array of shape (n_sample, n_features)
        Feature matrix
    y: nd.array of shape (n_samples,)
        Target vector
    epsilon: np.int64
        variance smoothing value to prevent division by zero, and provide numerical stability

    Returns
    -------
    self: class-instance
    """

    # First, we have to group the feature matrix by class
    # We use a boolean mask, a matrix (shape: n_classes x n_features)
    # mask[i,j] = True if sample j belongs to class i
    mask = (y == self.classes_[:, None])

    # Reshape class_count_ class attribute from shape: (n_classes, ) to shape: (n_classes, 1)
    # Required for matrix multiplication below
    counts = self.class_count_[:, None]

    # Compute per-class sums of features (shape: n_classes x n_features)
    sums = mask @ X

    # Compute per-class means
    thetas = sums / counts

    # Compute sums of squares for variance computation
    X_squared = X ** 2

    # Compute per-class sums of squares
    sums_of_squares = mask @ X_squared

    # Compute variances = E[x^2] - (E[x])^2 + ϵ
    variances = (sums_of_squares / counts) - (thetas ** 2) + epsilon

    # With thetas and variances computed, we can pre-compute the log-likelihood constants
    self.class_curvature_ = -0.5 / variances
    self.class_mean_pull_ = thetas / variances
    self.class_log_likelihood_consts_ = -0.5 * np.log(2 * np.pi * variances) - (thetas ** 2) / (2 * variances)

    return self

With the helpers implemented, we can call them in .fit().

def fit(self, X: NDArray, y: NDArray, epsilon=1e-9)-> Self:
    """
    Fits the Gaussian Naive Bayes model
    according to the given training data

    Parameters
    ----------
    X: nd.array of shape (n_sample, n_features)
        Feature matrix
    y : nd.array of shape (n_samples,)
        Target vector
    epsilon: np.int64
        variance smoothing value to prevent division by zero, and provide numerical stability

    Returns
    -------
    self: class-instance
    """

    # Always check if the provided data meets the most basic needs.
    # In our case: Both X and y cannot be empty
    if len(X) == 0:
        raise ValueError("Feature Matrix cannot be empty")

    if len(y) == 0:
        raise ValueError("Target Vector cannot be empty")

    # Compute prior probabilities
    self.__compute_log_prior_probabilities(y)

    # Compute log-likelihood constants for feature matrix X, and target vector y
    self.__compute_log_likelihood_constants(X, y)

    return self

ClassificationPermalink

Classification is done on an array of test vectors X that we provide as parameters to .predict(). This method has two steps:

  1. Compute log-likelihood probabilities
  2. Perform classification by selecting the class with the highest posterior probability score with argmax.

To keep the code legible, maintainable, and interoperable, we added another helper method:

  1. .__compute_joint_log_likelihood()
def __compute_joint_log_likelihood(self, X: NDArray) -> NDArray:
    """
    Compute joint log likelihood

    Parameters
    ----------
    X: nd.array of shape (n_sample, n_features)
        Feature matrix

    Returns
    -------
    C: nd.array of shape (n_classes, )
        Classification results vector
    """

    # Compute log-likelihood terms
    log_likelihood = (
            (X ** 2) @ self.class_curvature_.T +
            X @ self.class_mean_pull_.T +
            np.sum(self.class_log_likelihood_consts_, axis=1)
    )

    # Add prior (log) probabilities to each row
    joint_log_likelihood = log_likelihood + self.class_prior_

    return joint_log_likelihood


def predict(self, X: NDArray) -> NDArray:
    """
    Perform classification on an array of test vectors X

    Parameters
    ----------
    X: nd.array of shape (n_sample, n_features)
        Feature matrix

    Returns
    -------
    C: nd.array of shape (n_classes, )
        Classification results vector
    """
    joint_log_likelihood = self.__compute_joint_log_likelihood(X)

    return np.argmax(joint_log_likelihood, axis=1)

Compute posterior (log) probabilitiesPermalink

The final step is to estimate (log) probability outputs on an array of test vectors. We have two methods .predict_log_proba(), and .predict_proba(), where the latter normalizes the log posterior probabilities computed in the first.

.predict_log_proba() has two steps:

  1. Compute log-likelihood probabilities
  2. Normalize with log-sum-exp transform
def predict_log_proba(self, X: NDArray) -> NDArray:
    """
    Estimate log probability outputs on an array of test vectors

    Parameters
    ----------
    X: nd.array of shape (n_sample, n_features)
        Feature matrix

    Returns
    -------
    C: nd.array of shape (n_samples, n_classes)
        Returns the log probability of the samples for each class in
        the model
    """

    # Compute joint log likelihood
    joint_log_likelihood = self.__compute_joint_log_likelihood(X)

    # Apply log-sum-exp transform
    max_joint_log_likelihood = np.max(joint_log_likelihood, axis=1, keepdims=True)
    logsumexp = max_joint_log_likelihood + np.log(np.sum(np.exp(joint_log_likelihood - max_joint_log_likelihood), axis=1, keepdims=True))

    return joint_log_likelihood - logsumexp


def predict_proba(self, X: NDArray) -> NDArray:
    """
    Return probability estimates for the test vector X.

    Parameters
    ----------
    X: nd.array of shape (n_sample, n_features)
        Feature matrix

    Returns
    -------
    C : nd.array of shape (n_samples, n_classes)
        Returns the probability of the samples for each class in
        the model
    """

    return np.exp(self.predict_log_proba(X))

ConclusionPermalink

In this tutorial, we translated theory into practice by implementing a Gaussian Naive Bayes classifier from scratch using nothing but NumPy. We explored every component — from fitting the model with class-wise means and variances to efficiently computing log-likelihoods for predictions.

Along the way, we took care to address numerical stability with techniques like the log-sum-exp trick, and built a vectorized implementation that scales well and is easy to read.