Dirichlet Processes

Sat, Apr 3, 2021 5-minute read

Dirichlet Processes

A Dirichlet Process is best described as a stochastic process used in bayesian non-parametric modeling. Each draw from a DP is itself a distribution, making it a distribution over distributions. The DP derives it’s name from the fact that the marginal distributions of a DP is a finite dimensional Dirichlet Distribution just as a Gaussian Process has a finite dimensional Gaussian distributed marginal distribution. The DP has an infinite number of parameters which places it in the family of non-parametric.

To motivate our exploration of DP’s, we can imagine a situation where we would like to extend our finite mixture model to the case of infinite components. Infinite components is going to require an infinite probability vector \(\pi\) and typically when thinking about a Dirichlet Distribution (not to be confused with a Dirichlet Process), we have \(K << N\). Now we might ask ourselves what happens when \(K>>N\)? We know that the number of unique groups in our sample \(K^*\) will be less than \(N\). To see what this may look like, we can generate \(\pi \sim Dir\left(\frac{\alpha}{K}, ...\right)\) and then sample \(Z \sim cat(\pi)\) with an arbitrarily large \(K\) for increasing values of \(N\).

import numpy as np
import pandas as pd
import scipy as sp

from matplotlib import pyplot as plt
from scipy.stats import dirichlet, beta

K=1000
alpha=10
N=500

def num_clusters(alpha, K, N):
    # Sorting the probability vector helps with the cluster naming
    p = -np.sort(-dirichlet([alpha/K]*K).rvs().squeeze())
    samples = np.random.choice(K, N, p=p, replace=True) + 1
    output = [1]
    for i in range(1, len(samples)):
        if samples[i-1] in output:
            output.append(samples[i-1])
        else:
            output.append(np.max(output) + 1)
            np.where(samples == i, i, samples)
    return output

y = num_clusters(alpha=alpha, K=K, N=N)
fig, ax = plt.subplots()
ax.scatter(range(len(y)), y, alpha = .6)
ax.plot(range(len(y)), alpha*np.log(1 + np.divide(range(len(y)), alpha)), color = "red")
fig.suptitle("Number of Clusters (alpha = 10)")
ax.grid()
plt.xlabel("N")
plt.ylabel("Number Unique Clusters")
fig.savefig("num_clusters.png")
plt.close(fig)

Interesting! It seems that the number of clusters grows logarithmic in \(\alpha\), and it turns out that we can solve for and plot the expected value in red:

\[ \begin{align} \mathbb{E}\left[K^*\right|\alpha] & = \alpha \log\left(1 + \frac{n}{\alpha}\right) \end{align} \]

We can also note the apparent stability in the probability mass at our first few clusters, even as \(N\) grows. This helps us understand how the Dirichlet Distribution could work with large \(K\) and gives us confidence that the distribution is well behaved as \(K\) gets large, but our original question was what happens when \(K\to\infty\)? Our intuition should lead us to believe that as \(K\to\infty\) the probability mass is still concentrated around the first few clusters but this is because we are sampling categorically from the probability vector \(\pi\). If we generalize this to any distribution \(\mathcal{H}\), we will arrive at the formal definition of a Dirichlet Process.

\[ \begin{align*} \pi |\alpha & \sim \lim_{k \to \infty} Dir\left(\frac{\alpha_1}{k}, ..., \frac{\alpha_k}{k}\right) \end{align*} \]

For each \(\pi\) in this distribution we associate a draw from the base measure \(\mathcal{H}\):

\[ \begin{align*} \theta_k & \sim \mathcal{H} \;\;\;\;\;\;\;\;\; \forall \;\;k \in \mathbb{N} \end{align*} \]

Then we define: \[ \begin{align*} G \doteq \sum_{k=1}^{\infty}\pi_k \delta\left(\theta_k\right) \end{align*} \]

and conclude that:

\[ \begin{align*} G \sim DP\left(\alpha, \mathcal{H} \right) \end{align*} \]

where \(G\) is a discrete distribution over the continuous space \(\Omega\)

Stick Breaking

There are a few alternative ways to construct a Dirichlet Process and one common method is to define the Dirichlet Process by using the “stick breaking” construction. Formally, Let \((\phi_1,\phi_2, ...)\) be a sequence of independent random variables distributed \(Beta(1, \alpha)\). Independent of this sequence let \((Z_1, Z_2, ...)\) be a sequence of random variables with base distribution \(H\). If we define \(p_1 = \phi_1\) and \(p_i = \phi_i\prod_{j<i}(1-\phi_j)\), then the random measure \(G = \sum_{i\in \mathbb{N}}p_i\delta(Z_i)\) is a Dirichlet Process with concentration parameter \(\alpha\) and base measure \(H\). The nice thing about this formulation is that we can take finite samples from an infinite dimensional Dirichlet distribution by following this process.

To see this, we can generate samples using the stick-breaking method and verify that they are visually similar to those we generated using the Dirichlet Distribution for large values of \(K\).

def stick_breaking(N, alpha):
    table = [1]
    beta = np.random.beta(1, alpha)
    remaining_length = 1-beta
    beta_vals = [beta]
    probs = [beta, 1-beta]
    for i in range(1, N):
        sample = np.random.choice(range(len(probs)), p=np.array(probs)) + 1
        if sample == len(probs):
            table.append(sample)
            new_beta = np.random.beta(1, alpha)
            beta_vals.append(new_beta)
            probs[-1] = new_beta * remaining_length
            remaining_length = 1 - np.sum(probs)
            probs.append(1 - np.sum(probs))
        else:    
            table.append(sample)
    return table, probs

y, prob = stick_breaking(500, 10)
fig, ax = plt.subplots()
ax.scatter(range(len(y)), y, alpha = .6)
ax.plot(range(len(y)), 10*np.log(1 + np.divide(range(len(y)), 10)), color = "red")
fig.suptitle("Stick Breaking Process (alpha = 10)")
ax.grid()
plt.xlabel("Iteration")
plt.ylabel("Class Assignment")
fig.savefig("stick_breaking.png")
plt.close(fig)

Chinese Restaurant Process

Finally, we’ll explore one more way to construct a Dirichlet Process using a Chinese Restaurant Process. Suppose we own a restaurant with an infinite amount of space and an infinite line of customers waiting to be seated. The hostess will take the first customer and sit them at the first table with probability 1. The next customer gets seated at table 1 with probability \(\frac{m_k}{N-1+\alpha}\) or another table is added with probability \(\alpha\) where: \[ \begin{align*} m_k & = \text{Number of people sitting at table k}\\ N & = \text{Total number of customers seated}\\ \alpha & = \text{Concentration parameter} \end{align*} \] As customers continue to be seated, we recognize that an infinite amount of table may be created. We also operated where the tables with the most people currently sitting there also have the highest probability of gaining the newest customer. This is a “rich get richer” paradigm. This process seems simple enough, but what’s the connection to the Dirichlet Process? Let’s code up a CRP and see if we can draw a connections:

def CRP(N, alpha):
    # First person sits at first table with probability 1
    table = [1] 
    for i in range(1, N):
        unique_tables, counts = np.unique(table, return_counts=True)
        probs = np.array(counts / (i + alpha))
        probs = np.append(probs, alpha / (i + alpha))
        table.append((np.random.choice(range(len(probs)) , p=probs)+ 1))
    return table, probs

y, prob = CRP(500, 10)
fig, ax = plt.subplots()
ax.scatter(range(len(y)), y, alpha = .6)
ax.plot(range(len(y)), 10*np.log(1 + np.divide(range(len(y)), 10)), color = "red")
fig.suptitle("Chinese Retaurant Process (alpha = 10)")
ax.grid()
plt.xlabel("Iteration")
plt.ylabel("Class Assignment")
fig.savefig("crp.png")
plt.close(fig)