Let’s firstly give the result, and then show how to achieve it.
where $Z$ is latent variable and $S = \{Z^{1}, Z^{2} ….\}$ is the collection of samples. Given a joint model
one may recognize it as collapsed form of Latent Dirichlet Allocation(LDA), here omits $\alpha, \beta$ for simplicity.
Now, we want to calculate the gradient of it’s marginalized posterior $p(\phi|X)$ as following:
The second is just gradient of prior distribution of $\phi$, first term is incomplete-data log likelihood, We can not calculate it directly. As the full likelihood is $p(x_i,z_i|\phi)$, we calculate it as following:
where, in line-2, log cannot go through the summation, note that line2 and line3 are equivalent.
Here, we derive the approximation as following:
We can use MCMC method to approximate this expectation, specifically, we can construct a Markov chain to sample $z_i$ from $p(z_i | x_i, \phi)$, and sum over gradient based on each sample $z_i^{s}$ to calculate expectation.
We denote the collection of samples as ${{z_i^{s}}}_{s=1}^{|S|}$ , the approximation is:
In these example, the Gibbs sampler for $p(z_i|x_i, \phi)$ is:
Now, we consider more complex model with two set of variables that we want their gradient, un-collapsed LDA is a good example:
The log likelihood:
It’s easy to see that we only need concern how to calculate the gradient of
The approximation:
This draft may contain some mistakes, if you found any, I will appreciate the correction.