What is a markov chain's equilibrium distribution? For illustration purposes, take the following example, from Applied Linear Algebra and Matrix Analysis:
"Suppose two toothpaste companies compete for customers in a fixed market in which each customer uses either brand A or brand B. Suppose also that a market analysis shows that the buying habits of the customers fit the following pattern: each quarter, 30% of A users will switch to B, while the rest stay with A. During the same time, 40% of B users will switch to A, while the rest will stay with B."
As the python session below shows, it does not matter what the initial state distribution is (i.e. what percentage of customers uses brand A and B), after enough quarters, the state distribution will converge to an equilibrium:
#tpaste_inter.py
from numpy import *
def evolve(x,T,n):
for i in range(n):
x=dot(T,x)
print str(x)+'\n'
#transition matrix
T=array([[.7,.4],
[.3,.6]])
#one initial state distribution
x0=array([.5,
.5]).reshape(2,1)
evolve(x0,T,20)
#another initial state distribution
x0=array([1,
0]).reshape(2,1)
evolve(x0,T,20)
#yet another initial state distribution
x0=array([0,
1]).reshape(2,1)
evolve(x0,T,20)
Output:
[[ 0.57142857]
[ 0.42857143]]
[[ 0.57142857]
[ 0.42857143]]
[[ 0.57142857]
[ 0.42857143]]
As shown, all the initial distributions converged to A retaining 57% of customers, with B retaining the remaining 43%. This convergence to an equilibrium distribution does not always happen in markov chains (check this out for necessary conditions for convergence), but we do need it in MCMC. PageRank is also an example of this, where the transition matrix is a link graph and the equilibrium state distribution denotes the rank of each page.
The idea in MCMC, is that we need to construct a transition matrix so that its equilibrium distribution represents the target distribution we want to sample from, so that independently of what we initialize the initial state distribution to, by applying the transition matrix enough times, we will converge to our desired distribution. One obvious benefit of this method over say rejection sampling is that in rejection sampling, we are sampling all over the distribution space (the proposal distribution's space, that is), which may generate a lot of rejections (wasted samples). The same is true with importance sampling (sampling importance resampling, that is). With MCMC, on the other hand, we are smarter about navigating the state space of the desired distribution by navigating the links on its chain. For example, imagine we are sapling from some bivariate distribution, one approach is to use use something like rejection sampling and to generate a bunch of i.i.d (x0,x1) tuples; while another approach would be to start with a random sample for 'x0', then generate 'x1' from a distribution of p(x1|x0), which gives us our first tuple (x0,x1), then we generate a new sample for 'x0' based on our last 'x1' by drawing this time from the conditional p(x0|x1), which gives us our next tuple (not i.i.d), and then we generate a new 'x1' based on this last 'x0' for the next tuple, and so on. This last approach is known as Gibbs Sampling, and is one variant of MCMC.
A more general form of MCMC is the so called Metropolis-Hastings algorithm, where rather than always taking the next link in the chain as a valid sample, we only take it if it is a much better sample than the last sample (according to some fitness function). If it is not much better, we don't take it. If they are about the same (similar fitness scores), we flip a coin to decide. Good introductory resources include Nando's paper on MCMC and chapter 11 of Bishop's book. In there you will find more details about how we go about constructing a transition matrix (the proposal conditional distribution) with the desired properties, which can be considered somewhat of an art (different choices for the proposal distribution lead to different variants of the Metropolis-Hastings algorithms, one of which is Gibbs Sampling).
Below is an example from Marsland's book, where he uses Metropolis-Hastings to sample from a mixture of two gaussians based on a proposal distribution that is a gaussian of the form N(x,b^2) for some b>0. This means that the proposal is a drawn from a Normal centered at the current value. This variant of Metropolis-Hastings is known as the "Random-walk-Metropolis-Hastings" algorithm, the reason for the name being that if we did not do the accept-reject step (evaluating the new sample by a fitness function) we would then be simulating a random walk:
#mcmc.py
# Code from Chapter 14 of Machine Learning: An Algorithmic Perspective
# by Stephen Marsland (http://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html)
# You are free to use, change, or redistribute the code in any way you wish for
# non-commercial purposes, but please maintain the name of the original author.
# This code comes with no warranty of any kind.
# Stephen Marsland, 2008
# The Metropolis-Hastings algorithm
from pylab import *
from numpy import *
def p(x):
mu1 = 3
mu2 = 10
v1 = 10
v2 = 3
return 0.3*exp(-(x-mu1)**2/v1) + 0.7* exp(-(x-mu2)**2/v2)
stepsize = 0.5
x = arange(-10,20,stepsize)
px = zeros(shape(x))
for i in range(len(x)):
px[i] = p(x[i])
N = 5000
# random walk chain
u2 = random.rand(N)
sigma = 10
y2 = zeros(N)
#start with a random sample
y2[0] = random.normal(0,sigma)
for i in range(N-1):
#get next sample in the chain
y2new = y2[i] + random.normal(0,sigma)
#apply fitness function
alpha = min(1,p(y2new)/p(y2[i]))
#only take this sample if better than last
if u2[i] < alpha:
y2[i+1] = y2new
else:
y2[i+1] = y2[i]
figure(1)
nbins = 30
hist(y2, bins = x)
plot(x, px*N/sum(px), color='r', linewidth=2)
show()
And for which we get:
0 comments:
Post a Comment