Monday, May 18, 2009

Hidden Markov Model III

Last time we used an existing HMM to showcase examples of the inference scenarios in which HMM's might be useful. Today we explore how we might infer the HMM itself from data.

The parameters to be estimated are the transition probabilities and the evidence probabilities, and the inputs we have to work with include the observation sequence and the number of latent states, similar to how the number of gaussians was an input to the mixture of gaussians algorithm. In fact, the standard algorithm for learning a hidden markov model is an instance of the Expectation-Maximization algorithm known as the Baum-Welch algorithm, which uses a slightly modified version of the Forward-Backward algorithm:


#hmm_imp.py
from numpy import *

############################################################
#hmmAlgorithms
############################################################

def fwd(tProb,eProb,prior,ev):
nstates=shape(prior)[0]
T=shape(ev)[1]-1
fMatrix=zeros((nstates,T))

#our initial forward message equals the prior
#state distribution (time=0)
fMatrix[:,0]=prior.reshape(nstates,)

for t in range(1,T):
fTemp=zeros((nstates,nstates))

for i in range(nstates):
for j in range(nstates):
#for every state i at time t,
#calculate the probability that each
#state j at t-1 would have lead to i at t,
#weighted by the belief that we were in
#state j at t-1, and weighted by the
#probability that state i would have
#produced the evidence we observed at t
fTemp[i,j]= \
tProb[j,i]*fMatrix[j,t-1]*eProb[i,ev[0,t]]

#collapse fTemp from right to left, summing contributions
#of each state j at time t-1 to a given state i at t
fMatrix[:,t]=normEachRow(fTemp.sum(1))

return fMatrix

#same as fwd, w/o norm
def likeli(tProb,eProb,prior,ev):
nstates=shape(prior)[0]
T=shape(ev)[1]-1
fMatrix=zeros((nstates,T))

#our initial forward message equals the prior
#state distribution (time=0)
fMatrix[:,0]=prior.reshape(nstates,)

for t in range(1,T):
fTemp=zeros((nstates,nstates))

for i in range(nstates):
for j in range(nstates):
#for every state i at time t,
#calculate the probability that each
#state j at t-1 would have lead to i at t,
#weighted by the belief that we were in
#state j at t-1, and weighted by the
#probability that state i would have
#produced the evidence we observed at t
fTemp[i,j]= \
tProb[j,i]*fMatrix[j,t-1]*eProb[i,ev[0,t]]

#collapse fTemp from right to left, summing contributions
#of each state j at time t-1 to a given state i at t
fMatrix[:,t]=fTemp.sum(1)

return fMatrix

#same as fwd, but backwards
def bwd(tProb,eProb,prior,ev):
nstates=shape(prior)[0]
T=shape(ev)[1]-1

#by initializing the entire matrix to ones,
#the last backward message will be set to all ones,
#which indicates our diffuse prior belief on our
#final state. If we knew what state we ended up on,
#we could use that information here.
bMatrix=ones((nstates,T))

for t in range(T-1,0,-1):
fTemp=zeros((nstates,nstates))

#for every state i at time t-1,
#calculate the probability that it would
#have lead to state j at time t, weighted
#by the probabilistic belief that we are
#in state j at time t, also weighted by
#the likelihood that j would emit the evidence
#it emitted at time t
for i in range(nstates):
for j in range(nstates):
fTemp[i,j]=tProb[i,j]*bMatrix[j,t]*eProb[j,ev[0,t]]

#collapse fTemp from right to left, summing contributions
#of each state j at time t to a given state i at t-1
bMatrix[:,t-1]=normEachRow(fTemp.sum(1))

#allow the backward message to be neutral
#for the prior state distribution (time=0)
bMatrix[:,0]=ones((1,nstates))
return bMatrix

#uses only filtered estimates
def viterbi(tProb,eProb,prior,ev):
nstates=shape(prior)[0]
T=shape(ev)[1]-1
fMatrix=zeros((nstates,T))
pMatrix=zeros((nstates,T),int)

#our initial forward message equals the prior
#state distribution (time=0)
fMatrix[:,0]=prior.reshape(nstates,)

for t in range(1,T):
fTemp=zeros((nstates,nstates))

for i in range(nstates):
for j in range(nstates):
#for every state i at time t,
#calculate the probability that each
#state j at t-1 would have lead to i at t,
#weighted by the belief that we were in
#state j at t-1, and weighted by the
#probability that state i would have
#produced the evidence we observed at t
fTemp[i,j]= \
tProb[j,i]*fMatrix[j,t-1]*eProb[i,ev[0,t]]

#sweep fTemp from right to left, choosing the most likely
#of all states j at time t-1 to reach state i at t
fMatrix[:,t]=fTemp.max(1)
pMatrix[:,t]=fTemp.argmax(1)
fMatrix[:,t]=normEachRow(fMatrix[:,t])

return fMatrix ,pMatrix

def baumwelch(ev,nstates):
counter=1
T=shape(ev)[1]-1
evclass=set(filter(lambda i: i is not None,ev.tolist()[0]))
nclass=len(evclass)
tProb=normEachRow(random.rand(nstates,nstates))
eProb=normEachRow(random.rand(nstates,nclass))
prior=normEachRow(ones((nstates,)))
gamma=zeros((nstates,nstates,T))
eps=zeros((nstates,T))

tProb_old=tProb*0
eProb_old=eProb*0
while (abs(tProb_old-tProb)).max()>1e-3 and \
(abs(eProb_old-eProb)).max()>1e-3:
tProb_old=tProb+0
eProb_old=eProb+0

###########
#E-step:
###########

f=fwd(tProb,eProb,prior,ev)
b=bwd(tProb,eProb,prior,ev)

#update gamma
#for all time t,
#compute the probability that we transition
#from state i at time t to state j at time t+1,
#weighted by the beliefs about where we are at
#each time and the probability that the evidence
#seen would have been emitted at time t+1, divided
#by the smoothed likelihood of the observed sequence
#observed up until time t would have been generated
#by the hmm
for t in range(T-1):
for i in range(nstates):
for j in range(nstates):
gamma[i,j,t]= \
f[i,t]*tProb[i,j]* \
b[j,t+1]*eProb[j,ev[0,t+1]]/ \
sum(f[i,t]*b[j,t])

#update eps
#which is our smoothed state distribution
#for each time step
eps=f*b
eps=transpose(normEachRow(transpose(eps)))

###########
#M-step:
###########

#updated tProb_new
for i in range(nstates):
for j in range(nstates):
#num is the weighted number of times that
#i has transitioned to j in our evidence sequence
num=sum([gamma[i,j,t] for t in range(T)])

#den is the weighted number of times that
#i has transitioned to any other state in
#each time t of our observation sequence
den=0
for k in range (nstates):
den+=sum([gamma[i,k,t] for t in range(T)])
tProb[i,j]=num/den

#update eProb_new
#weighted probabilitic belief about each state
#combined with the probability that each state
#will emit the given evidence
for j in range(nstates):
for k in range(nclass):
num=sum([eps[j,t]
for t in range(T)
if ev[0,t]==k])
den=sum([eps[j,t]
for t in range(T)])
eProb[j,k]=num/den

#update prior
#sum each state here, which we will normalize later.
#how about doing some viterbu here and then
#looking at the most likely path and getting the
#prior distirbution from there?
for j in range(nstates):
prior[j,]=sum([eps[j,t] for t in range(T)])

tProb=normEachRow(tProb)
eProb=normEachRow(eProb)
prior=normEachRow(prior)

print 'iter: %d'%counter
counter+=1

return prior,eProb,tProb

############################################################
#helperFunctions
############################################################

def normEachRow(pDist):
dim=shape(pDist)
rows=dim[0]
if len(dim)>1 and rows>1:
for row in range(rows):
pDist[row,:]=pDist[row,:]/sum(pDist[row,:])
else:
pDist=pDist/sum(pDist)
return pDist

def printMatrix(matrix,msg,start=0,dec=True):
if not dec:
format='%d\t'
else:
format='%1.3f\t'
print '\n'+msg
rows,cols=shape(matrix)
if cols>10: cols=11
for col in range(start,cols): print 't=%d\t'%col,
print '\n',
for row in range(rows):
for col in range(start,cols):
print format%matrix[row,col],
print '\n',

def accu(a,b):
assert shape(a)==shape(b)
total=float(shape(a)[1])
c=0
for i in range(shape(a)[1]):
if a[0,i]==b[0,i]:
c=c+1
return c/total

def load(filename):
inf=open(filename,'r')
inf.readline()
data=[]
for i in inf.readlines():
x=i.split()
y=x[0].split(",")
data.append(y)
return data

def backtrack(vMatrix,pMatrix):
rows,cols=shape(vMatrix)
path=zeros((1,cols),int)

#the most likely state at the last time step
#is the one with the highest probability in
#the vMatrix at that time step:
lastPathCol=shape(path)[1]-1
lastvMatrixCol=shape(vMatrix)[1]-1
lastpMatrixCol=shape(pMatrix)[1]-1
path[0,lastPathCol]=vMatrix[:,lastvMatrixCol].argmax(0)

#once we have the most likely last state's index, we
#use it to look up in the backpointer matrix to
#trace our way back the most likely path:
while lastPathCol>0:
path[0,lastPathCol-1]= \
pMatrix[path[0,lastPathCol],lastpMatrixCol]
lastPathCol-=1
lastpMatrixCol-=1

return path


Driven with:


#hmm_inter.py
from hmm_imp import *

############################################################
#weatherModel
############################################################

#state map
stateMap={'sunny':0,'rainy':1,'foggy':2}
stateIndex={0:'sunny',1:'rainy',2:'foggy'}
nStates=len(stateMap)

#observation map
obsMap={'no':0,'yes':1}
obsIndex={0:'no',1:'yes'}

#prior probability on weather states
#P(sunny)=0.50
#P(rainy)=0.25
#P(foggy)=0.25
pi=array([0.50,
0.25,
0.25]).reshape(nStates,1)

#transition probabilities of weather states
# tomorrrow
# today sunny rainy foggy
# sunny 0.8 0.05 0.15
# rainy 0.2 0.60 0.20
# foggy 0.2 0.30 0.50
A=array([[0.8, 0.05, 0.15],
[0.2, 0.60, 0.20],
[0.2, 0.30, 0.50]])

#conditional probabilities of evidence given weather
# P(umbrella=no|weather) P(umbrella=yes|weather)
# sunny 0.9 0.1
# rainy 0.2 0.8
# foggy 0.7 0.3
B=array([[0.9, 0.1],
[0.2, 0.8],
[0.7, 0.3]])

############################################################
#session
############################################################

#data (evidence has the dummy 'None' entry for time=0)
evidence=array([None,0,0,0,1,0,0,1,1,0,1,0])
classPath=array([2,2,2,1,0,2,1,1,2,1,1])
classPath=classPath.reshape(1,len(classPath))
evidence=evidence.reshape(1,len(evidence))

#compute forward messages
fMatrix=fwd(A,B,pi,evidence)

#normalize probabilities
fMatrix=transpose(normEachRow(transpose(fMatrix)))

#show state distribution as a result of filtering
printMatrix(fMatrix,'fwd matrix')

#compute backward messages
bMatrix=bwd(A,B,pi,evidence)

#show backward messages
printMatrix(bMatrix,'bwd matrix')

#compute smoothed estimate (hindsight)
sMatrix=fMatrix*bMatrix
sMatrix=transpose(normEachRow(transpose(sMatrix)))

#show smooth matrix
printMatrix(sMatrix,'smooth matrix',start=1)

#by summing over the un-normalized state distribution
#resulting from filtering, we obtain the likelihood
#of the hmm having generated the given observation sequence
fMatrix=likeli(A,B,pi,evidence)
lkh=sum(fMatrix[:,shape(fMatrix)[1]-1])
print '\nLikelihood of obs. seq. given original hmm: %f '%lkh

def doViterbi(tProb,eProb,prior, \
evidence_seq,state_seq,printPath=False):
print '\nViterbi...'

#compute viterbi matrix and backpointer matrix
vMatrix,pMatrix=viterbi(tProb,eProb,prior,evidence_seq)

#show viterbi matrix and backpointer matrix
printMatrix(vMatrix,'viterbi matrix',start=1)
printMatrix(pMatrix,'backpointer matrix',start=1,dec=False)

#backtrack backpointer matrix to find most likely
#sequence of states
estPath=backtrack(vMatrix,pMatrix)
assert shape(state_seq)==shape(estPath)

if printPath:
print '\nEstimated class path:\n',estPath
print '\nActual class path:\n',state_seq

print '\nViterbi decoder accuracy: %1.2f'% \
accu(state_seq,estPath)

doViterbi(A,B,pi,evidence,classPath,printPath=True)

#load more data from disk
data=load('weather-test1-1000.txt')

#adding 'None' as per note at beginning of session
observations=[None]
states=[]
for c,o in data:
observations.append(obsMap[o])
states.append(stateMap[c])

observations=array(observations)
observations=observations.reshape(1,len(observations))
states=array(states)
states=states.reshape(1,len(states))

doViterbi(A,B,pi,observations,states)

print '\nBaum-Welch...'

random.seed(135)
prior,eProb,tProb=baumwelch(observations,3)

print '\nprior:\n',prior
print '\ntProb:\n',tProb
print '\neProb:\n',eProb

doViterbi(tProb,eProb,prior,observations,states)


With output:


fwd matrix
t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0.500 0.667 0.728 0.752 0.282 0.558 0.679 0.241 0.082 0.447 0.142
0.250 0.074 0.042 0.034 0.424 0.120 0.055 0.466 0.722 0.198 0.600
0.250 0.259 0.231 0.214 0.294 0.322 0.265 0.293 0.197 0.356 0.258

bwd matrix
t=0 t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
1.000 0.448 0.346 0.180 0.404 0.272 0.101 0.152 0.317 0.145 1.000
1.000 0.228 0.276 0.450 0.241 0.312 0.533 0.473 0.292 0.493 1.000
1.000 0.324 0.378 0.370 0.355 0.416 0.365 0.376 0.391 0.361 1.000

smooth matrix
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0.747 0.718 0.589 0.356 0.470 0.352 0.100 0.083 0.223 0.142
0.042 0.033 0.067 0.318 0.115 0.151 0.601 0.672 0.335 0.600
0.210 0.249 0.344 0.326 0.415 0.496 0.300 0.245 0.441 0.258

Likelihood of obs. seq. given original hmm: 0.000654

Viterbi...

viterbi matrix
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0.754 0.858 0.862 0.485 0.737 0.856 0.485 0.198 0.480 0.196
0.063 0.017 0.012 0.242 0.061 0.019 0.242 0.594 0.240 0.589
0.183 0.125 0.126 0.273 0.202 0.125 0.273 0.209 0.280 0.215

backpointer matrix
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0 0 0 0 0 0 0 0 0 0
1 2 0 0 1 2 0 1 1 1
2 0 0 0 2 0 0 2 1 2

Estimated class path:
[[0 0 0 0 0 0 0 1 1 1 1]]

Actual class path:
[[2 2 2 1 0 2 1 1 2 1 1]]

Viterbi decoder accuracy: 0.36

Viterbi...

viterbi matrix
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0.754 0.858 0.862 0.485 0.737 0.856 0.485 0.198 0.480 0.196
0.063 0.017 0.012 0.242 0.061 0.019 0.242 0.594 0.240 0.589
0.183 0.125 0.126 0.273 0.202 0.125 0.273 0.209 0.280 0.215

backpointer matrix
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0 0 0 0 0 0 0 0 0 0
1 2 0 0 1 2 0 1 1 1
2 0 0 0 2 0 0 2 1 2

Viterbi decoder accuracy: 0.61

Baum-Welch...
iter: 1
iter: 2
[...]
iter: 26
iter: 27

prior:
[ 7.33520128e-01 2.75132722e-05 2.66452359e-01]

tProb:
[[ 6.88340698e-01 9.00867784e-06 3.11650293e-01]
[ 7.81393712e-01 1.87048374e-05 2.18587583e-01]
[ 8.65993461e-01 4.23514682e-05 1.33964187e-01]]

eProb:
[[ 0.68360127 0.31639873]
[ 0.13036119 0.86963881]
[ 0.66894045 0.33105955]]

Viterbi...

viterbi matrix
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0.693 0.693 0.693 0.679 0.693 0.693 0.679 0.679 0.693 0.679
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.307 0.307 0.307 0.321 0.307 0.307 0.321 0.321 0.307 0.321

backpointer matrix
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10
0 0 0 0 0 0 0 0 0 0
2 2 2 2 2 2 2 2 2 2
0 0 0 0 0 0 0 0 0 0

Viterbi decoder accuracy: 0.49


It helps to look at a good diagram of the trellis that results as we unroll the HMM through time; an example of such a diagram can be found in DHS, towards the end of chapter 3. Another good resource to get a good grasp of the HMM algorithms is an interactive spreadsheet that can be found here.

You can check out the estimated transition and emission probability matrices and try to make out how the entries correspond to our original state and evidence vocabulary. In practice, in systems that do things like recognize speech, a lot of human guidance is used when estimating hidden markov models. Running the viterbi algorithm on the observation sequence with the estimated model returns an accuracy of 0.49, which sounds better than it is, since one of the states (the sunny state) is much more prevalent in our dataset (a prior of 0.50), and so if you always guessed that state you would get a score close to the prior of that state given enough data.

Completely unsupervised learning is hard; even humans will come to varying answers in a manual clustering exercise. Reinforcement learning (what we'll be focusing on the next few posts) allows us to use at least some sort of feedback, although not as much as supervised learning, and so should yield better results than learning without any kind of supervision (provided, of course, that a proper reward system is engineered).

0 comments:

Post a Comment