Telechargé par bouchareb.ilhem

lecture7

publicité
CS 262 Computational Genomics
Lecture 7: Hidden Markov Models Intro
Scribe: Vivian Lin
January 31, 2012
1. Introduction
Hidden Markov Model (HMM) is a statistical model for sequences of symbols; it is used
in many different fields of computer science for modeling sequential data. In
computational biology, the first HMM-based program, GENSCAN, was built in 1997.
This program was developed by Chris Burge at Stanford University to predict the
location of genes.
In this lecture, the HMM definitions, properties, and the three main questions (evaluation,
decoding, and learning), are discussed. During the discussion, the example of the
dishonest casino was provided for better understanding of the material. In addition, the
Viterbi algorithm and the forward algorithm are analyzed to solve the evaluation and
decoding questions.
2. Example: The Dishonest Casino
The lecture started by giving the dishonest casino example to describe HMM. Consider a
casino has two dice:
• Fair die
P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
• Loaded die
P(1) = P(2) = P(3) = P(4) = P(5) = 1/10
P(6) = 1/2
Casino player switches back-and-forth between fair and loaded die once every 20 turns.
The rule of the game is:
1. You bet $1
2. You roll (always with a fair die)
3. Casino player rolls (maybe with fair die, maybe with loaded die)
4. Highest number wins $2
Then, a sequence of rolls by the casino player is given:
1245526462146146136136661664661636616366163616515615115146123562344
With this given sequence, there are three main questions we would be able to answer
using HMM:
1. Evaluation: How likely is this sequence, given our model of how the casino works?
2. Decoding: What portion of the sequence was generated with the fair die, and what
portion with the loaded die?
3. Learning: How “loaded” is the loaded die? How “fair” is the fair die? How often
does the casino player change from fair to loaded, and back?
3. Definition of an Hidden Markov Model
A Markov chain is useful when computing a probability for a sequence of states that we
can observe in the world. In many cases, the states we are interested in may not be
directly observable. In contrast, a hidden Markov model allows us to talk about both
observed states and hidden states that we think of as causal factors in the probabilistic
model.
Formally, the HMM is defined as follows:
• Observation Alphabet:  = { b1, b2, …, bM }
• Set of hidden states: Q = { 1, ..., K }
• Transition probabilities: alk
• alk = transition probability from state l to state k = P(i =k | i-1 =l).
• In addition, the transition probabilities from a given state to every state
(include itself) should sum up to 1: ai1 + … + aiK = 1, for all states i =
1…K
• Start probabilities: a0i
• a0i = probability of starting in state i
• Note that the sum of the start probabilities should add up to 1: a01 + … +
a0K = 1.
• Emission probabilities: ek(b)
• ek (b) = the probability that symbol b is been when in state k = P( xi = b |
i = k)
• Again, the sum of the emission probabilities should add up to 1: ei(b1)
+ … + ei(bM) = 1, for all states i = 1…K
1
2
K
…
Note that the end probabilities ai0 in Durbin (Biological Sequence Analysis) is not needed
since we assume the length of the sequence we are analyzing is given.
In the dishonest casino example, we can present the example with the following graphical
model and settings:
0.05
0.95
FAIR
P(1|F) = 1/6
P(2|F) = 1/6
P(3|F) = 1/6
P(4|F) = 1/6
P(5|F) = 1/6
P(6|F) = 1/6
•
•
•
•
•
0.95
LOADED
0.05
P(1|L) = 1/10
P(2|L) = 1/10
P(3|L) = 1/10
P(4|L) = 1/10
P(5|L) = 1/10
P(6|L) = 1/2
Alphabet:  = numbers on the dice = { 1,2,3,4,5,6}
Set of states: Q = { Fair, Loaded }
Transition probabilities:
• probability from fair to loaded
• probability from fair to fair
• probability from loaded to fair,
• Probability from loaded to loaded.
Start probabilities: probability of starting with a loaded die or a fair die
Emission probabilities: probabilities of rolling a certain number in a fair or loaded
state. For instance, the probability of rolling a 6 is 1/6 in the fair state but ½ in the
loaded state.
4. A HMM is memory-less
Memory-less is an important property of a HMM. This property is based on the two
assumptions: Markov assumption and independence assumption. Because of these two
assumptions, solving problems becomes simple. Markov assumption states that the
current state is dependent only on the previous state. In other words, at each time step t,
the only thing that affects future states is the current state t :
P(t+1 = k | “whatever happened so far”) =
P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) =
P(t+1 = k | t)
The second assumption, independence assumption, states that the output observation at
time t is dependent only on the current state. In other words, at each time step t, the only
thing that affects xt is the current state t:
P(xt = b | “whatever happened so far”) =
P(xt = b | 1, 2, …, t, x1, x2, …, xt-1) =
P(xt = b | t)
5. A parse of a sequence
Given a sequence of symbols: x = x1……xN, and
a sequence of states, called parse,  = 1, ……, N
We can present a parse as follows:
a02
0
1
1
1
…
1
2
2
2
…
2
…
…
…
K
K
K
x1
x2
x3
…
…
K
e2(x1)
xn
Given a HMM like the one above, we can generate a sequence of length n. First, we
begin at a start state, and then we generate an emission, then we transfer from one state to
another. Formally, we have
1.
2.
3.
4.
Start at state 1 according to probability a01
Emit letter x1 according to probability e1(x1)
Go to state 2 according to probability a12
… until emitting xn
6. Likelihood of a Parse
Given a sequence x = x1……xN and a parse  = 1, ……, N, we can find how likely this
parse is. From the given HMM, we can calculate this by looking at all the probabilities of
emissions and transitions:
P(x, ) = P(x1, …, xN, 1, ……, N)
= P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1)
= a01 a12……aN-1N e1(x1)……eN(xN)
Now, the equation can be written in a more compact way by letting each parameter of aij
and ei(b) be presented by θi. In our dishonest casino example, we have 18 parameters in
total: 2 start probabilities (a0Fair : 1; a0Loaded : 2), 4 transitions probabilities (aFairFair : 3,
aFairLoaded : 4, aLoadedFair : 5, aLoadedLoaded : 6), 12 emission probabilities (6 from each of the
states).
Then, we define feature counts, F(j, x, ), to be the number of times parameter j occurs
in (x, ). Replacing all the aij and ei(b) with the feature counts, the compact equation for
parse likelihood becomes
P(x, ) = j=1…n jF(j, x, ) = exp[j=1…n log(j)F(j, x, )]
Let’s practice this equation with our dishonest casino example. Suppose our initial
probabilities are a0Fair = ½ and aoLoaded = ½, and we have a sequence of rolls: x = 1, 2, 1, 5,
6, 2, 1, 5, 2, 4. Then the likelihood of x = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair,
Fair is
P(start at Fair) P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair)
= ½  (1/6)10  (0.95)9 = .00000000521158647211 ~= 0.5  10-9
where ½ is the start probability, 1/6 is the emission probability from the fair die, and 0.95
is the transition probability from Fair state to Fair state.
With the same sequences of rolls, the likelihood of x = Loaded, Loaded, Loaded, Loaded,
Loaded, Loaded, Loaded, Loaded, Loaded, Loaded is
P(start at Loaded) P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded)
= ½  (1/10)9  (1/2)1 (0.95)9 = .00000000015756235243 ~= 0.16  10-9
where the first ½ is the start probability, 1/10 is the emission probability of NOT rolling
six from the loaded die, the second ½ is the emission probability of rolling six from the
loaded die, and 0.95 is the transition probability from Loaded state to Loaded state.
Therefore, it somewhat more likely that all the rolls are done with the fair die than that
they are all done with the loaded die
Now, changing the sequence of rolls to x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6. The likelihood  = F,
F, …, F equals to ½  (1/6)10  (0.95)9 ~= 0.5  10-9, which is the same as before. But
the likelihood  =L, L, …, L equals to ½  (1/10)4  (1/2)6 (0.95)9 ~= 0.5  10-7 . This
time, it is 100 times more likely the die is loaded.
7. Overview of the Three Main Questions on HMMs
We can define the three main questions on HMMs as follows:
 Evaluation: What is the probability of the observed sequence?
o Given: a HMM M, and a sequence x,
o Find : Prob[ x | M ]
o Algorithm: Forward
o Example of dishonest casino: How likely is this sequence, given our
model of how the casino works?

Decoding: Find the most likely parse of a sequence
o Given: a HMM M, and a sequence x,
o Find : the sequence  of states that maximizes P[ x,  | M ]
o Algorithm: Viterbi, Forward-Backward
o Example of dishonest casino: What portion of the sequence was generated
with the fair die, and what portion with the loaded die?

Learning: Under what parameterization are the observed sequences most probable?
o Given: a HMM M, with unspecified transition/emission probs.,
and a sequence x,
o Find : parameters  = (ei(.), aij) that maximize P[ x |  ]
o Algorithm: Baum-Welch
o Example of dishonest casino: How “loaded” is the loaded die? How “fair”
is the fair die? How often does the casino player change from fair to
loaded, and back?
Before discussing the algorithms, let’s talk about the notation:
P[ x | M ]: the probability that sequence x was generated by the model,
where, the model is architecture (#states, etc) + parameters  = aij, ei(.)
So, P[x | M] is the same with P[ x |  ], and P[ x ], when the architecture, and the
parameters, respectively, are implied. Similarly, P[ x,  | M ], P[ x,  |  ] and P[ x,  ]
are the same when the architecture, and the parameters, are implied. In the LEARNING
problem we always write P[ x |  ] to emphasize that we are seeking the * that aximizes
P[ x |  ].
8. Decoding and Viterbi Algorithm
The purpose of decoding is to find the hidden state sequence that was most likely to have
produced a given observation sequence. One way to solve this problem is to use the
Viterbi algorithm.
Derivation of the Viterbi Algorithm
We can formally describe the decoding problem as follows:
Given observations: x = x1x2……xN
Find the sequence of states:  = 1, …, N to maximize P[ x,  ]. Alternately, find
* = argmax P[ x,  ] that maximizes a01 e1(x1) a12……aN-1N eN(xN) .
This problem can be solved with dynamic programming. Let Vk(i) be the probability of
most likely sequence of states ending at state i = k:
Vk(i) = max{1… i-1} P[x1…xi-1, 1, …, i-1, xi, i = k]
Given Vk(i), we are looking for a general formula to compute Vk(i+1) for our iteration
step in the dynamic programming.
From the definition of Vl(i+1):
Vl(i+1) = max{1… i}P[ x1…xi, 1, …, i, xi+1, i+1 = l ]
By Chain rule,
= max{1… i}P(xi+1, i+1 = l | x1…xi, 1,…, i) P[x1…xi, 1,…, i]
With the Markov assumption,
= max{1… i}P(xi+1, i+1 = l | i ) P[x1…xi-1, 1, …, i-1, xi, i]
Separating the two max functions,
= maxk [P(xi+1, i+1 = l | i=k) max{1… i-1}P[x1…xi-1,1,…, i-1, xi, i=k]]
By definition of Vk(i), we replace the blue part,
= maxk [P(xi+1, i+1 = l | i=k) Vk(i)]
By Chain rule,
= maxk [ P(xi+1 | i+1 = l ) P(i+1 = l | i=k) Vk(i) ]
By definition of emission and transition,
= el(xi+1) maxk akl Vk(i)
Viterbi Algorithm
Combining with the iteration step, the overall Viterbi Algorithm is as follows:
Input: x = x1……xN
Initialization:
V0(0) = 1
(0 is the imaginary first position)
Vk(0) = 0, for all k > 0
1
1
1
1
…
Iteration:
2
2
2
2
…
Vj(i) = ej(xi)  maxk akj Vk(i – 1)
…
…
…
…
Ptrj(i) = argmaxk akj Vk(i – 1)
K
K
K
K
…
Termination:
x
x
x
x
P(x, *) = maxk Vk(N)
Traceback:
Given that we end up in
N* = argmaxk Vk(N)
state k at step i,
i-1* = Ptri* (i)
maximize product to the
1
2
3
K
left and right
The backtracking allows the best state sequence to be found from the back pointers stored
in the recursion step, but it should be noted that there is no easy way to find the second
best state sequence.
Complexity of the Viterbi Algorithm
The figure below shows how each Vj(i) looks for the max value from all the previous
Vk(i-1) in the iteration step. It also indicates that a N by K matrix is needed while running
Viterbi Algorithm.
x1 x2 x3 ………………………………………..xN
State 1
2
Vj(i)
K
Since each Vj(i) needs O(k) time to compute and the matrix size is O(NM), the space
complexity is O(KN) and the time complex is (K2 N).
Underflow Issues
When calculating
P[ x1,…., xi,  1, …,  i ] = a01 a12……ai e1(x1)……ei(xi)
Extremely small numbers are produced, thus the problem of underflows occurs. To fix
this issue, we can take the logs of all values:
Vl(i) = log el(xi) + maxk [ Vk(i-1) + log akl ]
Back to the Dishonest Casino Example
Let x = a long sequence with ~1/6 of the first portion is 6’s and ~ ½ of the second portion
is 6’s:
x = 123456123456…12345 6626364656…1626364656
Then, it is not hard to show that optimal parse is:
FFF…………………...F LLL………………………...L
If the 6-character substring “123456” is picked from the first portion:
6
6
-5
parsed as F, contribute .95 (1/6)
= 1.610
6
1
5
-5
parsed as L, contribute .95 (1/2) (1/10) = 0.410
thus, the first portion is more likely to be parsed as Fair.
If the 6-character substring “162636” is picked from the second portion:
6
6
-5
parsed as F, contribute .95 (1/6)
= 1.610
6
3
3
-5
parsed as L, contribute .95 (1/2) (1/10) = 9.010
thus, the second portion is more likely to be parsed as Loaded.
9. Evaluation and Forward Algorithm
Given a HMM, and a sequence of observations, x, we’d like to be able to compute the
probability of the observation sequence given a model. So, we can ask
• What is the probability that x was generated by the model? Or, what is P(X)?
What is the probability of a substring of x given the model? Or, what is P(xi…xj)?
• Given a position i, what is the most likely state that emitted xi? Or, what is P(i =
k | x)
These problems could be viewed as one of evaluating how well a model predicts a given
observation sequence, and thus allow us to choose the most appropriate model from a set.
One way of solving the problem is to use the Forward Algorithm
Derivation of the Forward Algorithm
We can formally describe the evaluation problem as follows:
Given observations: x = x1x2……xN
Find probability of x, P(x)
We can find P(x) by summing up all the possible ways of producing x:
P(x)=  P(x, ) =  P(x | ) P()
By doing so, we are summing over an exponential number of paths . To avoid that, we
solved the problem with dynamic programming. Let’s define the forward probability as
the probability of generating i first characters of x and end up in state k:
fk(i) = P(x1…xi, i = k)
Given fl(i-1), we are looking for a general formula to compute fk(i) for our iteration step
in the dynamic programming:
From the definition of fk(i):
fk(i) = P(x1…xi, i = k)
Summing up all the possible paths,
= 1…i-1 P(x1…xi-1, 1,…, i-1, i = k) ek(xi)
Separating the last state and all the previous states,
= l 1…i-2 P(x1…xi-1,  1,…,  i-2, i-1 = l) alk ek(xi)
By Markov assumption,
= l P(x1…xi-1,  i-1 = l) alk ek(xi)
By definition of fl(i-1), we replace the blue part
= ek(xi) l fl(i – 1) alk
Forward Algorithm
Combining with the iteration step, the overall Forward Algorithm is as follows:
Input: x = x1……xN
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
fk(i) = ek(xi) l fl(i – 1) alk
Termination:
P(x) = k fk(N)
The main difference with the Viterbi algorithm in the recursions step is that we are
summing, rather than maximizing
Complexity of the Forward Algorithm
Same as Viterbi algorithm, the space complexity is O(KN) and the time complex is (K2N).
Underflow Issues
Like the Viterbi algorithm the Forward algorithm can give underflow errors when
implemented on a computer. Thus, we can solve the issue by working in log space as well.
10. Comparison between Viterbi Algorithm and Forward Algorithm
VITERBI
FORWARD
Initialization:
V0(0) = 1
Vk(0) = 0, for all k > 0
Initialization:
f0(0) = 1
fk(0) = 0, for all k > 0
Iteration:
Iteration:
Vj(i) = ej(xi) maxk Vk(i – 1) akj
Termination:
P(x, *) = maxk Vk(N)
fl(i) = el(xi)  k fk(i – 1) akl
Termination:
P(x) =  k fk(N)
The topic of backward algorithm is explored in the next lecture.
Téléchargement