Hidden Markov Models in Computational Genomics: Lecture Notes

Telechargé par bouchareb.ilhem

Téléchargement

CS 262 Computational Genomics

Lecture 7: Hidden Markov Models Intro

Scribe: Vivian Lin

January 31, 2012

1. Introduction

Hidden Markov Model (HMM) is a statistical model for sequences of symbols; it is used

in many different fields of computer science for modeling sequential data. In

computational biology, the first HMM-based program, GENSCAN, was built in 1997.

This program was developed by Chris Burge at Stanford University to predict the

location of genes.

In this lecture, the HMM definitions, properties, and the three main questions (evaluation,

decoding, and learning), are discussed. During the discussion, the example of the

dishonest casino was provided for better understanding of the material. In addition, the

Viterbi algorithm and the forward algorithm are analyzed to solve the evaluation and

decoding questions.

2. Example: The Dishonest Casino

The lecture started by giving the dishonest casino example to describe HMM. Consider a

casino has two dice:

• Fair die

P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6

• Loaded die

P(1) = P(2) = P(3) = P(4) = P(5) = 1/10

P(6) = 1/2

Casino player switches back-and-forth between fair and loaded die once every 20 turns.

The rule of the game is:

1. You bet $1

2. You roll (always with a fair die)

3. Casino player rolls (maybe with fair die, maybe with loaded die)

4. Highest number wins $2

Then, a sequence of rolls by the casino player is given:

1245526462146146136136661664661636616366163616515615115146123562344

With this given sequence, there are three main questions we would be able to answer

using HMM:

1. Evaluation: How likely is this sequence, given our model of how the casino works?

2. Decoding: What portion of the sequence was generated with the fair die, and what

portion with the loaded die?

3. Learning: How “loaded” is the loaded die? How “fair” is the fair die? How often

does the casino player change from fair to loaded, and back?

3. Definition of an Hidden Markov Model

A Markov chain is useful when computing a probability for a sequence of states that we

can observe in the world. In many cases, the states we are interested in may not be

directly observable. In contrast, a hidden Markov model allows us to talk about both

observed states and hidden states that we think of as causal factors in the probabilistic

model.

Formally, the HMM is defined as follows:

• Observation Alphabet:  = { b1, b2, …, bM }

• Set of hidden states: Q = { 1, ..., K }

• Transition probabilities: alk

• alk = transition probability from state l to state k = P(i =k | i-1 =l).

• In addition, the transition probabilities from a given state to every state

(include itself) should sum up to 1: ai1 + … + aiK = 1, for all states i =

1…K

• Start probabilities: a0i

• a0i = probability of starting in state i

• Note that the sum of the start probabilities should add up to 1: a01 + … +

a0K = 1.

• Emission probabilities: ek(b)

• ek (b) = the probability that symbol b is been when in state k = P( xi = b |

i = k)

• Again, the sum of the emission probabilities should add up to 1: ei(b1)

+ … + ei(bM) = 1, for all states i = 1…K

Note that the end probabilities ai0 in Durbin (Biological Sequence Analysis) is not needed

since we assume the length of the sequence we are analyzing is given.

…

In the dishonest casino example, we can present the example with the following graphical

model and settings:

• Alphabet:  = numbers on the dice = { 1,2,3,4,5,6}

• Set of states: Q = { Fair, Loaded }

• Transition probabilities:

• probability from fair to loaded

• probability from fair to fair

• probability from loaded to fair,

• Probability from loaded to loaded.

• Start probabilities: probability of starting with a loaded die or a fair die

• Emission probabilities: probabilities of rolling a certain number in a fair or loaded

state. For instance, the probability of rolling a 6 is 1/6 in the fair state but ½ in the

loaded state.

4. A HMM is memory-less

Memory-less is an important property of a HMM. This property is based on the two

assumptions: Markov assumption and independence assumption. Because of these two

assumptions, solving problems becomes simple. Markov assumption states that the

current state is dependent only on the previous state. In other words, at each time step t,

the only thing that affects future states is the current state



t :

P(t+1 = k | “whatever happened so far”) =

P(t+1 = k | 1, 2, …, t, x1, x2, …, xt) =

P(t+1 = k | t)

The second assumption, independence assumption, states that the output observation at

time t is dependent only on the current state. In other words, at each time step t, the only

thing that affects xt is the current state



P(xt = b | “whatever happened so far”) =

P(xt = b | 1, 2, …, t, x1, x2, …, xt-1) =

P(xt = b | t)

FAIR LOADED

0.05

0.950.95

P(1|F) = 1/6

P(2|F) = 1/6

P(3|F) = 1/6

P(4|F) = 1/6

P(5|F) = 1/6

P(6|F) = 1/6

P(1|L) = 1/10

P(2|L) = 1/10

P(3|L) = 1/10

P(4|L) = 1/10

P(5|L) = 1/10

P(6|L) = 1/2

5. A parse of a sequence

Given a sequence of symbols: x = x1……xN, and

a sequence of states, called parse,  = 1, ……, N

We can present a parse as follows:

Given a HMM like the one above, we can generate a sequence of length n. First, we

begin at a start state, and then we generate an emission, then we transfer from one state to

another. Formally, we have

1. Start at state 1 according to probability a01

2. Emit letter x1 according to probability e1(x1)

3. Go to state 2 according to probability a12

4. … until emitting xn

6. Likelihood of a Parse

Given a sequence x = x1……xN and a parse  = 1, ……, N, we can find how likely this

parse is. From the given HMM, we can calculate this by looking at all the probabilities of

emissions and transitions:

P(x, ) = P(x1, …, xN, 1, ……, N)

= P(xN | N) P(N | N-1) ……P(x2 | 2) P(2 | 1) P(x1 | 1) P(1)

= a01 a12……aN-1N e1(x1)……eN(xN)

Now, the equation can be written in a more compact way by letting each parameter of aij

and ei(b) be presented by θi. In our dishonest casino example, we have 18 parameters in

total: 2 start probabilities (a0Fair : 1; a0Loaded : 2), 4 transitions probabilities (aFairFair : 3,

aFairLoaded : 4, aLoadedFair : 5, aLoadedLoaded : 6), 12 emission probabilities (6 from each of the

states).

Then, we define feature counts, F(j, x, ), to be the number of times parameter j occurs

in (x, ). Replacing all the aij and ei(b) with the feature counts, the compact equation for

parse likelihood becomes

P(x, ) = j=1…n jF(j, x, ) = exp[j=1…n log(j)F(j, x, )]

…

x1x2x3xn

e2(x1)

a02

Let’s practice this equation with our dishonest casino example. Suppose our initial

probabilities are a0Fair = ½ and aoLoaded = ½, and we have a sequence of rolls: x = 1, 2, 1, 5,

6, 2, 1, 5, 2, 4. Then the likelihood of x = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair,

= ½  (1/6)10  (0.95)9 = .00000000521158647211 ~= 0.5  10-9

where ½ is the start probability, 1/6 is the emission probability from the fair die, and 0.95

is the transition probability from Fair state to Fair state.

With the same sequences of rolls, the likelihood of x = Loaded, Loaded, Loaded, Loaded,

Loaded, Loaded, Loaded, Loaded, Loaded, Loaded is

P(start at Loaded) P(1 | Loaded) P(Loaded, Loaded) … P(4 | Loaded)

= ½  (1/10)9  (1/2)1 (0.95)9 = .00000000015756235243 ~= 0.16  10-9

where the first ½ is the start probability, 1/10 is the emission probability of NOT rolling

six from the loaded die, the second ½ is the emission probability of rolling six from the

loaded die, and 0.95 is the transition probability from Loaded state to Loaded state.

Therefore, it somewhat more likely that all the rolls are done with the fair die than that

they are all done with the loaded die

Now, changing the sequence of rolls to x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6. The likelihood  = F,

F, …, F equals to ½  (1/6)10  (0.95)9 ~= 0.5  10-9, which is the same as before. But

the likelihood  =L, L, …, L equals to ½  (1/10)4  (1/2)6 (0.95)9 ~= 0.5  10-7 . This

time, it is 100 times more likely the die is loaded.

7. Overview of the Three Main Questions on HMMs

We can define the three main questions on HMMs as follows:

 Evaluation: What is the probability of the observed sequence?

o Given: a HMM M, and a sequence x,

o Find : Prob[ x | M ]

o Algorithm: Forward

o Example of dishonest casino: How likely is this sequence, given our

model of how the casino works?

1 / 10 100%

Documents connexes

Free 20 No Deposit Casino

Management Styles & Qualities: Definitions & Vocabulary

joker123

Merci pour votre participation!

Faire une suggestion

Avez-vous trouvé des erreurs dans l'interface ou les textes ? Ou savez-vous comment améliorer l'interface utilisateur de StudyLib ? N'hésitez pas à envoyer vos suggestions. C'est très important pour nous!

GDPR Confidentialité Conditions d'utilisation

Hidden Markov Models in Computational Genomics: Lecture Notes

Documents connexes

Faire une suggestion

Produits

Assistance

Produits

Assistance

Hidden Markov Models in Computational Genomics: Lecture Notes

Documents connexes

Faire une suggestion

Produits

Assistance

Ajouter ce document à la (aux) collections

Ajouter ce document à enregistré

Suggérez-nous comment améliorer StudyLib