The renowned article to describe hidden Markov models (HMMs). I may not have enough background knowledge to a:ppreciate the speech recognition part but this is very clear and easy to follow to understand the HMM mechanics.

For some background: A Markov chain is to model an autonomous system with state observable. If we can manipulate the state transition of the system, we have the Markov decision process. HMM is, however, about an autonomous system with unobservable state.

A Markov process is modelled as following: We have set of states which the state at time is specified as (or sequence denoted as ), and the transition probability is specified by

The hidden Markov model introduces an observation sequence to the Markov process. We can describe a HMM as follows: There is a set of urns and a Markov process of selecting an urn. Then a ball is drawn from the selected urn with replacement, which is of a set of colors , i.e. . Only the sequence of ball drawn is observable as . The observation probability is given by

and the initial probability is given by

The HMM is parameterized by which is the state transition probability matrix, is the observation probability matrix, and is the vector of initial probability.

The HMM is usually to solve for one of the following three problems:

  1. : Given model find the probability of a given observation .
  2. : Given observation sequence and model , determine the optimal state sequence that correspond to
  3. : Given observation , determine the parameter

Below address each of these problems:

Solution to

Consider a fixed state sequence ,

and

which the last equation above is explained as follows: The multiplication inside summation is terms, which computes the probability that, at , we start at state with probability ; and at state , observation is generated at probability ; and at , the transition from to at probability ; and so on until state observing . Such probability should sum over the possible state sequences .

A naive computation is in the order of but we can speed it up to by dynamic programming on a lattice structure through the forward-backward procedure: Defining

Solving recursively, we have

For the convenience of the subsequent discussion we can define the backward counterpart:

and recursively:

Solution to

We can treat each state separately and find the most likely state at each time step. Then the probability is given by

which the fraction term above is to make and this allows us to solve individually for all

This approach, however, maximize the number of correct in the sequence but not the most probable sequence as a whole. To find the single best sequence , we use the Viterbi algorithm:

Define the highest running probability of a sequence as

i.e., this is the highest probability of subsequence given and observation sequence . Then the next time step is recursively,

and ultimately . So we have the following algorithm:

Initially we have, for ,

Then in time steps and for all states ,

and finally after time step , we back track the whole sequence ,

Solution to

By local maximization using expectation-maximization (EM) method: Define the probability of transition from to at time to as

and hence the probability of state at individual time step for . The corresponding expected number of transition from state is ; and expected number of transition from state to state is .

Given we have enough data to train the model, we can estimate the model parameter as follows:

These estimation depends on and which in turn depends on . We recursively apply these estimation formula and update until converge. This is the re-estimation procedure and each step will give a greater .

An alternative approach is to solve for the maxima of by Lagrangian multipliers, which yields this set of formula:

Iterative procedure in Markov decision process

The following formula is based on the notation used in Wikipedia. It turns out the iterative procedure to converge to the solution is common to this sort of problems.

A Markov decision process is modelled as which as the set of states, is the set of actions, and

and are, respectively, the probability of transition given action and the reward of the transition given action .

The MDP is to find a policy function that selects the action for the state . This is a simplified form as the function is time-invariant, which reduces the MDP into a Markov chain, and

becomes the state transition probability matrix. To determine the policy function , we aimed at maximizing the present value of the reward

which is a discount factor. The algorithm is to apply the following until converge:

Bibliographic data

@article{
   title = "A tutorial on hidden Markov models and selected applications in speech recognition",
   author = "Lawrence R. Rabiner",
   journal = "Proceedings of the IEEE",
   volume = "77",
   number = "2",
   month = "Feb",
   year = "1989",
   pages = "257--286",
   doi = "0018-9219/89/0200-0257",
}