Solution 1:

A Markov chain is a discrete random process with the property that the next state depends only on the current state (wikipedia)

So $P(X_n | X_1, X_2, \dots X_{n-1}) = P(X_n | X_{n-1})$. An example could be when you are modelling the weather. You then can take the assumption that the weather of today can be predicted by only using the knowledge of yesterday.

Let's say we have Rainy and Sunny. When it is rainy on one day the next day is Sunny with probability $0.3$. When it is Sunny, the probability for Rain next day is $0.4$.

Now when it is today Sunny we can predict the weather of the day after tomorrow, by simply calculating the probability for Rain tomorrow, multiplying that with the probablity for Sun after rain plus the probability of Sun tomorrow times the probability of Sun after sun. In total the probability of Sunny of the day after tomorrow is $P(R|S) \cdot P(S|R) + P(S|S) \cdot P(S|S) = 0.3 \cdot 0.4+0.6 \cdot 0.6 = 0.48$.

Solution 2:

In a nutshell, a Markov chain is (the behavior of) a random process which may only find itself in a (not necessarily finite) number of different states. The process moves from a state to another in discrete times (that is, you define a sequence S(t) of states at time t=0,1,2,...), and for which the probability of going from state S to state R depends just from S and R; that is, there is no "memory of the past" and the process is "timeless". This means that the Markov chain may be modeled as a n*n matrix, where n is the number of possible states.

An example of a process which may be modeled by a Markov chain is the sequence of faces of a die showing up, if you are allowed to rotate the die wrt an edge. The corresponding matrix is

    1   2   3   4   5   6
   ------------------------ 
1 | 0   1/4 1/4 1/4 1/4 0
2 | 1/4 0   1/4 1/4 0   1/4
3 | 1/4 1/4 0   0   1/4 1/4
4 | 1/4 1/4 0   0   1/4 1/4
5 | 1/4 0   1/4 1/4 0   1/4
1 | 0   1/4 1/4 1/4 1/4 0

As usual, Wikipedia and MathWorld are your friends.

Solution 3:

Markov chains, especially hidden Markov models are hugely important in computation linguistics. A hidden Markov model is one where we can't directly view the state, but we do have some information about what the state might be. For example, consider breaking down a sentence into what is called "parts of speech" such as verbs, adjectives, ect. We don't know what the parts of speech are, but we can attempt to deduce them from the word. For example, the word run might be used 80% as a verb, 18% of the time as a noun and 2% of the time as an adjective. We also have (Markov) relations between the parts of speech, so for example an adjective might be followed by a noun 70% of the time and another adjective 30% of the time.

We can use the Viterbi algorithm to decide which sequence is most likely to have generated the observed sentence. This algorithm takes into account two factors:

  • the probability of such a sequence of parts of speech occurring together in a sentence
  • the relative chance that such a sequence of parts of speech would be responsible for us observing the given sentence.

Solution 4:

Markov chains are used in Markov Chain Monte Carlo (MCMC). This computational technique is extremely common in Bayesian statistics.

In Bayesian statistics, you want to compute properties of a posterior distribution. You'd like to draw independent samples from this distribution, but often this is impractical. So you construct a Markov chain that has as its limiting distribution the the distribution you want. So, for example, to get the mean of your posterior distribution you could take the mean of the states of your Markov chain. (Ergodic theory blesses this process.)

Solution 5:

I had a programming project in college where we generated large amounts of psuedo-English text using Markov chains. The assignment is here, although I don't know if that link will be good forever. From that page:

For example, suppose that [our Markov chains are of length] 2 and the sample file contains

I like the big blue dog better than the big elephant with the big blue hat on his tusk. 

Here is how the first three words might be chosen:

  • A two-word sequence is chosen at random to become the initial prefix. Let's suppose that "the big" is chosen.

  • The first word must be chosen based on the probability that it follows the prefix (currently "the big") in the source. The source contains three occurrences of "the big". Two times it is followed by "blue", and once it is followed by "elephant". Thus, the next word must be chosen so that there is a 2/3 chance that "blue" will be chosen, and a 1/3 chance that "elephant" will be chosen. Let's suppose that we choose "blue" this time.

  • The next word must be chosen based on the probability that it follows the prefix (currently "big blue") in the source. The source contains two occurrences of "big blue". Once it is followed by "dog", and the other time it is followed by "hat". Thus, the next word must be chosen so that there is a 50-50 probability of choosing "dog" vs. "hat". Let's suppose that we choose "hat" this time.

  • The next word must be chosen based on the probability that it follows the prefix (currently "blue hat") in the source. The source contains only one occurrence of "blue hat", and it is followed by "on". Thus, the next word must be "on" (100% probability).

  • Thus, the first three words in the output text would be "blue hat on".

You keep going like that, generating text that is completely nonsensical, but ends up having sort of the same "tone" as the original text. For example, if your sample file is the complete text of Alice In Wonderland (one of the texts we tried it on) then your nonsense comes out kind of whimsical and Carrollian (if that's a word). If your sample file is The Telltale Heart, you get somewhat dark, morbid nonsense.

Anyway, while not a rigorous, formal, definition, I hope this helps give you a sense of what a Markov chain is.