Homogeneous chains Markov matrix and transition graph. The Markov chain is simple: we analyze the principle in detail

  • The date: 21.09.2019

This article gives general idea about how to generate texts by modeling Markov processes. In particular, we will get acquainted with Markov chains, and as a practice we will implement a small text generator in Python.

To begin with, we write out the definitions we need, but not yet very clear to us from the Wikipedia page, in order to at least roughly imagine what we are dealing with:

Markov process t t

Markov chain

What does all of this mean? Let's figure it out.

Basics

The first example is extremely simple. Using a sentence from a children's book, we will master the basic concept of a Markov chain, as well as define what is in our context corpus, links, probability distribution and histograms. Although the proposal is English language, the essence of the theory will be easy to grasp.

This offer is frame, that is, the base on the basis of which the text will be generated in the future. It consists of eight words, but there are only five unique words - this is links(we are talking about the Markov chains). For clarity, let's color each link in its own color:

And write down the number of occurrences of each of the links in the text:

The picture above shows that the word fish appears in the text 4 times more often than each of the other words ( One, two, red, blue). That is, the probability of meeting in our corpus the word fish 4 times higher than the probability of meeting every other word in the picture. Speaking in the language of mathematics, we can determine the law of distribution of a random variable and calculate with what probability one of the words will appear in the text after the current one. The probability is calculated as follows: you need to divide the number of occurrences of the word we need in the corpus by total number all the words in it. For the word fish this probability is 50% since it appears 4 times in an 8 word sentence. For each of the remaining links, this probability is 12.5% ​​(1/8).

Graphically represent the distribution random variables possible with the help histograms. In this case, the frequency of occurrence of each of the links in the sentence is clearly visible:

So, our text consists of words and unique links, and we displayed the probability distribution of the appearance of each of the links in the sentence on a histogram. If it seems to you that you should not mess with statistics, read. And possibly save your life.

The essence of the definition

Now let's add to our text elements that are always implied, but not voiced in everyday speech - the beginning and end of the sentence:

Any sentence contains these invisible "beginning" and "end", let's add them as links to our distribution:

Let's return to the definition given at the beginning of the article:

Markov process is a random process whose evolution after any set value time parameter t does not depend on the evolution that preceded t, provided that the value of the process at this moment is fixed.

Markov chain - special case Markov process, when the space of its states is discrete (i.e., not more than countable).

So what does this mean? Roughly speaking, we are modeling a process in which the state of the system at the next moment of time depends only on its state at the current moment, and does not depend in any way on all previous states.

Imagine what's in front of you window, which displays only the current state of the system (in our case, it is one word), and you need to determine what the next word will be based only on the data presented in this window. In our corpus, the words follow one after the other according to the following pattern:

Thus, pairs of words are formed (even the end of the sentence has its own pair - an empty value):

Let's group these pairs by the first word. We will see that each word has its own set of links, which in the context of our sentence may follow him:

Let's present this information in a different way - each link will be assigned an array of all words that may appear in the text after this link:

Let's take a closer look. We see that each link has words that may come after it in a sentence. If we were to show the diagram above to someone else, that person could, with some probability, reconstruct our initial sentence, i.e. the corpus.

Example. Let's start with the word Start. Next, choose a word One, since according to our scheme this is the only word that can follow the beginning of a sentence. Behind the word One also only one word can follow - fish. Now the new proposal in the intermediate version looks like One fish. Further, the situation becomes more complicated fish words can go with equal probability in 25% "two", "red", "blue" and end of sentence End. If we assume that the next word is - "two" the reconstruction will continue. But we can choose a link End. In this case, based on our scheme, a sentence will be randomly generated that is very different from the corpus - One fish.

We have just modeled a Markov process - we have determined each next word only on the basis of knowledge about the current one. For complete assimilation of the material, let's build diagrams that show the dependencies between the elements within our corpus. Ovals represent links. The arrows lead to potential links that can follow the word in the oval. Near each arrow - the probability with which the next link will appear after the current one:

Fine! We learned the necessary information to move on and parse more complex models.

Expanding vocabulary base

In this part of the article, we will build a model according to the same principle as before, but we will omit some steps in the description. If you have any difficulties, go back to the theory in the first block.

Let's take four more quotes from the same author (also in English, it won't hurt us):

“Today you are you. That is truer than true. There is no one alive who is you-er than you."

“You have brains in your head. You have feet in your shoes. You can steer yourself any direction you choose. You're on your own."

“The more that you read, the more things you will know. The more that you learn, the more places you'll go."

Think left and think right and think low and think high. Oh, the thinks you can think up if only you try."

The complexity of the corpus has increased, but in our case this is only a plus - now the text generator will be able to produce more meaningful sentences. The fact is that in any language there are words that occur in speech more often than others (for example, we use the preposition "in" much more often than the word "cryogenic"). How more words in our corpus (and hence the dependencies between them), the more information the generator has about which word is most likely to appear in the text after the current one.

The easiest way to explain this is from the point of view of the program. We know that for each link there is a set of words that can follow it. And also, each word is characterized by the number of its occurrences in the text. We need to somehow capture all this information in one place; a dictionary that stores "(key, value)" pairs is best suited for this purpose. The dictionary key will record the current state of the system, that is, one of the links in the corpus (for example, "the" in the picture below); and another dictionary will be stored in the dictionary value. In a nested dictionary, the keys will be words that can appear in the text after the current corpus unit ( "thinks" and more can go in the text after "the"), and the values ​​​​- the number of occurrences of these words in the text after our link (the word "thinks" appears in the text after the word "the" 1 time, word more after the word "the"- 4 times):

Re-read the paragraph above several times to get it right. Please note that the nested dictionary in this case is the same histogram, it helps us track the links and the frequency of their occurrence in the text relative to other words. It should be noted that even such a vocabulary base is very small for the proper generation of texts in natural language - it must contain more than 20,000 words, and preferably more than 100,000. And even better - more than 500,000. But let's look at the vocabulary database that we got us.

The Markov chain in this case is built similarly to the first example - each next word is selected only on the basis of knowledge about the current word, all other words are not taken into account. But due to the storage in the dictionary of data about which words appear more often than others, we can choose to accept weighted decision. Let's look at a specific example:

More:

That is, if the current word is the word more, after it there can be words with equal probability in 25% "things" and "places", and with a probability of 50% - the word "that". But the probabilities can be and all are equal to each other:

think:

Working with windows

So far, we have only considered one-word windows. You can increase the size of the window so that the text generator produces more "verified" sentences. This means that the larger the window, the less deviation from the hull will be during generation. Increasing the window size corresponds to the transition of the Markov chain to a higher order. Previously, we built a chain of the first order, for a window of two words we get a chain of the second order, of three - of the third, and so on.

Window- this is the data in the current state of the system that is used to make decisions. If we combine a large window and a small data set, then we are likely to get the same sentence every time. Let's take the dictionary base from our first example and expand the window to size 2:

The extension has resulted in each window now having only one option for the next system state - no matter what we do, we will always get the same sentence, identical to our corpus. Therefore, in order to experiment with windows, and for the text generator to return unique content, stock up on a vocabulary base of 500,000 words or more.

Implementation in Python

Dictogram data structure

Dictogram (dict is a built-in dictionary data type in Python) will display the relationship between links and their frequency of occurrence in the text, i.e. their distribution. But at the same time, it will have the dictionary property we need - the program execution time will not depend on the amount of input data, which means we are creating an efficient algorithm.

Import random class Dictogram(dict): def __init__(self, iterable=None): # Initialize our distribution as new object class, # add existing elements super(Dictogram, self).__init__() self.types = 0 # number of unique keys in distribution self.tokens = 0 # total of all words in the distribution if iterable: self.update(iterable) def update(self, iterable): # Update the distribution with items from the # existing iterable dataset for item in iterable: if item in self: self += 1 self.tokens += 1 else: self = 1 self.types += 1 self.tokens += 1 def count(self, item): # Return the item's counter value, or 0 if item in self: return self return 0 def return_random_word(self): random_key = random.sample(self, 1) # Another way: # random.choice(histogram.keys()) return random_key def return_weighted_random_word(self): # Generate a pseudo-random number between 0 and (n-1), # where n is the total number of words random_int = random.randint(0, self.tokens-1) index = 0 list_of_keys = self.keys() # print "random index:", random_int for i in range(0, self.types): index += self] # print index if(index > random_int): # print list_of_keys[i] return list_of_keys[i]

Any object that can be iterated over can be passed to the constructor of the Dictogram structure. The elements of this object will be the words to initialize the Dictogram, for example, all the words from some book. In this case, we are counting the elements so that in order to access any of them, it would not be necessary to run through the entire data set each time.

We also made two functions to return a random word. One function selects a random key in the dictionary, and the other, taking into account the number of occurrences of each word in the text, returns the word we need.

Markov chain structure

from histograms import Dictogram def make_markov_model(data): markov_model = dict() for i in range(0, len(data)-1): if data[i] in markov_model: # Just append to an already existing markov_model distribution].update( ]) else: markov_model] = Dictogram(]) return markov_model

In the implementation above, we have a dictionary that stores windows as the key in the "(key, value)" pair and distributions as the values ​​in that pair.

Structure of a Markov chain of the Nth order

from histograms import Dictogram def make_higher_order_markov_model(order, data): markov_model = dict() for i in range(0, len(data)-order): # Create window window = tuple(data) # Add to dictionary if window in markov_model: # Attach to an existing distribution markov_model.update(]) else: markov_model = Dictogram(]) return markov_model

Very similar to a first-order Markov chain, but in this case we store tuple as a key in a "(key, value)" pair in a dictionary. We use it instead of a list, since the tuple is protected from any changes, and this is important for us - after all, the keys in the dictionary should not change.

Model parsing

Great, we have implemented a dictionary. But how now to generate content based on the current state and step to the next state? Let's go through our model:

From histograms import Dictogram import random from collections import deque import re def generate_random_start(model): # To generate any start word, uncomment the line: # return random.choice(model.keys()) # To generate the "correct" start word, use code below: # Correct initial words are those that were the beginning of sentences in the if "END" in model corpus: seed_word = "END" while seed_word == "END": seed_word = model["END"].return_weighted_random_word() return seed_word return random.choice(model .keys()) def generate_random_sentence(length, markov_model): current_word = generate_random_start(markov_model) sentence = for i in range(0, length): current_dictogram = markov_model random_weighted_word = current_dictogram.return_weighted_random_word() current_word = random_weighted_word sentence.append(current_word) sentence = sentence.capitalize() return " ".join(sentence) + "." return sentence

What's next?

Try to think of where you yourself can use a text generator based on Markov chains. Just do not forget that the most important thing is how you parse the model and what special restrictions you set on generation. The author of this article, for example, used a large window when creating the tweet generator, limited the generated content to 140 characters, and used only “correct” words to start sentences, that is, those that were the beginning of sentences in the corpus.

Markov chains

Introduction

§ 1. Markov chain

§ 2. Homogeneous Markov chain. Transition probabilities. Transition matrix

§3. Markov equality

§4. Stationary distribution. Theorem about marginal probabilities

§5. Proof of the theorem on limiting probabilities in a Markov chain

§6. Applications of Markov chains

Conclusion

List of used literature

Introduction

The theme of our term paper Markov chains. Markov chains are named after the outstanding Russian mathematician, Andrei Andreevich Markov, who worked a lot on random processes and made a great contribution to the development of this field. AT recent times You can hear about the application of Markov chains in a variety of areas: modern web technologies, in the analysis of literary texts, or even in the development of football team tactics. For those who do not know what Markov chains are, there may be a feeling that this is something very complex and almost inaccessible to understanding.

No, it's just the opposite. A Markov chain is one of the simplest cases of a sequence of random events. But, despite its simplicity, it can often be useful even when describing rather complex phenomena. A Markov chain is a sequence of random events in which the probability of each event depends only on the previous one, but does not depend on earlier events.

Before delving deeper, we need to consider a few subsidiary issues that are well known, but are absolutely necessary for further presentation.

The purpose of my course work is to study in more detail the applications of Markov chains, the problem statement and Markov problems.

§one. Markov chain

Imagine that a sequence of tests is being performed.

Definition. A Markov chain is a sequence of trials, in each of which one and only one of the incompatible events of the complete group appears, and the conditional probability that an event occurs in the i-th trial , provided that the event occurred in the -th trial , does not depend on the results of previous tests.

For example, if the sequence of trials forms a Markov chain and the complete group consists of four incompatible events , and it is known that the event appeared in the sixth trial, then the conditional probability that the event occurs in the seventh trial does not depend on what events appeared in the first, second, ..., fifth trials.

Note that independent trials are a special case of a Markov chain. Indeed, if the trials are independent, then the occurrence of some specific event in any trial does not depend on the results of previously performed tests. It follows that the concept of a Markov chain is a generalization of the concept of independent trials.

Often, when presenting the theory of Markov chains, they adhere to a different terminology and talk about some physical system, which at each moment of time is in one of the states: , and changes its state only at certain points in time, that is, the system passes from one state to another (for example, from to ). For Markov chains, the probability of going to any state at the moment depends only on the state of the system at the moment , and does not change from the fact that its states become known at earlier moments. Also, in particular, after the test, the system can remain in the same state (“transition” from state to state ).

To illustrate, consider an example.

Example 1 Imagine that a particle located on a straight line moves along this straight line under the influence of random shocks occurring at moments . The particle can be located at points with integer coordinates: ; at points and there are reflecting walls. Each push moves the particle to the right with probability and to the left with probability , unless the particle is near a wall. If the particle is located near the wall, then any push moves it one unit inside the gap between the walls. Here we see that this example of a particle walk is a typical Markov chain.

Thus, events are called states of the system, and tests are called changes in its states.

We now give a definition of a Markov chain using the new terminology.

A discrete-time Markov chain is a chain whose states change at certain fixed points in time.

A continuous-time Markov chain is a chain whose states change at any random possible time.

§2. Homogeneous Markov chain. Transition probabilities. Transition matrix

Definition. A Markov chain is called homogeneous if the conditional probability (transition from state to state ) does not depend on the test number. Therefore, instead of writing simply.

Example 1 Random wander. Let there be a material particle on a straight line at a point with an integer coordinate. At certain moments of time, the particle experiences shocks. Under the action of the push, the particle is displaced by one to the right with probability and by one to the left. It is clear that the position (coordinate) of the particle after the shock depends on where the particle was after the immediately preceding shock, and does not depend on how it moved under the action of the other previous shocks.

Thus, a random walk is an example homogeneous chain Markov with discrete time.

The transition probability is the conditional probability that from the state (in which the system ended up as a result of some test, no matter what number) as a result of the next test, the system will go to the state.

Thus, in the notation, the first index indicates the number of the previous state, and the second one indicates the number of the subsequent state. For example, is the probability of transition from the second state to the third.

Let the number of states be finite and equal to .

The transition matrix of a system is a matrix that contains all transition probabilities this system:


Since each row of the matrix contains the probabilities of events (transition from the same state to any possible state), which form a complete group, the sum of the probabilities of these events is equal to one. In other words, the sum of the transition probabilities of each row of the transition matrix is ​​equal to one:

Let us give an example of the transition matrix of a system that can be in three states; the transition from state to state occurs according to the scheme of a homogeneous Markov chain; transition probabilities are given by the matrix:

Here we see that if the system was in the state , then after changing the state in one step, it will remain in the same state with a probability of 0.5, with a probability of 0.5 it will remain in the same state, with a probability of 0.2 it will go to the state, then after the transition, it may end up in the states ; it cannot go from state to state. The last row of the matrix shows us that from the state to go to any of possible states with the same probability of 0.1.

Based on the transition matrix of the system, it is possible to build the so-called system state graph, it is also called the labeled state graph. This is useful for visualizing the circuit. Let's consider the order of graph construction using an example.

Example 2 Based on the given transition matrix, construct a state graph.

Because matrix of the fourth order, then, accordingly, the system has 4 possible states.

The graph does not indicate the probabilities of the system transition from one state to the same one. When considering specific systems, it is convenient to first construct a state graph, then determine the probability of system transitions from one state to the same one (based on the requirement that the sum of the matrix row elements be equal to one), and then compile the system transition matrix.

§3. Markov equality

Definition. Denote by the probability that as a result of steps (tests) the system will pass from state to state . For example, is the probability of transition in 10 steps from the second state to the fifth.

We emphasize that, as , we obtain the transition probabilities

Let's set ourselves the task: knowing the transition probabilities, find the probabilities of the system transition from state to state in steps.

For this purpose, let us introduce into consideration an intermediate (between and ) state . In other words, we will assume that the system will pass from the initial state in steps to the intermediate state with probability , after which, in the remaining steps, it will pass from the intermediate state to the final state with probability .

According to the total probability formula, we get

. (1)

This formula is called Markov's equality.

Explanation. Let us introduce the notation:

is the event of interest to us (in steps the system will pass from the initial state to the final state), therefore, ; - hypotheses (in steps the system will move from the initial state to the intermediate state), therefore,

According to the total probability formula,

()

Or in our notation

which coincides with the Markov formula (1).

Knowing all the transition probabilities, i.e. knowing the transition matrix from state to state in one step, you can find the transition probabilities from state to state in two steps, therefore, the transition matrix itself; using a known matrix, you can find the transition matrix from state to state in three steps, and so on.

Indeed, putting in the Markov equality

,

mark chain random probability


,

(2)

Thus, by formula (2) one can find all the probabilities, hence, the matrix itself. Since the direct use of formula (2) turns out to be tedious, and the matrix calculus leads to the goal faster, I will write the relation following from (2) in matrix form:

Putting in (1), we similarly obtain

In general

Theorem 1. For any s, t

(3)

Proof. Calculate the probability according to the total probability formula (), setting


From equalities

Hence from equalities (4) and

we obtain the assertion of the theorem.

Let us define the matrix In matrix notation (3), it has the form

Since where is the transition probability matrix. From (5) it follows

(6)

The results obtained in the theory of matrices allow using formula (6) to calculate and study their behavior at

Example 1 The transition matrix is ​​given Find transition matrix

Decision. Let's use the formula

Multiplying the matrices, we finally get:

.

§4. Stationary distribution. Theorem on marginal probabilities

The probability distribution at an arbitrary point in time can be found using the total probability formula

(7)

It may turn out that it does not depend on time. Let's call stationary distribution vector , satisfying the conditions

where are the transition probabilities.

If in a Markov chain then for any

This assertion follows by induction from (7) and (8).

Let us state the theorem on limiting probabilities for one important class of Markov chains.

Theorem 1. If for some >0 all elements of the matrix are positive, then for any , for

, (9)

where stationary distribution with and some constant satisfying the inequality 0< h <1.

Since , then, according to the condition of the theorem, one can get from any state to any state in time with a positive probability. The conditions of the theorem exclude chains that are in some sense periodic.

If the condition of Theorem 1 is satisfied, then the probability that the system is in some state , in the limit, does not depend on the initial distribution. Indeed, from (9) and (7) it follows that for any initial distribution ,

Let us consider several examples of Markov chains, where the conditions of Theorem 1 are not satisfied. It is easy to check that such examples are examples. In the example, the transition probabilities have limits, but these limits depend on the initial state. In particular, when


In other examples, the limits of probabilities at obviously do not exist.

Let's find the stationary distribution in example 1. We need to find the vector satisfying conditions (8):

,

;

Hence, Thus, the stationary distribution exists, but not all coordinate vectors are positive.

For the polynomial scheme, random variables were introduced equal to the number of outcomes of this type. Let us introduce similar quantities for Markov chains. Let be the number of hits of the system in the state during the time . Then the frequency of the system hitting the state . Using formulas (9), we can prove that for approaches . To do this, we need to obtain asymptotic formulas for and and use the Chebyshev inequality. We present the derivation of the formula for . Let's represent in the form

(10)

where , if , and otherwise. As

,

then, using the property of mathematical expectation and formula (9), we obtain

.

The second term on the right-hand side of this equality, by virtue of Theorem 1, is a partial sum of a convergent series. Putting , we get

Insofar as

From formula (11), in particular, it follows that

at


You can also get the formula for which is used to calculate the variance.

§5. Proof of the theorem on limiting probabilities in a Markov chain

Let us first prove two lemmas. Let's put

Lemma 1. There are limits for any

Proof. Using equation (3) with we get

Thus, the sequences are both monotone and bounded. This implies the assertion of Lemma 1.

Lemma 2. If the conditions of Theorem 2 are satisfied, then there are constants , such that

For any


where , means summation over all , for which is positive, and summation over the rest . From here

. (12)

Since under the conditions of Theorem 1 the transition probabilities for all , then for any

And due to the finiteness of the number of states

Let us now estimate the difference . Using equation (3) with , , we obtain


Hence, using (8)-(10), we find

=.

Combining this relation with inequality (14) , we obtain the assertion of the lemma.

Go to the proof of the theorem. Since the sequences , are monotone, then

0<. (15)

From here and from Lemma 1 we find

Therefore, when you get and

The positivity follows from inequality (15). Passing to the limit at and in equation (3), we obtain that satisfies equation (12). The theorem has been proven.

§6. Applications of Markov chains

Markov chains serve as a good introduction to the theory of random processes, i.e. the theory of simple sequences of a family of random variables, usually depending on a parameter, which in most applications plays the role of time. It is intended primarily to fully describe both the long-term and local behavior of the process. Here are some of the most studied issues in this regard.

Brownian motion and its generalizations - diffusion processes and processes with independent increments . The theory of stochastic processes contributed to the deepening of the connection between the theory of probability, the theory of operators and the theory of differential equations, which, among other things, was of great importance for physics and other applications. Applications include processes of interest to actuarial (insurance) mathematics, queuing theory, genetics, traffic control, electrical circuit theory, and accounting and accumulation theory.

martingales . These processes preserve enough of the properties of Markov chains for them to hold important ergodic theorems. Martingales differ from Markov chains in that when the current state is known, only the expectation of the future, but not necessarily the probability distribution itself, is independent of the past. In addition to being an important research tool, martingale theory has enriched the theory of random processes arising in statistics, the theory of nuclear fission, genetics, and information theory with new limit theorems.

Stationary processes . The oldest known ergodic theorem, as noted above, can be interpreted as a result describing the limiting behavior of a stationary random process. Such a process has the property that all probabilistic laws that it satisfies remain invariant under time shifts. The ergodic theorem, first formulated by physicists as a hypothesis, can be represented as a statement that, under certain conditions, the average over the ensemble coincides with the average over time. This means that the same information can be obtained from long-term observation of the system and from the simultaneous (and instantaneous) observation of many independent copies of the same system. The law of large numbers is nothing but a special case of Birkhoff's ergodic theorem. Interpolation and prediction of the behavior of stationary Gaussian processes, understood in a broad sense, serve as an important generalization of the classical theory of least squares. The theory of stationary processes is a necessary research tool in many areas, for example, in communication theory, which deals with the study and creation of systems that transmit messages in the presence of noise or random interference.

Markov processes (processes without aftereffect) play a huge role in modeling queuing systems (QS), as well as in modeling and choosing a strategy for managing socio-economic processes occurring in society.

Also, the Markov chain can be used to generate texts. Several texts are fed into the input, then a Markov chain is built with the probabilities of following one word after another, and the resulting text is created based on this chain. The toy is very entertaining!

Conclusion

Thus, in our term paper we were talking about the scheme of Markov chains. They learned in what areas and how it is applied, independent tests proved the theorem on the limiting probabilities in the Markov chain, gave examples for a typical and homogeneous Markov chain, as well as for finding the transition matrix.

We have seen that the Markov chain scheme is a direct generalization of the independent test scheme.

List of used literature

1. Chistyakov V.P. Probability course. 6th ed., rev. - St. Petersburg: Publishing house "Lan", 2003. - 272 p. − (Textbook for universities. Special literature).

2. Gnedenko B.V. Probability course.

3. Gmurman V.E. Theory of Probability and Mathematical Statistics.

4. Wentzel E.S. Probability theory and its engineering applications.

5. Kolmogorov A.N., Zhurbenko I.G., Prokhorov A.V. Introduction to the theory of probability. − Moscow-Izhevsk: Institute for Computer Research, 2003, 188 pages.

6. Kats M. Probability and related issues in physics.

Homogeneous is called a Markov chain for which the conditional probability of transition from the state into a state does not depend on the test number. For homogeneous circuits instead of
use the notation
.

Random walks are an example of a homogeneous Markov chain. Let there be a material particle on the straight line Ox at the point with integer coordinate x=n. At certain points in time
the particle abruptly changes its position (for example, it can move to the right with probability p and to the left with probability 1 –p). Obviously, the position of the particle after the jump depends on where the particle was after the immediately preceding jump, and does not depend on how it moved at the previous moments of time.

In what follows, we confine ourselves to considering finite homogeneous Markov chains.

Transition probabilities. transition matrix.

transition probability
is called the conditional probability that from the state as a result of the next test, the system will go into the state . So the index refers to the previous - to the next state.

transition matrix system is called a matrix that contains all the transition probabilities of this system:

, where represent the transition probabilities in one step.

Let us note some features of the transition matrix.

Markov equality

Denote by
the probability that as a result of n steps (tests) the system will pass from the state into a state . For example,
- the probability of transition in 10 steps from the third state to the sixth. Note that for n= 1 this probability reduces simply to the transition probability
.

The question arises how, knowing the transition probabilities
, find the state transition probabilities into a state for n steps. For this purpose, an intermediate (between and ) state r. In other words, it is believed that from the initial state after m steps, the system will pass to an intermediate state r with probability
, after which, in the remaining n–m steps, it will pass from the intermediate state to the final state with probability
. Using the total probability formula, one can show that the formula is valid

This formula is called Markov equality .

Knowing all transition probabilities
, i.e. knowing the transition matrix from state to state in one step, you can find the probabilities
transition from state to state in two steps, and hence the transition matrix itself , then according to the known matrix - to find etc.

Indeed, assuming in the Markov equality n= 2,m= 1 we get

or
. In matrix form, this can be written as
.

Assuming n=3,m=2, we get
. In the general case, the relation
.

Example. Let the transition matrix is equal to

It is required to find the transition matrix
.

Multiplying a matrix on itself, we get
.

For practical applications, the issue of calculating the probability of a system being in a particular state at a particular moment in time is extremely important. The solution of this question requires knowledge of the initial conditions, i.e. the probabilities of the system being in certain states at the initial moment of time. The initial probability distribution of a Markov chain is the probability distribution of the states at the beginning of the process.

Here through
the probability of finding the system in the state at the initial time. In a particular case, if the initial state of the system is exactly known (for example,
), then the initial probability
, and all others are equal to zero.

If for a homogeneous Markov chain the initial probability distribution and the transition matrix are given, then the probabilities of the system states at the nth step
are calculated by the recursive formula

.

Let's take a simple example to illustrate. Consider the process of functioning of some system (for example, a device). Let the device for one day can be in one of two states - serviceable ( ) and faulty ( ). As a result of mass observations of the operation of the device, the following transition matrix was compiled
,

where - the probability that the device will remain in good condition;

- the probability of the device transition from a serviceable to a faulty state;

- the probability of the device transition from a faulty to a serviceable state;

- the probability that the device will remain in the "faulty" state.

Let the vector of initial probabilities of device states be given by the relation

, i.e.
(at the initial moment the device was faulty). It is required to determine the probabilities of the state of the device after three days.

Decision: Using the transition matrix, we determine the probabilities of states after the first step (after the first day):

The probabilities of the states after the second step (the second day) are

Finally, the probabilities of states after the third step (third day) are

Thus, the probability that the device will be in good condition is 0.819, and that it is in a faulty state, respectively, 0.181.

Vote: 41, 23

This article is abstract in nature, written on the basis of the sources cited at the end, which are cited in places.

Introduction to the theory of Markov chains

A Markov chain is such a sequence of random events in which the probability of each event depends only on the state in which the process is currently located and does not depend on earlier states. The finite discrete circuit is defined by:

  1. set of states S = ( s 1 , …, s n ), an event is a transition from one state to another as a result of a random test
  2. vector of initial probabilities (initial distribution) p (0) = ( p (0) (1),…, p (0) (n)), which determines the probabilities p (0) (i) that at the initial time t = 0 process was in state s i
  3. the matrix of transition probabilities P = ( p i j ), which characterizes the probability of transition of the process with the current state s i to the next state s j , while the sum of the probabilities of transitions from one state is equal to 1:

∑ j =1… n p i j = 1

An example of a matrix of transition probabilities with a set of states S = ( S 1 , …, S 5 ), a vector of initial probabilities p (0) = (1, 0, 0, 0, 0):

Using the initial probability vector and the transition matrix, we can calculate the stochastic vector p (n) - a vector composed of the probabilities p (n) (i) that the process will be in state i at time n. You can get p(n) using the formula:

P(n) = p(0)×Pn

The vectors p (n) with the growth of n in some cases stabilize - they converge to some probability vector ρ, which can be called the stationary distribution of the chain. Stationarity manifests itself in the fact that taking p (0) = ρ, we get p (n) = ρ for any n.
The simplest criterion that guarantees convergence to a stationary distribution is as follows: if all elements of the transition probability matrix P are positive, then as n tends to infinity, the vector p (n) tends to the vector ρ , which is the only solution to the system of the form
p × P = p.
It can also be shown that if, for some positive value of n, all elements of the matrix P n are positive, then the vector p (n) will stabilize anyway.
The proof of these assertions is given in detail.

A Markov chain is depicted as a transition graph, the vertices of which correspond to the states of the chain, and the arcs correspond to the transitions between them. The weight of the arc (i , j) connecting the vertices s i and s j will be equal to the probability p i (j) of the transition from the first state to the second. The graph corresponding to the matrix shown above:


Classification of states of Markov chains

When considering Markov chains, we may be interested in the behavior of the system in a short period of time. In this case, the absolute probabilities are calculated using the formulas from the previous section. However, it is more important to study the behavior of the system over a large time interval, when the number of transitions tends to infinity. Next, definitions of the states of Markov chains are introduced, which are necessary to study the behavior of the system in the long term.
Markov chains are classified depending on the possibility of transition from one state to another.
The groups of states of a Markov chain (a subset of vertices of the transition graph) to which dead-end vertices of the order diagram of the transition graph correspond are called ergodic classes of the chain. If we consider the graph shown above, we can see that it contains 1 ergodic class M 1 = ( S 5 ), reachable from the strongly connected component corresponding to the subset of vertices M 2 = ( S 1 , S 2 , S 3 , S 4 ). States that are in ergodic classes are called essential, and the rest are called inessential (although such names do not agree well with common sense). The absorbing state s i is a special case of the ergodic class. Then, once in such a state, the process will stop. For S i it will be true p ii = 1, i.e. in the transition graph, only one edge will come out of it - a loop.

Absorbing Markov chains are used as temporary models of programs and computational processes. When modeling a program, the states of the chains are identified with the blocks of the program, and the matrix of transition probabilities determines the order of transitions between the blocks, depending on the structure of the program and the distribution of the initial data, the values ​​of which affect the development of the computational process. As a result of the representation of the program by an absorbing chain, it is possible to calculate the number of calls to program blocks and the program execution time, estimated by means, variances, and, if necessary, distributions. Using these statistics in the future, you can optimize the program code - apply low-level methods to speed up critical parts of the program. This technique is called code profiling.

For example, in Dijkstra's algorithm, there are the following circuit states:

  • vertex (v), extracting a new vertex from the priority queue, transition to state b only;
  • begin (b), the beginning of the cycle of enumeration of outgoing arcs for the weakening procedure;
  • analysis (a), analysis of the next arc, jump to a , d , or e ;
  • decrease (d), reduce the score for some graph vertex, go to a ;
  • end (e), end of the loop, move to the next vertex.

It remains to set the transition probabilities between vertices, and we can study the duration of transitions between vertices, the probabilities of getting into different states, and other average characteristics of the process.

Similarly, the computational process, which is reduced to requests for system resources in the order determined by the program, can be represented by an absorbing Markov chain, the states of which correspond to the use of system resources - processor, memory and peripheral devices, transition probabilities reflect the order of access to various resources. Due to this, the computational process is presented in a form convenient for the analysis of its characteristics.

A Markov chain is called irreducible if any state S j can be reached from any other state S i in a finite number of transitions. In this case, all states of the chain are said to be communicating, and the transition graph is a strongly connected component. The process generated by an ergodic chain, starting in a certain state, never ends, but successively passes from one state to another, falling into different states with different frequencies, depending on the transition probabilities. Therefore, the main characteristic of an ergodic chain is
the probabilities of the process being in states S j , j = 1,…, n , the fraction of time that the process spends in each of the states. Irreducible chains are used as systems reliability models. Indeed, if a resource that the process uses very often fails, the health of the entire system will be in jeopardy. In such a case, duplication of such a critical resource can help avoid failures. At the same time, the states of the system, which differ in the composition of the serviceable and failed equipment, are interpreted as the states of the circuit, the transitions between which are associated with failures and restoration of devices and changes in the connections between them, carried out to maintain the system's operability. Estimates of the characteristics of an irreducible circuit give an idea of ​​the reliability of the behavior of the system as a whole. Also, such chains can be models of the interaction of devices with tasks coming for processing.

Examples of using

Failure Service System

The server consists of several units, such as modems or network cards, which receive requests from users for service. If all blocks are busy, then the request is lost. If one of the blocks receives a request, then it becomes busy until the end of its processing. As states, we take the number of idle blocks. Time will be discrete. Denote by α the probability of receiving a request. We also assume that the service time is also random and consists of independent continuations, i.e. a request with probability β is served in one step, and with probability (1 - β) is served after this step as a new request. This gives the probability (1 - β) β for a two-step service, (1 - β) 2 β for a three-step service, and so on. Consider an example with 4 devices operating in parallel. Let's make a matrix of transition probabilities for the chosen states:

1-α α 0 0 0
β 1 - α - β α 0 0
0 2 β 1 - α - 2 β α 0
0 0 1 - α - 3 β α
0 0 0 1 - 4 β

It can be seen that it has a unique ergodic class, and hence the p × P = p system has a unique solution in the class of probability vectors. We write down the equations of the system that allows us to find this solution:

P 0 (1 - α) + p 1 β \u003d p 0,
p 0 α + p 1 (1 - α - β) + p 2 2 β = p 1,
p 1 α + p 2 (1 - α - 2 β) + p 3 3 β = p 2,
p 2 α + p 3 (1 - α - 3 β) + p 4 4 β = p 3,
p 3 α + p 4 (1 - 4 β) = p 4 .

From here we obtain (for γ = α / β):

P 1 \u003d γ p 0,
p 2 \u003d γ 2 p 0 /2,
p 3 \u003d γ 3 p 0 /3,
p 4 \u003d γ 4 p 0 /4,

from the normalization condition it follows:

P 0 \u003d C \u003d (1 + γ + γ 2 /2 + γ 3 /3 + γ 4 /4) -1.

Now we know the set of probabilities π i that i blocks will be occupied in the system in the stationary mode. Then the fraction of time p 4 = С γ 4 /4 in the system is occupied by all blocks, the system does not respond to requests. The results obtained apply to any number of blocks. Now you can use them: you can compare the cost of additional devices and the reduction in the time the system is completely busy.
You can read more about this example in .

Decision processes with a finite and infinite number of stages

Consider a process in which there are several matrices of transition probabilities. For each moment of time, the choice of one or another matrix depends on the decision we made. The above can be understood with the following example. As a result of the analysis of the soil, the gardener evaluates its condition with one of three numbers: (1) - good, (2) - satisfactory, or (3) - poor. At the same time, the gardener noticed that the productivity of the soil in the current year depends only on its condition in the previous year. Therefore, the probabilities of soil transition from one state to another without external influences can be represented by the following Markov chain with matrix P1:

Now we can assign to each transition from one state to another some income function, which is defined as profit or loss over a one-year period. The gardener can choose to use or not to use fertilizer, it will depend on his final income or loss. Let us introduce the matrices R1 and R2, which determine the income functions depending on the costs of fertilizers and soil quality:

R1
7.00 6.00 3.00
0.00 5.00 1.00
0.00 0.00 -1.00

R2
6.00 5.00 -1.00
7.00 4.00 0.00
6.00 3.00 -2.00

Finally, the gardener is faced with the problem of which strategy to choose to maximize the average expected return. Two types of problems can be considered: with a finite and infinite number of stages. In this case, someday the activity of the gardener will definitely end. In addition, visualizers solve the decision problem for a finite number of stages. Let the gardener intend to stop his occupation in N years. Our task now is to determine the optimal strategy for the gardener's behavior, that is, the strategy that will maximize his income. The finiteness of the number of stages in our problem is manifested in the fact that the gardener does not care what will happen to his agricultural land for N +1 years (all years up to N inclusive are important to him). Now it is clear that in this case the problem of finding a strategy turns into a problem of dynamic programming. If we denote by f n (i) the maximum average expected income that can be obtained for stages from n to N inclusive, starting from the state with number i , then it is easy to derive a recursive relation connecting f n (i) with the numbers f n +1 (j)

F n (i) = max k (∑ j =1 m p ij k [ r ij k + f n +1 (j)]), n = 1, 2, …, N

Here k is the number of the strategy used. This equation is based on the fact that the total income r ij k + f n +1 (j) is obtained as a result of the transition from state i at stage n to state j at stage n +1 with probability p ij k .
Now the optimal solution can be found by sequentially computing f n (i) in the downward direction (n = N ... 1). At the same time, the introduction of the vector of initial probabilities into the condition of the problem will not complicate its solution.
This example is also discussed in .

Modeling word combinations in text

Consider a text consisting of the words w . Imagine a process in which the states are words, so that when it is in the state (S i) the system goes to the state (s j) according to the matrix of transition probabilities. First of all, it is necessary to “train” the system: submit a sufficiently large text to the input to estimate the transition probabilities. And then you can build the trajectories of the Markov chain. An increase in the semantic load of a text constructed using the Markov chain algorithm is possible only with an increase in order, where the state is not one word, but sets with greater power - pairs (u, v), triples (u, v, w), etc. And that in the chains of the first, that of the fifth order, there will be little more sense. Meaning will begin to appear when the dimension of the order increases to at least the average number of words in a typical phrase of the source text. But it is impossible to move in this way, because the growth of the semantic load of the text in Markov chains of high orders is much slower than the decline in the uniqueness of the text. And a text built on Markov chains, for example, of the thirtieth order, will still not be so meaningful as to be of interest to a person, but already quite similar to the original text, moreover, the number of states in such a chain will be amazing.
This technology is now very widely used (unfortunately) on the Internet to create content for web pages. People who want to increase traffic to their site and improve its ranking in search engines tend to put as many search keywords on their pages as possible. But search engines use algorithms that can distinguish real text from a rambling jumble of keywords. Then, in order to deceive the search engines, they use texts created by the generator based on the Markov chain. There are, of course, positive examples of using Markov chains to work with text, they are used to determine authorship and analyze the authenticity of texts.

Markov chains and lotteries

In some cases, a probabilistic model is used in predicting numbers in various lotteries. Apparently, there is no point in using Markov chains to model the sequence of different circulations. What happened to the balls in the draw will not affect the results of the next draw, because after the draw the balls are collected, and in the next draw they are placed in the lottery drum in a fixed order. The connection with the previous edition is lost. Another thing is the sequence of balls falling within the same draw. In this case, the fall of the next ball is determined by the state of the lottery drum at the time of the fall of the previous ball. Thus, the sequence of balls falling in one draw is a Markov chain, and such a model can be used. When analyzing numerical lotteries, there is a great difficulty here. The state of the lottery drum after the next ball has fallen determines further events, but the problem is that this state is unknown to us. All we know is that some ball fell out. But when this ball falls out, the other balls can be arranged in different ways, so that there is a group of a very large number of states corresponding to the same observed event. Therefore, we can only construct a matrix of transition probabilities between such groups of states. These probabilities are an average of the transition probabilities between different individual states, which of course reduces the effectiveness of applying the Markov chain model to number lotteries.
Similar to this case, such a neural network model can be used for weather forecasting, currency quotes, and in connection with other systems where there is historical data, and new information can be used in the future. A good application in this case, when only the manifestations of the system are known, but not the internal (hidden) states, can be applied hidden Markov models, which are discussed in detail in Wikibooks (hidden Markov models).

Literature

  1. Romanovsky I.V. Discrete Analysis: Textbook for Students, 3rd ed. - St. Petersburg: Nevsky Dialect; BHV Petersburg, 2003.
  2. Taha, Hamdy A. An Introduction to Operations Research, 6th ed. - M .: Williams Publishing House, 2001.
  3. Werner M. Fundamentals of coding. Textbook for universities. — M.: Technosfera, 2004.
  4. Belyaev A., Gavrilov M., Masalskikh A., Medvinsky M. Markov processes, 2004.

Visualizers

  1. Alekseev V. Markov Decision Processes, 2002.
  2. Belyaev A. Markov chains, 2002.
June 1, 2016 at 04:31 pm

Developing a class for working with Markov chains

  • C++ ,
  • Algorithms

Today I would like to tell you about writing a class to simplify the work with Markov chains.

Please under cat.

Basic knowledge:

Representation of graphs in the form of an adjacency matrix, knowledge of the basic concepts of graphs. Knowledge of C++ for the practical part.

Theory

A Markov chain is a sequence of random events with a finite or countable number of outcomes, characterized by the property that, loosely speaking, with a fixed present, the future is independent of the past. Named after A. A. Markov (senior).

In simple terms, a Markov chain is a weighted graph. Events are located at its vertices, and the weight of the edge connecting vertices A and B is the probability that event A will be followed by event B.

Quite a few articles have been written about the use of Markov chains, but we will continue to develop the class.

Here is an example of a Markov chain:

In the following, we will consider this scheme as an example.

Obviously, if there is only one outgoing edge from vertex A, then its weight will be equal to 1.

Notation
At the vertices we have events (from A, B, C, D, E...). On the edges, the probability that after the i-th event there will be an event j > i. For convention and convenience, I numbered the peaks (No. 1, No. 2, etc.).

A matrix is ​​the adjacency matrix of a directed weighted graph, which represents the Markov chain. (more on this later). In this particular case, this matrix is ​​also called the transition probability matrix or simply the transition matrix.

Matrix representation of a Markov chain
We will represent Markov chains using a matrix, exactly the adjacency matrix with which we represent graphs.

Let me remind you:

The adjacency matrix of a graph G with a finite number of vertices n (numbered from 1 to n) is a square matrix A of size n, in which the value of the element aij is equal to the number of edges from the i-th vertex of the graph to the j-th vertex.

More about adjacency matrices - in the course of discrete mathematics.

In our case, the matrix will have a size of 10x10, let's write it:

0 50 0 0 0 0 50 0 0 0
0 0 80 20 0 0 0 0 0 0
0 0 0 0 100 0 0 0 0 0
0 0 0 0 100 0 0 0 0 0
0 0 0 0 0 100 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 70 30 0
0 0 0 0 0 0 0 0 0 100
0 0 0 0 0 0 0 0 0 100
0 0 0 0 0 100 0 0 0 0

Idea
Take a closer look at our matrix. In each row, we have non-zero values ​​in those columns whose number matches the next event, and the non-zero value itself is the probability that this event will occur.

Thus, we have the values ​​of the probability of the occurrence of an event with a number equal to the number column after an event with a number equal to lines.

Those who know the theory of probability could guess that each line is a probability distribution function.

Markov chain traversal algorithm

1) initialize the initial position k with a zero vertex.
2) If the vertex is not final, then we generate a number m from 0...n-1 based on the probability distribution in row k of the matrix, where n is the number of vertices, and m is the number of the next event (!). Otherwise we leave
3) The number of the current position k is equal to the number of the generated vertex
4) Go to step 2

Note: the vertex is final if the probability distribution is zero (see the 6th row in the matrix).

Pretty neat algorithm, right?

Implementation

In this article, I want to separately take out the implementation code of the described bypass. The initialization and filling of the Markov chain is of no particular interest (see the complete code at the end).

Implementation of the traversal algorithm

template Element *Markov ::Next(int StartElement = -1) ( if (Markov ::Initiated) // if the adjacency matrix is ​​created ( if (StartElement == -1) // if the default start element is StartElement = Markov ::Current; // then continue (in the constructor Current = 0) std::random_device rd; std::mt19937gen(rd()); std::discrete_distribution<>dicr_distr(Markov ::AdjacencyMatrix.at(Current).begin(), Markov ::AdjacencyMatrix.at(Current).end()); // initialize the container to generate a number based on the probability distribution int next = dicr_distr(gen); // generate the next vertex if (next == Markov ::size()) // subtleties of the generator, if the probability distribution is zero, then it returns the number of elements return NULL; Markov ::Current = next; // change the current vertex return &(Markov ::elems.at(next)); // return the value at the top ) return NULL; )

This algorithm looks especially simple due to the features of the container discrete_distribution. It is rather difficult to describe in words how this container works, so let's take the 0th row of our matrix as an example:

0 50 0 0 0 0 50 0 0 0

As a result of the generator, it will return either 1 or 6 with a probability of 0.5 for each. That is, it returns the column number (which is equivalent to the number of the vertex in the chain) where to continue moving further.

An example program that uses the class:

Implementation of a program that traverses the Markov chain from the example

#include #include "Markov.h" #include #include using namespace std; int main() ( Markov chain; outstream outs; outs.open("out.txt"); ifstream ins; ins.open("matrix.txt"); int num; doubleProb = 0; (ins >> num).get(); // number of vertices string str; for (int i = 0; i< num; i++) { getline(ins, str); chain.AddElement(str); // добавляем вершину } if (chain.InitAdjacency()) // инициализируем матрицу нулями { for (int i = 0; i < chain.size(); i++) { for (int j = 0; j < chain.size(); j++) { (ins >>Prob).get(); if (!chain.PushAdjacency(i, j, Prob)) // push matrix ( cerr<< "Adjacency matrix write error" << endl; } } } outs << chain.At(0) << " "; // выводим 0-ю вершину for (int i = 0; i < 20 * chain.size() - 1; i++) // генерируем 20 цепочек { string *str = chain.Next(); if (str != NULL) // если предыдущая не конечная outs << (*str).c_str() << " "; // выводим значение вершины else { outs << std::endl; // если конечная, то начинаем с начала chain.Current = 0; outs << chain.At(0) << " "; } } chain.UninitAdjacency(); // понятно } else cerr << "Can not initialize Adjacency matrix" << endl;; ins.close(); outs.close(); cin.get(); return 0; }


An example of the output that the program generates: