What is called the transition probabilities of a homogeneous Markov chain. Homogeneous Markov Chain

The date: 22.09.2019

Markov chains

Introduction

§ 1. Markov chain

§ 2. Homogeneous Markov chain. Transition probabilities. Transition matrix

§3. Markov equality

§four. Stationary distribution. Theorem about marginal probabilities

§5. Proof of the theorem on limiting probabilities in a Markov chain

§6. Applications of Markov chains

Conclusion

List of used literature

Introduction

The theme of our course work is the Markov chain. Markov chains are named after the outstanding Russian mathematician, Andrei Andreevich Markov, who worked a lot on random processes and made a great contribution to the development of this field. AT recent times You can hear about the application of Markov chains in a variety of areas: modern web technologies, in the analysis of literary texts, or even in the development of football team tactics. For those who do not know what Markov chains are, there may be a feeling that this is something very complex and almost inaccessible to understanding.

No, it's just the opposite. A Markov chain is one of the simplest cases of a sequence of random events. But, despite its simplicity, it can often be useful even when describing rather complex phenomena. A Markov chain is a sequence of random events in which the probability of each event depends only on the previous one, but does not depend on earlier events.

Before delving deeper, it is necessary to consider a few auxiliary issues that are well known, but are absolutely necessary for further presentation.

The purpose of my course work is to study in more detail the applications of Markov chains, the problem statement and Markov problems.

§one. Markov chain

Imagine that a sequence of tests is being performed.

Definition. A Markov chain is a sequence of trials, in each of which one and only one of

of incompatible events of the full group, and the conditional probability that an event will occur in the th trial

, provided that the event occurred in the -th trial

, does not depend on the results of previous tests.

For example, if the sequence of trials forms a Markov chain and the complete group consists of four disjoint events

, and it is known that the event appeared in the sixth trial, then the conditional probability that the event occurs in the seventh trial does not depend on what events appeared in the first, second, ..., fifth trials.

Note that independent trials are a special case of a Markov chain. Indeed, if the trials are independent, then the occurrence of some specific event in any trial does not depend on the results of previously performed trials. It follows that the concept of a Markov chain is a generalization of the concept of independent trials.

Often, when presenting the theory of Markov chains, they adhere to a different terminology and talk about some physical system

, which at each moment of time is in one of the states: , and changes its state only at certain points in time, that is, the system passes from one state to another (for example, from to ). For Markov chains, the probability of going to any state

at the moment depends only on the state of the system at the moment , and does not change from the fact that its states become known at earlier moments. Also, in particular, after the test, the system can remain in the same state (“transition” from state to state ).

To illustrate, consider an example.

Example 1 Imagine that a particle located on a straight line moves along this straight line under the influence of random shocks occurring at moments

. The particle can be located at points with integer coordinates:

; at points and there are reflecting walls. Each push moves the particle to the right with probability and to the left with probability , unless the particle is near a wall. If the particle is located near the wall, then any push moves it one unit inside the gap between the walls. Here we see that this example of a particle walk is a typical Markov chain.

Thus, events are called states of the system, and tests are called changes in its states.

We now give a definition of a Markov chain using the new terminology.

A discrete-time Markov chain is a chain whose states change at certain fixed points in time.

A continuous-time Markov chain is a chain whose states change at any random possible time.

§2. Homogeneous Markov chain. Transition probabilities. Transition matrix

Definition. A Markov chain is said to be homogeneous if the conditional probability

(transition from state to state ) does not depend on the test number. Therefore, instead of writing simply.

Example 1 Random wander. Let on a straight line

a material particle is located at a point with an integer coordinate. At certain moments of time, the particle experiences shocks. Under the action of the push, the particle is displaced by one to the right with probability and by one to the left. It is clear that the position (coordinate) of the particle after the shock depends on where the particle was after the immediately preceding shock, and does not depend on how it moved under the action of the other previous shocks.

Thus, a random walk is an example of a homogeneous Markov chain with discrete time.

Markov chain- such a chain of events in which the probability of each event depends only on the previous state.

This article is abstract in nature, based on the sources cited at the end, which are cited in places.

Introduction to the theory of Markov chains

A Markov chain is such a sequence of random events in which the probability of each event depends only on the state in which the process is currently located and does not depend on earlier states. The finite discrete circuit is defined by:

∑ j=1…n p ij = 1

An example of a matrix of transition probabilities with a set of states S = (S 1 , ..., S 5 ), a vector of initial probabilities p (0) = (1, 0, 0, 0, 0):

FROM Using the vector of initial probabilities and the transition matrix, you can calculate the stochastic vector p (n) - a vector composed of the probabilities p (n) (i) that the process will be in state i at time n. You can get p(n) using the formula:

p(n) = p(0)×Pn

The vectors p (n) with the growth of n in some cases stabilize - they converge to some probability vector ρ, which can be called the stationary distribution of the chain. Stationarity manifests itself in the fact that taking p (0) = ρ, we get p (n) = ρ for any n.

The simplest criterion that guarantees convergence to a stationary distribution is as follows: if all elements of the transition probability matrix P are positive, then as n tends to infinity, the vector p (n) tends to the vector ρ, which is the only solution to the system of the form

It can also be shown that if, for some positive value of n, all elements of the matrix P n are positive, then the vector p (n) will stabilize anyway.

The proof of these assertions is given in detail.

A Markov chain is depicted as a transition graph, the vertices of which correspond to the states of the chain, and the arcs correspond to the transitions between them. The weight of the arc (i, j) connecting the vertices si and sj will be equal to the probability pi(j) of the transition from the first state to the second. The graph corresponding to the matrix shown above:

To classification of states of Markov chains

When considering Markov chains, we may be interested in the behavior of the system in a short period of time. In this case, the absolute probabilities are calculated using the formulas from the previous section. However, it is more important to study the behavior of the system over a large time interval, when the number of transitions tends to infinity. Next, definitions of the states of Markov chains are introduced, which are necessary to study the behavior of the system in the long term.

Markov chains are classified depending on the possibility of transition from one state to another.

The groups of states of a Markov chain (a subset of vertices of the transition graph) to which dead-end vertices of the order diagram of the transition graph correspond are called ergodic classes of the chain. If we consider the graph shown above, we see that it has 1 ergodic class M1 = (S5) reachable from the strongly connected component corresponding to the subset of vertices M2 = (S1, S2, S3, S4). States that are in ergodic classes are called essential, and the rest are called inessential (although such names do not agree well with common sense). The absorbing state si is a special case of the ergodic class. Then, once in such a state, the process will stop. For Si, pii = 1 will be true, i.e. in the transition graph, only one edge will come out of it - a loop.

Absorbing Markov chains are used as temporary models of programs and computational processes. When modeling a program, the states of the chains are identified with the blocks of the program, and the matrix of transition probabilities determines the order of transitions between the blocks, depending on the structure of the program and the distribution of the initial data, the values of which affect the development of the computational process. As a result of the representation of the program by the absorbing chain, it is possible to calculate the number of calls to program blocks and the program execution time, estimated by means, variances, and, if necessary, distributions. Using these statistics in the future, you can optimize the program code - apply low-level methods to speed up critical parts of the program. This technique is called code profiling.

For example, in Dijkstra's algorithm, there are the following circuit states:

vertex (v), extracting a new vertex from the priority queue, transition to state b only;

begin (b), the beginning of the cycle of enumeration of outgoing arcs for the weakening procedure;

analysis (a), analysis of the next arc, possible transition to a, d, or e;

decrease (d), decrease in the estimate for some graph vertex, transition to a;

end (e), end of the loop, move to the next vertex.

It remains to set the transition probabilities between vertices, and we can study the duration of transitions between vertices, the probabilities of getting into different states, and other average characteristics of the process.

Similarly, the computational process, which is reduced to requests for system resources in the order determined by the program, can be represented by an absorbing Markov chain, the states of which correspond to the use of system resources - processor, memory and peripheral devices, transition probabilities reflect the order of access to various resources. Due to this, the computational process is presented in a form convenient for the analysis of its characteristics.

A Markov chain is said to be irreducible if any state Sj can be reached from any other state Si in a finite number of transitions. In this case, all states of the chain are said to be communicating, and the transition graph is a strongly connected component. The process generated by an ergodic chain, starting in a certain state, never ends, but successively passes from one state to another, falling into different states with different frequencies, depending on the transition probabilities. Therefore, the main characteristic of an ergodic chain is

the probabilities of the process being in the states Sj, j = 1,…, n, the fraction of time that the process spends in each of the states. Irreducible chains are used as systems reliability models. Indeed, if a resource that the process uses very often fails, the health of the entire system will be in jeopardy. In such a case, duplication of such a critical resource can help avoid failures. At the same time, the states of the system, which differ in the composition of the serviceable and failed equipment, are interpreted as the states of the circuit, the transitions between which are associated with failures and restoration of devices and changes in the connections between them, carried out to maintain the system's operability. Estimates of the characteristics of an irreducible circuit give an idea of the reliability of the behavior of the system as a whole. Also, such chains can be models of the interaction of devices with tasks coming for processing.

Examples of using

Failure Service System

The server consists of several units, such as modems or network cards, which receive requests from users for service. If all blocks are busy, then the request is lost. If one of the blocks receives a request, then it becomes busy until the end of its processing. As states, we take the number of idle blocks. Time will be discrete. Denote by α the probability of receiving a request. We also assume that the service time is also random and consists of independent continuations, i.e. a request with probability β is served in one step, and with probability (1 - β) is served after this step as a new request. This gives the probability (1 - β) β for a two-step service, (1 - β)2 β for a three-step service, and so on. Consider an example with 4 devices operating in parallel. Let's make a matrix of transition probabilities for the chosen states:

M It can be seen that it has a unique ergodic class, and hence the p × P = p system has a unique solution in the class of probability vectors. We write down the equations of the system that allows us to find this solution:

Now we know the set of probabilities πi that i blocks will be occupied in the system in the stationary mode. Then the fraction of time p 4 = С γ 4 /4 in the system is occupied by all blocks, the system does not respond to requests. The results obtained apply to any number of blocks. Now you can use them: you can compare the cost of additional devices and the reduction in the time the system is completely busy.

You can read more about this example in .

Decision processes with a finite and infinite number of stages

Consider a process in which there are several matrices of transition probabilities. For each moment of time, the choice of one or another matrix depends on the decision we made. The above can be understood with the following example. As a result of the analysis of the soil, the gardener evaluates its condition with one of three numbers: (1) - good, (2) - satisfactory, or (3) - poor. At the same time, the gardener noticed that the productivity of the soil in current year depends only on its condition in the previous year. Therefore, the probabilities of soil transition from one state to another without external influences can be represented by the following Markov chain with matrix P1:

L Naturally, soil productivity deteriorates over time. For example, if last year the state of the soil was satisfactory, then this year it can only remain the same or become bad, but it will never become good. However, the gardener can influence the state of the soil and change the transition probabilities in the P1 matrix to those corresponding to them from the P2 matrix:

T Now we can assign to each transition from one state to another some income function, which is defined as profit or loss over a one-year period. The gardener can choose to use or not to use fertilizer, it will depend on his final income or loss. Let us introduce the matrices R1 and R2, which determine the income functions depending on the costs of fertilizers and soil quality:

H Finally, the gardener is faced with the task of what strategy to choose to maximize the average expected income. Two types of problems can be considered: with a finite and infinite number of stages. In this case, someday the activity of the gardener will definitely end. In addition, visualizers solve the decision problem for a finite number of stages. Let the gardener intend to stop his occupation in N years. Our task now is to determine the optimal strategy for the gardener's behavior, that is, the strategy that will maximize his income. The finiteness of the number of stages in our problem is manifested in the fact that the gardener does not care what will happen to his agricultural land for N + 1 years (all years up to N inclusive are important to him). Now it is clear that in this case the problem of finding a strategy turns into a problem of dynamic programming. If we denote by fn(i) the maximum average expected income that can be obtained in stages from n to N inclusive, starting from the state with number i, then it is easy to derive the recursive

W where k is the number of the strategy used. This equation is based on the fact that the total income rijk + fn+1(j) is obtained as a result of the transition from state i at stage n to state j at stage n+1 with probability pijk.

Now the optimal solution can be found by sequentially computing fn(i) in the downward direction (n = N…1). At the same time, the introduction of the vector of initial probabilities into the condition of the problem will not complicate its solution.

This example is also discussed in .

Modeling word combinations in text

Consider a text consisting of the words w. Imagine a process in which the states are words, so that when it is in the state (Si) the system goes to the state (sj) according to the matrix of transition probabilities. First of all, it is necessary to “train” the system: submit a sufficiently large text to the input to estimate the transition probabilities. And then you can build the trajectories of the Markov chain. An increase in the semantic load of a text constructed using the Markov chain algorithm is possible only with an increase in order, where the state is not one word, but sets with greater power - pairs (u, v), triples (u, v, w), etc. And that in the chains of the first, that of the fifth order, there will be little more sense. Meaning will begin to appear when the dimension of the order increases to at least the average number of words in a typical phrase of the source text. But it is impossible to move in this way, because the growth of the semantic load of the text in Markov chains of high orders is much slower than the decline in the uniqueness of the text. And a text built on Markov chains, for example, of the thirtieth order, will still not be so meaningful as to be of interest to a person, but already quite similar to the original text, moreover, the number of states in such a chain will be amazing.

This technology is now very widely used (unfortunately) on the Internet to create content for web pages. People who want to increase traffic to their site and improve its ranking in search engines tend to put as many search keywords on their pages as possible. But search engines use algorithms that can distinguish real text from a rambling jumble of keywords. Then, in order to deceive the search engines, they use texts created by the generator based on the Markov chain. There are, of course, positive examples of using Markov chains to work with text, they are used to determine authorship and analyze the authenticity of texts.

Markov chains and lotteries

In some cases, a probabilistic model is used in predicting numbers in various lotteries. Apparently, there is no point in using Markov chains to model the sequence of different circulations. What happened to the balls in the draw will not affect the results of the next draw, because after the draw the balls are collected, and in the next draw they are placed in the lottery drum in a fixed order. The connection with the previous edition is lost. Another thing is the sequence of balls falling within the same draw. In this case, the fall of the next ball is determined by the state of the lottery drum at the time of the fall of the previous ball. Thus, the sequence of balls falling in one draw is a Markov chain, and such a model can be used. When analyzing numerical lotteries, there is a great difficulty here. The state of the lottery drum after the next ball has fallen determines further events, but the problem is that this state is unknown to us. All we know is that some ball fell out. But when this ball falls out, the other balls can be arranged in different ways, so that there is a group of a very large number of states corresponding to the same observed event. Therefore, we can only construct a matrix of transition probabilities between such groups of states. These probabilities are an average of the transition probabilities between different individual states, which of course reduces the effectiveness of applying the Markov chain model to number lotteries.

Similar to this case, such a neural network model can be used for weather forecasting, currency quotes, and in connection with other systems where there is historical data, and new information can be used in the future. A good application in this case, when only the manifestations of the system are known, but not the internal (hidden) states, can be applied hidden Markov models, which are discussed in detail in Wikibooks (hidden Markov models).

A Markov Chain is a discrete-time Markov process defined in a measurable space.

Introduction

Markov random processes are named after the outstanding Russian mathematician A.A. Markov (1856-1922), who first began the study of probabilistic relationships random variables and created a theory that can be called "probability dynamics".

AT further basis This theory was the initial basis of the general theory of random processes, as well as such important applied sciences as the theory of diffusion processes, the theory of reliability, the theory of queuing, etc. At present, the theory of Markov processes and its applications are widely used in various fields.

Due to the comparative simplicity and clarity of the mathematical apparatus, the high reliability and accuracy of the solutions obtained, Markov processes have received special attention from specialists involved in operations research and the theory of optimal decision making.

Simple example: tossing a coin

Before giving a description general scheme, turn to a simple example. Suppose that we are talking about successive tossing of a coin when playing "toss"; the coin is tossed at conditional times t = 0, 1, ... and at each step the player can win ±1 with the same probability 1/2, so at time t his total gain is a random variable ξ(t) with possible values j = 0, ±1, ...

Provided that ξ(t) = k, at the next step the payoff will already be equal to ξ(t+1) = k ± 1, taking the indicated values j = k ± 1 with the same probability 1/2.

Conventionally, we can say that here, with the corresponding probability, there is a transition from the state ξ(t) = k to the state ξ(t + 1) = k ± 1.

Formulas and definitions

Generalizing this example, one can imagine a "system" with a countable number of possible "phase" states, which randomly passes from state to state over a discrete time t = 0, 1, ....

Let ξ(t) be its position at time t as a result of a chain of random transitions

ξ(0) - ξ(1) - ... - ξ(t) - ... ... (1)

Let us formally denote everything possible states integers i = 0, ±1, ... Suppose that, given a known state ξ(t) = k, at the next step the system passes into the state ξ(t+1) = j with conditional probability

p kj = P(ξ(t+1) = j|ξ(t) = k) ... (2)

regardless of its behavior in the past, more precisely, regardless of the chain of transitions (1) up to the moment t:

P(ξ(t+1) = j|ξ(0) = i, ..., ξ(t) = k) = P(ξ(t+1) = j|ξ(t) = k) for all t, k, j ... (3) - Markov property.

Such a probabilistic scheme is called homogeneous chain Markov with a countable number of states- its homogeneity lies in the fact that defined in (2) transition probabilities p kj, ∑ j p kj = 1, k = 0, ±1, ..., do not depend on time, i.e., P(ξ(t+1) = j|ξ(t) = k) = P ij - transition probability matrix in one step does not depend on n.

It is clear that P ij is a square matrix with non-negative entries and unit sums over the rows.

Such a matrix (finite or infinite) is called a stochastic matrix. Any stochastic matrix can serve as a matrix of transition probabilities.

In practice: delivery of equipment by districts

Suppose that a certain company delivers equipment in Moscow: to the northern district (let's denote A), southern (B) and central (C). The firm has a group of couriers that serves these areas. It is clear that for the next delivery, the courier travels to the area that is this moment closer to him. The following has been statistically determined:

1) after the delivery to A, the next delivery in 30 cases is carried out in A, in 30 cases - in B and in 40 cases - in C;

2) after delivery to B, the next delivery is in 40 cases to A, in 40 cases to B, and in 20 cases to C;

3) after the delivery to C, the next delivery is in 50 cases to A, in 30 cases to B, and in 20 cases to C.

Thus, the area of the next delivery is determined only by the previous delivery.

The transition probability matrix will look like this:

For example, p 12 \u003d 0.4 is the probability that after delivery to area B, the next delivery will be made in area A.

Let's say that each delivery, followed by a move to the next district, takes 15 minutes. Then, according to statistics, after 15 minutes, 30% of the couriers who were in A will be in A, 30% will be in B and 40% in C.

Since at the next moment in time each of the couriers will necessarily be in one of the districts, the sum over the columns is 1. And since we are dealing with probabilities, each element of the matrix is 0<р ij <1.

The most important circumstance that allows us to interpret this model as a Markov chain is that the location of the courier at time t + 1 depends on only from location at time t.

Now let's ask a simple question: if the courier starts from C, what is the probability that, after making two deliveries, he will be in B, i.e. How can you achieve B in 2 steps? So, there are several paths from C to B in 2 steps:

1) first from C to C and then from C to B;

2) C-->B and B-->B;

3) C-->A and A-->B.

Taking into account the rule of multiplication of independent events, we get that the desired probability is equal to:

P = P(CA)*P(AB) + P(CB)*P(BB) + P(CC)*P(CB)

Substituting numerical values:

P = 0.5*0.3 + 0.3*0.4 + 0.2*0.3 = 0.33

The result obtained says that if the courier started work from C, then in 33 cases out of 100 he will be in B after two deliveries.

It is clear that the calculations are simple, but if you need to determine the probability after 5 or 15 deliveries, this can take quite a long time.

Let us show a simpler way to calculate such probabilities. In order to get the transition probabilities from different states in 2 steps, we square the matrix P:

Then the element (2, 3) is the probability of going from C to B in 2 steps, which was obtained above in a different way. Note that the elements in the P2 matrix also range from 0 to 1, and the sum over the columns is 1.

That. if you need to determine the transition probabilities from C to B in 3 steps:

1 way. p(CA)*P(AB) + p(CB)*P(BB) + p(CC)*P(CB) = 0.37*0.3 + 0.33*0.4 + 0.3*0.3 = 0.333 where p(CA) - the probability of transition from C to A in 2 steps (i.e., this is an element (1, 3) of the matrix P 2).

2 way. Calculate matrix P 3:

The matrix of transition probabilities to the 7th power will look like this:

It is easy to see that the elements of each row tend to some numbers. This means that after enough a large number delivery, it does not matter in which district the courier started work. That. at the end of the week, approximately 38.9% will be in A, 33.3% will be in B, and 27.8% will be in C.

Such convergence is guaranteed to take place if all elements of the matrix of transition probabilities belong to the interval (0, 1).

This article gives general idea about how to generate texts by modeling Markov processes. In particular, we will get acquainted with Markov chains, and as a practice we will implement a small text generator in Python.

To begin with, we write out the definitions we need, but not yet very clear to us from the Wikipedia page, in order to at least roughly imagine what we are dealing with:

Markov process t t

Markov chain

What does all of this mean? Let's figure it out.

Basics

The first example is extremely simple. Using a sentence from a children's book, we will master the basic concept of a Markov chain, as well as define what is in our context corpus, links, probability distribution and histograms. Although the proposal is English language, the essence of the theory will be easy to grasp.

This offer is frame, that is, the base on the basis of which the text will be generated in the future. It consists of eight words, but there are only five unique words - this is links(we are talking about the Markov chains). For clarity, let's color each link in its own color:

And write down the number of occurrences of each of the links in the text:

The picture above shows that the word fish appears in the text 4 times more often than each of the other words ( One, two, red, blue). That is, the probability of meeting in our corpus the word fish 4 times higher than the probability of meeting every other word in the picture. Speaking in the language of mathematics, we can determine the law of distribution of a random variable and calculate with what probability one of the words will appear in the text after the current one. The probability is calculated as follows: you need to divide the number of occurrences of the word we need in the corpus by total number all the words in it. For the word fish this probability is 50% since it appears 4 times in an 8 word sentence. For each of the remaining links, this probability is 12.5% (1/8).

You can graphically represent the distribution of random variables using histograms. In this case, the frequency of occurrence of each of the links in the sentence is clearly visible:

So, our text consists of words and unique links, and we displayed the probability distribution of the appearance of each of the links in the sentence on a histogram. If it seems to you that you should not mess with statistics, read. And possibly save your life.

The essence of the definition

Now let's add to our text elements that are always implied, but not voiced in everyday speech - the beginning and end of the sentence:

Any sentence contains these invisible "beginning" and "end", let's add them as links to our distribution:

Let's return to the definition given at the beginning of the article:

Markov process is a random process whose evolution after any set value time parameter t does not depend on the evolution that preceded t, provided that the value of the process at this moment is fixed.

Markov chain - special case Markov process, when the space of its states is discrete (i.e., not more than countable).

So what does this mean? Roughly speaking, we are modeling a process in which the state of the system at the next moment of time depends only on its state at the current moment, and does not depend in any way on all previous states.

Imagine what's in front of you window, which displays only the current state of the system (in our case, it is one word), and you need to determine what the next word will be based only on the data presented in this window. In our corpus, the words follow one after the other according to the following pattern:

Thus, pairs of words are formed (even the end of the sentence has its own pair - an empty value):

Let's group these pairs by the first word. We will see that each word has its own set of links, which in the context of our sentence may follow him:

Let's present this information in a different way - each link will be assigned an array of all words that may appear in the text after this link:

Let's take a closer look. We see that each link has words that may come after it in a sentence. If we were to show the diagram above to someone else, that person could, with some probability, reconstruct our initial sentence, i.e. the corpus.

Example. Let's start with the word Start. Next, choose a word One, since according to our scheme this is the only word that can follow the beginning of a sentence. Behind the word One also only one word can follow - fish. Now the new proposal in the intermediate version looks like One fish. Further, the situation becomes more complicated fish words can go with equal probability in 25% "two", "red", "blue" and end of sentence End. If we assume that the next word is - "two" the reconstruction will continue. But we can choose a link End. In this case, based on our scheme, a sentence will be randomly generated that is very different from the corpus - One fish.

We have just modeled a Markov process - we have determined each next word only on the basis of knowledge about the current one. For complete assimilation of the material, let's build diagrams that display the dependencies between the elements within our corpus. Ovals represent links. The arrows lead to potential links that can follow the word in the oval. Near each arrow - the probability with which the next link will appear after the current one:

Excellent! We learned the necessary information to move on and parse more complex models.

Expanding vocabulary base

In this part of the article, we will build a model according to the same principle as before, but we will omit some steps in the description. If you have any difficulties, go back to the theory in the first block.

Let's take four more quotes from the same author (also in English, it won't hurt us):

“Today you are you. That is truer than true. There is no one alive who is you-er than you."

“You have brains in your head. You have feet in your shoes. You can steer yourself any direction you choose. You're on your own."

“The more that you read, the more things you will know. The more that you learn, the more places you'll go."

Think left and think right and think low and think high. Oh, the thinks you can think up if only you try."

The complexity of the corpus has increased, but in our case this is only a plus - now the text generator will be able to produce more meaningful sentences. The fact is that in any language there are words that occur in speech more often than others (for example, we use the preposition "in" much more often than the word "cryogenic"). The more words in our corpus (and hence the dependencies between them), the more information the generator has about which word is most likely to appear in the text after the current one.

The easiest way to explain this is from the point of view of the program. We know that for each link there is a set of words that can follow it. And also, each word is characterized by the number of its occurrences in the text. We need to somehow capture all this information in one place; a dictionary that stores "(key, value)" pairs is best suited for this purpose. The dictionary key will record the current state of the system, that is, one of the links in the corpus (for example, "the" in the picture below); and another dictionary will be stored in the dictionary value. In a nested dictionary, the keys will be words that can appear in the text after the current corpus unit ( "thinks" and more can go in the text after "the"), and the values - the number of occurrences of these words in the text after our link (the word "thinks" appears in the text after the word "the" 1 time, word more after the word "the"- 4 times):

Re-read the paragraph above several times to get it right. Please note that the nested dictionary in this case is the same histogram, it helps us track the links and the frequency of their occurrence in the text relative to other words. It should be noted that even such a vocabulary base is very small for the proper generation of texts in natural language - it must contain more than 20,000 words, and preferably more than 100,000. And even better - more than 500,000. But let's look at the vocabulary database that we got us.

The Markov chain in this case is built similarly to the first example - each next word is selected only on the basis of knowledge about the current word, all other words are not taken into account. But due to the storage in the dictionary of data about which words appear more often than others, we can choose to accept weighted decision. Let's look at a specific example:

That is, if the current word is the word more, after it there can be words with equal probability in 25% "things" and "places", and with a probability of 50% - the word "that". But the probabilities can be and all are equal to each other:

think:

Working with windows

So far, we have only considered one-word windows. You can increase the size of the window so that the text generator produces more "verified" sentences. This means that the larger the window, the less deviation from the hull will be during generation. Increasing the window size corresponds to the transition of the Markov chain to a higher order. Previously, we built a chain of the first order, for a window of two words we get a chain of the second order, of three - of the third, and so on.

Window- this is the data in the current state of the system that is used to make decisions. If we combine a large window and a small data set, then we are likely to get the same sentence every time. Let's take the dictionary base from our first example and expand the window to size 2:

The extension has resulted in each window now having only one option for the next system state - no matter what we do, we will always get the same sentence, identical to our corpus. Therefore, in order to experiment with windows, and for the text generator to return unique content, stock up on a vocabulary base of 500,000 words or more.

Implementation in Python

Dictogram data structure

Dictogram (dict is a built-in dictionary data type in Python) will display the relationship between links and their frequency of occurrence in the text, i.e. their distribution. But at the same time, it will have the dictionary property we need - the program execution time will not depend on the amount of input data, which means we are creating an efficient algorithm.

Import random class Dictogram(dict): def __init__(self, iterable=None): # Initialize our distribution as a new class object, # add existing elements super(Dictogram, self).__init__() self.types = 0 # number of unique keys in distribution self.tokens = 0 # total of all words in the distribution if iterable: self.update(iterable) def update(self, iterable): # Update the distribution with items from the # existing iterable dataset for item in iterable: if item in self: self += 1 self.tokens += 1 else: self = 1 self.types += 1 self.tokens += 1 def count(self, item): # Return the item's counter value, or 0 if item in self: return self return 0 def return_random_word(self): random_key = random.sample(self, 1) # Another way: # random.choice(histogram.keys()) return random_key def return_weighted_random_word(self): # Generate a pseudo-random number between 0 and (n-1), # where n is the total number of words random_int = random.randint(0, self.tokens-1) index = 0 list_of_keys = self.keys() # print "random index:", random_int for i in range(0, self.types): index += self] # print index if(index > random_int): # print list_of_keys[i] return list_of_keys[i]

Any object that can be iterated over can be passed to the constructor of the Dictogram structure. The elements of this object will be the words to initialize the Dictogram, for example, all the words from some book. In this case, we are counting the elements so that in order to access any of them, it would not be necessary to run through the entire data set each time.

We also made two functions to return a random word. One function selects a random key in the dictionary, and the other, taking into account the number of occurrences of each word in the text, returns the word we need.

Markov chain structure

from histograms import Dictogram def make_markov_model(data): markov_model = dict() for i in range(0, len(data)-1): if data[i] in markov_model: # Just append to an already existing markov_model distribution].update( ]) else: markov_model] = Dictogram(]) return markov_model

In the implementation above, we have a dictionary that stores windows as the key in the "(key, value)" pair and distributions as the values in that pair.

Structure of a Markov chain of the Nth order

from histograms import Dictogram def make_higher_order_markov_model(order, data): markov_model = dict() for i in range(0, len(data)-order): # Create window window = tuple(data) # Add to dictionary if window in markov_model: # Append to an existing distribution markov_model.update(]) else: markov_model = Dictogram(]) return markov_model

Very similar to a first-order Markov chain, but in this case we store tuple as a key in a "(key, value)" pair in a dictionary. We use it instead of a list, since the tuple is protected from any changes, and this is important for us - after all, the keys in the dictionary should not change.

Model parsing

Great, we have implemented a dictionary. But how now to generate content based on the current state and step to the next state? Let's go through our model:

From histograms import Dictogram import random from collections import deque import re def generate_random_start(model): # To generate any start word, uncomment the line: # return random.choice(model.keys()) # To generate the "correct" start word, use code below: # Correct initial words are those that were the beginning of sentences in the if "END" in model corpus: seed_word = "END" while seed_word == "END": seed_word = model["END"].return_weighted_random_word() return seed_word return random.choice(model .keys()) def generate_random_sentence(length, markov_model): current_word = generate_random_start(markov_model) sentence = for i in range(0, length): current_dictogram = markov_model random_weighted_word = current_dictogram.return_weighted_random_word() current_word = random_weighted_word sentence.append(current_word) sentence = sentence.capitalize() return " ".join(sentence) + "." return sentence

What's next?

Try to think of where you yourself can use a text generator based on Markov chains. Just do not forget that the most important thing is how you parse the model and what special restrictions you set on generation. The author of this article, for example, used a large window when creating the tweet generator, limited the generated content to 140 characters, and used only “correct” words to start sentences, that is, those that were the beginning of sentences in the corpus.

Today I would like to tell you about writing a class to simplify the work with Markov chains.

Please under cat.

Basic knowledge:

Representation of graphs in the form of an adjacency matrix, knowledge of the basic concepts of graphs. Knowledge of C++ for the practical part.

Theory

A Markov chain is a sequence of random events with a finite or countable number of outcomes, characterized by the property that, loosely speaking, with a fixed present, the future is independent of the past. Named after A. A. Markov (senior).

If to speak in simple terms, then the Markov chain is a weighted graph. Events are located at its vertices, and the weight of the edge connecting vertices A and B is the probability that event A will be followed by event B.

Quite a few articles have been written about the use of Markov chains, but we will continue to develop the class.

Here is an example of a Markov chain:

In the following, we will consider this scheme as an example.

Obviously, if there is only one outgoing edge from vertex A, then its weight will be equal to 1.

Notation

At the vertices we have events (from A, B, C, D, E...). On the edges, the probability that after the i-th event there will be an event j > i. For convention and convenience, I numbered the peaks (No. 1, No. 2, etc.).

A matrix is the adjacency matrix of a directed weighted graph, which represents the Markov chain. (more on this later). In this particular case, this matrix is also called the transition probability matrix or simply the transition matrix.

Matrix representation of a Markov chain

We will represent Markov chains using a matrix, exactly the adjacency matrix with which we represent graphs.

Let me remind you:

The adjacency matrix of a graph G with a finite number of vertices n (numbered from 1 to n) is a square matrix A of size n, in which the value of the element aij is equal to the number of edges from i-th vertex graph to the j-th vertex.

More about adjacency matrices - in the course of discrete mathematics.

In our case, the matrix will have a size of 10x10, let's write it:

0 50 0 0 0 0 50 0 0 0
0 0 80 20 0 0 0 0 0 0
0 0 0 0 100 0 0 0 0 0
0 0 0 0 100 0 0 0 0 0
0 0 0 0 0 100 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 70 30 0
0 0 0 0 0 0 0 0 0 100
0 0 0 0 0 0 0 0 0 100
0 0 0 0 0 100 0 0 0 0

Idea

Take a closer look at our matrix. In each row, we have non-zero values in those columns whose number matches the next event, and the non-zero value itself is the probability that this event will occur.

Thus, we have the values of the probability of the occurrence of an event with a number equal to the number column after an event with a number equal to lines.

Those who know the theory of probability could guess that each line is a probability distribution function.

Markov chain traversal algorithm

1) initialize the initial position k with a zero vertex.
2) If the vertex is not final, then we generate a number m from 0...n-1 based on the probability distribution in row k of the matrix, where n is the number of vertices, and m is the number of the next event (!). Otherwise we leave
3) The number of the current position k is equal to the number of the generated vertex
4) Go to step 2

Note: the vertex is final if the probability distribution is zero (see the 6th row in the matrix).

Pretty neat algorithm, right?

Implementation

In this article, I want to separately take out the implementation code of the described bypass. The initialization and filling of the Markov chain is of no particular interest (see the complete code at the end).

Implementation of the traversal algorithm

template Element *Markov ::Next(int StartElement = -1) ( if (Markov ::Initiated) // if the adjacency matrix is created ( if (StartElement == -1) // if the default start element is StartElement = Markov ::Current; // then continue (in the constructor Current = 0) std::random_device rd; std::mt19937gen(rd()); std::discrete_distribution<>dicr_distr(Markov ::AdjacencyMatrix.at(Current).begin(), Markov ::AdjacencyMatrix.at(Current).end()); // initialize the container to generate a number based on the probability distribution int next = dicr_distr(gen); // generate the next vertex if (next == Markov ::size()) // subtleties of the generator, if the probability distribution is zero, then it returns the number of elements return NULL; Markov ::Current = next; // change the current vertex return &(Markov ::elems.at(next)); // return the value at the top ) return NULL; )

This algorithm looks especially simple due to the features of the container discrete_distribution. It is rather difficult to describe in words how this container works, so let's take the 0th row of our matrix as an example:

0 50 0 0 0 0 50 0 0 0

As a result of the generator, it will return either 1 or 6 with a probability of 0.5 for each. That is, it returns the column number (which is equivalent to the number of the vertex in the chain) where to continue moving further.

An example program that uses the class:

Implementation of a program that traverses the Markov chain from the example

#include #include "Markov.h" #include #include using namespace std; int main() ( Markov chain; outstream outs; outs.open("out.txt"); ifstream ins; ins.open("matrix.txt"); int num; doubleProb = 0; (ins >> num).get(); // number of vertices string str; for (int i = 0; i< num; i++) { getline(ins, str); chain.AddElement(str); // добавляем вершину } if (chain.InitAdjacency()) // инициализируем матрицу нулями { for (int i = 0; i < chain.size(); i++) { for (int j = 0; j < chain.size(); j++) { (ins >>Prob).get(); if (!chain.PushAdjacency(i, j, Prob)) // push matrix ( cerr<< "Adjacency matrix write error" << endl; } } } outs << chain.At(0) << " "; // выводим 0-ю вершину for (int i = 0; i < 20 * chain.size() - 1; i++) // генерируем 20 цепочек { string *str = chain.Next(); if (str != NULL) // если предыдущая не конечная outs << (*str).c_str() << " "; // выводим значение вершины else { outs << std::endl; // если конечная, то начинаем с начала chain.Current = 0; outs << chain.At(0) << " "; } } chain.UninitAdjacency(); // понятно } else cerr << "Can not initialize Adjacency matrix" << endl;; ins.close(); outs.close(); cin.get(); return 0; }

An example of the output that the program generates: