DRAFT

Building a Markov Model

Often patterns may be observed that give rise to insight into data. Unfortunately, in many common situations, the volume and complexity of the data being observed makes it impossible for a human to perceive and interpret such patterns. The goal of this exercise is to implement a program that will enable a computer to quantify structure in data so that structure may be used to characterize the generator, compare the generator with others, or predict other elements of the context from which the data were sampled.

As an example, consider the set W = {steam, teams, meets, teems, eat, ate, state, tease, test, mast, mates} to be a random sample of words taken from frequently used vocabulary in some context. The key elements of this example are:

1.  Letters, which are primitive names for sounds, and which may be viewed as representative of arbitrary tokens;

2.  Words that express or suggest relationship between tokens (letters), such as “e” often, but not always, appears near “a”; and

3.  Transitions from one token to another within a word.

In order to quantify the structure inherent within W, we first observe the set of starting letters S = {a, e, m, s, t}. There are 11 words in W. Thus, if all that was known regarding the language from which this sample was drawn was the sample W, one might hypothesize the following probabilities:

1.  P(“a” is the first letter of a new word) = 1/11

2.  P(“e” is the first letter of a new word) = 1/11

3.  P(“m” is the first letter of a new word) = 3/11

4.  P(“s” is the first letter of a new word) = 2/11

5.  P(“t” is the first letter of a new word) = 4/11

Similarly one may tally the token transitions that occur within the words and determine the probabilities of transition from one token to another.

Transition Probabilities

Starting Letter Probabilities

Sample output of a python program is given in this section. The numeric output values have been truncated to no more than 4 digits after the decimal. The first output arose from code that calculated the first letter probabilities, which are displayed in the following dictionary:

{' e': 0.0909, ' t': 0.3636, ' s': 0.1818, ' m': 0.2727, ' a': 0.0909}

Note the leading blank spaces in each key.

Letter-to-letter and Letter-to-end Transition Probabilities

The second block of output is a list of dictionaries, which provide the transition probabilities for each letter in W. The dictionaries are presented in alphabetic order of the first symbol of each key, though the content of any given dictionary may not be in alphabetic order.

{'at': 0.5, 'am': 0.25, 'as': 0.25}

{'em': 0.0769, 'e ': 0.2307, 'ea': 0.3077, 'es': 0.1538, 'et': 0.0769, 'ee': 0.1538}

{'m ': 0.1666, 'ma': 0.3333, 'ms': 0.3333, 'me': 0.1666}

{'st': 0.4444, 'se': 0.1111, 's ': 0.4444}

{'ts': 0.0769, 'ta': 0.0769, 't ': 0.2307, 'te': 0.6154}

In the displays, the start-of-word and end-of-word marks are blank spaces.

Generating Words

Shown below, the next block of program output reveals results produced by running a “word” generator five times. For each trial, the generator function produces the probability that the word might occur under the assumption that the sample words are representative of the words in the language. Note that actual words may be produced, such as “mate” during this run, but many nonsense words may also be generated.

Also note that an asterisk, “*”, is pre-pended to each word and a hash mark, “#”, is appended to each word. These were added to the display in order to highlight the start and particularly, the finish of each word, which includes a blank space in the implementation being used.

Probability= 0.006455083378160301
*mate #

Probability= 0.0034002085284136566
*stes #

Probability= 1.5176308961160885e-08
*ateateastesteates #

Probability= 0.0808080808080808
*s #

Probability= 0.011475703783396091
*ste #

State Diagram

Thanks to Mr. Mark Mason, here is a diagram showing the states {Start, End, A, E, M, S, T} and the transition probabilities arising from the samples provided in W.

Assignment (Demonstration due by the end of the last lab meeting of the semester)

Create and apply a python program that will:

1.  With a text file (20 points):

  1. Read the contents;
  2. Assemble the contents into a single string with punctuation deleted and the letters in each of the words mapped to a single case;
  3. Provide a leading blank for the first word, a trailing blank for the last word, and a separating blank between consecutive words;
  4. Close the file

2.  Determine the transition probabilities (30 points) for:

  1. First letters (these are blank-to-letter transitions);
  2. Letter-to-letter and letter-to-blank pairs

3.  Present the probabilities in a formatted and organized order (30 points);

4.  Use the transition probabilities to generate 10-1000 hypothetical words in the language represented by the sample and for each of the words (20 points):

  1. Display the word;
  2. Calculate and display the probability of occurrence of the word.

5.  Compare writing samples of three authors with an unknown author and determine which known author’s work is most nearly similar to that of the unknown author (30 points).

Applications

1.  Online shopping suggestions

2.  Authorship comparisons

3.  Word completion applications

4.  Natural language processing