Chapter 5 Rule E2-2 and the Horspool Algorithm

In Chapter 4, we introduced Rule E2 which is a substring matching rule. Given a substring in the window of the text string , we must try to find if there exists a substring which is identical to to the left of in the pattern string . In Chapter 4, we also introduced a variant of Rule E2, namely Rule E2-1 in which the substring is the longest suffix of which is equal to a prefix of . In this chapter, we shall introduce another variant of Rule E2.

Section 5.1 Rule E2-2: The 1-Suffix Rule

Consider Fig. 5.1-1. Note that the last character of is . If we have to move the pattern , we must align the particular in , if it exists, to align it with the in as shown in Fig. 5.1-1(b). If no such an exists in , we move as shown in Fig. 5.1-1(c).

Fig. 5.1-1 The Basic Idea of Rule E2-2

The following is a formal statement of Rule E2-2.

Rule E2-2: We are given a text string , a pattern string , and a window of , which is aligned with . Assume that the last character of is and we have to shift . If exists in , let be the location of the rightmost in , shift to such an extent that is aligned with . If no exists in , shift to such an extent that is aligned with .

Section 5.2 The Horspool Algorithm

The Horspool Algorithm starts with scanning from the right as shown in Fig. 5.2-1. For each pair of characters in the text and pattern, we compare them until we find a mismatch. After we find a mismatch, we know that we should shift the pattern to the right. The shifting is based upon Rule E2-2.

Fig. 5.2-1 The right to left scanning in the Horspool Algorithm

To implement Rule E2-2, we must have a mechanism to find for a given . This is not done in run-time. Instead, we do a pre-processing.

Definition 5.2-1 The Location Table for the Horspool Algorithm

Given an alphabet set and pattern with length , we create a table, denoted as location table of , containing entries. Each entry stores the location of the rightmost , in counted from location , if it exists. If does not exist in , store in the entry.

Example 5.2-1

Let . The location table of is displayed as follows:

Table 5.2-1 The location table of

a / c / g / t
1 / 9 / 3 / 4

Each time when we have to move the pattern , we consult the location table of .

For instance, consider the case shown in Fig. 5.2-1.

T / = / A / c / c / g / a / g / g / t / t / g / a / a / t / t / g / c
P / = / A / g / g / t / t / g / a / a / t

Fig. 5.2-1 An example for the Horspool Algorithm

As can be seen, we have to move . The last character of the window is . There are two ’s in . Both ’s are bold faced in Fig. 5.2-1. We consult Table 5.2-1. The entry of is 4. We therefore move 4 steps to the right as shown in Fig. 5.2-2. A match is now found.

T / = / A / c / c / g / a / g / g / t / t / g / a / a / t / t / g / c
P / = / a / g / g / t / t / g / a / a / t

Fig. 5.2-2 The moving of in the Horspool Algorithm

The Horspool Algorithm is very similar to the Reverse Factor Algorithm. It is now given as Algorithm 5.1 below:

Algorithm 5.1 The Horspool Algorithm Based upon Rule 1-2

Input: A text string and a pattern string with lengths and respectively

Output: All occurrences of in .

Construct the location table of

Set ,

Step 1: Let be a window.

Align with .

Set and .

If , exit.

While and

End of While

If , report that is an exact match of .

Find the entry of in the location table of . Let it be denoted as .

.

Go to Step 1.

Example 5.2-2

Let and . The location table is as shown in Table 5.2-2.

Table 5.2-2 The location table for

a / c / g / t
2 / 3 / 3 / 1

The Horspool Algorithm is initialized as shown in Fig. 5.2-3.

T / = / a / a / g / t / t / a / t / t / g / a / c
P / = / a / t / t

Fig. 5.2-3 The initial alignment for Example 5.2-2

The last character of the window is which is not found in . We move 3 steps as shown in Fig. 5.2-4.

T / = / a / a / g / t / t / a / t / t / g / a / c
P / = / a / t / t

Fig. 5.2-4 The first movement of in Example 5.2-2

The last character of the window is which is found in . From Table 5.2-2, we know that we should move 2 steps as shown in Fig. 5.2-5.

T / = / a / a / g / t / t / a / t / t / g / a / c
P / = / a / t / t

Fig. 5.2-5 The second movement of in Example 5.2-2

A match is found. is moved 1 step as shown in Fig. 5.2-6

T / = / a / a / g / t / t / a / t / t / g / a / c
P / = / a / t / t

Fig. 5.2-6 The third movement of in Example 5.2-2

The last character of the window, namely , does not exist in . is moved 3 steps as shown in Fig. 5.2-7.

T / = / a / a / g / t / t / a / t / t / g / a / c
P / = / a / t / t

Fig. 5.2-7 The fourth movement of in Example 5.2-2

Section 5.3 The Time-Complexity Analysis of the Horspool Algorithm

The worst case time-complexity of the Horspool Algorithm is easy to obtain. Let denote the number of alphabets. Then

Preprocessing phase in time and space complexity.

Searching phase in time complexity.

Before we analyze the average number of comparisons of this algorithm, we must state our probabilistic assumptions. We assume that the distribution of the characters occurring in T or in is uniform. That is, the random variable of the characters, X ranging over the -alphbet A, satisfies, for any in A,

We shall assume that the given pattern string and the text string are random. We first define a term, called “head”.

Definition 5.3-1 The last character of a window is called a head.

To obtain an average case analysis of the Horspool Algorithm, we must know the probability that a location of the text is a head, denoted as . It is intuitive and correct that is the same for all because there is no reason that one location is different others so far as being head is concerned. To find , we denote the average number of steps of shift by . With this term, we may easily have the following equation:

(5.3-1)

Let us imagine that . Then obviously, every location will be a head. Suppose that . It will be expected that half of the locations in will be heads. If the number of steps is large in average, then a small number of locations in will be heads.

Let denote the value stored in the entry of the Location Table. Then we have

. (5.3-2)

For example, for the Location Table shown in Table 5.2-2,

.

To obtain the average case time-complexity of the Horspool Algorithm, we must have the average number of character comparisons for a window. Let denote the average number of character comparisons for a window with size . Then we can reason as follows:

(1) The first comparison of character is a mismatch. In this case, the expected number of character comparisons is therefore as is the probability that the first comparison yields a mismatch.

(2) The first comparison is a match and the second comparison of character is a mismatch. In this case, the expected number of character comparisons is therefore . Note that is the probability that the first comparison yields a match.

The first comparisons all yield matchings. Then there will be totally comparisons.

Based upon the above reasoning, we have:

(5.3-3)

when is reasonably large.

Let us denote the expected number of character comparisons for a text string with length and a pattern string with length by . Then,

(5.3-4)

The expected number of character comparisons for a text string with length and a pattern string with length per character is therefore:

(5.3-5)

For the case of the Location Table 5.2-2, we have:

(5.3-6)

The above result is obtained under the assumption that the pattern string is given and fixed, as we stated at the very beginning. We must understand that this is not very good average case analysis because it fails to give an analysis based upon the assumption that the pattern is random. In the following, we show some experiments of finding of the average number of character comparisons. For each of the following three pattern strings, we randomly generated 500 text strings with length 1000. The average number of character comparisons is shown below. It shows that the theoretical result is quite close to the experimental results.

P / Theoretical result / Experiment result
att / 0.5925 / 0.6031
cgtac / 0.5333 / 0.5592
aggttgaat / 0.3137 / 0.3302

The above discussion is what we shall call the first approximation of the average case analysis of the Horspool Algorithm. In this discussion, we ignored one fact: There may be another head to the left of the head in the window. Consider Fig. 5.3-1. The case shown in Fig. 5.3-1 is a special one in which and , the distance between the two heads is equal to 3. Note that at Head 2, there is an exact matching between corresponding characters of and . There cannot be the case where the number of comparisons being 3 because as soon as the comparison at Head 2 is done, it will automatically do the fourth comparison. We of course may ask the question: Under what condition will the characters corresponding to Head 2 be compared? They will be compared if the first two comparisons all yield exact matchings. In other words, as soon as the two first comparisons yield exact matchings, we will have 4 comparisons.

Fig. 5.3-1 The case where there are two heads in the window.

The expected number of character comparisons is:

(5.3-7)

We may now ask another question, what is the expected number of character comparisons if there is no Head 2? It will be equal to

(5.3-8)

We may now rewrite Equation (5.3-7) as follows:

(5.3-9)

Since ,we mathematically conclude that the expected number of character comparisons will be increased if there are more than one two head in the window.

In the general case, it can be easily derived that the expected number of character comparisons is:

(5.3-10)

if is reasonably large.

The above discussion only gives the reader some feeling about how to handle the problem where there are two heads in a window. The above discussion is simple enough to understand. It will be quite complicated mathematically if we want to consider the general case. Since the experimental results show that the first approximation theoretical result is close enough, the general case of existing more than one head will not be discussed this book.

Section 5.4 Some Variations of the Horspool Algorithm

In this section, we shall introduce four algorithms which are variations of the Horspool Algorithm. They are easy to understand and we shall only give a brief sketch of them.

1. The Raita Algorithm.

The Raita Algorithm is different from the Horspool Algorithm in only one aspect: the order of comparison of characters. In the Horspool Algorithm, the comparison starts from the right. But the Raita Algorithm has a specified order of character comparison. For instance, it may first compare with and then with .

2. The Nebel Algorithm

The Nebel Algorithm is also different from Horspool Algorithm in only one aspect: the order to compare of characters. In the Horspool Algorithm, the comparison starts from the right of W. The Nebel Algorithm also has a specified order of character comparison. Let the alphabet set of P be . Without losing generality, we may assume that the number of occurrence of xi in P is the ith smallest. First, it compares the positions of x1 with the corresponding positions of W and then the positions of x2 with the corresponding positions of W. Finally, it compares the positions of with the corresponding positions of W.