Chapter 7 The Boyer and Moore Algorithm Based upon Rule E2 and Rule E3

In the Horspool Algorithm, we scan the strings from right to left. We should note that in the algorithm, we do find the longest suffix of which is equal to a suffix of . But, we do not utilize this particular suffix. In this chapter, we shall introduce the Boyer and Moore Algorithm which utilizes this suffix. The algorithm is based upon Rule E2 and Rule E3 of exact string matching. We introduced Rule E2 in Section 4.1 before. Rule E3 will be introduced in this chapter. To remind the reader about Rule E2, let us consider Fig. 7-1. For any substring in , if there is a substring in which is the first one to the left of in and is equal to , as illustrated in Fig. 7-1, we can shift in such a way that the two ’s are aligned, as shown in Fig. 7-2. If there is no such a in which is to the left of in , we can now shift a much larger number of steps, as shown in Fig. 7-3.

Fig. 7-1 The substring matching: central idea of Rule E2

Fig. 7-2 The shifting of according to Rule E2

Fig. 7-3 The longer shifting of according to Rule E2

This Rule E2 is a general case in which may be any substring. In the Boyer and Moore Algorithm, is a particular one and therefore, the way of shifting is quite complicated and effective.

Section 7.1 The Basic Idea of the Boyer and Moore Algorithm

The Boyer and Moore Algorithm is, as nearly all other string matching algorithms, a window sliding algorithm. Unlike the MP Algorithm which scans from the left, the Boyer and Moore Algorithm scans from the right, as shown in Fig. 7.1-1.

Fig. 7.1-1 The basic idea of the Boyer and Moore Algorithm

As in the MP Algorithm, the scanning will stop as soon as a mismatching occurs. Thus, we may assume that we have found the longest suffix of which is equal to a suffix of as shown in Fig. 7.1-2. There is also a character in which is not equal to its corresponding character in .

Fig. 7.1-2 The longest suffix of which is equal to

a suffix of in the Boyer and Moore Algorithm

The Boyer and Moore Algorithm utilizes the suffix which is called the good suffix and the character in , which is called the bad character. As can be seen, we can apply Rule E2 to both good suffix and bad character .

Section 7.2 The Good Suffix Rule 1 of the Boyer and Moore Algorithm: An Application of Rule E2

In this section, we shall introduce the Good Suffix Rule 1 of the Boyer and Moore Algorithm. As shown in the previous section, the Boyer and Moore Algorithm scans from right to left and stops at its first mismatch.. Thus we may say that the algorithm finds the longest suffix of which is equal to a suffix of . Since is a substring, we can now apply Rule E2 to it. Consider suffix in first. Let , and be the largest such that , and as shown in Fig. 7.2-1. We then claim that the Good Suffix Rule 1 works and may utilize this situation by shifting the pattern to the right as shown in Fig. 7.2-2.

Fig. 7.2-1 The existence of in

Fig. 7.2-2 The shifting of based upon the existence of in

But, the Boyer and Moore Algorithm has another rule, called the Bad Character Rule which will be explained later. If the Good Suffix Rule E2 works, we then compare the result of using this rule against the result of using the Bad Character Rule. We would always use the result which gives larger number of steps of shifting.

If there is no such an such that , and , we claim that the Good Suffix Rule 1 fails and we shall apply the Good Suffix Rule 2 and will shift even farther. We shall get to Bad Character Rule and Good Suffix Rule 2 later. One possible shifting is now shown in Fig. 7.2-3.

Fig. 7.2-3 Another possibility of shifting related to

the suffix in

We now give the Good Suffix Rule 1 as follows:

Definition 7.2-1 The Good Suffix Rule 1 of the Boyer and Moore Algorithm

Good Suffix Rule 1 of the Boyer and Moore Algorithm: Let , andbe the largest such that , and . If such an exists, we claim that Good Suffix Rule 1 works and we then compare the result of using this Good Suffix Rule 1 against the result of using the Bad Character Rule. If the Bad Character Rule does not give a larger number of steps to shift, we shift so that is aligned exactly with . If the Bad Character Rule gives a larger number of steps to shift, use the Bad Character Rule. If no such an exists, we claim that Good Suffix Rule 1 fails and we then apply Good Suffix Rule 2.

Example 7.2-1 An Example of the Good Suffix Rule 1 of the Boyer and Moore Algorithm

Consider the strings in the following table:

Table 7.2-1 The first case to illustrate Good Suffix Rule 1

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13
W / = / a / c / c / t / g / t / a / c / g / g / t / c / a
P / = / c / c / t / c / a / t / g / a / a / t / t / c / a

As we can see, in this case, because and We find the largest in this case to be as and . Thus we claim that the Good Suffix Rule 1 works.

Example 7.2-2 Another Example of the Good Suffix Rule 1

Consider the strings in Table 7.2-2.

Table 7.2-2 The second case illustrating the Good Suffix Rule 1

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13 / 14 / 15
W / = / a / c / c / t / g / t / g / a / c / t / t / a / t / g / c
P / = / c / g / t / t / a / a / c / t / t / c / a / c / t / g / c

In this case, . There is simply no other substring in which is equal to . Therefore, the Good Suffix Rule 1 fails in this case.

Example 7.2-3 Still Another Example of the Good Suffix Rule 1

Consider the strings in Table 7.2-3.

Table 7.2-3 The third case illustrating the Good Suffix Rule 1

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13
W / = / g / t / t / g / a / c / t / a / c / c / g / t / a
P / = / g / t / t / c / a / t / a / c / g / g / a / t / a

In this case, it an be seen that and . We find that the only other substring which is equal to is. But, unfortunately, we have . Therefore, we claim that Good Suffix Rule 1 fails.

Example 7.2-4 The Fourth Example of the Good Suffix Rule 1

Consider the case in Table 7.2-4.

Table 7.2-4 The third case illustrating the Good Suffix Rule 1

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13
W / = / g / t / t / g / a / c / t / a / c / c / g / t / a
P / = / c / t / a / c / a / t / a / c / g / g / a / t / a

In this case, we can see that and . We note that although , . Therefore, we cannot use this substring. But we can use because would be 1 in such a situation and . We conclude that the Good Suffix Rule 1 works in this case.

Example 7.2-5 The Fifth Example of the Good Suffix Rule 1

Consider the case in Table 7.2-5.

Table 7.2-5 The fourth case illustrating the Good Suffix Rule 1

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11 / 12 / 13
W / = / g / t / t / g / a / c / t / a / c / c / g / t / a
P / = / g / t / a / c / a / t / a / c / g / g / g / t / a

In this case, we have and . Although , there is no such that because and will become 0 which is not acceptable. Thus in this case, the Good Suffix Rule 1 fails.

Finally, let us consider the extreme case where . In this case, the mismatching occurs at the very beginning of the scanning. We want to use the Bad Character Rule. If we claim that the Good Suffix Rule 1 fails, we would then use the Good Suffix Rule 2.

The Good Suffix Rule 1 of the Boyer and Moore Algorithm is a natural application of Rule E2. But, we must pause and ask ourselves two questions:

(1) Why would we apply Rule E2 to this suffix in ? Note that Rule E2 can be applied to any substring in . Is there any reason that we should utilize this particular suffix in ?

(2) How can we determine whether there is another in which is to the left of the in ? We know that we cannot determine this in run-time. What is the pre-processing mechanism which allows us to know where the other in exists if it exists?

To answer these two questions, we must note that this particular suffix in is not ordinary. It has the special property that it is the longest suffix of which is a suffix in . Thus, to see whether there is another in which is to the left of in , we may simply ask whether there is a substring in which is equal to a suffix of . We shall show that after we perform a pre-processing, this problem can be easily solved. The pre-processing will be introduced in a later section. Besides, we would have large number of steps to shift because the substring is a suffix.

Section 7.3 The Bad Character Rule of the Boyer and Moore Algorithm: Another Application of Rule E2

The above discussion shows that the Boyer and Moore Algorithm obviously follows Rule E2 when it utilizes the suffix in . But, of course, we ignored the bad character in . To apply Rule E2 to , we should find whether exists in to the left of in as shown in Fig. 7.3-1.

Fig. 7.3-1 The existence of in

If such an exists in to the left of in , we can shift to the right as shown in Fig. 7.3-2.

Fig. 7.3-2 The shifting of based upon the existence of

to the left of in in the Boyer and Moore Algorithm

If no such an exists in , we shift as shown in Fig. 7.3-3.

Fig. 7.3-3 Another possibility of shifting related to

the bad character in

The above discussion gives us the Bad Character Rule of the Boyer and Moore Algorithm which will be formally given later. One problem which we must deal with is how we can determine whether exists in and is to the left of in . This cannot be done in run-time. The Boyer and Moore Algorithm actually only creates a location table which is defined as follows:

Definition 7.3-1 The Location Table of the Boyer and Moore Algorithm

The location table of the Boyer and Moore Algorithm: The location table of Boyer and Moore Algorithm stores the locations of the first distinct characters counted from location in to the left, as we did in the Horspool Algorithm.

Example 7.3-1 An Example for the Bad Character Rule

Assume that we have the following case:

Table 7.3-1 A case for illustrating the Bad Character Rule

1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 11
W / = / g / t / t / a / c / t / a / g / g / t / a
P / = / a / c / c / t / g / t / a / c / c / t / a

Then the location table of is as shown in Table 7.3-2.