Sequence Query Processing

To processing sequence queries, we need to first based on the sequence of features in the images and construct an efficient indexing system. Next, we shall search for similar sequences and compare to the target query sequence and rank them according to a certain measure. In the following we shall discuss the Index construction procedure and sequence query processing

Technique and follow by an demonstration.

Index Construction Procedure

The index construction procedure consists of the three steps:

1. normalization

2. categorization

3. suffix tree construction

The short description for the normalization step is shown in

in (norm.ps).

At the categorization step, we convert each element value into

the symbol of the corresponding leaf node of TAH.

The detailed description of this step can be found in Appendix A.

At the index construction step, we extract every possible suffix

from each sequence and insert it into a suffix tree. The concept

and the structure of a suffix tree is shown in Appendix A.

Sequence query processing

The sequence query processing procedure consists of the following three steps:

1. normalization

2. suffix tree traversal

3. post-processing

At the normalization step, we convert each element value of a

query sequence into its normalized value. For this step, we maintain

the average and the standard deviation values of each dimension.

At the suffix tree traversal step, we traverse the suffix tree

to find a set of candidate subsequences using lower-bound time warping

distance function. The concept, the definition, and the lower-bound

definition of the time warping distance are shown in Appendix A.

The false-hit is detected and discarded at the post-processing step

where actual data sub-sequences corresponding to candidate subsequences

are retrieved from the database and compared with a query sequence.

The method for handling the nearest-neighbor query is described

in (nn.ps)

Appendix A

Searches for patients with similar temporal characteristics can augment the process of patient care by providing physicians with insight into the treatment of previous patients with similar medical conditions. For example, an oncologist might search the database for patients with similar tumor evolution patterns to find the optimal course of treatment.

Supporting similarity searches of temporal sequences requires the definition of a similarity measure. Unfortunately, defining such a similarity measure for medical data sequences is difficult because sequences that are qualitatively the same may be quantitatively different; the reasons are twofold: 1) the compared sequences may be of different lengths, making it difficult to embed sequences in a metric space and use a Euclidean distance measure; and 2) the sampling rates of compared sequences may be different, negating the utility of similarity measures like cross-correlation. In the area of speech recognition, these problems have been approached using time warping, allowing sequences to be stretched or compressed along the temporal axis [RJ93]. Analogous methods can be used to measure the similarities between sequences of medical data; but without an appropriate indexing scheme, query processing performance suffers as the total number and length of sequences increase.

Spatial access methods, such as R-trees, have been used as index structures for the fast retrieval of similar sequences. While spatial methods work well using Euclidean distances as a similarity measure, these methods may produce false dismissals when using time warping distance as a metric. A false dismissal refers to a sequence that is actually similar to the query sequence but which is judged by the system to be dissimilar [AFS93]. Indexing methods predicated upon the triangular inequality (TI) may generate false dismissals when a distance function not satisfying the TI is used as a similarity measure [YJF98]. As the time warping distance metric does not satisfy the TI, spatial access methods using this metric are subject to false dismissals.

Suffix trees have been used extensively as an index structure to find sub-strings matched exactly to a given query string [Ste94], and are extremely useful for subsequence matching as all possible subsequences in a sequence are easily determined. Unlike spatial access methods, suffix trees do not assume a specific geometry, nor a distance function. As such, suffix trees are a suitable candidate for an indexing structure based on time warping distances. However, for a suffix tree to be used in such a manner, the following problems need to be solved: 1) a suffix tree is built on data sequences whose elements take values from finite alphabets – but most elements of medical data sequences have continuous real values; and 2) conventional suffix trees are used for finding exactly matched sub-strings, so suffix tree algorithms need to be adapted to search for similarly matched subsequences.

We propose an indexing technique for similarity searching of multi-dimensional data sequences. In our approach, similarities between sequences are measured by the time warping distance. To achieve fast retrieval of similar sub-sequences without false dismissals, a disk-based suffix tree is used as an index structure. In addition, we introduce a categorization technique to compress the suffix tree, making the index structure efficient and scalable.

A.1 Related work

Several approaches for the fast retrieval of similar sequences have recently been proposed, many of them focusing on one-dimensional data sequences and using Euclidean distance as a similarity measure. In [AFS93], sequences of single-dimensional real numbers are converted into the frequency domain by a discrete Fourier Transform (DFT) and are subsequently mapped into multi-dimensional points that are managed by an R*-tree; the technique is extended to locate similar subsequences in [FRM94]. As this approach uses Euclidean distance as the similarity measure metric, data sequences of different lengths or different sampling rates cannot be matched.

The access methods of [BYO97, YO96, YJF98] permit the matching of sequences of different lengths. [BYO97] presents a modified version of an edit distance, considering two sequences as being matched if a majority of the elements match; this approach is extended to the matching of multi-dimensional data sequences in [YO96]. In [YJF98], the time warping distance is used as a similarity measure with a two-step filtering process: a FastMap index filter followed by a lower-bound distance filter [FL95]. These different approaches focus on whole sequence searches and use index structures based on the triangular inequality, therefore leading to possible false dismissals.

Similarity-based temporal sequence searching has also been addressed in the area of video databases. Work in [SYV97, LSYR98] describes similarity subsequence searching of video clips; however, this research focuses primarily on multimedia modeling, and applies spatial indexing techniques with sequential scanning to perform matching.

A.2 Similarity measure

In our approach, similarities between sequences are measured by the time warping distance metric [RJ93]. Time warping is a generalization method for comparing discrete sequences to sequences of continuous values, and has been used in the matching of voice, audio, and medical data (e. g., electrocardiograms). To find the minimum difference between two sequences, time warping is used to map each element of a sequence to one or more neighboring elements of a secondary sequence. Let X = <x1,…,xn> be a multi-dimensional temporal data sequence and let x = <x1,…,xm> be a feature vector consisting of m features. Then for two multi-dimensional sequences, X and Y, the multi-dimensional time warping distance Dmtw(X,Y) is defined as:

Dmtw(X,Y) = Dmbase(X[1], Y[1])

+ min {Dmtw(X,Y[2:-]), Dmtw(X[2:-],Y), Dmtw(X[2:-],Y[2:-])}

Dmbase(x,y) = i=1..m wi * |x[i]-y[i]|

where m is the number of features and wi is the weight of the ith feature, where (w1 + … + wm = 1).

In the definition of Dmtw(X,Y), X[p] represents the pth element of X, and X[p:r] denotes the subsequence of X including elements in positions p through r. The notation X[p:-] is used for the suffix of X starting at the pth element. That is, X[p:-] is identical to X[p:|X|] where |X| is the number of elements contained in X. Dmtw(X,Y) can be efficiently calculated with algorithmic complexity O(m|X||Y|) using a dynamic programming technique based on recurrence relations [BC96].

A.3 Index construction

We propose the use of a suffix tree as an index structure for similarity searches based on a time warping distance function. Unlike spatial access methods, a suffix tree does not assume any geometries nor an underlying distance function, and therefore, can guarantee the absence of false dismissals if actual distances are always lower-bounded within the indexing space. We first present the definition and structure of a suffix tree.

The suffix tree structure is based upon tries and suffix tries. A trie is an indexing structure used for indexing sets of keywords of varying sizes. A suffix trie is a trie whose set of keywords comprises the suffixes of sequences. Nodes of a suffix trie with a single outgoing edge can be collapsed, yielding a suffix tree [Ste94]. Each suffix of a sequence is represented by a leaf node. The concatenation of the edge labels on the path from the root of the tree to the internal node ni represents the longest common prefix of the suffixes represented by the leaf nodes under ni . We denote the labels on the path connecting nodes ni and nj as label(ni, nj).

Based on these definitions and notation, we now present the index construction methodology, which consists of three steps: pre-processing, categorization, and suffix tree construction.

Pre-processing: In this step, the raw data sequences are converted to sequences of feature vectors. To obtain a feature vector for an image, the automatic knowledge-based image segmentation is applied; the resultant contours are then used to generate a set of spatial features. As an example, consider the lung image sequence containing lung lesions, shown in Figure 1. To characterize the lesion evolution pattern, each lung lesion is represented by three features: area, perimeter, and circularity. After pre-processing, a sequence of three feature vectors, X = <(2.1, 3.0, 0.3), (3.9, 5.3; 0.4), (8.1, 9.3, 0.9)>, is obtained to represent the image sequence.


Figure 1: Visual stream representation for a temporal CT lung image sequence, where the size of the tumor is increasing over time. Three snapshots of a tumor's progression are depicted; for each image, appropriate features are extracted from the segmented lesion, including the area, perimeter, and circularity.

Categorization: In this step, every feature vector is converted to a corresponding symbol using the categorization (i.e., clustering) technique of MDISC [CC94, CCHY96]. MDISC maps a large set of element values into a much smaller set of categories, and has the following benefits: 1) the algorithm considers both value and frequency distributions, thus generating more accurate categories than equal-length interval categorization methods; and 2) MDISC is easier to implement than maximum entropy categorization methods [CC94]. Based on MDISC, each category is assigned a unique symbol, denoted as C = ([lb1, ub1],…,[lbm, ubm]), where lbi and ubi are the lower and upper bounds of categories for the ith feature. Thus, for any element included in a category, C, the ith feature of the element is contained between lbi and ubi. For example, given two categories

C1 =( [lb1=1.0, ub1=4.0], [lb2=1.0, ub2=6.0], [lb3=0.1, ub3=0.5]) and

C2 =( [4.1,10.0], [6.1,12.0], [0.6,1.0]), the sequence X for Figure 1 can be converted to Xc = <C1, C1, C2>. Similarly, a second sequence, Y = <(6.5, 9.2, 0.8), (3.5, 5.8, 0.4)>, with the corresponding categorization sequence being Yc = <C2, C1>. The sequences of feature vectors are converted to sequences of finite symbols after categorization, greatly reducing the overall size of the suffix tree [PCYH00].

Suffix tree construction: Based on the sequences of symbols obtained from the previous step, a suffix tree is constructed.


To handle large set of sequences, we employ an incremental disk-based suffix tree construction method, whereby the tree is constructed by performing a series of binary merges of suffix trees of increasing size [PRC94]. Two suffix trees for X and Y are merged with complexity O(|X| + |Y|); hence the suffix tree for M sequences is constructed with complexity O(ML) where L is the average length of M sequences. The total number of nodes in a suffix tree is constrained due to two factors: there are O(M L) leaf nodes, and the degree of any internal node is at least two. Therefore, the maximum number of nodes and overall space requirement of the suffix tree is linear in M L [Ste94]. As an example, consider the two categorized sequences Xc and Yc. Figure 2 shows the suffix tree constructed from these two categorized sequences.

Figure 2: Suffix tree from two categorized sequences Xc = <C1, C1, C2> and Yc = <C2, C1>. The three suffixes X[1:-] = <C1, C1, C2>, X[2:-] = <C1, C2>, and X[3:-] = <C2> from Xc are represented by the three leaf nodes n3, n4, and n6, respectively. Likewise, the two suffixes Y[1:-] = <C2, C1> and Y[2:-] = <C1> from Yc are represented by the two leaf nodes n7 and n2. $ denotes the end marker of a suffix.

A.4 Query processing

Given a multi-dimensional query sequence, Q, the goal of query processing is to find the nearest k similar subsequences from a set of M multi-dimensional medical data sequences, {X1,…,XM} where Xi is of arbitrary length. The query processing consists of three steps (Figure 3): pre-processing, index searching, and post-processing.


Figure 3: Query processing for similarity-matching of temporal data sequences.


Pre-processing: Based on the given feature (i.e., query attribute) names, their associated weights, and the type of query sequence, the query is converted into a sequence of feature vectors. For example, Figure 4 shows the query for a lung image sequence with specific feature names (area, perimeter, circularity), their associated weights (area=0.6, perimeter=0.2, circularity=0.2), and the desired number of answers (k = 2). From pre-processing, a query sequence with two feature vectors, Q= <(3.0, 4.2, 0.4), (8.0, 8.5, 0.8)>, is obtained.

Figure 4: Visual query for finding patients having similar lung tumor evolution patterns. The lesions are segmented and three features (area, perimeter, circularity) are calculated. Weights for features and the desired number of retrieved similarity-matched answers are specified in this query.

Index searching: Recall that the categorization process partitions the sequence elements into groups. The edges of the suffix tree are labeled by the groups of alphabetic symbols. The range of the group is represented by lower-and upper-bound values. Based on the time warping distance function, Dmtw(), we define Dmtw-lb() based on the lower-bound values and Dmtw-ub() based on the upper-bound values. For two multi-dimensional sequences, X and Y, Dmtw-lb(X, Y)  Dmtw(X, Y)  Dmtw-ub(X, Y) is always satisfied.

To find subsequences that may be included in the set of k-nearest subsequences, depth-first traversal is performed on the suffix tree. Let n0 represent the root of the suffix tree, and UBk the kth largest upper-bound distance in the set of candidate answers found thus far (initialized to infinity). Starting at n0, each child node, ni, is inspected; if Dmtw-lb(label(n0, ni), Q) UBk, the search algorithm adds label(n0, ni) to the set of candidate answers. As these new candidate answers are added, the value of UBk is updated accordingly. To reduce the search space, a branch pruning technique is applied to node ni to determine if further traversal is required [PCYH00]. This index searching step produces a set of k’(k) candidate answers.

Consider the query sequence, ~ Q, shown in Figure 4, and the set of data sequences represented by the suffix tree in Figure 2. The search algorithm starts by inspecting node n1. As Dmtw-lb(label(n0, n1), Q) UBk (= ), then label(n0, n1) is inserted into the set of candidate answers. Two more subsequences, label(n0, n3) and label(n0, n4), are added to the set of candidates during the subsequent stages of the search. The final set of candidates resulting from the traversal of the complete suffix tree is {label(n0, n1), label(n0, n3), label(n0, n4)}.

Post-processing: For each candidate found during the index searching step, the corresponding subsequences are retrieved and the exact distances are computed using the distance function Dmtw(). Subsequences are ranked according to these distances and the nearest k subsequences are returned as the final answers. For example, given the set of candidates obtained from index searching, the three subsequences Y[2:2], X[1:1], and X[2:2], are retrieved from label(n0, n1), and their distances to Q are calculated. Similarly, the two subsequences, X[1:3] and X[2:3], are determined from label(n0, n3) and label(n0, n4), respectively. Based on the distance values to Q, the nearest two subsequences, X[2:3] and X[1:3], are returned as the final answers.

References

[AFS93] R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In Proceedings FODO, 1993.

[BC96] D. J. Berndt and J. Clifford. Finding patterns in time series: A dynamic programming approach. In Advances in Knowledge Discovery and Data Mining, 1996.

[BYO97] T. Bozkaya, N. Yazdani, and M. Ozsoyouglu. Matching and indexing sequences of different lengths. In Proceedings of the ACM CIKM, 1997.

[CC94] W. W. Chu and K. Chiang. Abstraction of high level concepts from numerical values in databases. In Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, 1994.

[CCHY96] Wesley W. Chu, Kuorong Chiang, Chih-Cheng Hsu, and Henrick Yau. An error-based conceptual clustering method for providing approximate query answers. In Communications of ACM, Virtual Extension Ed., 1996.

[FL95] C. Faloutsos and K. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the ACM SIGMOD, 1995.

[FRM94] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series data-bases. In Proceedings of the ACM SIGMOD, 1994.

[LSYR98] K. L. Liu, P. Sistla, C. Yu, and N. Rishe. Query processing in a video retrieval system. In Proceedings of the 14th International Conference on Data Engineering, Orlando, Florida, USA, February 1998. IEEE, IEEE Computer Society Press.

[PCYH00] S. Park, W. W. Chu, J. Yoon, and C. Hsu. Efficient searches for similar subsequences of different lengths in sequence databases. In Proceedings of the IEEE ICDE, 2000.

[PRC94] P. Bieganski, J. Riedl, and J. V. Carlis. Generalized suffix trees for biological sequence data: Applications and implementation. In Proceedings of the International Conference on System Sciences, 1994.

[RJ93] L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.