TDMinerUser Manual

By

Debprakash Patnaik

Version 0.0d

June20, 2008

Table of Contents

1

Introduction to TDMiner

Serial episodes

Parallel episodes

Installation and Setup

System Requirements

Installing and Running the Software

Getting Started

Loading/Creating an Event Sequence

Loading Events from a Comma Separated File

Mining the Events

Mining for serial episodes with inter-event interval constraint

Anatomy of an episode:

Saving the episodes

Mining for parallel episodes with expiry constraint

Visualizing

The Event Sequence

The Episode Instances

Graph Visualization

1

Introduction to TDMiner

Temporal data mining is concerned with the mining of large sequential or ordered data streams. The data streams dealt with here are only categorical data sequences or event streams. The data can be viewed as a single long sequence of ordered pairs (Ei, ti) which are called instantaneous events. In each event (Ei, ti), Ei is referred to as an event type (which takes its value from a finite alphabet) and tiis the time of occurrence of the event. There are many applications where data appears in this form, e.g., spike trains in multi-neuronal recordings, alarm sequences in telecom networks, web navigation logs etc. The data here can be viewed as a sequence of ordered pairs (Ei, ti), where Ei denotes the event type and ti is the start time of the event. The data must have been sorted according to start times.

Frequent episodes are temporal patterns that occur sufficiently often along the data sequence. The temporal patterns referred to as episodes are ordered collections of event types. For example A→B→C is a temporal pattern where an event type A is followed some time later by B and then a C, in that order. The frequent episode framework aims at discovering all episodes that occur often in the data.

TDMiner is a temporal data mining tool that minesFrequent Episodes from long sequence of events. The program is written in Java and provides a suit of mining techniques for different types of frequent episodes and also tools for visualizing the results obtained from mining.

Frequent episode discovery framework was proposed by Mannila et al (Discovery of frequent episodes in event sequences, Data Mining and Knowledge Discovery, 1997), in the context analyzing alarm sequences in a communication network. Laxman et al (Discovering frequent episodes and learning hidden markov models: A formal connection, IEEE Transactions on Knowledge and Data Engineering, 17(11), 2005) introduced the notion of non-overlapped occurrences as episode frequency and proposed efficient counting algorithms. This work was then extended by Debprakash et.al. to include temporal constraints on non-overlapped occurrences of frequent episodes (Application of Frequent Episode Discovery Framework in Microelectrode Array Data Analysis, Masters Thesis, Electrical Engineering Dept, Indian Institute of Science, 2006). For example constraints can be added to time interval between two consecutive events in a serial episode etc. The TDMiner package provides the algorithms for mining episodes with all different temporal constraints.

The data to be analyzed is a sequence of events denoted by (E1, t1), (E2, t2) … where Ei represents an event type and ti, the time of occurrence of the ith event. Ei's are drawn from a finite set of event types. The sequence is ordered with respect to time of occurrences of the events so that, ti< ti+1, for all i=1, 2… The following is an example event sequence:

Event, Time of occurrence

------

A, 0.00396

B, 0.05284

V, 0.05800

R, 0.07332

W, 0.11700

D, 0.21044

Y, 0.24852

A, 0.32056

The patterns to be discovered in the event sequence are called Episodes. An episode is a collection of events occurring together. Three different kinds of episodes are shown in the figure below.

Figure 1: Types of episodes

A) Serial Episode, chained events

B) Parallel Episode, synchronized events

C) Neither Serial nor Parallel Episode, synfire-chain events

Serial episodes

Figure 1A shows a serial episode: it occurs in a data stream only if there are events of types X, Y and Z that occur in order in the sequence, and denoted by (X→Y→Z). In the sequence there can be other events interspersed between the occurrences of event types of an episode (e.g. …(X,1),(A,3),(F,4),(Y,6),(M,9),(Z,10)…).

Parallel episodes

Figure 1B shows a parallel episode: there is no constraint on the relative order of X, Y and Z. Parallel episodes are denoted e.g. as (XYZ).

Figure 1C shows an example of a non-serial and non-parallel episode: it occurs in a sequence if events of type X and Y occur in any order followed some time later by an event of type Z.

In multi-neuron data, a spike event has the label of the neuron (or the electrode number in case of multi-electrode array recordings) which generated the spike as its event type and has the associated time of occurrence. The neurons in the ensemble under observation fire action potentials at different times, that is, generate spike events. All these spike events are strung together, in time order, to give a single long data sequence.

Installation and Setup

The programis distributed as TDMiner.exe for Windows operating system and TDMiner.jar for all other platforms.

System Requirements

Software Requirements:

Java Runtime Environment (JRE) 5.0 or above

Download:

Basic Hardware Requirements:

1GB RAM

Intel Pentium Series 1 GHz onwards

Recommended Hardware Requirements:

200MB of free disk space

2GB real memory

Intel Pentium IV 3 GHz onwards

Installing and Running the Software

.

TDMiner.exe – distribution package for Windows platform.

TDMiner.jar – distribution package for all other platforms.

To run TDMiner.jar

$ java –jar TDMiner.jar

Getting Started

In this section we give step by step directions for how to load/create an event sequence, mine for frequent episodes and visualize the episodes discovered.

Loading/Creating an Event Sequence

Loading Events from a Comma Separated File

The program can load events from an ASCII file stored in comma separated format: event, time on each line of the input file. The following is a portion of a typical input file:

A, 0.00396

B, 0.05284

V, 0.05800

R, 0.07332

W, 0.11700

J, 0.14344

D, 0.21044

Y, 0.24852

A, 0.32056

Events can be alpha-numeric strings of any size and time must be a numeric value. The times in the file must be in ascending order.

  1. Go to Event Sequence Loader Tab
  2. Click on button in File Name panel.
  3. Select the data file using the file selection dialog.
  4. The following dialog provides options to specify the data format of the file. It is assumed that each line of the data file contains an event with its time of occurrence.

  1. After the event sequence is loaded, plot of the no. of occurrence of each event type is displayed along with a table indicating the actual numbers. See below:

Mining the Events

Frequent episode discovery is an iterative procedure in which frequent episodes of a given size are found in each pass over the event stream. These frequent episodes are used to generate the candidate episodes of the next size. That is, first 1-node frequent episodes are found through one pass over the data. Using the appropriate candidate generation scheme, the 1-node frequent episodes are then combined to obtain a set of 2-node candidate episodes. By counting the frequencies of 2-node candidates through one pass over the data, frequent 2-node episodes are selected. This process is continued till frequent episodes of all different sizes are obtained. The candidate generation strategies depend on the frequency measure used. For details refer to Chapter 2 of Application of Frequent Episode Discovery Framework in Microelectrode Array Data Analysis, Masters Thesis, Electrical Engineering Dept, Indian Institute of Science, 2006.

Mining for serial episodes with inter-event interval constraint

  1. Select “Discovery of episodes & inter-event intervals(Serial)” from the drop-down list of Algorithms. This mining algorithm discovers serial episodes with suitable inter-event intervals chosen from a set of intervals provided by the user.
  2. A threshold type determines whether an episode is frequent/significant based on its number of occurrence in the data. The “Episode Connection Strength” based Threshold Type is selected by default. The Estrong panel takes the parameters for this threshold type.
  3. Connection Strength: It is the probability of the subsequent event to follow an event that is part of an episode.
  4. Type I Error: This is the allowed type I error of incorrectly classifying an episode as frequent/significant.
  5. Time Granularity: This specifies the granularity at which the data sequence is discretized for computing significance.
  6. For this algorithm we need to specify a set of inter-event intervals. The algorithm then automatically selects the suitable interval for each pair of consecutive event types in an episode.The intervals are entered in the Constraints Panel using button.The following dialog allows us to enter intervals as pairs of Tlow and Thigh (indicating the start and end of an interval).
  1. “Episode Size” on constraints panel indicates the maximum size of an episode up to which the mining algorithm will run.
  2. After all the parameters are selected we mine the data using button.

Note:

The other type of frequency threshold available for this mining is “Explicit Decaying Threshold”.

This is the cut-off frequency expressed as a fraction of the number events in the data sequence. All episodes that occur more number of times than the cut-off frequency times the number of events in the sequence are declared as frequent. For example if threshold = 0.01 and number of events = 20,000, then all frequent episodes must occur more than 200 (0.01 x 20,000) time is the data. The threshold can be expressed separately for one-node and two-node onwards episodes.The Decay Multiplier is used to determine the threshold for episodes of size > 2 if the mining process continues. That is for 3-node episodes threshold = 0.018 (2-node threshold) x 0.9 = 0.0162, and the threshold for 4-node episodes = 0.0162 (3-node threshold) x 0.9 = 0.01458.

After the event sequence is mined the results are displayed in the Result panel as shown below. The left panel can be used to browse through episode sets of different sizes. Clicking on different headers of the table sort the episode sets in different orders.

Anatomy of an episode:

A[0.004-0.006]-B[0.004-0.006]-C[0.004-0.006]-D #(0.855) : 255

  1. A,B,C,D – Event types in the episode (appearing in that order)
  2. [0.004-0.006] – The inter-event time between consecutive event types. This can be different for different pairs.
  3. #(0.855) – The conditional probability of one event being followed by the other (assuming this is same for each consecutive pair of events)
  4. 255 – Number of non-overlapped counts in the data sequence of the given episode.

Saving the episodes

The button on Results panel saves the mined episodes into a text file. Show below is an example of a stored episode set.

Frequent Episode discovery results

------

Episode set size = 4

Episodes of size = 1

------

No. of 1 node frequent episodes = 26

N : 432

J : 413

Z : 408

R : 408

L : 407

F : 406

K : 404

V : 402

T : 402

M : 401

I : 400

Y : 399

B : 394

U : 392

C : 390

G : 390

P : 389

D : 383

A : 381

W : 381

E : 378

Q : 377

H : 374

O : 371

S : 364

X : 357

------

Episodes of size = 2

------

No. of 2 node frequent episodes = 3

B[0.004-0.006]-C #(0.738) : 343

C[0.004-0.006]-D #(0.738) : 338

A[0.004-0.006]-B #(0.738) : 333

------

Episodes of size = 3

------

No. of 3 node frequent episodes = 2

B[0.004-0.006]-C[0.004-0.006]-D #(0.824) : 297

A[0.004-0.006]-B[0.004-0.006]-C #(0.824) : 290

------

Episodes of size = 4

------

No. of 4 node frequent episodes = 1

A[0.004-0.006]-B[0.004-0.006]-C[0.004-0.006]-D #(0.855) : 255

------

Mining for parallel episodes with expiry constraint

  1. Select “Non-overlapped count with episode expiry constraint (Parallel)” from the drop-down list of Algorithms. This mining algorithm discovers parallel episodes with the expiry time constraint provided by the user.
  2. A threshold type determines whether an episode is frequent/significant based on its number of occurrence in the data. Here we can use “Explicit Decay Threshold” or “Auto-threshold for parallel episode” in the Threshold Type panel.Explicit Decay Threshold is same as that described above in Serial Episode mining section.

Auto-threshold for parallel episode: Here a chance occurrence of an episode in a window of time less than the expiry constraint is considered as success of a random binomial trail with p = probability of arrival of each of the constituent event types in the time interval of the expiry constraint.

Error for Parallel Episode is probability of incorrectly selecting a randomly occurring set of events as a significant parallel episode. Hence it may be useful to have extremely small values for this e.g. 1e-5.

  1. Enter the expiry constraint in sec in the Constraints panel. All episode occurrences mined here complete within the specified time duration.

After the event sequence is mined the results are displayed in the Result panel as shown below. The left panel can be used to browse through episode sets of different sizes. Clicking on different headers of the table sort the episode sets in different orders. Although parallel episodes are shown as a sequence of even types, there is noorder between the shown event types.

Visualizing

The Event Sequence

The event sequence can be visualized on a typical raster plot on Event Sequence Display panel.

Each spike represents a separate event. The y-axis shows the id/label of the event type and the x-axis shows the time stamp.

Standard Zoom In, Zoom Outand Scroll operations are supported on the plot. The reset button can be used to fit the entire event sequence again into the display.

The Episode Instances

  1. Episode occurrence can be displayed on the raster plot using from the operations panel.
  2. The following dialog is displayed for selecting episodes.
  1. We select the episodes using the check box beside them. The Select All button selects all episodes shown on the dialog.
  2. The Add Episode panel on the above dialog can be used to enter episodes not already mined.

The format for specifying episodes here is as follows

Serial Episodes with inter-event intervals:M [0.009-0.011] N [0.004-0.006] O [0.014-0.016] P

Each pair event typeis separated by the inter-event interval applicable to it.

Parallel Episodes: M N O P (after parallel episode mining has been done)

The above plot shows occurrences of episodes A[0.004-0.006]-B[0.004-0.006]-C[0.004-0.006]-D and M[0.009-0.011]-N[0.004-0.006]-O[0.014-0.016]-P. It turns out that there are no occurrences of the second episode satisfying the specified inter-event time constraints.

Graph Visualization

The frequent episodes mined in the process can be strung together to build the structure of the under-lying network. The Graph Visualization panel provided a way to see the graph structure based on the mined episodes. The display button loads the frequent episodes into the graph display. The Level indicates the size of episodes that are displayed as edges of the graph. The EStrong slider helps limit episodes of certain conditional probability and above.

Shown below is an example of graph visualization.

1