Support Vector Machines (Svms) Classifiers, ROC Analysis and Grid Search for IDS

Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005

Classification Models for Intrusion Detection Systems

Srinvas Mukkamala Andrew H. SungRajeev Veeraghattam

email: mail: mail:

Department of Computer Science, New Mexico Tech, Socorro, NM 87801, USA

Institute of Complex Additive Systems Analysis, New Mexico Tech, Socorro, NM 87801, USA

Key words: Machine learning, Intrusion detection systems, CART, MARS, TreeNet

Abstract

Proceedings of 2nd International Conference on Intelligent Knowledge Systems (IKS-2005), 06-08 July 2005

This paper describes results concerning the classification capability of supervised machine learning techniques in detecting intrusions using network audit trails. In this paper we investigate three well known machine learning techniques: classification and regression tress (CART), multivariate regression splines (MARS) and treenet. The best model is chosen based on the classification accuracy (ROC curve analysis). The results show that high classification accuracies can be achieved in a fraction of the time required by well known support vector machines and artificial neural networks.Treenet performs the best for normal, probe and denial of service attacks (DoS). CART performs the best for user to super user (U2su) and remote to local (R2L).

1. Introduction

Since the ability of an Intrusion Detection System (IDS) to identify a large variety of intrusions in real time with high accuracy is of primary concern, we will in this paper consider performance of machine learning-based IDSs with respect to classification accuracy and false alarm rates.

AI techniques have been used to automate the intrusion detection process; they include neural networks, fuzzy inference systems, evolutionary computation, machine learning, support vector machines, etc [1-6]. Often model selection using SVMs, and other popular machine learning methods requires extensive resources and long execution times [7,8]. In this paper, we present a few machine learning methods (MARS, CART, TreeNet) that can perform model selection with higher or comparable accuracies in a fraction of the time required by the SVMs.

MARS is a nonparametric regression procedure that is based on the “divide and conquer” strategy, which partitions the input space into regions, each with its own regression equation [9]. CART is a tree-building algorithm that determines a set of if-then logical (split) conditions that permit accurate prediction or classification of classes [10]. TreeNet a tree-building algorithm that usesstochasticgradient boosting to combine trees via a weighted voting scheme, to achieve accuracy without the drawback of a tendency to be misled by bad data [11,12].

We performed experiments using MARS, CART, Treenet for classifying each of the five classes (normal, probe, denial of service, user to super-user, and remote to local) of network traffic patterns in the DARPA data.

A brief introduction MARS and model selection is given in section II. CART and a tree generated for classifying normal vs. intrusions in DARPA data is explained in section III. TreeNet is briefly described in section IV. Intrusion detection dataused for experiments is explained in section V. In section VI, weanalyze classification accuracies of MARS, CART, TreeNet using ROC curves.Conclusions of our work are given in section VII.

II. MARS

Multivariate Adaptive Regression Splines (MARS) is a nonparametric regression procedure that makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead, MARS constructs this relation from a set of coefficients and basis functions that are entirely “driven” from the data.

The method is based on the “divide and conquer” strategy, which partitions the input space into regions, each with its own regression equation. This makes MARS particularly suitable for problems with higher input dimensions, where the curse of dimensionality would likely create problems for other techniques.

Basis functions: MARS uses two-sided truncated functions of the form as basis functions for linear or nonlinear expansion, which approximates the relationships between the response and predictor variables. A simple example of two basis functions (t-x)+ and (x-t)+[9,11]. Parameter t is the knot of the basis functions (defining the "pieces" of the piecewise linear regression); these knots (parameters) are also determined from the data. The "+" signs next to the terms (t-x) and (x-t) simply denote that only positive results of the respective equations are considered; otherwise the respective functions evaluate to zero.

[1]The MARS Model

The basis functions together with the model parameters (estimated via least squares estimation) are combined to produce the predictions given the inputs. The general MARS

Where the summation is over the M nonconstant terms in the model, y is predicted as a function of the predictor variables X (and their interactions); this function consists of an intercept parameter ( ) and the weighted by ( ) sum of one or more basis functions.

Model Selection

After implementing the forward stepwise selection of basis functions, a backward procedure is applied in which the model is pruned by removing those basis functions that are associated with the smallest increase in the (least squares) goodness-of-fit. A least squares error function (inverse of goodness-of-fit) is computed. The so-called Generalized Cross Validation error is a measure of the goodness of fit that takes into account not only the residual error but also the model complexity as well. It is given by

with

Where N is the number of cases in the data set, d is the effective degrees of freedom, which is equal to the number of independent basis functions. The quantity c is the penalty for adding a basis function. Experiments have shown that the best value for C can be found somewhere in the range 2 < d < 3 [9].

III.CART

CART builds classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification) [10,11].

CART analysis consists of four basic steps1 [12]:

The first step consists of tree building, during which a tree is built using recursive splitting of nodes. Each resulting node is assigned a predicted class, based on the distribution of classes in the learning dataset which would occur in that node and the decision cost matrix.
The second step consists of stopping the tree building process. At this point a “maximal” tree has been produced, which probably greatly overfits the information contained within the learning dataset.
The third step consists of tree “pruning,” which results in the creation of a sequence of simpler and simpler trees, through the cutting off of increasingly important nodes.
The fourth step consists of optimal tree selection, during which the tree which fits the information in the learning dataset, but does not overfit the information, is selected from among the sequence of pruned trees.

The decision tree begins with a root node t derived from whichever variable in the feature space minimizes a measure of the impurity of the two sibling nodes. The measure of the impurity or entropy at node t, denoted by i(t), is as shown in the following equation [11]:

Wherep(wj | t ) is the proportion of patterns xi allocated to class wj at node t. Each non-terminal node is then divided into two further nodes, tL and tR, such that pL , pR are the proportions of entities passed to the new nodes tL, tR respectively. The best division is that which maximizes the difference given in [11]:

The decision tree grows by means of the successive sub-divisions until a stage is reached in which there is no significant decrease in the measure of impurity when a further additional division s is implemented. When this stage is reached, the node t is not sub-divided further, and automatically becomes a terminal node. The class wj associated with the terminal node t is that which maximizes the conditional probability p(wj | t). No of nodes generated and terminal node values for each class are for the DARPA data set described in section V are presented in Table 1.

Figure 1. Tree for classifying normal vs. intrusions

Figure 1 is represents a classification tree generated for DARPA data described in section V for classifying normal activity vs. intrusive activity. Each of the terminal node describes a data value; each record is classifies into one of the terminal node through the decisions made at the non-terminal node that lead from the root to that leaf.

Table 1. Summary of tree splitters for all five classes.

Class / No of Nodes / Terminal Node Value
Normal / 23 / 0.016
Probe / 22 / 0.019
DoS / 16 / 0.004
U2Su / 7 / 0.113
R2L / 10 / 0.025

IV. TreeNet

In a TreeNet model classification and regression models are built up gradually through a potentially large collection of small trees. Typically consist from a few dozen to several hundred trees, each normally no longer than two to eight terminal nodes. The model is similar to a long series expansion (such as Fourier or Taylor’s series) - a sum of factors that becomes progressively more accurate as the expansion continues. The expansion can be written as [11,13]:

Where Ti is a small tree

Each tree improves on its predecessors through an error-correcting strategy. Individual trees may be as small as one split, but the final models can be accurate and are resistant to overfitting.

V.Data Used for Analysis

A subset of the DARPA intrusion detection data set is used for offline analysis. In the DARPA intrusion detection evaluation program, an environment was set up to acquire raw TCP/IP dump data for a network by simulating a typical U.S. Air Force LAN. The LAN was operated like a real environment, but being blasted with multiple attacks [14,15]. For each TCP/IP connection, 41 various quantitative and qualitative features were extracted [16] for intrusion analysis. Attacks are classified into the following types. The 41 features extracted fall into three categorties, “intrinsic” features that describe about the individual TCP/IP connections; can be obtained from network audit trails, “content-based” features that describe about payload of the network packet; can be obtained from the data portion of the network packet, “traffic-based” features, that are computed using a specific window (connection time or no of connections). As DOS and Probe attacks involve several connections in a short time frame, whereas R2U and U2Su attacks are embedded in the data portions of the connection and often involve just a single connection; “traffic-based” features play an important role in deciding whether a particular network activity is engaged in probing or not.

Attack types fall into four main categories:

Denial of Service (DOS) Attacks: A denial of service attack is a class of attacks in which an attacker makes some computing or memory resource too busy or too full to handle legitimate requests, or denies legitimate users access to a machine. Examples are Apache2, Back, Land, Mail bomb, SYN Flood, Ping of death, Process table, Smurf, Syslogd, Teardrop, Udpstorm.
User to Superuser or Root Attacks (U2Su): User to root exploits are a class of attacks in which an attacker starts out with access to a normal user account on the system and is able to exploit vulnerability to gain root access to the system. Examples are Eject, Ffbconfig, Fdformat, Loadmodule, Perl, Ps, Xterm.
Remote to User Attacks (R2L): A remote to user attack is a class of attacks in which an attacker sends packets to a machine over a networkbut who does not have an account on that machine; exploits some vulnerability to gain local access as a user of that machine. Examples are Dictionary, Ftp_write, Guest, Imap, Named, Phf, Sendmail, Xlock, Xsnoop.
Probing (Probe): Probing is a class of attacks in which an attacker scans a network of computers to gather information or find known vulnerabilities. An attacker with a map of machines and services that are available on a network can use this information to look for exploits. Examples are Ipsweep, Mscan, Nmap, Saint, Satan.

In our experiments, we perform 5-class classification. The (training and testing) data set contains 11982 randomly generated points from the data set representing the five classes, with the number of data from each class proportional to its size, except that the smallest class is completely included. The set of 5092 training data and 6890 testing data are divided in to five classes: normal, probe, denial of service attacks, user to super user and remote to local attacks. Where the attack is a collection of 22 different types of instances that belong to the four classes described in Section V, and the other is the normal data. Note two randomly generated separate data sets of sizes 5092 and 6890 are used for training and testing MARS, CART, and TreeNet respectively. Section VI summarizes the classifier accuracies.

VI.ROC Curves

Detection rates and false alarms are evaluated for the five-class pattern in the DARPA data set and the obtained results are used to form the ROC curves. The point (0,1) is the perfect classifier, since it classifies all positive cases and negative cases correctly. Thus an ideal system will initiate by identifying all the positive examples and so the curve will rise to (0,1) immediately, having a zero rate of false positives, and then continue along to (1,1).

Figures 2 to 6 show the ROC curves of the detection models by attack categories as well as on all intrusions. In each of these ROC plots, the x-axis is the false positive rate, calculated as the percentage of normal connections considered as intrusions; the y-axis is the detection rate, calculated as the percentage of intrusions detected. A data point in the upper left corner corresponds to optimal high performance, i.e, high detection rate with low false alarm rate. Area of the ROC curves, no of false positives and false negatives are presented in Tables 2 to 6.

Table 2. Summary of classification accuracy for normal.

Curve / Area / False Positives / False Negatives
MARS / 0.993 / 56 / 4
CART / 0.991 / 75 / 5
TreeNet / 0.997 / 18 / 0

Figure 2. Classification accuracy for normal

Table 3. Summary of classification accuracy for probe.

Curve / Area / False Positives / False Negatives
MARS / 0.777 / 64 / 305
CART / 0.998 / 24 / 0
TreeNet / 0.999 / 14 / 0

Figure 3. Classification accuracy for probe

Table 4. Summary of classification accuracy for DoS.

Curve / Area / False Positives / False Negatives
MARS / 0.945 / 185 / 169
CART / 0.998 / 1 / 16
TreeNet / 0.998 / 3 / 9

Figure 4. Classification accuracy for DoS

Table 5. Summary of classification accuracy for U2Su.

Curve / Area / False Positives / False Negatives
MARS / 0.700 / 3 / 15
CART / 0.720 / 3 / 14
TreeNet / 0.699 / 7 / 16

Figure 5. Classification accuracy for U2Su

Table 6. Summary of classification accuracy for R2L

Curve / Area / False Positives / False Negatives
MARS / 0.992 / 17 / 7
CART / 0.993 / 15 / 6
TreeNet / 0.992 / 19 / 7

Figure 6. Classification accuracy for R2L

VII. Conclusions

A number of observations and conclusions are drawn from the results reported in this paper:

TreeNet easily achieves high detection accuracy (higher than 99%) for each of the 5 classes of DARPA data. Treenet performed the best for normal with 18 false positives (FP) and 0 false negatives (FP), probe with 14 FP and 0 FN, and denial of service attacks (DoS) with 3 FP and 9 FN.
CART performed the best for user to super user (U2su) with 3 FP and 14 FN and remote to local (R2L) with 15 FP and 6 FN.

We demonstrate that using these fast execution machine learning methods we can achieve high classification accuracies in a fraction of the time required by the well know support vector machines and artificial neural networks.

We note, however, that the difference in accuracy figures tend to be small and may not be statistically significant, especially in view of the fact that the 5 classes of patterns differ tremendously in their sizes. More definitive conclusions perhaps can only be drawn after analyzing more comprehensive sets of network data.

Acknowledgements

Partial support for this research received from ICASA (Institute for Complex Additive Systems Analysis, a division of New Mexico Tech), a DoD IASP, and an NSF SFS Capacity Building grants are gratefully acknowledged.

References

S. Mukkamala,G. Janowski,A. H. Sung, Intrusion Detection Using Neural Networks and Support Vector Machines. Proceedings of IEEE International Joint Conference on Neural Networks 2002, IEEE press, pp. 1702-1707, 2002.
M. Fugate, J. R. Gattiker, Computer Intrusion Detection with Classification and Anomaly Detection, Using SVMs. International Journal of Pattern Recognition and Artificial Intelligence, Vol. 17(3), pp. 441-458, 2003.
W. Hu, Y. Liao, V. R. Vemuri,Robust Support Vector Machines for Anamoly Detection in Computer Security. International Conference on Machine Learning, pp. 168-174, 2003.
K. A. Heller, K. M. Svore,A. D. Keromytis, S. J. Stolfo, One Class Support Vector Machines for Detecting Anomalous Window Registry Accesses. Proceedings of IEEE Conference Data Mining Workshop on Data Mining for Computer Security, 2003.
A. Lazarevic, L. Ertoz, A. Ozgur, J. Srivastava, V. Kumar, A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection. Proceedings of Third SIAM Conference on Data Mining, 2003.
S. Mukkamala, A. H. Sung, Feature Selection for Intrusion Detection Using Neural Networks and Support Vector Machines. Journal of the Transportation Research Board of the National Academics, Transportation Research Record No 1822: 33-39, 2003.
S. J. Stolfo, F. Wei, W. Lee, A. Prodromidis,P. K. Chan, Cost-based Modeling and Evaluation for Data Mining with Application to Fraud and Intrusion Detection. Results from the JAM Project, 1999.
S. Mukkamala, B. Ribeiro, A. H. Sung, Model Selection for Kernel Based Intrusion Detection Systems. Proceedings of International Conference on Adaptive and Natural Computing Algorithms (ICANNGA), Springer-Verlag, pp. 458-461, 2005.
T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: Data mining, inference, and prediction. Springer, 2001.
L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone,Classification and regression trees.Wadsworth and Brooks/Cole Advanced Books and Software, 1986.
Salford Systems. TreeNet, CART, MARS Manual.
R. J. Lewis. An Introduction to Classification and Regression Tree (CART) Analysis. Annual Meeting of the Society for Academic Emergency Medicine, 2000.
J. H. Friedman, Stochastic Gradient Boosting. Journal of Computational Statistics and Data Analysis, Elsevier Science, Vol. 38, PP. 367-378, 2002.
K. Kendall, A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems. Master's Thesis, Massachusetts Institute of Technology(MIT), 1998.
S. E. Webster, The Development and Analysis of Intrusion Detection Algorithms. Master's Thesis, MIT, 1998.
W. Lee, S. J. Stolfo,A Framework for Constructing Features and Models for Intrusion Detection Systems. ACM Transactions on Information and System Security, Vol. 3, pp. 227-261, 2000.

[1]Reference [12] was accidentally omitted during the editing process of the original manuscript. Complete reference is: R. J. Lewis. An Introduction to Classification and Regression Tree (CART) Analysis. Annual Meeting of the Society for Academic Emergency Medicine, 2000.