On Integrating System Support for Intelligence/Information Fusion
A final report submitted toNetwork Information and Space Security Center (NISSC) for Spring 2004 Sponsored Project
Ganesh Godavari
C. Edward Chow
Marijke Augusteijn
Xiaobo Zhou
Abstract
In this paper critical tasks in large scale information fusion process in the context of enterprise intrusion detection are identified. An algorithm for alert refinement are designed and implemented. Results of several intrusion alert datasets and the algorithm performance is presented. It is shown that significant data reduction is achieved by the algorithm without losing important alert information. We discussed how the alert refinement results can be used for decision support and how the multiple tier information fusion system components further correlate and verify the refined alert results.
1. Project Goal
The project also aims to investigate those challenging issues on how to exchange, verify, and correlate intelligence information for decision support, and how to allocate and coordinate sensors in different agencies for a set of tasks with different priorities.
When an intruder attacks a system, the ideal response would be to stop his activity before any damage or access to sensitive information occurs. This would require recognition of the attack as it takes place in real time. A single instance of suspect behavior on a host in a network may not warrant any serious action. However, repeated suspect behavior across several hosts in a network may indeed suggest an attack, with a response definitely warranted. This would be very difficult for a human to recognize because Intrusion Detection Systems (IDS) typically produce a lot of false positive alerts. This is understandable, because an IDS has usually no way of knowing if the attack actually succeeded or failed. This requires a correlation of alerts from various IDSes so that an effective decision can be made based on the alerts from various sources.
IDSes typically provide a constant feed of new alerts which are written into a log file. We propose an approach for alert refinement before sending information from various IDSes to achieve the correlation objective based on the three following functions at each host:
Alert management function: where Alert messages generated by different IDS are stored and managed in a local database.
Alert clustering function: mapping an occurrence of an attack into "cluster" based on temporal relationship between alerts.
Alert Decision function: Analyze alerts generated from each cluster identified by the clustering function and correlate alerts in IDMEF[].
Some of the critical requirements for data refinement are a) Removal of unwanted or unrelated information. b) High level abstraction of data without loss of meaningfulness of data. In intrusion detection, removal of unrelated data requires the knowledge of external environment variables like human interference, knowledge of the network, vocabulary etc. Data refinement methods that require external information often lead to misinterpretation or loss of data in the high level decision making process. For example, removal of alerts from a friendly host may backfire if the friendly host has been compromised.
One of the interesting problems encountered during the data refinement process in the mining of Intrusion Detection data is the high dimensionality of the data. Dimension referrers to attributes that one is interested in looking at in order to come to a meaningful conclusion. Application of traditional data mining algorithms on high dimensional data does not provide optimal results as they look at a single dimension [paper references goes here].
With the number of intrusion and hacking incidents around the world on the rise, the importance of having dependable intrusion detection systems in place is greater than ever. Having an Intrusion Detection System (IDS) does not solve the problem as organizations frequently have problems addressing the challenges of detecting, alerting and responding to unauthorized access into their computing environment.
IDSes can be categorized into host-based IDSes and network-based IDSes. Typically, host-based IDSes generate alerts based on some events generated by the host's operating system (e.g., syslog, Solaris BSM log, Windows NT event log). Network-based IDSes look at the packets on the network as they pass by the network. The network-based IDSes can only see the packets that happen to be carried on the network segment it’s attached to. Packets are considered to be of interest if they match a signature or pattern. Network-based IDSes look primarily for string, port and header format patterns of the packets across the network. For effective Intrusion detection, one needs both IDSes as a network-based IDS will be able to detect exploits such as IP spoofing but is blind to buffer overflows exploits which can be analyzed or identified by a host-based IDS. An effective defense mechanism against intrusion detection requires the use of both host-based IDSes and network-based IDSes. This requires cooperation between Host-based and Network-based IDSes.
One of the major challenges faced in today’s IDSes is its ability to effectively measure their performance. Measure of the effectiveness of intrusion detection refers to its ability to efficiently and correctly classify the events as being malicious or not. This most often includes a binary (yes/no) decision, with a percentage of certainty.
In case of a malicious activity on the network, countermeasure activities like blocking, limiting traffic etc. typically require the following attributes \{Source IP address, Destination IP address, Target Service\}. Typically malicious activities create abnormal behavioral patterns, for example, a lot of Internet Control Message Protocol (ICMP) host unreachable messages in a network is abnormal. These activities can be considered to be an alert. A false positive alert in IDS is an attack alarm that is raised incorrectly. For example, User Datagram Protocol (UDP) scan can be performed to determine a list of live hosts in an address range. A UDP message to a nonexistent machine results in ICMP HOST unreachable error. Suppose an IDS detects this type of scan based on a certain threshold of ICMP HOST unreachable error messages. This alarm is not a false positive: someone actually was attempting to do something malicious. This IDS alert is a false positive if ICMP HOST unreachable error is raised because of faulty router.
Discuss why alert refine and correlation problem are challenging and difficult? Use synchronized attacks as examples.
As the overall number of alerts generated by an IDS is overwhelming for a human operator to handle there is a need to reduce the alerts that falsely indicate security issues. These alerts require investigation of audit events for diagnosis by human operators or automated agents. Most of the time the diagnosis information associated with the alerts is so poor that it requires the operator to go back to the original data source to understand the diagnosis and assess the real severity of the alert. Care must also be taken to remove or minimize the number of undetected attacks (false negatives) in order to increase the effectiveness of Intrusion Detection. \url{
2. Literature Survey.
2.1 Common intrusion Detection Framework and inter-component negotiation.
One application area of information fusion is enterprise intrusion detection and handling [6,7,16]. In enterprise system, intrusion detection devices, firewalls, are routers are deployed strategically throughout the networks. Traditionally intrusion and traffic data are collected and forwarded to a database server for further analysis. Distributed intrusion detection aimed at the coordinating the intrusion detection in a distributed manner. Fusing local intrusion data and correlating them with traffic data not only reduce the volume of data to be processed, but also provide opportunity for early preventive defense. Common Intrusion Detection Framework (CIDF) [11,12] is DARPA-initiated effort for defining a common set of APIs and protocols for IDS component interoperability. Protocols for the configuration and coordination the enterprise intrusion detection systems are important area of research [8,9,10,13,14,15]. The framework, related protocols, and lessons learned for the information fusion tasks in enterprise intrusion detection/handling systems can be extended for other information fusion systems. We have examined cyber attack scenarios and studied how CIDF, IDIP or IDIAN can be used to defense such attacks. A testbed for IDIP-based enterprise intrusion detection and push back was established for examining the performance of IDIP-based protocols [7]. We are investigating how large scale comprised node detection and network quarantine can be handled with the help of CIDF, IDIP, and IDIAN.
2.2 Outlier Detection Using Clustering
Typical Clustering approaches for categorical data, such as in
[Guha et al., 1999] are not generally available commercially. Unsupervised approaches for detecting outliers in large data sets for the purposes of fraud or intrusion detection are starting to appear in the literature, but these approaches are primarily based on ordered data. Knorr and Ng [1998] recently developed a distance-based clustering approach for outlier detection in large data sets. Ramaswarny, et al. [2000] define a new outlier criterion based on the distance of a point to its kth nearest neighbor. Breunig et al. [2000] define a new local outlier factor, which is the degree to which a data point is an outlier.
\url { – anomaly detection papers but this cannot be done at this stage may be during decision making stage (Read once again the papers)
icdm02 mining association rules in starts using clustering approach but does is more relevant to databases with ordered dataset than unordered dataset
lpminer tech report talks about finding frequent item-sets with decreasing support/threshold/occurrence. However, it does not take position information in the item into consideration in its processing.
bloedorn_datamining_report tells us how various clustering methods can be used for intrusion detection
slct-ipom03 is the paper that we drew inspiration from; this algorithm does not do any temporal relationship identification.
2.2IDMEF
Intrusion Detection Message Exchange Format: The Intrusion Detection Working Group of Internet Engineering Task Force (IETF) has proposed the Intrusion Detection Message Exchange Requirements [] which, in addition to defining the requirements for the Intrusion Detection Message Exchange Format, also specifies the architecture of IDS.
The Intrusion Detection Message Exchange Format Data Model (IDMEF) and accompanying Extensible Markup Language Document Type Definition [3] is a profound effort to establish an industry-wide data model which defines computer intrusions. IDMEF, however, has its shortcomings. Specifically, it uses XML which is limited to a syntactic representation of the data model which does not convey the semantics, relationships, attributes and characteristics of the objects which it represents. This limitation requires that each individual IDS interpret and implement the data model programmatically.
Figure 1: shows IDMEF model with some core class information
From the figure above, it can be seen that an IDMEF message can be either an alert or a heartbeat message.
The Alert Class: every time an analyzer detects an event that it has been configured to look for, it sends an Alert message to its manager(s). Depending on the analyzer, an Alert message may correspond to a single detected event, or multiple detected events. Alerts occur asynchronously in response to outside events. The Alert class has one optional attribute ident – a unique identifier for the alert. Ident attribute is unique within the sensor, and to make the alert unique within the intrusion detection infrastructure, it needs to be correlated together with the analyzerid of the analyzer class. An Alert is represented in the XML DTD as follows:
<!ELEMENT Alert (
Analyzer, CreateTime, DetectTime?, AnalyzerTime?, Source*,
Target*, Classification+, Assessment?, (ToolAlert |
OverflowAlert | CorrelationAlert)?, AdditionalData*
)>
<!ATTLIST Alert
ident CDATA '0'
Classification Class: The main purpose of the Classification class is to provide an alert name, and indicate where additional information may be found. The "name" of the alert or other information allows the manager to determine what the alert is.
The Analyzer Class: The Analyzer class identifies the analyzer from which the alert or heartbeat message originates. Only one analyzer may be encoded for each alert or heartbeat, and that MUST be the analyzer at which the alert or heartbeat originated. The analyzer class has two main purposes during correlation. Firstly, Analyzerid (together with ident values) may be used to uniquely specify which sensor produced a specific alert message. The first information is important when performing inter sensor correlation.
The AdditionalData Class: The purpose of this class is to provide means to extend the basic IDMEF data model. It helps to provide additional Information by the analyzer that does not fit into the data model. This may be an atomic piece of data, or a large amount of data provided through an extension to the IDMEF.
The Heartbeat Class: Analyzers use Heartbeat messages to indicate their current status to managers. Heartbeats are intended to be sent in a regular period, say every ten minutes or every hour. The receipt of a Heartbeat message from an analyzer indicates to the manager that the analyzer is up and running; lack of single/consecutive Heartbeat message indicates that the analyzer or its network connection has failed.
Source Class: The main purpose of the source class is to identify possible source(s) of the event(s) that generated an alert. An event may have more than one source incases like distributed attack scenarios.
Target Class: The purpose of the target class is to specify possibly affected entities. Target class contains information about the possible target(s) of the event(s) that generated an alert.
The Time classes have one attribute (ntpstamp) representing different aspects of date and time making it possible to pinpoint the actual time of the intrusion, and secondly, making it possible to correlate alerts in time.
CreateTime Class: The CreateTime class is used to indicate the date and time the alert or heartbeat was created by the analyzer.
DetectTime Class: the DetectTime class contains the date and time of the original event(s) leading to the alert was first detected.
AnalyzerTime class: The analyzerTime class is used to indicate the current date and time on the analyzer. This class should be filled just before the message transmission for further processing.
\url {
\url {
3. Alert Refinement
Log files typically contain information about various activities performed on the system. Data mining provides valuable information, which can be used to not only identify but also analyze the intrusion pattern of the intruder. Table 1 and Table 2 malicious activity captured using snort and system log of a malicious remote user trying to get root access respectively.
## Signatures captured using snort (
#
May 9 08:02:54 lisa snort[2370]: spp_portscan: PORTSCAN DETECTED from 216.61.43.89
May 9 08:21:02 lisa snort[2370]: spp_portscan: PORTSCAN DETECTED from 204.2.13.22
May 9 09:39:28 lisa snort[2370]: IDS212/dns-zone-transfer: 206.133.123.19:2421 -> 172.16.1.101:53
May 9 11:03:20 lisa snort[2370]: IDS197/trin00-master-to-daemon: 137.132.17.202:2984 -> 172.16.1.107:27444
May 9 11:03:20 lisa snort[2370]: IDS187/trin00-daemon-to-master-pong: 172.16.1.107:1025 -> 137.132.17.202:31335
May 9 11:26:04 lisa snort[2370]: IDS197/trin00-master-to-daemon: 137.132.17.202:2988 -> 172.16.1.107:27444
May 9 11:26:04 lisa snort[2370]: IDS187/trin00-daemon-to-master-pong: 172.16.1.107:1027 -> 137.132.17.202:31335
May 9 14:04:55 lisa snort[2370]: spp_portscan: PORTSCAN DETECTED from 206.133.123.19
May 9 14:04:57 lisa snort[2370]: IDS8/telnet-daemon-active: 172.16.1.101:23 -> 206.133.123.19:1720
May 9 14:04:58 lisa snort[2370]: IDS8/telnet-daemon-active: 172.16.1.101:23 -> 206.133.123.19:1741
May 9 14:05:08 lisa snort[2370]: IDS128/web-cgi-phf: 206.133.123.19:1815 -> 172.16.1.107:80
May 9 14:05:09 lisa snort[2370]: IDS218/web-cgi-test-cgi: 206.133.123.19:1820 -> 172.16.1.107:80
May 9 14:05:09 lisa snort[2370]: IDS235/web-cgi-handler: 206.133.123.19:1824 -> 172.16.1.107:80
May 9 20:48:14 lisa snort[2370]: IDS197/trin00-master-to-daemon: 137.132.17.202:3076 -> 172.16.1.107:27444
May 9 20:48:14 lisa snort[2370]: IDS187/trin00-daemon-to-master-pong: 172.16.1.107:1028 -> 137.132.17.202:31335
Table 1 shows probes captured using snort [
## System logs showing malicious activity
#
Aug 25 13:32:51 malice sshd[1632]: Failed password for illegal user user from 65.75.130.110 port 41963 ssh2
Aug 25 13:32:54 malice sshd[1634]: Failed password for root from 65.75.130.110 port 42068 ssh2
Aug 25 13:32:57 malice sshd[1636]: Failed password for root from 65.75.130.110 port 42179 ssh2
Aug 25 13:33:00 malice sshd[1638]: Failed password for root from 65.75.130.110 port 42280 ssh2
Aug 25 13:48:09 malice sshd[1657]: Failed password for test from 65.75.130.110 port 45824 ssh2
Aug 25 13:48:09 malice sshd[1659]: Illegal user guest from 65.75.130.110
Table 2 showing malicious activity by from 65.75.130.110
The Goal was to design an algorithm, which would be fast and would detect clusters present in subspaces of the original data space.
The data space (Rn) is assumed to contain data points with categorical attributes (i.e., Attributes have a small number of unordered values), where each point represents a line from a log file data set. The attributes of each data point are the words from the corresponding log file line. The data space has n dimensions, where n is the maximum number of words per line in the data set.
A region S is a subset of the data space, where certain attributes word1,...,wordk (1≤k≤n) of all points that belong to S have identical values value1,...,valuek
for some line belongs to S, lineword1 = value1, ..., linewordk = valuek.
So all lines in the dataset containing the attributes (word1, value1), (word2, value2), (word3, value3) … (wordk, valuek)} are grouped into a cluster.
Log files contain timestamp of when the event has been logged into the system. This timestamp can be used for correlating log events to provide qualitative temporal relationships between interval-based events. With these relationships one can prove more concise information about the cluster. Listed below are the seven interval based events from [James F Allen --Maintaining knowledge about temporal intervals]
Relation / Meaninge1 equal e2 / e1.begin_time == e2.begin_time and
e1.end_time == e2.end_time
e1 before e2 / e1.end_time < e2.begin_time
e1 meets e2 / e1.end_time == e2.begin_time
e1 overlaps e2 / e1.begin_time < e2.begin_time and
e1.end_time < e2.end_time and
e1.end_time > e2.begin_time
e1 during e2 / e1.begin_time > e2.begin_time and
e1.end_time < e2.end_time
e1 starts e2 / e1.begin_time == e2.begin_time and
e1.end_time < e2.end_time
e1 finishes e2 / e1.begin_time > e2.begin_time and
e1.end_time == e2.end_time
Table showing the qualitative temporal relationships