1

Security methods for Instant Messaging

Keerthi V Putane

PG Student, Dept. of CSE, MSRIT

Monica R Mundada

Associate Professor

Dept. of CSE, MSRIT, Bangalore

Abstract

1

Instant messaging is a form of computer-mediated communication (CMC) with unique characteristics that reflect a realistic presentation of an author’s online stylistic characteristics. Instant messaging communications use virtual identities, which hinder social accountability and facilitate IM-related cybercrimes. Criminals often use virtual identities to hide their true identity and may also supply false information on their virtual identities. This paper presents Instant Messaging authorship identification and validation through various methods such as Intrusion detection system, IM authorship analysis framework and feature set taxonomy and Presence Based Secure Instant Messaging.

1

1. Introduction

1

This paper describes a framework for Instant Messaging, Data Mining for Author Identification also called authorship attribution, for the purpose of intrusion detection. We use various linguistic patterns and characteristics for author profiling and matching. Although IM is used in many legitimate activities for conversations and message exchange, it can also be misused by various means. This paper focuses on the case where an attacker attempts to masquerade as another user. It provides the foundations of a detection system to alert the user when he/she is potentially communicating with a masquerader. The principle objective is to create a set of Characteristics that remains relatively constant for a large number of messages written by an author and to classify conversation messages as belonging to a particular author. This allows the IDS to create a profile of user communication characteristic.

All humans also have unique patterns of behavior, much like the uniqueness of biometric data. Therefore, certain characteristics pertaining to language, composition, and writing, such as particular syntactic and structural layout traits, patterns of vocabulary usage, unusual language usage and stylistic traits will remain relatively constant. The identification and learning of these characteristics with a sufficiently high accuracy is the principal challenge in authorship identification. An important consideration involves whether the writing characteristics or style of an author evolves over time orchanges with different contexts such as location, mood, time of day, presence of other people, etc. However, humans tend to have certain persistent personal traits. Formally, authorship identification is the task of determining the author of a piece of work. The IM authorship, identification and validation system creates a user profile based on linguistics patterns, characteristics and compares future communication to the profile not matching the profile within a predefined variance will be alerted as anomalous user activity and a warning indicator will be presented. [2]

2. Related Work

There has been significant research in identifying authors of literary texts, such as Shakespeare’s works and The Federalist Papers. [Men1887, MW1964] Some of the earliest research dates back to the fourth century BC, when librarians in the library of Alexandria studied the authentication of texts attributed to Homer. [Lov2002] Other early known research dates back to the 18th century when English logician Augustus de Morgan theorized that authorship can be determined by examining if one text contains longer words than another.

Recent research has introduced authorship analysis to CMC including e-mail, online newsgroups and chat groups with promising results. Recent applications of authorship analysis in computer-mediated communications include several exceptional publications by Olivier de Vel [DeV2000, DACM2001a, and DACM2001b] that studied classification of e-mail documents for the purpose of authorship identification. Another important contribution to the study of authorship analysis for cyber forensics includes the research by Rong Zheng, Yi Qin, Zan Huang, Hsinchun Chen [ZQHC2003, ZLCH2006]. The authors presented a comparison of techniques to automate author identification by using several classification algorithms to extract features such as style markers, structural features and content-specific features. A unique visualization authorship analysis study includes the research by Ahmed Abbasi and Hsinchun Chen [AC2006, AC2008]. This research introduced an authorship write print visualization technique that can assist in identifying online authors based on their writing style. Another writeprint authorship analysis study includes the research by Jiexun Li, Rong Zheng, Hsinchun Chen [LZC2006]. This research introduced “A method of identifying the key writeprint features for authors of online messages to facilitate identity tracing in cybercrime investigation” [LZC2006]. A study that performs both authorship identification and authorship characterization includes the research by Tayfun Kucukyilmaz, B. Barla Cambazoglu, Cevdet Aykanat, and Fazli Can [KCAC2008]. This research performs classification of online chat messages to determine both author attributes (gender, age, educational environment, and Internet connection domain) and message attributes (author, receiver, and time of day). A large research gap exists in applying authorship analysis techniques to instant messaging communications to determine the author identity. There are no known studies that present a comprehensive examination of IM authorship analysis.[1]

3. Research Methodologies

3.1. IM Authorship Analysis Framework

We propose an IM authorship analysis framework, shown in Figure 3.1 that extract features from the messages to create author writeprints and applies several data mining algorithms to build classification models to perform automated authorship analysis.

Fig.3.1. IM Authorship Analysis Framework.

The parameters are systematically modified in an iterative process to assess their impact on the prediction accuracy of the classification methods. The goal of the framework is to identify the set of features, classification algorithm, number of authors and number of messages with the highest prediction accuracy.The framework consists of three stages: data pre-processing, feature extraction and classification. The data pre-processing stage parses the messages to extract the data for each particular author and to remove metadata and noise such as timestamps, usernames and message responses. Next, the logs are input into the extractor.pl Perl program. The extractor first splits the logs into a configurable conversation size, such as 50 messages per conversation.

The feature extraction stage inputs conversations and pre-selected features to the extractor.pl program totaled module to create totals for each feature, resulting in the output of a writeprint (Wn) for each set of messages {M1,…,Mn} of each supplied author (An). A writeprint is an n-dimensional vector, where n represents the total number of features. Each writeprint is assigned a class, which is the author (An) of the writeprint (Wn). The extractor.pl program then outputs a writeprint in commaseparated value (CSV) format for input into the WEKA data mining toolkit. The classification stage uses the writeprints as input to build classification models and resulting prediction accuracies. This stage uses the WEKA data mining toolkit to perform classification on the writeprints using the C4.5, k-nearest neighbor, Naïve Bayes, and SVM classifiers. The cross-validation module supplies training and testing instances to the training module and testing module, respectively. The training module creates a classification model and the testing module calculates the prediction accuracy of the classification model. To evaluate the effectiveness of the prediction, the framework uses the average prediction accuracy produced by the testing module. There are a number of parameters that impact the prediction accuracy performance of the data mining classifiers. To assess the impact of the parameters, the framework is repeated with varying numbers of authors and messages, features and classification models.[1]

3.2. IM IDS Framework

The IM IDS framework applies data mining techniques for analyzing JIM conversation logs. The framework first applies pattern analysis to the collected user conversation logs. The goal of pattern analysis is to attempt to construct an accurate user profile based on user IM usage Characteristics. This is the mapping of user activity over a period of time, attempting to determine what patterns a user exhibits. User profile mapping involves the use of various algorithms to build an accurate profile. Next, the framework applies an anomaly intrusion detection system. This system collects information on various user actions, characteristics and compares them to the existing user profile. Based upon the collected information it is determined whether or not a user is performing as expected. Due to the structure of Instant Messaging networks, anonymity and potential for masquerading, the model of intrusion detection is applicable and beneficial. Many intrusion detection systems already deploy what is known as Profile-based Anomaly detection. Profiles are created based on the results of the user pattern analysis.

Using these profiles, users whose actions deviate from the profile will be flagged. In order to compensate for variations each profile must be built to include acceptable deviations that a user can perform without being flagged. This is termed the degree of confidence. Flexible profiles are essential to accurately determine authorship identification andvalidation. Profile comparison is performed as follows:

  1. Compare data against a user's previous actions. By constructing a single profile of a user's activity patterns it can be used for quick comparisons to determine when a user strays from their personal patterns.
  2. Compare data for a new user against another user's previous actions. This attempts to identify users that use multiple IM names. For example, if a new user appears and claims to be "'BobSmithatHome", it can compare this user's actions to the legitimate "BobSmith" user profile.[2]

3.3. Feature Set Taxonomy

A feature set is composed of a predefined set of measurable writing style attributes. This research presents a feature set taxonomy of Instant Messaging writing style characteristics for performing authorship analysis for the purpose of cyber forensics. The proposed feature set is a 356-dimensional vector including lexical, syntactic and structural features, shown in Figure 3.3.

Fig.3.3. IM Feature Set Taxonomy

Content specific features are highly dependent on the topic of the messages; therefore the feature set taxonomy does not include content specific features in order to achieve generic authorship identification across various applications. Instant messaging communications have several characteristics that are useful in forming a well-rounded feature set, which may help reveal the writing style of the author. The IM feature set taxonomy includes several new stylistic features, such as abbreviations and emoticons, which are frequently found in Instant Messaging communications. Lexical features mainly consist of frequency totals and are further broken down into emoticons, abbreviations, word-based and character-based features. Syntactic features include punctuation and function words in order to capture an author’s habits of organizing sentences. Structural features capture the way an author organizes the layout of text. With IM communications there are no standard headers, greetings, farewells or signatures, leaving simply the average characters and words per message in terms of structural layout. Each feature in the taxonomy was selected for its relevance to IM communications. The goal of the IM feature set taxonomy is to develop a streamlined set of features that best reveal the true writing style of the author.[1]

3.4. Presence Enabled Proposed Secure Architecture

Presence service is proved as a good service provisioning mechanism when it is used with other services. Presence service inside the messaging service is also proposed. To use presence service users have to subscribe for it. Subscription and notification of presence service put an extra load on air interface of the network since it requires exchange of messages. In our solution there is no need to subscribe for the presence service. Presence service is checked and verify during the transit of Instant Message.

Whenever a user sends an immediate message or tries to initiate a session with particular user, the message or request is received by P-CSCF. P-CSCF after receiving the request forwards it towards S-CSCF. In the normal scenarios S-CSCF after Presence Based Secure Instant Messaging Mechanism for IP Multimedia Subsystem 451 performing necessary actions forwards the message or request towards the receiver’s P-CSCF or towards particular application server. But in our solution S-CSCF, before forwarding a particular message or request towards the receiver, checks the presence information of sender as well as that of receiver.

Fig.3.4.1. Presence enabled instant messaging architecture.

Fig.3.4.2. Presence enabled instant messaging flow

Checking the presence information of a sender is necessary because if someone is misusing (spoofing) the identity of the sender it can be caught from presence information of the sender. If the presence information of the sender contradicts with the sent request or message it means that the identity of the sender is misused by someone else. In that case message or request is discarded and sender is notified about the situation. If the message or request does not contradict with the presence information of the sender then it is matched with the presence information of the receiver. Presence information of receiver may state that currently he/she is not willing to receive any Instant Message or currently he/she is in meeting and doesn’t want to receive any message. In this case forwarding the request or message towards receiver not only results in disturbance at receiver’s end but also results in wastage of valuable resources. So in our solution if the message or request contradicts with the receiver’s presence information it is discarded immediately.[3]

4. Conclusion and Future Work

The Instant Messaging market has seen explosive growth with millions of users participating in online conversations. Little has been explored in terms of research and analysis of the network, messages, user behavior and data mining of these systems. As a user there are several concerns involving the use of Instant Messaging Systems. One such concern is; are you really talking to the person you think you are talking to. The threats involve account hijacking, man in the middle attacks and masquerading. There are various reasons someone would wish to Masquerade as someone else including spying, disgruntlement, snooping or malicious intentions. The IM framework has to analyze user behavior in Instant Messages to create a user profile of activity. Future conversations are matched against the user profile to determine authorship identification and validation.

Human behavior presents its challenges to analysis. It has an extremely wide range of "normal" and can be very unpredictable, abnormal activities are sometimes perfectly normal and most of all people change. By definition an intrusion detection system alerts when an event has triggered an alarm. For an adequate comparison to be made in an IM authorship identification and validation system a user won't know if someone is masquerading until sometime into the conversation. Therefore a time period exists where a user may be conversing with a masquerader before an alert is triggered. This paper presented the unexplored area of data mining in instant message communications. It focused on applying character frequency analysis to IM messages for authorship attribution and instance-based learning to detect anomalies in user activity. The experiments supported the identification of IM authors based on user behavior. The experiments also supported the use of the nearest neighbor classification method to classify new conversations with known profiles. The data showed that users tend to exhibit the same character frequency characteristics throughout different conversations. The data also showed that conversations with different users exhibit different characteristics. This is supported by various measures such as use or non-use of the various classes of characters, variability in character use within a class and character outliers used for user profiling. Character frequency analysis is only one subset of the stylometric features used by an IM anomaly intrusion detection system. Stylometric features must also be used to create and accurate well-rounded user profile.

There are several future avenues available for research in this area, such as statistical user profiling, pattern analysis using clustering, computational linguistics and user anomaly detection. There are also several aspects of the IM intrusion detection system that still need to be explored such as the frequency of profile computation, the resolution of a nearest neighbor tie, the application of weighted attributes, the storage of the complete training instance and other classification performance enhancements. More research is already underway to incorporate the remaining stylometric features into the anomaly detection system and to automate the entire data collection, mining and detection Processes.[2]

5. Acknowlegement

I am grateful to management of my institution,M S RAMAIAH INSTITUTE OF TECHNOLOGY its ideals and inspirations for having provided me with the facilities, which has made this technical paper a success.

I would like to extend regards to my HOD,Dr. K G Srinivas, Department of Computer Science and Engineering, for all his prolific rapport in all endeavours.At this outset, I extend my sincere gratitude to my guide Dr. Monica R Mundada, Associate Professor, Department of Computer Science and Engineering for her technical guidance for the completion of my technical paper.

6. References

[1] Angela Orebaugh and Dr. Jeremy

Allnutt . ” Data Mining Instant

Messaging Communications to

Perform Author Identification for

Cybercrime Investigations ” . S. Goel

(Ed.): ICDF2C 2009, LNICST 31 , pp.

99-110,2010.

[2] Angela Orebaugh . ” An Instant

Messaging Intrusion Detection

System Framework : Using

Character Frequency analysis for

authorship identification and

validation”. 1 - 4244 - 0174-7/06

©2006 IEEE.

[3] Zeeshan Shafi Khan , Muhammad

Sher and Khalid Rashid.” Presence

Based Secure Instant Messaging

Mechanism for IP Multimedia

Subsystem ” . B. Murgante et al.

(Eds.): ICCSA 2011, Part V, LNCS

6786, pp . 447 – 457 , 2011 .

© Springer - Verlag Berlin

Heidelberg 2011.

[4] Mohammad Mannan and Paul

C.van Oorschot.” A Protocol for