A collaborative approach to Spam E-Mail Filtering: Recommendation Using Ontological user profiling
Dr. Lina Zhou
Xiaoli Jiao
Department of Information Systems
University of Maryland Baltimore County
Abstract
Most definitions assume UCE (Unsolicited Commercial Email) and spam to be synonymous. However, people do not classify emails as spam objectively purely on whether they adhere to a definition, but rather subjectively on whether the email is of interest to them. It is noteworthy that some consider email to be spam even if they have explicitly given the sender permission to contact them. This reflects some of the conundrums of legislative debates on spam.
Server-based collaborative filtering has been very successful in many systems, but they are in favor of ubiquitous computing settings. We propose a personalized, collaborative approach to filtering spam using recommendation system. In addition, we explore a novel ontological approach to improve user profiling and hence the recommendation accuracy.
1. Introduction
The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. There are a number of very successful collaborative filters today: Vipul’s Razor[17], distributed Checksum Clearinghouse [18] and SpamNet [19]. however, very few takes a considerate care of domains or users personalization.
Studies [16] show that people have their personal views on what constitutes spam. A centralized server filter will cause false positives for users whose opinions differ from the majority. In our research, we address the issue with the aid of user profiling in collaborative spam email filtering.
The user profiling approach used by most recommendation systems is behavior based, commonly using a binary class model to represent what users find interesting and uninteresting. However, a binary profile does not lend itself to sharing examples of interest or integrating any domain knowledge that might be available.
We use the term, ontology, to refer to the classification structure and instances within a knowledge base. We propose an approach to filtering spam email combining collaborative and content-based recommendation techniques and representing user profiles in ontological term.
2. Related work
Alan Gray and Mads Haahr analyzed 3 assumptions implicit in centralized spam filtering and described how they affect spam filtering. They presented an architecture for personalized, collaborative spam filtering and described the design and implementation of proof-of-concept, peer-to-peer, signature-based system based on the architecture. Ernest Damiani and his collogues proposed a decentralized privacy-preserving approach to spam filtering in their paper “P2P-Based Collaborative Spam Detection and Filtering.” They exploits digests to indentify messages that are a slight variation of one another and a structured peer-to-peer architecture between mail servers to collaboratively share knowledge about spam.
3. Our Proposed Approach
We propose a personalized, collaborative approach to filtering spam using recommendation system, our approach tends to intend to improve the overall performance of conventional server-based collaborative filters. Recommendations are based on a group of user with similar user profiles, in this way, when a false positive/negative happens to one user, it helps others. User interest profile is computed by correlating previously email messages with their classification and is a part of user profile to determine similarities among users. We use ontological inference to improve user profiling and hence the recommendation accuracy.
The proposed approach leaves much flexibility for user’s input and feedback, users can define the parameters like cost values, interest topics, recommendation groups, trust levels, or for any absence these parameters, system can sit behind the screen and determine by watching their behavior.
3.1Message representation and classification
Massages are represented with term vectors and each term has a weight. Similarities among messages can be computed using their term vectors. Features may include sender’s ID (name, domain, IP address), subject, classification….
Classification of message is determined by message content, several machine learning methods such as k-Nearest Neighbor, Expectation Maximization (EM), Support Vector Machine, Naïve Bayes and decision tree may be applied to develop a classifier.
To be more helpful in identifying spam with just a little variation in context, a signature is computed on every new email received and compared to the known message. In order for algorithms to be more robust (or “fuzzy” i.e. ignores small randomization in the text), they need to be developed to be more content-aware so that unimportant discrepancies between messges do not change the signature. When a new message comes to a user, its signature is first compared to the known message signatures of this user, if there is a match, its spamminess can be identified by the known message and no further step to go.
3.2Recommendation
A threshold is set for every user based the cost value assigned to false positive/negative messages. When a message comes, an accumulated value is calculated based on the recommendation of a group of users with similar profiles. If the accumulated value is greater than the threshold, the message is classified as spam for this user.
User can define a user group from which the recommendation is made, or let the system to define one based on similarities of user profiles. The accumulated value for a message is calculated based on the recommendation value and the trust level of each user in the recommendation group for this user. The system will look into the message folders of each user in the recommendation group and assign a recommendation value to indicate how likely it be spam by comparing the similarity with previous messages either spam or legitimate. (This procedure can be seen as a particular message classification problem, in which only two classes are available: spam or legitimate. In this way, a classifier is trained for each user.) Individual sets of spam and non-spam messages are maintained for each user’s profile.
3.3User Profiling
In addition to an individual interest profile computed by correlating previously email messages browsed or flagged as spam with their classification, a user’s profile may include features like key word list, rules, black list, white list and etc. Ontological relationships between email classifications are used to infer topics that might not have been specified explicitly.
When the system get started to work, a uniform profile is assigned to every user and then individual user profile is updated on daily bases or upon acceptation of recommendations or receiving of false positive/negative report from user. This is done by system with unobtrusive monitoring of user behavior. After updating user profiles similarities between the changed and other user profiles need to be updated and followed by updating of recommendation group for users.
User interest feedback details a level of interest/dislike in a topic, this feedback enables the spam filter to track concept drift in spam and to be retrained in the case of false positive [15]. We are going to develop a profiling algorithm to automatically adjust user profiles to match any topic interest/dislike levels declared via profile feedback. An instance of spam for a specific class may add a percentage of its value to the super-class.
Time-decay function and other existing profiling algorithms may also be invoked to find current interests.
3.4Ontology construction
Ontology is a conceptualization of a domain into a human-understandable, and machine-readable format consisting of entities, attributes, relationships, and axioms [14]. Ontologies can provide a rich conceptualization of the working domain of an organization, representing the main concepts and relationships of the work activities.
There are two ways for this task: use existing ontology or construct one based on the classification of email messages.
Although there are many topic-specific messages in a moderated mailing list, most of them will fall in standard categories. We are going to user an existing taxonomy with appropriate customizations based on specific domain knowledge.
For example, if the application is in academic research domain, we can add additional topics to the ontology for the target researchers.
Kazem Taghva and Julia Borsack reported on the construction of an ontology that applies rules for identification of features to be used for email classification [12].
4. Evaluation Measures
Suppose S and N are the number of spam and non-spam messages for each user, S+ is the number of spam message that are correctly classified by a system, and S- is the number of messages misclassified as Non-spam, similarly, N+ denote the number of non-spam message that are correctly classified and N- the number of messages misclassified as spam. Following measures can be calculated based on these four values to measure the performance of the system.
4.1 Filtering accuracy
4.1.1 Precision, recall and F measure value
We can calculate Precision, Recall and F-measure value based on these four values to measure the performance of the system.
4.1.2 Utility measures
In this measure, a loss value V is attached to each S-, S+, N-, N+, the overall performance of a system is the sum of the multiplication of 4 numbers of messages and its corresponding V-value.
4.1.3 Weighted accuracy
Suppose misclassifying a non-spam message as spam is t times more costly than the symmetric misclassification, a version of accuracy sensitive to t-cost: Wacc = (t*L- + S+)/(t*L + S)
4.2 Ontological inference in user profiling
We are going to compare the accuracy values as 3.1 in the absence of ontological inference scenario to identify to what extent ontological profiling helps improve the overall system performance.
4.3 User satisfaction
The higher accuracy value indicates higher satisfaction of users.
(Some metrics for measuring recommendation performance are suggested by Schein et al. [2002]. Jonathan L. et al overviewed the factors that have been considered in evaluations as well as introduced new factors that they believe should be considered in evaluation in [3].)
References
[1] Middleton, S. E., Shadbolt, N. R. and De Roure, D. C. (2004) Ontological User Profiling in Recommender Systems. ACM Transactions on Information Systems (TOIS)22 Pages: 54 - 88[2] Joseph A. Konstan Introduction to recommender systems: Algorithms and Evaluation ACM Transactions on Information Systems (TOIS)22 Pages: 1 - 4
[3] JONATHAN L. HERLOCKER, JOSEPH A. KONSTAN, LOREN G. TERVEEN, and JOHN T. RIEDL Evaluating collaborative filtering recommender systems ACM Transactions on Information Systems (TOIS)22 Pages: 5 - 53
[4]Fabrizio Sebastiani. Machine learning in automated text categorization, ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp.1-47
[5]Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. Spyropoulos, and P. Stamatopoulos. Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)
[6] Ken Lang. NewsWeeder: Learning to filter netnews. In Machine Learning: Proceedings of the Twelfth International Conference, Lake Taho, California, 1995.
[7] J. Canny. Collaborative filtering with privacy. In IEEE Symposium on Security and Privacy, pages 45--57, Oakland, CA, May 2002.
[8]Mark Claypool Anuja Gokhale, Tim Miranda, Pavel Murnikov, Dmitry Netes and Matthew Sartin. Combining Content-Based and Collaborative Filters in an Online NewspaperACM SIGIR Workshop on Recommender Systems Berkeley, CA(1999)
[9]Cranor, L.F. and B.A. LaMacchia. 1998. Spam! Communications of ACM, 41(8):74–83.
[10] “MIT Spam Conference looks beyond filters” by Paul Roberts, IDG News Service, Boston Bureau, January 20, 2004
[11] Mitchell, T.M. 1997. Machine Learning. McGraw-Hill.
[12]Kazem Taghva, Julie Borsack, Jeffrey Coombs, Allen Condit, Steve Lumos, Tom Nartker, Ontology-based Classification of EmailProceedings of the International Conference on Information Technology: Computers and Communications(2003)
[13] Masahiro Morita and Yoichi Shinoda. Information filtering based on User Behavior Analysis and Best Match Text Retrieval
[14]Guarino, N. and Giaretta, P. 1995. Ontologies and knowledge bases: Towards a terminological clarification. In Towards Very Large Knowledge Bases: Knowledge Building and Knowledge Sharing, N. Mars, Ed. IOS Press, 25-32.
[16] Deborah Fallows. Spam: How it is hurting email and degrading life on the Internet. Pew Internet and American Life Project, October 2003.
Internet resources: