Supporting Privacy Protection in Personalized Web Search

SUPPORTING PRIVACY PROTECTION IN PERSONALIZED WEB SEARCH

ABSTRACT

Personalized web search (PWS) has demonstrated its effectiveness in improving the quality of various search services onthe Internet. However, evidences show that users’ reluctance to disclose their private information during search has become a majorbarrier for the wide proliferation of PWS. We study privacy protection in PWS applications that model user preferences as hierarchicaluser profiles. We propose a PWS framework called UPS that can adaptively generalize profiles by queries while respecting userspecifiedprivacy requirements. Our runtime generalization aims at striking a balance between two predictive metrics that evaluate theutility of personalization and the privacy risk of exposing the generalized profile. We present two greedy algorithms, namely GreedyDPand GreedyIL, for runtime generalization. We also provide an online prediction mechanism for deciding whether personalizing a queryis beneficial. Extensive experiments demonstrate the effectiveness of our framework. The experimental results also reveal thatGreedyIL significantly outperforms GreedyDP in terms of efficiency.

INTRODUCTION

The web search engine has long become the mostimportant portal for ordinary people looking for usefulinformation on the web. However, users might experiencefailure when search engines return irrelevant results that donot meet their real intentions. Such irrelevance is largelydue to the enormous variety of users’ contexts andbackgrounds, as well as the ambiguity of texts. Personalizedweb search (PWS) is a general category of search techniquesaiming at providing better search results, which are tailoredfor individual user needs. As the expense, user informationhas to be collected and analyzed to figure out the userintention behind the issued query.

The solutions to PWS can generally be categorized intotwo types

Click-log-based methods and

Profile-basedmethods

Click-log-based methods

The click-log based methods are straightforward they simply impose bias to clicked pages in the user’s queryhistory.

It can onlywork on repeated queries from the same user, which is astrong limitation confining its applicability.

Profile-based methods

Profile-based methods can be potentiallyeffective for almost all sorts of queries, but arereported to be unstable under some circumstances.

Improve the search experience withcomplicated user-interest models generated from userprofiling techniques.

PWS has demonstrated moreeffectiveness in improving the quality of web searchrecently, with increasing usage of personal and behaviorinformation to profile its users, which is usually gatheredimplicitly from query history, browsing history, click-through data bookmarks, userdocuments and so forth.

PROBLEM DEFINITION

To protect user privacy in profile-based PWS, researchershave to consider two contradicting effects during the searchprocess. On the one hand, they attempt to improve thesearch quality with the personalization utility of the userprofile. They need to hide the privacycontents existing in the user profile to place the privacy riskunder control. Significantgain can be obtained by personalization at the expenseof only a small (and less-sensitive) portion of the userprofile, namely a generalized profile. Thus, user privacy canbe protected without compromising the personalizedsearch quality. In general, there is a tradeoff between thesearch quality and the level of privacy protection achievedfrom generalization.Unfortunately, the previous works of privacy preservingPWS are far from optimal.

EXISTING SYSTEM

Profile based PWS

A user profile is typically generalized foronly once offline, and used to personalize all queriesfrom a same user indiscriminatingly.

Such “oneprofile fits all” strategy certainly has drawbacksgiven the variety of queries.

Profile-based personalization maynot even help to improve the search quality forsome ad hoc queries, though exposing user profile toa server has put the user’s privacy at risk.

A betterapproach is to make an online decision onwhether to personalize the query andwhat to expose in the user profile at runtime.

Customization of privacy requirements

This considers, all the sensitive topics are detected using anabsolute metric called surprisal based on theinformation theory, assuming that the interests withless user document support are more sensitive.

Iterative user interactions

They usually refine the search results with somemetrics which require multiple user interactions,such as rank scoring, average rank, and so on.

This paradigm is, however, infeasible for runtimeprofiling, as it will not only pose too much risk ofprivacy breach, but also demand prohibitive processingtime for profiling.

Thus, we need predictivemetrics to measure the search quality and breachrisk after personalization, without incurring iterativeuser interaction.

Disadvantages

The existing profile-based PWS do not support runtimeprofiling.

The existing methods do not take into account thecustomization of privacy requirements.

Many personalization techniques require iterative userinteractions when creating personalized search results.

PROPOSED SYSTEM

To propose UPS (User customizable Privacy-preserving Search) framework, which is a privacy-preserving personalized websearch framework, which can generalize profilesfor each query according to user-specifiedprivacy requirements.

To develop two simple but effective generalizationalgorithms, GreedyDP and GreedyIL, to supportruntime profiling. GreedyDP tries to maximizethe discriminating power (DP), GreedyIL attempts to minimize the information loss (IL).

The framework assumes that the queries do not containany sensitive information, and aims at protecting theprivacy in individual user profiles while retaining theirusefulness for PWS.

UPS consists of a nontrusty searchengine server and a number of clients. Each client (user)accessing the search service trusts no one but himself/herself.

The key component for privacy protection is anonline profiler implemented as a search proxy running on theclient machine itself.

The proxy maintains both thecomplete user profile, in a hierarchy of nodes with semantics,and the user-specified (customized) privacy requirements representedas a set of sensitive-nodes.

During the offline phase, ahierarchical user profile is constructed and customized withthe user-specified privacy requirements.

The online phasehandles queries as When a user issues a query qi on the client, theproxy generates a user profile in runtime in thelight of query terms. The output of this step is ageneralized user profile Gi satisfying the privacyrequirements. The generalization process is guidedby considering two conflicting metrics, namely thepersonalization utility and the privacy risk, bothdefined for user profiles.

The query and the generalized userprofile are sent together to the PWS server forpersonalized search.

The search results are personalized with the profileand delivered back to the query proxy.

Finally, the proxy either presents the raw results to theuser, or reranks them with the complete user profile.

Advantages

UPS provides runtime profiling, which in effect optimizes thepersonalization utility while respecting user’s privacyrequirements;

Allows for customization of privacy needs;and

Does not require iterative user interaction.

Provides an inexpensive mechanism for the clientto decide whether to personalize a query in UPS.

LITERATURE SUMMARY

Profile-Based Personalization

Many profile representations are available in theliterature to facilitate different personalization strategies.Earlier techniques utilize term lists/vectors or bag ofwords to represent their profile.

However, most recentworks build profiles in hierarchical structures due to theirstronger descriptive ability, better scalability, and higheraccess efficiency. The majority of the hierarchical representationsare constructed with existing weighted topichierarchy/graph, such as ODP, Wikipediaand so on.

Another work builds thehierarchical profile automatically via term-frequency analysison the user data.

Privacy Protection in PWS System

Generally there are two classes of privacy protectionproblems for PWS.

One class includes those treat privacyas the identification of an individual.
The other includes those consider the sensitivity of the data,particularly the user profiles, exposed to the PWS server.

Typical works in the literature of protecting useridentifications (class one) try to solve the privacy problemon different levels, including the pseudoidentity, the groupidentity, no identity, and no personal information. Solution tothe first level is proved to fragile.The third and fourthlevels are impractical due to high cost in communicationand cryptography. Therefore, the existing efforts focus onthe second level.

The useless userprofile (UUP) protocol is proposed to shuffle queriesamong a group of users who issue them. As a result anyentity cannot profile a certain individual.

These worksassume the existence of a trustworthy third-party anonymizer,which is not readily available over the Internet atlarge.

Viejo and Castell-a-Roca use legacy social networksinstead of the third party to provide a distorted userprofile to the web search engine. In the scheme, every useracts as a search agency of his or her neighbors. They candecide to submit the query on behalf of who issued it, orforward it to other neighbors. The shortcomings of currentsolutions in class one is the high cost introduced due to thecollaboration and communication.

The solutions in class two do not require third-partyassistance or collaborations between social network entries.

In these solutions, users only trust themselves and cannottolerate the exposure of their complete profiles an anonymityserver.

Krause and Horvitz employ statisticaltechniques to learn a probabilistic model, and then use thismodel to generate the near-optimal partial profile. Limitation in this work is that it builds the user profileas a finite set of attributes, and the probabilistic model istrained through predefined frequent queries. These assumptionsare impractical in the context of PWS.

Xu et al.proposed a privacy protection solution for PWS basedon hierarchical profiles. Using a user-specified threshold, ageneralized profile is obtained in effect as a rooted subtreeof the complete profile. Unfortunately, this work does notaddress the query utility, which is crucial for the servicequality of PWS.

Xiao and Tao proposed Privacy-Preserving DataPublishing (PPDP). A person can specify the degree ofprivacy protection for her/his sensitive values by specifying“guarding nodes” in the taxonomy of the sensitiveattribute.

Teevan et al. collect a setof features of the query to classify queries by their clickentropy.While these works are motivative in questioningwhether to personalize or not to, they assume the availabilityof massive user query logs anduser feedback.

Hardware requirements:

Processor: Any Processor above 500 MHz.

Ram: 128Mb.

Hard Disk: 10 Gb.

Compact Disk: 650 Mb.

Input device: Standard Keyboard and Mouse.

Output device: VGA and High Resolution Monitor.

Software requirements:

Operating System: Windows Family.

Language: JDK 1.5

Database: MySQL 5.0

Tool: HeidiSQL 3.0