A

Seminar Report

On

“DATA LEKAGE DETECTION”

ATUL KUMAR SINGH

0913313036

SUBMITTED TO:

Mrs.Priya Chaudhary

NOIDA INSTITUTE OF ENGINEERING AND TECHNOLOGY

19, Knowledge park-2, Institutional Area,Phase-2 Greater Noida 201306

ACKNOWLEDGEMENT

The satisfaction that accompanies that the successful completion of any task would be incomplete without the mention of people whose ceaseless cooperation made it possible, whose constant guidance and encouragement crown all efforts with success.

I m grateful to my project guide Mrs. Priya Choudhary for the guidance, inspiration and constructive suggestions that helpful me in the preparation of this project.

CERTIFICATE

This is to certify that the Project Work titled “Data Leakage Detection” is a bonafide work of Atul Kumar Singh,carried out in partial fulfillment for the award of degree of BTECH-3NDYEAR[GBTU] under my guidance.This project work is original and not submitted earlier for the award of any degree of any other University.

Mrs. Priya Choudhary

Lecturer

NIET, Greater Noida

ABSTRACT

We study the following problem: A data distributor has given sensitive data to a set of supposedly trusted agents (thirdparties). Some of the data are leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor mustassess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by othermeans. We propose data allocation strategies (across the agents) that improve the probability of identifying leakages. These methodsdo not rely on alterations of the released data (e.g., watermarks). In some cases, we can also inject “realistic but fake” data records tofurther improve our chances of detecting leakage and identifying the guilty party.

Key Terms-Allocation strategies, data leakage, data privacy, fake records, leakage model.

TABLE OF CONTENTS

S.NO / CONTENTS / PAGE
1 / Introduction / 1
2 / Techniques Used / 10
3 / Conclusion / 1
4 / Limitations / 1
5 / References / 1

TABLE OF FIGURES

S.NO / CONTENTS
1 / Creating watermarking
2 / Watermarking algorithm
3 / Graph Showing Perturbation
4 / Process Of Data Loss
5 / Evaluation Of Explicit data
request
6 / Evaluation Of Sample data
request

INTRODUCTION

IN the course of doing business, sometimes sensitive datamust be handed over to supposedly trusted third parties.For example, a hospital may give patient records toresearchers who will devise new treatments. Similarly, acompany may have partnerships with other companies thatrequire sharing customer data. Another enterprise mayoutsource its data processing, so data must be given tovarious other companies.Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data.Perturbation is a very useful technique where the data is modified and made “less sensitive” before being handed to agents.

We consider applications where the original sensitivedata cannot be perturbed. Perturbation is a very usefultechnique where the data are modified and made “lesssensitive” before being handed to agents. For example, onecan add random noise to certain attributes, or one canreplace exact values by ranges [4]. However, in some cases,it is important not to alter the original distributor’s data. Forexample, if an outsourcer is doing our payroll, he must havethe exact salary and customer bank account numbers.Ifmedical researchers will be treating patients (as opposed tosimply computing statistics), they may need accurate datafor the patients.

Traditionally, leakage detection is handled by watermarking,e.g., a unique code is embedded in each distributedcopy. If that copy is later discovered in the hands of anunauthorized party, the leaker can be identified. Watermarkscan be very useful in some cases, but again, involve somemodification of the original data. Furthermore, watermarkscan sometimes be destroyed if the data recipient is malicious. Here we study unobtrusive techniques fordetecting leakage of a set of objects or records.

Creating a watermark

we study the following scenario: After giving a set of objectsto agents, the distributor discovers some of those sameobjects in an unauthorized place. (For example, the datamay be found on a website, or may be obtained through alegal discovery process.) At this point, the distributor can

assess the likelihood that the leaked data came from one ormore agents, as opposed to having been independentlygathered by other means. Using an analogy with cookiesstolen from a cookie jar, if we catch Freddie with a singlecookie, he can argue that a friend gave him the cookie. But ifwe catch Freddie with five cookies, it will be much harderfor him to argue that his hands were not in the cookie jar. Ifthe distributor sees “enough evidence” that an agent leaked

data, he may stop doing business with him, or may initiatelegal proceedings.

DIGITAL WATERMARKING:

Digital watermarkingis the process of embedding information into a digital signal which may be used to verify its authenticity or the identity of its owners, in the same manner as paper bearing awatermarkfor visible identification. In digital watermarking, the signal may be audio, pictures, or video. If the signal is copied, then the information also is carried in the copy. A signal may carry several different watermarks at the same time. Paul Levinson Future of the Information Revolution (1997), where he called for the use "smart patent numbers" (p. 202), or the embedding of electronic chips in every piece of technology, which would give an updated listing of all of its inventors.

Invisibledigital watermarking, the information is visible in the picture or video. Typically, the information is text or a logo, which identifies the owner of the media. The image on the right has a visible watermark. When a television broadcaster adds its logo to the corner of transmitted video, this also is a visible watermark.

Ininvisibledigital watermarking, information is added as digital data to audio, picture, or video, but it cannot be perceived as such (although it may be possible todetectthat some amount of information is hidden in the signal). The watermark may be intended for widespread use and thus, is made easy to retrieve or, it may be a form of stenography, where a party communicates a secret message embedded in the digital signal. In either case, as in visible watermarking, the objective is to attach ownership or other descriptive information to the signal in a way that is difficult to remove. It also is possible to use hidden embedded information as a means of covert communication between individuals.

One application of watermarking is incopyrightprotection systems, which are intended to prevent or deter unauthorized copying of digital media. In this use, a copy device retrieves the watermark from the signal before making a copy; the device makes a decision whether to copy or not, depending on the contents of the watermark. Another application is insource tracing. A watermark is embedded into a digital signal at each point of distribution. If a copy of the work is found later, then the watermark may be retrieved from the copy and the source of the distribution is known. This technique reportedly has been used to detect the source of illegally copied movies.

Annotationof digital photographs with descriptive information is another application of invisible watermarking.

PERTURBATION TECHNIQUE:

Perturbation is a very usefultechnique where the data are modified and made “lesssensitive” before being handed to agents. For example, onecan add random noise to certain attributes, or one canreplace exact values by ranges [4].

Graph showing perturbation

However, in some cases,it is important not to alter the original distributor’s data. Forexample, if an outsourcer is doing our payroll, he must havethe exact salary and customer bank account numbers. Ifmedical researchers will be treating patients (as opposed tosimply computing statistics), they may need accurate datafor the patients.

UNOBTRUSIVE TECHNIQUES:

In this section we develop a model for assessing the “guilt” of agents. We also present algorithms for distributing objects to agents, in a way that improves our chances of identifying a leaker. Finally, we also consider the option of adding “fake” objects to the distributed set. Such objects do not correspond to real entities but appear realistic to the agents. In a sense, the fake objects acts as a type of watermark for the entire set, without modifying any individual members. If it turns out an agent was given one or more fake objects that were leaked, then the distributor can be more confident that agent was guilty.

Typical Block Diagram Showing the Process of Data Loss In Blocking spam.

PROBLEM SETUP AND NOTATION:

A distributor owns a set T={t1,…,tm}of valuable data objects. The distributor wants to share some of the objects with a set of agents U1,U2,…Un, but does not wish the objects be leaked to other third parties. The objects in T could be of any type and size, e.g., they could be tuples in a relation, or relations in a database. An agent Ui receives a subset of objects, determined either by a sample request or an explicit request:

  1. Sample request

Ri=SAMPLE(T,mi): Any subset ofmi records from T can be given to Ui.

  1. Explicit request

Ri=EXPLICIT(T,condi): Agent Ui receives all T objects that satisfy condi.

GUILT AGENTS

Suppose that after giving objects to agents, the distributordiscovers that a set S _ T has leaked. This means that somethird party, called the target, has been caught in possessionof S. For example, this target may be displaying S on itswebsite, or perhaps as part of a legal discovery process, thetarget turned over S to the distributor.Since the agents U1; . . . ; Un have some of the data, it isreasonable to suspect them leaking the data. However, theagents can argue that they are innocent, and that the S datawere obtained by the target through other means.

Forexample, say that one of the objects in S represents acustomer X. Perhaps X is also a customer of some othercompany, and that company provided the data to the target.Or perhaps X can be reconstructed from various publiclyavailable sources on the web.Our goal is to estimate the likelihood that the leaked datacame from the agents as opposed to other sources.Intuitively, the more data in S, the harder it is for theagents to argue they did not leak anything. Similarly, the“rarer” the objects, the harder it is to argue that the targetobtained them through other means. Not only do we wantto estimate the likelihood the agents leaked data, but wewould also like to find out if one of them, in particular, wasmore likely to be the leaker. For instance, if one of theS objects was only given to agent U1, while the other objectswere given to all agents, we may suspect U1 more. Themodel we present next captures this intuition.We say an agent Ui is guilty and if it contributes one ormore objects to the target. We denote the event that agent Uiis guilty by Gi and the event that agent Ui is guilty for agiven leaked set S by Gi|S. Our next step is to estimatePr{Gi|S}, i.e., the probability that agent Ui is guilty given evidence S.

RELATED WORK:

The guilt detection approach we present is related to the data provenance problem: tracing the lineage ofS objects implies essentially the detection of the guilty agents. Tutorial [1] provides a good overview on theresearch conducted in this field. Suggested solutions aredomain specific, such as lineage tracing for data warehouses, and assume some prior knowledge on the way a

data view is created out of data sources. Our problemformulation with objects and sets is more general andsimplifies lineage tracing, since we do not consider any datatransformation from Ri sets to S.As far as the data allocation strategies are concerned, ourwork is mostly relevant to watermarking that is used as ameans of establishing original ownership of distributedobjects. Watermarks were initially used in images [5], video[5], and audio data [5] whose digital representation includesconsiderable redundancy. Recently, [1], andother works have also studied marks insertion to relationaldata. Our approach and watermarking are similar in thesense of providing agents with some kind of receiveridentifying information. However, by its very nature, awatermark modifies the item being watermarked. If the object to be watermarked cannot be modified, then awatermark cannot be inserted. In such cases, methods thatattach watermarks to the distributed data are not applicable.Finally, there are also lots of other works on mechanismsthat allow only authorized users to access sensitivedata through access control policies. Such approachesprevent in some sense data leakage by sharinginformation only with trusted parties. However, thesepolicies are restrictive and may make it impossible tosatisfy agents’ requests

AGENT GUILT MODEL

To compute this Pr{Gi|S}, we need an estimate for theprobability that values in S can be “guessed” by the target.For instance, say that some of the objects in S are e-mails ofindividuals. We can conduct an experiment and ask aperson with approximately the expertise and resources ofthe target to find the e-mail of, say, 100 individuals. If thisperson can find, say, 90 e-mails, then we can reasonablyguess that the probability of finding one e-mail is 0.9. Onthe other hand, if the objects in question are bank accountnumbers, the person may only discover, say, 20, leading toan estimate of 0.2. We call this estimate pt, the probabilitythat object t can be guessed by the target.Probability pt is analogous to the probabilities used indesigning fault-tolerant systems. That is, to estimate howlikely it is that a system will be operational throughout agiven period, we need the probabilities that individualcomponents will or will not fail. A component failure in ourcase is the event that the target guesses an object of S. Thecomponent failure is used to compute the overall systemreliability, while we use the probability of guessing toidentify agents that have leaked information. The componentfailure probabilities are estimated based on experiments,just as we propose to estimate the pts. Similarly, thecomponent probabilities are usually conservative estimates, rather than exact numbers. For example, say that we use a component failure probability that is higher than the actualprobability, and we design our system to provide a desired high level of reliability. Then we will know that the actual system will have at least that level of reliability, but possibly higher. In the same way, if we use pts that are higher than the true values, we will know that the agents will be guilty with at least the computed probabilities.

To simplify the formulas that we presentin the rest of thepaper, we assume that all T objects have the same pt, whichwe call p. Our equations can be easily generalized to diversepts though they become cumbersome to display.

ALGORITHMS:

1. Evaluation of Explicit Data Request Algorithms

In the first place, the goal of these experiments was to see whether fake objects in the distributed data sets yield significant improvement in our chances of detecting a guilty agent. In the second place, we wanted to evaluate our e-optimal algorithm relative to a random allocation.

Evaluation of EXPLICIT data request algorithms.

We focus on scenarios with a few objects that are sharedamong multiple agents. These are the most interestingscenarios, since object sharing makes it difficult to distinguisha guilty from non-guilty agents. Scenarios with moreobjects to distribute or scenarios with objects shared amongfewer agents are obviously easier to handle. As far asscenarios with many objects to distribute and many overlappingagent requests are concerned, they are similar to thescenarios we study, since we can map them to thedistribution of many small subsets.

2. Evaluation of Sample Data Request Algorithms

With sample data requests agents are not interested in particular objects. Hence, object sharing is not explicitly defined by their requests. The distributor is “forced” to allocate certain objects to multiple agents only if the number of requested objects exceeds the number of objects in set T. The more data objects the agents request in total, the more recipients on average an object has; and the more objects are shared among different agents, the more difficult it is to detect a guilty agent.

Evaluation of Sample Data Request

MODULES:

1. Data Allocation Module:

The main focus of our project is the data allocation problem as how can the distributor “intelligently” give data to agents in order to improve the chances of detecting a guilty agent.

2. Fake Object Module:

Fake objects are objects generated by the distributor in order to increase the chances of detecting agents that leak data. The distributor may be able to add fake objects to the distributed data in order to improve his effectiveness in detecting guilty agents. Our use of fake objects is inspired by the use of “trace” records in mailing lists.

3. Optimization Module:

The Optimization Module is the distributor’s data allocation to agents has one constraint and one objective. The distributor’s constraint is to satisfy agents’ requests, by providing them with the number of objects they request or with all available objects that satisfy their conditions. His objective is to be able to detect an agent who leaks any portion of his data.

4. Data Distributor:

A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the data is leaked and found in an unauthorized place (e.g., on the web or somebody’s laptop). The distributor must assess the likelihood that the leaked data came from one or more agents, as opposed to having been independently gathered by other means.

DATA ALLOCATION PROBLEM

Our main focus is the data allocation problem: how can the distributed “intelligently” give data to agents inorder to improve the chances of detecting a guilty agent? Asillustrated in Fig. 2, there are four instances of this problemwe address, depending on the type of data requests madeby agents and whether “fake objects” are allowed.The two types of requests we handle were defined inSection 2: sample and explicit. Fake objects are objectsgenerated by the distributor that are not in set T. The objectsare designed to look like real objects, and are distributed toagents together with T objects, in order to increase thechances of detecting agents that leak data. We discuss fakeobjects in more detail in later section.

As shown in Fig. 2, we represent our four probleminstances with the names EF, EF, SF, and SF, where Estands for explicit requests, S for sample requests, F for theuse of fake objects, and F for the case where fake objects arenot allowed.Note that, for simplicity, we are assuming that in the Eproblem instances, all agents make explicit requests, whilein the S instances, all agents make sample requests. Ourresults can be extended to handle mixed cases, with someexplicit and some sample requests. We provide here a smallexample to illustrate how mixed requests can be handled,but then do not elaborate further.