Knowledge Discovery
in
Protected Vertical Information
Dr. William Perrizo
University Distinguished Professor
of Computer Science
North Dakota State University,
Fargo, ND 58108
Abstract
The predicate-Tree (pTree) is a lossless compression of vertical data structures. The pTree is specifically designedfor efficient data mining of a bit-level vertical database. A very desirable feature of any data storeis the abilityto secure the data from unauthorized access and manipulation. In fact, it is a requirement in many cases (consider the HIPPA requirements for medical records and the FERPA requirements for academic records). In this paper, we develop an approach which contribute to this end. The method relies on a storage architecture that anonymizes and pads the individual vertical structures to secure the information they contain. The method can be used by itself all alone or it can be combined with other methods for enhanced information availability and security. That is to say, it is orthogonal to most know security methods.
Vertical Data andpTrees
A file or dataset, as represented in a computer, comprises one or more records, each record comprises one or morefields, and each field comprises some number of bits that represent a piece of data. Traditionally, the data set isviewed horizontally as a set of records, and the data are processed vertically, record by record. Databasemanagement systems (DBMS) based such traditional data representation are called row-store architectures.
In contrast to a data set "as a set of rows", a data set may also be viewed vertically. The data sets may comprise a columnof data fields (attributes) and is stored contiguously as such. The DBMS designs that are based on a data set "asa set of columns" are called column-store architectures [1]. As part of the column-store architecture,the stored columns are densepacked, and this simple method of compression provides significant algorithm and codingadvantages compared to a row-store design. As an integral part of column-store DBMS, its operations areapplied to the compressed representation whenever possible [1].
At a next level within the column-store architecture, a column-based data set in the database may be structured as aset of vertical data bit-vectors. In this column-store architecture, a data set may be viewed as an ordered set of sets ofbits, such that each set of bits is a list of bit values found in a pre-determined position within the binary representationof the field value that comprises the data column. This list of bit values is called a bit-vector and one or more of thesebit-vectors, when stored in a manner that mirrors the order of their position in the field's binary representation, thenencodes the column as a vertical data object.
Once structured in this way, the data may then be processed horizontally, bit by bit, using bit-level operations. Thus,the order of the bit-vectors stored in the database is a essential component of each data set. Importantly, the structureof a data set also depends on the structure of the column-store of the data as well, when conceptualized as amulti-dimensional array. The number of dimensions and the dimension sizes are important order relations to thestructure of the vertical data object. Together with the binary bit-order, these additional structures placed on thecolumn-store format are the overall order substructure of the database when structured as a bit-based verticaldatabase.
Each vertical data structure, as an n-dimensional set of ordered bit sets (representing a vector, matrix, ortensor) has a unique pTree representation [2] [25]. A pTree representation is a tree structurehaving a zero bit value for each internal node and a zero or one bit value for each of its leaves (external nodes). Vertical data structures, when represented by pTrees, areparticularly adapted to fundamental data mining operations applied to spatial data [3].
There are three basic pTree operations: AND, OR, and complement. These operations are defined bit-wise againstpTree-based vertical data structures. The pTree-based vertical structures provide a format that permits an efficientmethodology to mine data relationships [2][4].
A pTree is thus a lossless and compressed representation of a vertical data object as a bit-based structure. Fastalgorithms for pTree generation and the associated operations have been implemented, demonstrating smallspace and time costs when compared to using the original data structures for data mining [4].These important features of pTree-compressed vertical objects may lend significantly to the already high performanceobtained by using column-store architectures for data mining[5][1].
Anonymizing Vertical Data to provide Security
Just as a data set can be structured vertically, an entire database can be structured as a large collection of pTrees(compressed or uncompressed bit-vectors)stored in a very specific order. The database, as such, is structured for efficient storage and for processing by data mining operations.Knowing the order in which the vertical data structures are stored is paramount to working with the data. Once structured in sucha way, vertically defined bit-based operations can be applied to these vectors to extract answers to queries that are applied to thedatabase.
Many DBMS have the ability to secure the data within the database. Given a vertically structured database, how might thedata be made secure and yet there is no compromise to the efficiencies inherent in the data structure design? How can such adatabase be made secure in such a way that the pTree-based vertical data can be processed without an expensive pair ofencrypting and decrypting steps?
A solution was proposed in [23]: suppose there are N data objects, with N ≥ 1. The data objects havean (implicit) essential order that is directly related to their use. By repositioning the objects, they can effectively appear to benonsensical and meaningless. If one is unable to identify which vertical slice or vertical map it is, then the data is useless. A simple way to provide this anonymization, as a security measure, is to generate a random permutation of size N, anduse this bijective function to permute the position of the data objects and therefore to fully anonymize their position identity. By doing so, we would implicitly know the correct positionsfor the data objects only by using the permutation as a set of indexes. However, an adversary would find it infeasible to rearrange theobjects in hope of finding the correct order. One could argue that a very effective adversary could reduce the infeasibility of the task of discovering the correct permutation through intelligent incremental search by focusing on only the very first bit in each vertical bit structure (slice). The idea would be to make sense of the very first horizontal record and thereby, possibly, unravel the anonymization. It is certainly true that reducing the size of the problem by many many orders of magnitude (If there are a billion records altogether, then focusing on just the first one, reduces the de-anonymizing search by 9 orders of magnitude.). It is precisely for that reason that we introduce the second phase of anonymization, namely random bit padding of each vertical bit structure so that the position of the “first bit” is now anonymized as well. This, of course, increases the size of the key that must be known by qualified users by twice. Assuming there are N pTrees in all, now instead of one array[N] identity anonymizing permutation, we would need that identity anonymizing array plus just one more array[N] specifying the length of the bit pad for each pTree.
A first level of security for the database may be a secret rearrangement of the order of the vertical data pieces. If N is thenumber of vertical data vectors then a natural bijective function, such as a permutation, will reorder the N data objects. The inverseof the permutation will place the N data objects into their original order. The rearrangement of the vertical data order is called anonymizingand the permutation that is constructed to reorder the vertical data objects is called the anonymizing permutation.
This same method can also be applied to the bit-vectors themselves to provide a second level of security in the form of random bitpadding, as introduced briefly above. Padding is applied to the database vectors to add another level of protection against adversaries and their ability to breakthe reordering scheme. Padding will randomly place the true starting position of the data in the vertical structures somewhere in themiddle of the perceived structure by filling the front (and back) of the structure with a random length of random bits of data.
A second permutation would serve as a method to pad each bit-vector in the database front and back with a random string of bits (0/1)prior to the database reordering process. The padding would be unique to each vertical data structure. The padding would also accountfor the fact that bit-vectors in the database are not required to have the same lengths. This second permutation is called the paddingpermutation.
Why Use Permutations
There are two important reasons that permutations are proposed as a solution to the problem of securing a bit-based vertical database.
The first reason is based on the theoretical concern to provide a cryptographically strong reordering of the vertical data objects in thedatabase. It can be shown that pseudorandom permutations are cryptographically strong when constructedwith the novel algorithms presented in this paper. Therefore, using random permutations can provide superior security againstadversaries attempting to determine the original and meaningful order of the database objects.
The second reason is based on the practical concern of how to securely store and then use the secret indexes with which the databasewas reordered. Regardless of the method for reordering the database, the bijection between the original order and the new secure ordermust be readily available to the database each time it is called upon to perform a data operation. Therefore, this bijection, as a list ofpointers, must be stored securely, meaning it must be encrypted and a database user must have the decryption half of the key. Dependingon the size of the database, this bijection can be large, and the decryption process, even as a one time occurrence for each user, may betime-expensive.
As a reordering bijection, a random permutation can be deterministically generated with a single random number between 1 and N,where N represents the total number of vertical data objects (bit-vectors) in the database. The value of N will almost certainlyrequire a binary representation with no more than 128 bits. Therefore, using random permutations can provide a significantlycompressed version of the required order-reorder bijection that can be repeatedly reconstructed in O(n) time.
Defining The Permutations
In this paper, we recommend permutation generating algorithms to securely pad and then reorder the structures that makeup a vertical database. The algorithm will construct a permutation in a manner that makes it infeasible for adversaries to reconstruct the permutation without knowinghow it was constructed. By padding and ordering the bit-vectors according to the padding and anonymizng permutations, the data become securefrom unwanted scrutiny.
We therefore construct our permutations such that:
(1) each permutation is a random permutation,
(2)it is easily reconstructed whenever needed, and
(3)itpermutes the positions of all the vertical data objects in the database.
Given such a permutation, the vertical data objects are then rearranged accordingly, using the permutation as a list of pointers(exactly equal to the full-length cyclic permutation that it is). To apply a vertical data-based query to the database, the DBMSreconstructs the permutation (the secret reordering), and then applies it as a list of pointers (implemented as indirect addressing),to query the vertical database using vertical data-based bit operations.
The Padding Algorithm
Even though each of the bit-vectors have been repositioned to secure the database, it may be possible that an adversary can re-anonymizethe bit-vectors such that parts of some data files make sense. If each of the bit-vectors were padded at the beginning of the vector witha random string of 0/1 values, this random front-end padding would significantly hinder this possibility. For this reason, a paddingalgorithm was developed to place unique, random, length-varying strings of 0/1 at the front-end and at the back end of each bit-vector.
Under normal circumstances, the size, or length, of a pTree structure will vary with the type of data set that it represents. Different data filesin a database normally comprise different numbers and structures of records, depending on the purpose of the data file. The paddingalgorithm adds a random number of random bits (0/1) to the front and back of a bit-vector, regardless of its original length.
The issue with this straight-forward algorithm relates to the size of the database itself. If the number of bit-vectors in the databaseis large, then the bit-vectors will each be padded, counting the number in front and the numbers of bits in back, with a numberof bits at least equal to or greater than the total number of bit-vectors in the database. This amount of padding therefore maybe significantly more than the amount of padding needed and may significantly increase the overall size of the database.
Therefore, the padding algorithm can be modified slightly to allow the user to present it with a number M that is smaller thanN. Given the number M, the algorithm would generate a randompermutation of length M, and would then use this padding permutation repeatedly in the algorithm until all bit-vectors are padded.
Random Permutations Are Sufficient
Applying random permutations provides a significant level of security to the database without being burdened with encryptionand decryption during the course of using the database for data-mining activities. An important question to ask is whether usingrandom permutations are a secure method for anonymizing and padding the pTree-based vertical data structures.
A database management system may apply a secure block cipher to encrypt the reordering of the vertical data structure.Using format-preserving encryption and a secret key, the correct order of the data structures would serve as the plaintext and thecorresponding ciphertext would serve as the newly reordered (and encrypted) positions. Hence, the ciphertext output by a secureblock cipher is the reordering random permutation.
One issue with a block cipher in general, as it relates the problem discussed here, is that its cyphertext output is much longerthan its plaintext input. A block cipher expands the data, sometimes significantly. Another issue, again as it relates to our problem,is that the cyphertext is not formatted to match the input. These two issues together are called the format-preserving problem andblock ciphers that solvethem are called format-preserving encryption.
Implementation And Performance
The process of constructing and adding two numbers (as address offsets) to the address of a pTree (i) to find the true positionof the pTree in the reordered database and (ii) to find the true starting bit of the pTree within the padded bit-vector adds a minimalamount of time to vertical data mining operations. We note that the post pad length (the bits padded at the end of each pTree) need not be known. The authorized user will know the starting bit and the fixed table length and will therefore be able to extract the needed pTree without knowing the post pad length.
The database will most likely grow and shrink, as it reflects the growth and shrinkage of its data files during expected use andmaintenance of such a system. It is easy to understand how the security algorithms presented in this paper can be adaptedto expected changes to the data. For example, if data are inserted or deleted from an existing file, the representative set ofpTrees will change, but their anonymized position in the database do not change and the front-back paddings do not change.Also, if a new data file is created, then it can be secured into the database at the next round of periodic administratedre-securing of the database. Also, if a data file is deleted in its entirety, it can be marked as such and remain in thedatabase until the next round of periodic administrated re-securing.
References
[1]Stonebraker M, Abadi D, Batkin, et al. C-store: A column-oriented DBMS. Proceedings of the 31st Very Large Database Conference, Trondheim, Norway, 2005.
[2]Perrizo W, Ding Q, Ding Q, Roy A. Deriving high confidence rules from spatial data using Peano count trees. Pacific-Asian Data Mining, Conference, Springer-VerlagLectures In CS, 2001; 2118: 91-102.
[3]Ding Q, Khan M, Roy A, Perrizo W. p-tree algebra. Association of Computing Machinery (ACM) Symposium on Applied Computing 2002;426-431
[4]Perrizo W, Jockheck W, Perera A, Ren D, Wu W, Zhang Y. Multimedia data mining using p-trees. Lecture Notes In Computer Science 2003; 2797: 100-117.
[5]Ding Q, Ding Q, Perrizo W. PARM - an efficient algorithm to mine association rules from spatial data. IEEE Transactions Systems Man Cybernetics- Part B 2008; 38(6): 1513-1524.