Quality-Aware Subgraph Matching Over Inconsistent Probabilistic Graph Databases

Quality-Aware Subgraph Matching Over Inconsistent Probabilistic Graph Databases

Abstract

The Resource Description Framework (RDF) is a general framework for how to describe any Internet resource such as a Web site and its content. An RDF description (such descriptions are often referred to as metadata, or "data about data") can include the authors of the resource, date of creation or updating, the organization of the pages on a site (the sitemap), information that describes content in terms of audience or content rating, key words for search engine data collection, subject categories, and so forth.Resource Description Framework has been widely used in the Semantic Web to describe resources and their relationships. The RDF graph is one of the most commonly used representations for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs integrated from different data sources may often contain uncertain and inconsistent information (e.g., uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality-aware subgraph matching over inconsistent probabilistic RDF graphs , which retrieves subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and with high quality scores (considering both consistency and uncertainty). In order to efficiently answer QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an effective index to facilitate our proposed pruning methods, and propose an efficient approach for processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our proposed approaches through extensive experiments.

EXISTING SYSTEM

probabilistic graphs are often obtained from real-world applications such as the data extraction/integration in the SemanticWeb. Due to the unreliability of data sources or inaccurate extraction/integration techniques, probabilistic graph data often contain inconsistencies, violating some rules or facts. Here, rules or facts can be specified by knowledge base or inferred by data mining techniques. RDF graphs integrated from different data sources may often contain uncertain and inconsistent information , due to the unreliability of data sources. we formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality-aware subgraph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and with high quality scores (considering both consistency and uncertainty.

PROPOSED SYATEM

In this paper, we propose the quality-aware subgraph matching problem (namely, QA-gMatch) in a novel context of inconsistent probabilistic graphs G with quality guarantees. Specifically, given a query graph q, a QA-gMatch query retrieves subgraphs g of probabilistic graph G that match with q and have high quality scores. The QA-gMatch problem has many practical applications such as the Semantic Web. For example, we can answer standard queries, SPARQL queries, over inconsistent probabilistic RDF graphs by issuing QA-gMatch queries. we will propose effective pruning methods, namely adaptive label pruning (based on a cost model) and quality score pruning, to reduce the QAgMatch search space and improve the query efficiency.

Advantages

· We propose the QA-gMatch problem in inconsistent probabilistic graphs, which, to our best knowledge, no prior work has studied.

· We carefully design effective pruning methods, adaptive label and quality score pruning, specific for inconsistent and probabilistic features of RDF graphs.

· We build a tree index over pre-computed data of inconsistent probabilistic graphs, and illustrate efficient QA-gMatch query procedure by traversing the index.

IMPLEMENTATION

MODULES DESCRIPTION

1. Possibles of a probabilistic graph

2. QA-gMatch in an inconsistent probabilistic graph

3. Inconsistencies in Probabilistic Graphs

4. Adaptive Label Pruning

Possibles of a probabilistic graph

The possible worlds semantics of a probabilistic graph. Specifically, each possible world is a materialized instance of the probabilistic graph, which corresponds to a certain graph with one label assignment in vertices that may appear in the real world. We formally define possible worlds in a probabilistic graph. Given a probabilistic graph G, a possible world, pw(G), of G is a materialized graph where each vertex vi ∈ V (G) is assigned with a certain label l(vi). Then, the appearance probability, Pr{pw(G)}, of a possible world pw(G) is given by: Pr{pw(G)} = Pr = ∀vi ∈V (G) l(vi).p. (1) where Xi is a variable of possible label l(vi) that vertex vi may be assigned with.

QA-gMatch in an inconsistent probabilistic graph

The quality-aware subgraph matching problem over inconsistent probabilistic graphs G, under the all-possible-repair semantics (i.e., exploring all possible repairs on possible worlds of probabilistic graphs). Specifically, our problem retrieves subgraphs in G that match with a given query graph and have their quality scores higher than a threshold. Given an inconsistent probabilistic graph G, a fact table FT, a query graph q, and a threshold αg, a quality-aware subgraph matching in inconsistent probabilistic graph (QA-gMatch) retrieves subgraphs g ⊆ G such that: • g is isomorphic to q (denoted as g ≡ q); and • the quality score score(g) > αg. where score(g) is defined as: score(g) = ∀pw(G) Pr{pw(G)} · χ(∃pwR(G), g ⊆ pwR(G) ∧ g ≡ q) ·W(g). (3) Here, if z is true, χ(z) = 1; otherwise, χ(z) = 0. Moreover, g ≡ q is true, if g and q are isomorphic.

Inconsistencies in Probabilistic Graphs

Adaptive Label Pruning

The adaptive label pruning method specific for probabilistic graphs, which adaptively encodes label/structural information in signatures and filters out false alarms of QA-gMatch candidates via signatures. Here, the design of signatures takes into account a special feature of probabilistic RDF graphs, that is, some vertices in graphs may incur high degrees. Thus, our basic idea of the signature design is to adaptively allocate more space for encoding vertices with high degrees (less space for those with low degrees), and maximize the pruning power. We also design a cost model to guide the signature generation.

System Requirements:

H/W System Configuration:-

Processor - Pentium –III

Speed - 1.1 Ghz

RAM - 256 MB(min)

Hard Disk - 20 GB

Key Board - Standard Windows Keyboard

Mouse - Two or Three Button Mouse

Monitor - SVGA

S/W System Configuration:-

v Operating System :Windows95/98/2000/XP

v Application Server : Tomcat5.0/6.X

v Front End : HTML, Java, Jsp

v Scripts : JavaScript.

v Server side Script : Java Server Pages.

v Database Connectivity : Mysql.

Architecture Diagram

Adaptive label pruning

Algorithm

Subgraph Matching

QAProcedure

QA-gMatch Processing {

Input: an inconsistent probabilistic RDF graph G, a tree index I over G, a query graph q, and a score threshold αg

Output: subgraphs g that are QA-gMatch answers

(1) pre-process q to obtain label signatures and numbers of vertices/edges

(2) initialize a candidate set cand(qi) for each vertex qi ∈ V (q)

(3) for each entry Nz in root(I)

(4) for each qi

(5) if Nz cannot be pruned by qi via adaptive label and quality score pruning

(6) add Nz to cand(qi)

(7) while cand(・) is not empty

(8) for each Nz ∈∀i cand(qi)

(9) remove Nz from cand(qi)

(10) if Nz is a leaf node

(11) for each subgraph gunder Nz

(12) apply adaptive label pruning and quality score pruning to gw.r.t., qi

(13) if g cannot be pruned

(14) add g to candnew(qi)

(15) else // intermediate node

(16) for each child node Nc under Nz

(17) apply adaptive label pruning and quality score pruning to Nc w.r.t., qi

(18) if Nc cannot be pruned

(19) add Nc to candnew(qi)

(20) if leaf level is not encountered, then cand(qi) = candnew(qi) for all qi

(21) refine candidate graphs g by joining candidates in candnew(qi) and return

the QA-gMatch answers