Semantic Web Architecture and Applications

Semantic Web architecture and applications are the next generation in information architecture.

This paper reviews four generations of information management applications (Keywords, Statistical, Natural Language, Semantic Web) and defines their key features and limitations.

Organizations will migrate to Semantic Web architecture and applications in the next 1-3 years; copying the 1981 – 1984 user-led migration from centralized mainframe-terminal systems with rigid applications; to distributed Intel/Microsoft/PC architecture and flexible Visi-calc applications.

First Generation - Keywords

Keyword technologies were originally used in IBM’s free text retrieval system in the late 1960’s. These tools are based on a simple scan of a text document to find a key word or root stem of a key word. This approach can find key words in a document, and can list and rank documents containing key words. But, these tools have no ability to extract the meaning from the word or root stem, and no ability to understand the meaning of the sentence.

Advanced Search

Most keyword systems now include some form of Boolean logic “AND” , “OR” functions to narrow searches. This is often called “advanced search”. But, using Boolean logic to exclude documents from a search is not “advanced” . It is an arbitrary and random means to reduce the size of the source database to reduce the number of documents retrieved. This “advanced search” significantly increases false negatives by missing many relevant source documents.

Applications:

Keyword tools are appropriate for creating a word location or a list of documents which contain specific defined keywords and root stems. These are not capable of understanding similar words, the meanings of words, or their relationships, or context.

Problems:

The most common problems with keyword tools are: a) false negatives (no matches found because the word or stem are not exactly identical: “big” and “large”), false positives (too many unrelated matches found because a root stem finds many unrelated words: “process” and “processor”, and c) scale factors (keyword search tool produce very long random lists of documents if the source database is large, and the relevance rankings are highly misleading.

Examples:

The most common examples of key word tools are web site “Search” tools and the Microsoft “Find” function (control “f” key) in Microsoft Office applications.

Second Generation - Statistical Forecasting

Statistical forecasting first finds keywords; and then calculates the frequency and distance of these keywords. Statistical forecasting tools now include many techniques for predictive forecasting, most often using inference theory. The frequency and distribution of words has some general value in understanding content. But, cannot understand the meaning of words or sentences; or provide context. These tools are still limited by keyword constraints; and can only infer simplistic meaning from the frequency and distribution of words.

Applications:

Statistical forecasting tools are appropriate for performing simple document searches where the desired output is a list of documents which contain specific words which must then be read and classified and summarized manually by end users. These are not capable of understanding the meaning or context or relationships of documents.

Problems:

The most common problems with statistical forecasting tools are: a) keyword limitations of false positives and false negatives; b) misunderstanding the meaning of words and sentences (“man bites dog” is the same as “dog bites man”); c) lack of context: “Duke” could be Duke of Windsor or Duke of Earl or John Wayne; d) scale factors: a single statistical relevance ranking creates huge “Google” lists of many irrelevant documents.(“you have 100,000 hits”).

Examples:

The most common statistical forecasting tool is “Google” and many other tools using inference theory and similar analysis and predictive algorithms.

Third Generation - Natural Language Processing

Natural language processors focus on the structure of language. These recognize that certain words in each sentence (nouns and verbs) play a different role (subject-verb-object) than others (adjectives, adverbs, articles). This understanding of grammar increases the understanding of key words and their relationships. (“man bites dog” is different from “dog bites man”). But, these tools cannot extract the understanding of the words or their logical relationship beyond their basic grammar. And, these cannot perform any information summary, analysis or integration functions.

Applications:

Natural language tools are appropriate for linguistic research and word-for-word translation applications where the desired output is a linguistic definition or a translation. These are not capable of understanding the meaning or context of sentences in documents, or integrating information within a database.

Problems:

The most common problems with linguistic tools are: a) keyword limitations of false positives and false negatives; b) misunderstanding the context (does “I like java” mean an island in Indonesia, a computer programming language or coffee?) Without understanding the broader context, a linguistic tool only has a dictionary definition of “Java” and does not know which “Java” is relevant or what other data related to a specific “Java” concept.

Examples:

The most common natural language tools are translator programs which use dictionary look up tables to convert words and language-specific grammar to convert source to target languages.

Fourth Generation – Semantic Web Architecture and Applications

Semantic web architecture and applications are a dramatic departure from earlier database and applications generations. Semantic processing includes these earlier statistical and natural langue techniques, and enhances these with semantic processing tools. First, Semantic Web architecture is the automated conversion and storage of unstructured text sources in a semantic web database. Second, Semantic Web applications automatically extract and process the concepts and context in the database in a range of highly flexible tools.

a. Architecture; not only Application

First, the Semantic web is a complete database architecture, not only an application program. Semantic web architecture combines a two-step process. First, a Semantic Web database is created from unstructured text documents. And, then Semantic Web applications run on the Semantic Web database; not the original source documents.

The Semantic Web architecture is created by first converting text files to XML and then analyzing these with a semantic processor. This process understands the meaning of the words and grammar of the sentence, and also the semantic relationships of the context. These meanings and relationships are then stored in a Semantic web database. The Semantic Web is similar to the schematic logic of an electronic device or the DNA of a living organism. It contains all of the logical content AND context of the original source. And, it links each word and concept back to the original document.

Semantic Web applications directly access the logical relationships in the Semantic Web database. Semantic web applications can efficiently and accurately search, retrieve, summarize, analyze and report discrete concepts or entire documents from huge databases.

A search for “Java” links directly to the three Semantic Web logical clusters for “Java”: (island in Indonesia, a computer programming language, and coffee). The processor can now query the user for which “Java”, and then expand the search to all other concepts and documents related to the specific “Java”.

b. Structured and Unstructured Data

Second, Semantic Web architecture and applications handle both structured and unstructured data. Structured data is stored in relational databases with static classification systems, and also in discrete documents. These databases and documents can be processed and converted to Semantic Web databases, and then processed with unstrctured data.

Much of the data we read, produce and share is now unstructured; emails, reports, presentations, media content, web pages. And, these documents are stored in many different formats; text, email files, Microsoft word processor, spreadsheet, presentation files, Lotus Notes, Adobe.pdf, and HTML. It is difficult, expensive, slow and inaccurate to attempt to classify and store these in a structured database. All of these sources can be automatically converted to a common Semantic Web database, and integrated into one common information source.

c. Dynamic and Automatic; not Static and Manual

Third, Semantic Web database architecture is dynamic and automated. Each new document which is analyzed, extracted and stored in the Semantic Web expands the logical relationships in all earlier documents. These expanding logical relationships increase the understanding of content and context in each document, and the entire database. The Semantic Web conversion process is automated. No human action is required for maintaining a taxonomy, meta data tagging or classification. The semantic database is constantly updated and more accurate.

Semantic Web architecture is different from relational database systems. Relational databases are manual and static because these are based on a manual process for maintaining a taxonomy, meta data tagging and document classification in static file structures. Documents are manually captured, read, tagged, classified and stored in a relational database only once, and not updated. More important, the increase in new documents and information in a relational database does not make the database more “intelligent” about the concepts, relationships or documents.

d. From Machine Readable to Machine Understandable

Fourth, Semantic Web architecture and applications support both human and machine intelligence systems. Humans can use Semantic Web applications on a manual basis, and improve the efficiency of search, summary, analysis and reporting tasks. Machines can also use Semantic Web applications to perform tasks that humans cannot do; because of the cost, speed, accuracy, complexity and scale of the tasks.

e. Synthetic vs Artificial Intelligence:

Semantic Web technology is NOT “Artificial Intelligence”. AI was a mythical marketing goal to create “thinking” machines. The Semantic Web supports a much more limited and realistic goal. This is “Synthetic Intelligence”. The concepts and relationships stored in the Semantic Web database are “synthesized”, or brought together and integrated, to automatically create a new summary, analysis, report, email, alert; or launch another machine application. The goal of Synthetic Intelligence information systems is bringing together all information sources and user knowledge, and synthesizing these in global networks.

Future of Information Management: Network Spread Sheets for Ideas

The future of information management will be based on Semantic Web architecture and applications. The most important issue is which technologies and firms take the immediate leadership to drive the migration, and therefore guide the information architecture of the future.

1) Tidal Wave of Information Shifts Power

End users and corporations will drive the rapid expansion of Semantic Web architecture and applications to survive the tidal wave of data, and improve costs, speed and performance. IT management will resist or accelerate this trend. Information power will shift from the database managers back to the departments and end users; as the PC + spread sheet did in 1981-1984.

2) Migration to XMLand RDF Standards

Applications programs will follow Microsoft’s migration to XML standards for document authoring and exchange. XML and RDF standards will become the dominant approach for capturing, understanding, storing and exchanging external document descriptions and document content. Unstructured Text Documents become Synthetic Expert Networks.

3) Universal Internet Web Portals

Information access will migrate to web portals within organizations and with the general population; and web portals based on Semantic Web applications will become the central user application. Operating systems and legacy applications will become transparent under semantic web portals with highly flexible applications: Network spead sheets for ideas.

4) Parallel Legacy Database Integration

Legacy databases will be extracted into parallel Semantic Web architecture databases to provide access to fragmented sources. Parallel architecture dramatically reduces the costs, risks, and schedules from the ERP “tear down and rebuild” Transparent Grid Architecture.

5) Global and Language Expansion

Information sources, users and entities will expand globally and support many languages. Because Semantic Web architectures and applications “learn and think” in the original language, the production and exchange of multi-language information between language domains will increase dramatically.Interactive Japanese language sources on China in English.

6) Network Access and Distribution

Networks will get better, faster, cheaper, wireless and distributed. Semantic Web architecture and applications will expand to link global data sources from mainframe servers, desktop workstation and laptops, to hand held PDA and cell phones. Voice driven expert systems.

7) Network Transactions and Capacity

Human transactions will grow slowly; and machine transactions will grow exponentially. The migration from man to machine intelligence transactions will rapidly take over the private and public networks. This rapid capacity demand will force a major increase in network hardware investment and stimulate new value added network services. Japan DoKoMo mobile network.

Semantic Web Technology: Executive Summary1