OKN: Open Knowledge Network:

Creating the Semantic Information Infrastructure for the Future[1]

RV Guha, Schema.org; Andrew Moore, Carnegie Mellon University

Motivation

Natural interfaces to large knowledge structures have the potential to impact science, education and business to an extent comparable to the WWW. We are already seeing the first wave of this in consumer services such as Siri, Cortana and Alexa. But these services are limited in their scope of knowledge, not open to direct access or contributors beyond their corporate firewalls, and can only answer relatively limited questions in their business areas. We now have the technology and know-how to expand to thousands of new topic areas and many more useful classes of questions, if we mount an open effort. For example, the following kinds of question could be supported:

Which Hodgkin’s Lymphoma treatments are covered under the Affordable Care Act for my mother?

Which US representatives and senators from California received MBAs?

Did the recent rainstorm in Oakland cause unusual pollution levels in local streams?

Have there been unusual clusters of earthquakes in the US in the past six months?

How much of Chicago power has transitioned to solar?

What do the cells in capillary systems of liver tumors unresponsive to sorafenib have in common?

The architecture should allow people to encode knowledge for their topics of interest and be able to hook them into the larger network, without having to go through gatekeepers (such as Google or Apple).

Once this knowledge is encoded, access to this should not be restricted to a small priesthood of SQL or other programmatic interface users. There will be a wide range of interfaces, including natural language interfaces, graphical interfaces and visualizations which no one has even invented yet. Developers will be able to independently create more sophisticated programs for answering queries, providing summaries that help regular people make decisions in their lives.

A critical aspect of this vision is the availability of an open web-scale knowledge network. What is a knowledge network? First, it aspires to be a listing of every known concept from the worlds of science, business, medicine and human affairs. Second, it includes not merely raw data, but semantic information: for example, how different concepts relate to each other ([John F Kennedy] was [male], a [US] citizen, and held the [office] of [President of the United States] [during] the time period [{Jan 20 1961} to {Nov 22 1963}]). Third, it is machine readable: such a collection might be conceptually similar to, say, Wikipedia, or textbooks, or published scientific knowledge—but would be at web-scale (trillions of concepts) and machine understandable (not in natural language, but in data structures which machines can piece together to provide expert advice on detailed questions)

The current situation with knowledge networks is reminiscent of the mid-1980’s in computer networking—at that time many proprietary, disconnected islands of networking technology were in existence (e.g., AOL, Prodigy, CompuServe, IBM, DEC, etc). The subsequent advent of the Internet and the Web, with their open protocols, generated an explosion in innovation across all aspects of networking. This enabled “permission-less innovation, ” with anyone able to create and publish a website and thus become a part of the web.

Why now?

The success of Siri, Google Now, and other digital assistants have many thirsting for similar interfaces for their domains and applications. However, the fundamentally proprietary nature of these services, together with the very high startup cost of these systems, leave them with few options. In particular, the long tail of more specialized areas, such as scientific topics, remains unserved.

The second reason for urgency is that we, and all other advanced economies, now see that this vision is likely reachable, and so there is now a race which did not exist five years ago. This new confidence is due to the strong “existence proofs” for the success of this approach, e.g., Google Search, online retail and commerce (e.g., Amazon.com), and IBM’s Watsonfor Oncology, even if they are proprietary in nature or confined to specific domains. An open initiative would allow for full national experimentation: supporting research and innovation in academia, enabling industry to experiment, and enabling government to create new services using Open Data. This is similar to the impact of the open source software ecosystem: one can expect an open knowledge network to accelerate innovation, resulting also in transfer of technology and generating other interactions between the open and proprietary data environments/systems.

The third reason for urgency is a pragmatic business case: small “island” attempts to fuse concepts between government agencies have been frustrating. For example, the laudable open.gov goal has been impeded by the lack of a national infrastructure for sharing semantic information.

Getting Started

Our goal is to create the largest possible Open Knowledge Network (OKN). OKN would evolve continuously, with new data and information. OKN is envisioned as a distributed, federated system that anyone can participate in. The open nature will facilitate participants to bring in data from a wide range of sources --- web crawls, scientific databases, PubMed, natural language processing systems and so on. To succeed, OKN would provide a set of incentives to encourage stakeholders from around the world to contribute data and algorithms to the system.

It is important that statements in OKN have context and provenance information associated with them so machines can trace who made a particular claim and end users/applications can decide which data are safe to use. One must also be aware that the roadblocks to creating such an open system may eventually be “non-technical” in nature, including legal issues and appropriate incentives for participation.

OKN is a piece of the Internet’s infrastructure. The initiative would begin by populating an initial knowledge network and identifying initial application domains and scenarios. We intend to approach National Institute of Standards and Technology (NIST) to ascertain how standards can help our initiative.

In 3 months:
  1. Create the Open Knowledge Network Alliance (OKN-A), as a non-profit organization, with industry and academic participation.
  2. Assemble an initial set of cloud storage and computing resources; populate an initial knowledge base containing, say, 1 billion triples (a triple is a simple form of the semantic information described above), as a start.
  3. Arrange for an initial seed grant to help with the initial setup.
  4. Identify a few application domains and describe initial simple applications examples.
In 6 months:
  1. Run a community workshop to launch the Open Knowledge Network and develop a research agenda.
  2. Fund exploratory research grants, e.g., to further populate OKN; connect specific scientific and/or government data repositories; etc.
  3. Enlist support from other industry, foundation, and government agency sources, e.g., NSF, NIH, NIST, DOE, DARPA, and NSA, to support OKN projects.
  4. Devise some initial competitions to de-duplicate, merge and aggregate low-level entity descriptions in a way that supports scalability, diversity of inputs, and downstream reasoning that agrees with human ontologists.
In 1 year:
  1. Develop a full-blown research agenda, sustained data infrastructure, and a vibrant research community. This includes TREC-like competitions for technologies to make the OKN network and its front ends as useful as possible.
  2. Connect with NSF CISE directorate for technology-based research and other NSF directorates for content-based research, and also connect with other agencies.

Existing Technologies

This initiative is possible because of a series of previous successful experiments and component technologies. In a nutshell:

Big Data

The US has invested heavily, and with great dividend, in techniques to infer useful facts and relations from streams of data from sensors and transactional databases. This has already transformed the pace of development in science, medicine, law, manufacturing and transportation. We now have very large repositories of observed concepts and relations.

The Science of Representing Knowledge

These intelligent systems need to represent and reason about large, complex domains. There is a rich corpus of research on symbolic knowledge representation that this effort can draw upon. The past decade has seen tremendous advances in the more statistical aspects of Artificial Intelligence (AI), especially in machine learning and statistical approaches to natural language understanding and question answering. However, these two approaches have remained largely isolated from each other. Bridging this gap is an important research challenge that will help build the next generation of question answering systems.

Scale of Concepts and Relations

Cyc is the first attempt to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge. The scale was tens of thousands of common generic objects. Since then the systems which drive consumer applications such as shopping websites or car navigation systems have discovered they need to contain hundreds of millions of items to serve their business domains. How much data is sufficient for an open knowledge network? A web crawl at the scale of Google, Baidu or Bing may gather on the order of four trillion concepts and relations from the open web. Noncommercial items, such as all detected astronomical objects, geological objects, symptom presentations and ocean currents will be several orders of magnitude larger. OKN may initially aspire to produce a network of tens of billions of entities, and then grow, say, to that used by Microsoft Cortana or Amazon Echo.

Representing Hypotheses and Uncertainty

Cyc’s legacy includes other systems that scale to millions of concepts, such as Freebase, Satori and common retail and navigation catalogs such as CNET and Navtech. The representation of relations in models such as Freebase are simple triples of the form (concept X has relationship Y with concept Z), such as ([Belvita cookies] [contain] [wheat flour]). The extremely limited expressiveness of simple triple based systems poses many challenges when it comes to complex question answering that may involve conclusions not explicitly stated in the knowledge graph. It is clear that to answer the kinds of questions we are interested in, we will need to both more expressive constructs from symbolic KR and also probabilistic representations that enable us to combine evidence from multiple sources.

Tera-scale knowledge networks

As we ramp up to trillions of items, the technology to manage the data becomes a limiting factor. Possibly the largest collection of concepts in the world now are annotations on web pages called tags. For example, one class of tag, which provides a machine readable identifier for concepts or relations on open web pages are schema.org tags. Schema.org is estimated to have trillions of facts represented on the open web, but turning these into a knowledge base requires computation at the scale of a major internet company, and the resolution of hundreds of millions of different name spaces for the tags requires big data algorithms of almost unprecedented scale.

Representation Languages

Another example of past work is KIF (Knowledge Interchange Format). KIF was ahead of its time (in 1992), when the WWW had not yet found widespread adoption. KIF also made the mistake of focusing purely on the syntax of the exchange language and assumed that no standard schemas would be required. Experience with the Semantic Web and Schema.org has shown that one does need a minimum number of shared schemas so that data from different sources can be merged in a fashion that applications can consume. It is also important for the system to include notions of provenance and compliance, i.e., where did the data and information come from? What can I do with it? Who can I give it to? What are my contractual obligations with the data? How long can I keep the data?

Roles for the Research, Commercial, and Government Sectors

Researchers could contribute by:

●Collecting and incorporating facts/assertions from new sources.

●Developing new big data technologies for aggregating, disambiguating, and resolving references and maintaining provenance (the history of where a concept or relationship came from).

●Providing support for cross-domain inferences.

●Addressing, in an open academic forum, the design decisions around privacy and societal expectations regarding storage and dissemination of knowledge. For example, there would be well-justified and grave public concern if a politically charged historical account was to be included as a fact rather than a reported assertion. This is a topic for linguists, digital humanities experts and ethicists to work on in collaboration with computer scientists and statisticians.

●Studying scenarios for supporting multiple schemas created with the same data.

●Studying how to support free text assertions (schemas), and how they can be treated as evidence of knowledge.

●How to transcend from “narrow AI” to “broad AI”? How to efficiently learn and transfer structure, knowledge, and experience from one application domain to another? A system like IBM Watson, for example, is really a family of siloed knowledge bases. Can OKN lead to insights on how a common knowledge infrastructure could be created to make this knowledge transfer across domains much more easy and efficient?

●A central design around dynamic growth of knowledge: how to verify and modify existing assertions, when new assertions come in.

●Though OKN is an open system, it may include links to proprietary knowledge bases. How does one address security, access control, knowledge representation, and inference in such an environment? How does one combine proprietary facts with “open facts” in an open architecture?

●Implementing compliance with legal/contractual constraints, incorporating the issues of who owns the data (therefore, who might own the derived information); who has the right to use the data and its derivations; and for what purpose, etc.?

●Defining “Grand Challenges” related to populating and use of the knowledge network.

●Multilingual knowledge. We have the opportunity to engineer the system to provide knowledge that can be used to support dialogues in any world language.

Commercial companies could contribute by:

●Using startups and bizdev units of existing major companies to build new consumer applications, ranging from a Bosch washing machine that looks up the optimal conditions for all the items it detects on its racks to a student who develops a fresh question-answering app about the prerequisites for the classes in their school.

●Incorporating open knowledge network data into their own front ends. This is analogous to the great success of Wikipedia and public academic citation databases in strengthening the quality of result sets in existing search engines.

Government agencies could contribute by:

●Working with the OKN community to contribute sections of the knowledge network corresponding to the agency’s domains of expertise and interest.

●Identifying use cases to demonstrate new applications enabled by the OKN, including those relating to strategic areas, e.g. Precision Medicine Initiative, Material Genome Initiative, Smart and Connected Communities, Smart Manufacturing, etc.

●Developing new agency applications utilizing the OKN.

●Funding R&D activities related to the development of OKN.

●Funding challenges and competitions to enable creation and use of OKN.

1

[1]Based on discussions at the Entities, Facts, Questions, Answers (EFQA) Meeting held on Friday July 29th, 2016, at the White House Office of Science and Technology Policy (OSTP), Eisenhower Executive Office Building, Washington, DC, sponsored by the Networking and Information Technology Research and Development (NITRD) Big Data Interagency Working Group. List of attendees is in the last section of the document.