1. Overall Project Objectives and Significance
The growth of the Internet and the World Wide Web (WWW) has led to the creation and implementation of many hyperlinked information networks, including the Web itself[i don’t feel like this first sentence is saying enough…shorten or change?. This explosion in the availability of information has created a need for tools that assist users in navigating and managing hyperlinked information sources. Our research goal is to develop an agent-based system that is able to help users manage their personal information spaces and locate useful documents within a large [multi-source] information network.[mention “pushing” information to users…exposing the context of documents…impromptu learning…maybe also say something about “exploiting” the fact that our networks and processing are now fast enough to allow …
We posit that Web users maintain a personal information space, or personal web, which consists of documents they have created or edited and sites they regularly view.
[concerning the below, I’m not sure if following links outward is “extending” the personal web. I like to think it more as there being information at various degrees of separation from the personal space, and that webtop allows for seamless navigation between the personal space and other spaces (the www space, a group space, or some other info source space, including some other user’s space.
Following the links within a particular document allows a user extend this web. For example, locating pages that link to my home page, or finding the home pages of authors referenced in this proposal, will extend my personal web. This personal web can then be extended another step outward, for example by finding the home pages of those people mentioned in pages that link to my home page. We refer to a connection between documents such as this, based on their explicit or implicit reference to each other, as an associative link. As the web of associative links is expanded, a user is able to discover new documents that potentially satisfy their needs. However, since this web grows exponentially as it expands, a user is presented with both a retrieval problem (finding the associative links) and an organizational problem (ordering and displaying this information). As a means of confronting this problem, we propose to develop an agent-based system that will automatically search and expand the potentially promising areas of its user’s personal web and display those documents deemed most helpful in assisting the
user with her current task. We refer to an agent that engages in this sort of document association as an associative thinking agent. [may want to say that the user and agent work together…i.e., the user will sometimes want to explicitly choose an information source or sources (e.g., an expert’s space). At other times, he might want the agent to implicitly find sources related to the current topic.
WebTop is a prototype associative thinking agent that displays contextual information from the user’s personal web and from various information sources, including the Web (, Google) and the Internet Archive. A screenshot of WebTop is included below in Figure 1. Three types of links are currently supported: outward links (pages that a document links to), inward links (pages that link to a document), and content-based links (pages deemed to have similar content). The single page that the user is currently viewing or editing roots this personal web; we refer to this page as the working document. The working document’s associated links can be expanded, allowing the user to view networks of documents at various degrees of separation from the working one. Different relationship types can be expanded at each level, thus allowing a user to view, for instance, the documents that link to a document that has similar content to the working document. WebTop also serves as a file manager, allowing users to “save” web and local documents by linking them in a contextual manner. These features aid users in both the construction and extension of their personal webs, as well as the maintenance and display of this information.[they also show him contextual data that he might not have seen…
[may want to say something about zero-click interfaces and impromptu learning here…
how the creator need not perform a context switch to open up a search engine or various search engines, but can just work and now and then glance at the info being pushed.
While WebTop is already a useful tool, there are several limitations and extensions that we plan to address in this research. WebTop is personalized only in that it uses the user’s working document to perform searches, and it displays related links from the user’s personal space as well as external sources. able to provide some customized information by constructing content-based links from a user’s personal web,Bbut it does not learn a long-term model of a user’s preferences in order to better select what is displayed for him—for instance, pages returned from Google are displayed using Google’s impersonal page ranking.. We plan to integrate such a module into WebTop, with the goal of allowing a WebTop agent to make better decisions about how to expand a user’s personal web and organize newly discovered links.
Also, we propose to extend the sorts of associative links that WebTop can provide to include documents that are implicitly related, as well as domain-specific relations. For example, we would like for WebTop to be able to recognize a proper name in a working document and locate that person’s home page, the papers they have written (available through CiteSeer[1]), and their co-authors. This requires a much broader notion of association, since the “link” in the working document is an implicit one, based on the presence of a proper name, rather than an explicit hypertext link.
Finally, we would like for multiple WebTop agents to be able to share their personal webs. This will involve constructing a set of protocols that allow multiple WebTop agents to locate each other, describe and exchange information, announce their interests and areas of expertise, and request associations for a document.
[i think we need to reformulate the main objectives a bit]
This research will allow creators and consumers of hyperlinked documents to more easily organize their information, search through large document spaces and discover new and previously unknown information sources. It will also contribute to the study of how peer-to-peer information sharing networks can be constructed and maintained, how participants in these networks can locate each other and how they can describe their services to each other.
We outline these objectives in more detail below.
Objective 1: Develop a user-specific algorithm for selecting and ranking documents from multiple information sources.
Currently, WebTop allows a user to browse and expand her personal web, selecting related documents, documents associated with related documents, and so on. However, the work of sorting through these related documents and determining which are most valuable is left to the user. As the personal web is extended in a breadth-first manner, the user is quickly presented with a large number of documents to organize. Current search engine technology faces a similar problem; algorithms such as PageRank[cite] and HITS[cite] tackle this problem by taking advantage of the Web’s networked structure, awarding high ranks to documents that occupy central locations within the network. This is a very effective solution for search engines, but less useful for WebTop, since WebTop does not retain a large database of Web pages.
Unlike a search engine, each WebTop agent is essentially a single-user application. This provides us with the ability to allow WebTop to customize its recommendations to match a user’s preferences. We plan to extend WebTop to both construct and adapt a model of a user’s preferences, and then use this model to modulate the collaborative recommendations received either through Google or from other WebTop agents (see below).
Our initial proposal for learning user preferences involves the combination of TFIDF[cite], a standard information filtering algorithm for extracting keywords from documents, and a Naïve Bayes classifier[cite], a simple yet effective probabilistic classifier that has been widely used in text-classification[cite]. TFIDF allows WebTop to extract relevant keywords from a user’s personal web. These then provide positive examples for a Naïve Bayes classifier. Since WebTop is embedded within a web browser, the agent also has access to the user’s browsing history. Pages viewed provide a secondary source of data about preferences, albeit a noisier and less informative source.
Once such a model is learned, it can be used to help judge the relevance of different documents within the personal web.
[one thing that confuses the issue is that the way things work now, tfidf is used to get key words from the working document. Then those keywords are sent to the various search engines, which return a set of documents. The results are already geared toward the working document. I think personalized ranking can come in through seeing if any nearby pages in the personal space point at any prospective results, or if any results are authored by “favored” authors of the user, or if I have been shown something from that source before and liked it or not, or we might also think in terms of collectives.
The learning problem is actually more complex than this. Even though a WebTop agent is typically associated with a single user, a single user may be associated with multiple working contexts. For example, when I am working on a scholarly document about machine learning, pages about Bayes’ Theorem are relevant and useful, but when I am writing an article about Michael Jordan, they are less so. If a WebTop agent is to successfully recommend articles in accordance with a user’s tastes, it must be able to maintain a number of preference models for the user’s different working contexts, identify which working context is active at the moment, and recommend documents consistent with this context. Solving these problems will produce a flexible, adaptive agent that is able to recommend documents consistent both with a user’s tastes and with her current task at hand.
Objective 2: Extend WebTop to take advantage of domain-specific information sources.
[i like this section a lot…may need to emphasize it more in early paragraph?]
Many sorts of documents have implicit measures of similarity that are difficult to extract using only syntactic techniques such as TFIDF. One example, mentioned above, is that of locating the homepage of an author cited in a working document. Once this page is located, a WebTop agent can then display the author’s research papers (by querying Citeseer), his co-authors (also from Citeseer), his department or institution’s home page (extracted from his home page), and so on. This operation exploits two facts: access to a special-purpose database (Citeseer), and knowledge that proper names correspond to people, who author papers with other people.
There are many other special-purpose data sources that we can potentially integrate into WebTop. One particular source is the Internet Archive[2], a non-profit organization that has been producing an archive of the WWW since 1996. In this case, the domain-specific knowledge is the fact that the pages being viewed are snapshots from a particular time. Researchers are often interested in tracking a site (for example, news sites after September 11) over time, and so similarity is based on temporal proximity as well as the standard measures, such as shared keywords.
We plan to extend WebTop to deal with both of the cases described above, as well as other special-purpose collections as we are able. By doing this, we hope to both provide a more useful tool for users of these collections and also provide a “point of connection” between these collections and the WWW at large.
[perhaps learn some general rules/protocols that would allow a user to define new domain specific sources and mechanisms???
Objective 3: Develop infrastructure and protocols that allow multiple WebTop agents to share and disseminate information.
[I like this, too, but I think the infrastructure and protocol will be used the same for domain specific sources as regular persons (webtop agents). In other words, WebTop downloaders will automatically be registered, with consent, for peer to peer sharing, but they will register the same as a database source would, basically as a web service that can answer questions like: send me docuements related to keywordlist, or send me inward links of webdocument, etc.
One of our primary goals is the extension of WebTop to allow multiple WebTop agents to share and exchange information. By sharing information, users can avoid redundant searches, group their personal webs to form a “community of practice”, and discover previously unknown correlations between information sources.
In order for multiple WebTop agents to communicate and share information, several problems must be solved. We plan to construct a peer-to-peer protocol that allows two WebTop agents to query each other about the existence of other agents, exchange personal webs, request that an agent evaluate a document and recommend similar documents, and respond to these requests. We plan to construct this protocol using Web Services, with XML as the language used to exchange information. A fully-implemented multi-agent WebTop system will allow for both explicit sharing, in which a user chooses to see the personal web of another (expert) user, such as a copyright law student sharing Lawrence Lessig’s personal web, and implicit sharing, in which a WebTop agent advertises a need for documents matching its user’s preferences and retrieves matching personal webs autonomously from other agents. [text about exposing what is now hidden knowledge which we create every day…how privacy is one reason it is not exposed, but lack of time and infrastructure aslo…i.e., I would share most of the stuff in my personal space, but i don’t have time to publish/blog/email it, and there is no way to just say: go for it.
Several policy questions will need to be addressed in implementing such a system. How can users (and their agents) designate usage conditions for information? For example, I might wish to specify that my personal web can be shared with other colleagues in my department, but that they are not allowed to forward it on. Users of such a system will want to be assured that their private browsing behavior is not made available to the general public. A user might also want to specify that information will only be shared if the recipient is willing to also share her personal web. This will entail the construction of a language that can be used to encode access controls. [designate private/public parts]
We are also extremely interested in the potential of a multi-agent WebTop system to form communities of interest, or congregations[cite] of users with shared interests. A congregation is a set of agents that share a long-term interest, such as the members of a church. Although the constituents may change from week to week, the congregation as a whole maintains its identity. Congregations allow new agents to easily locate resources; rather than searching through every agent in a system, a newcomer can search in congregations corresponding to its needs. We feel that the ability of WebTop agents to form congregations will have two benefits: it will allow the system to scale, and it will allow agents to discover previously unrealized connections between distinct personal webs.
2. WebTop to Date
The existinginitial WebTop implementation provides a tree view of the personal web that is tightly integrated with an editor/browser (see Figure 1). The tree view is a zero-input interface [cite],meaning it pushes document context information to the user without the user typing keywords, or clicking a mouse or switching applications to a search engine. The user can instead intermittently glance at the information and follow one of the suggested links, or just continue on with the current task.
When a working document is opened in the editor/browser, WebTop performs a number of operations:
- Finds outward links: The working document is parsed to find its outward links. If the working document is a directory, its outward links include all of its subdirectories and files. If it is an If it is an HTML document, its outward links are its hyperlinks. This list will also contain edge links, which are discussed in the next section.
- Finds inward links: If the working document is a web document, a wrapper is called that invokes Google’s web services are invoked to retrieve all web pages that refer to the working document. In addition, all documents currently in thea user’s personal space web that refer to the working document are then added to the list of inward links.
- Finds content-related links: The most important terms in the working document are identified using a TFIDF- based algorithm [27]. These terms are then used to perform one or more of three possible searches. The first invokesuses a Google’s web servicerapper to perform a standard keyword search of the WWW. The second searches only documents within the personal space web. The third begins with the working document and crawls all (inward and outward) links from each associated document using a breadth-first search [7]. The user may choose which of these searches are performed through the options dialog.
A user-configurable number of linksdocuments for each relationship are then displayed, color-coded by type, in the first level of the tree view. For example, the default configuration displays five inward links, five outward links, and five content-related links. A “more” node is provided under the nodes of each type. When expanded, additional relations at the same level in the tree (degree of separation) are shown.