A Framework for Linking Documents with Its Usage Within the Context of CI

A framework for linking documents with its usage within the context of CI

A framework for linking documents with its usage within the context of Competitive Intelligence

AKANBI Lukman

Department of Computer Science and Engineering

Obafemi Awolowo University, Ile-Ife. Nigeria.

ADAGUNODO Emmanuel Rotimi

Department of Computer Science and Engineering

Obafemi Awolowo University, Ile-Ife. Nigeria.

Abstract.In this work we propose a computational framework that will assist decision makers to have access to various usages to which a piece of information (document) had been previously subjected. This is with a view to facilitate quick and easy resolution of decisional problems in a competitive intelligence environment. The conceptual model of the system showing various modules and entities that constitute the proposed system is presented. The system relies on the judgment of the Competitive Intelligence actors to capture the degree of relevance between a particular document and a decisional problem. The prototype of the proposed system is currently under development.

Keywords:framework, competitive intelligence, document usage index, decision problem

Introduction

Since the advent of Internet and the introduction of World Wide Web technology, the amount of available information has continued to increase exponentially due to the developments in information and communication technologies and the success of Internet applications. Therefore the identification of useful, reliable and interesting items that satisfies one’s informational requests becomes more and more difficult and time consuming. According to Okunoye et al. (2010), discovery of relevant information for solving decision problems (DP) is pivotal to making right decision. Such information however, has to be sought, collected, processed with a view to elicit knowledge relevant to the problem as well as representing both the collected information and elicited knowledge in a form that will facilitate information reuse.

We believe strongly that every search for information is associated with a decision problem either it is explicitly expressed or implicit to the information searcher. Keeping track of the kind of DP, documents have been used to resolve would ease the process of resolving similar problems in the future. The architecture and framework of a system that assists in capturing various usages of documents within the context of Competitive Intelligence (C.I.) is proposed in this work. The rest of the paper is organized as follows: Competitive intelligence is described in section two; document annotation and indexing are described in section three. In section four, the architecture of the proposed system and description of its various components are presented. Section five concludes the work.

Competitive Intelligence

According to Trigo et al. (2007), Competitive Intelligence can be described as a systematic process of information gathering, processing, analysis and decomposition, which is conducted within the context of the external environment of an organization’s activities. This process is with the major goal of supplying the right information, at the right moment, and in the correct structure, to the right person, in order to support the best decision possible. The Strategic and Competitive Intelligence Professionals (SCIP) define C.I. as the legal and ethical collection and analysis of information regarding the capabilities, vulnerabilities, and intentions of business competitors (SCIP, 2012). The information legally and ethically collected is to be used by the decision makers in an organization to support them in arriving at the best possible decision.

Dishman and Calof (2008) described C. I. as a process involving the gathering, analyzing and communicating of environmental information to assist in strategic decision-making, and as such it is the fundamental basis of the strategic decision-making process. It was pointed out in (Dishman and Calof, 2008) that it was clear from their study that a variety of information sources are utilized, including internal and external sources, sources that are both qualitative and quantitative in nature as well as using both textual and human information sources.

The editor of the Competitive Intelligence Magazine mentioned in his article on Competitive Intelligence, An Overview (Miller, 2007) that effective C.I. is a continuous cycle, whose steps include the following as identified in (Herring, 1998):

Planning & direction (working with decision makers to discover and hone their intelligence needs).
Collection activities (conducted legally and ethically)
Analysis (interpreting data and compiling recommended actions)
Dissemination (presenting findings to decision makers)
Feedback (taking into account the response of decision makers and their needs for continued intelligence).

Information gathering which is the process of aggregating different information sources and Analysis, which is the turning of raw data i.e. a collection of facts, figures, and statistics relating to business operations, into actionable intelligence i.e. data organized and interpreted to reveal underlying patterns, trends, and interrelationships, are the key thrust of C.I. The data thus transformed can be applied to analytical tasks and decision making, which forms the basis for strategic management (Miller, 2007; Chen, Chau & Zeng, 2002).

Document Annotation and Indexing

The traditional domain of document annotation covers the annotation of arbitrary textual documents, or parts of them, which can be manual (i.e. performed by one or more people), semi-automatic (based on automatic suggestions), or fully automatic. Manual annotation tools allow users to add annotations to web pages or other resources, and share these with others. An example annotation would relate the text “Paris” to ontology, identifying it as a city and as capital of France. Automatic tools can perform similar annotations (such as named-entity recognition) without manual intervention (Oren, Moller, Scerri, Handschuh, & Sintek, 2006). Document annotations are a very old concept through which much of business and academia works are accomplished. The usual scenario is that of a member of a group producing a document, which is then distributed to the group members for review. Each member reviews their own copy of the document and later returns the annotated document to the author. The author merges the comments into the document and produces a new document for review. After a number of iterations the document is declared final and is typically published outside of the group (Lapique and Regev, 1998).

In addition to the old Web-based annotation tools, some of which include: ComMentor annotation tool (Röscheisen, Mogensen, & Winograd, 1994), Yawas annotation tool (Denoue and Vignollet, 2000), Anotea annotation tool (Kahan, Koivunen, Prud’Hommeaux, & Swick, 2002), there has been emergent of recent ones which include: AMIE annotation tool (Robert and David, 2006) and AMTEA (Okunoye, David, & Uwadia, 2010). These annotation tools provided platforms for users to create annotations from documents of interest.

According to Manning et. al,(2009), the simplest form of document retrieval is for a computer to do a sort of linear scan through documents which is commonly referred to as grepping through text, named after the Unix GREP command grep, which performs the process. The way to avoid linearly scanning the texts for each query is to index the documents in advance. Document indexing is therefore a process of identifying a document with some carefully selected words from the document called terms.

An index to a document acts as a tag by means of which the information content of the document in question may be identified. The index may be a single term or a set of terms which together tag or identify the content of each document. The terms which constitute the allowable vocabulary for indexing documents in a library form the common language which bridges the gap between the information in the documents and the information requirements of the library users (Maron and Kuhns, 1960).

The three basic techniques for searching traditional information retrieval collections are: boolean models, vector space models, and probabilistic models (Baeza-Yates and Ribeiro-Neto, 1999).

As reported in Langville and Meyer (2006) Boolean model is one of the earliest and simplest retrieval methods that utilise the notion of exact matching to match documents to a user query. Its more refined descendents are still used by most libraries. The adjective Boolean refers to the use of Boolean algebra, whereby words are logically combined with the Boolean operators AND, OR, and NOT. The Boolean model of information retrieval operates by considering which keywords are present or absent in a document. As such, a document is judged as either relevant or irrelevant.

More advanced fuzzy set theoretic techniques try to remedy this black-white Boolean logic by introducing shades of gray. For example, a title search for car AND maintenance on a Boolean engine causes the virtual machine to return all documents that use both words in the title. A relevant document entitled “Automobile Maintenance” will not be returned. Fuzzy Boolean engines use fuzzy logic to categorize this document as somewhat relevant and return it to the user (Langville and Meyer, 2006).

Vector space models transform textual data into numeric vectors and matrices, and employ matrix analysis techniques to discover key features and connections in the document collection. Some advanced vector space models address the common text analysis problems of synonymy and polysemy (Langville and Meyer, 2006). Advanced vector space models, such as Latent Semantic Indexing (L.S.I.) proposed in Deerwester et al.(1990) can access the hidden semantic structure in a document collection.

Probabilistic models attempt to estimate the probability that the user will find a particular document relevant. Retrieved documents are ranked by their odds of relevance (the ratio of the probability that the document is relevant to the query divided by the probability that the document is not relevant to the query). The probabilistic model operates recursively and requires that the underlying algorithm guess at initial parameters then iteratively tries to improve this initial guess to obtain a final ranking of relevancy probabilities (Langville and Meyer, 2006).

The document annotation techniques described above provided the platform for adding additional information to the documents of interest whereas the document indexing techniques discussed above provided various approaches to indexing documents for the purpose of information retrieval. In this work, document annotation techniques will be employed to obtain document usage from the users’ end and index of document usage will be created based on appropriate document indexing technique.

The Proposed Model

Figure 1 shows the architecture of our proposed system for capturing document usage and integrating such into the document index. There are three basic modules in the system. The first module is the user interface; this is the module through which users (competitive intelligence actors) interact with the system. There are two actions that can be performed by the user through the module, the user can either specify a decision problem or submit a query. The results from search activities are also displayed by the user interface unit of the system.

The second module comprises document index, document usage index and documents databases. The document usage index database contains the index generated from the decision problem specification process. The degree of relevance between the DP at hand and the document retrieved specified by the user after using the document retrieved forms the basis for relating and linking document usage index to documents. Note the user interface and the databases are accessible to only the Competitive Intelligence actors. The third module is the public repositories of information which consists of servers that are scattered all over the world. Public search engines such as Google and Yahoo have direct access to these public repositories which they crawl to populate their document index database.

Figure 2 shows the flow of activities through the proposed system. From the figure, the user submit query to the system and the system process the query and search through the document usage index, if documents that match the query term exists, the system request the user to specify the DP problem being handled. If there is no document that matches the users’ request, the system informs the user of the unavailability of the requested information from within the system and gives the option of looking up the public repositories of documents for the required information. If the user indicates his intention to look up public repositories, then the system will request for specification of the DP being considered. The document usage index is thereafter updated base on the degree of relevance indicated by the user.

Figure 1: Architecture of the Proposed Framework

Figure 2: the flow of activities

Conclusion

A framework for developing a system that allows capturing of document usages from users’ perspective has been presented. The system is to be used by Competitive Intelligence actors within the Competitive Intelligence environment to aid DP resolution process. The proposed system when completed will facilitate easy access to what documents have been used for in the past. However it should be noted that the effective functioning of the system depends on the cooperation of the users during the DP specification process.

List of References

Baeza-Yates, R. and Ribeiro-Neto, B (1999) Modern Information Retrieval A.C.M. Press New York.

Chen, H., Chau, M. and Zeng, D. (2002) CI Spider: a tool for competitive intelligence on the Web. Elsevier Journal of Decision Support SystemVol.34, Issue 1, pp 1-17.

Denoue, L. and Vignollet, L. (2000) An annotation tool for Web browsers and its applications to information retrieval, in: Proceedings of RIAO200, April 2000, Available at rx.ist.psu.edu/ viewdoc/summary?doi=10.1.1.36.618(visited 19/09/2012)

Dishman, P. L. and Calof, J. L. (2008) Competitive intelligence: a multiphasic precedent to marketing strategy, European Journal of Marketing, Vol. 42, No 7/8, pp 766 – 785.

Herring, J. P. (1998) "What is Intelligence Analysis?" Competitive Intelligence Magazine, Vol. 1, no. 2, pp 13-16.

Kahan, J., Koivunen, M.-R., Prud'Hommeaux, E., and Swick, R. R. (2002). Annotea: an open RDF infrastructure for shared Web annotations. Elsevier Journal of Computer Networks, 39(5):589- 608.

Langville, A. N. and Meyer, C. D. (2006) Google’s PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press. Princeton.

Lapique, F. and Regev, G. (1998). An experiment using document annotations in education. Paper presented at the WebNet 98 World Conference of the WWW, Internet and Intranet Proceedings. 000019b/80/17/7a/21.pdf. (downloaded 19/09/2012)

Manning, C. D., Raghavan, P. and Schütze, H. (2009) An Introduction to information Retrieval, Cambridge University Press, Cambridge, England. Online Edition.

Miller, S. H. (2007), “Competitive Intelligence – An Overview”, Strategic and Competitive Intelligence Professionals, Alexandria VA Available at Intelligence%20 concurrentielle/Articles/CI%20Overview.pdf. Visited 23-09- 13

Okunoye, O. B., David, A., and Uwadia, C. (2010). Amtea: Tool for creating and exploring annotations in the context of economic intelligence (Competitive Intelligence). In 11th IEEE International Conference on Information Reuse and Integration (IRI 2010), pages 249-252, Las Vegas, United States.

Oren, E. Moller, K., Scerri, S., Handschuh, S. and Sintek, M. (2006). What are Semantic Annotations? Technical Report DERI Galwat. Available at: (downloaded 06102012)

Robert, C. A. and David, A. (2006) AMIE: An Annotation Model for Information Research. In L. Barolli, B. A. Abderazek, T. Grill, T. M. Nguyen, and D. Tjondronegore (eds.):Frontiers in Mobile and Web Computing, Austrian Computer Society. Vol. 216, ISBN 3- 85403-216-1, pp 129-137.

Röscheisen, M., Mogensen, C. and Winograd, T. (1994). Shared Web Annotations as a Platform for Third-Party Value-Added Information Providers: Architecture, Protocols, and Usage Examples. Technical Report STAN-CS-TR-97-1582, Stanford Integrated Digital Library Project, Computer Science Dept., Stanford University. Available at : ftp://db.stanford. edu/pub/public_html/public_html/ cstr/reports/cs/tr/97/1582/CS-TR-97- 1582.pdf. Visited January 2013.

SCIP, (2012) About SCIP, Strategic and Competitive Intelligence Professional. Available at temnumber=2214&navItemNumber=492, Retrieved 15/10/2012.

Trigo, M. R., Gouveia, L. B., Quoniam, L., and Riccio, E. L. (2007) Using competitive intelligence as a strategic tool in a Higher Education context, 8th European Conference on Knowledge Management (ECKM).Consorci Escola Industrial de Barcelona (CEIB), Barcelona, Spain. 6-7 September 2007.