Global 14Cyberminer WRS Document v1.2

CS/SE 6362 Advanced Software Architectural Design (Spring 2011)

Cyberminer

WRS Document

Submitted to:

Dr. Lawrence Chung

Associate Professor,

Department of Computer Science,

The University of Texas at Dallas,

Richardson, TX -75080

Team Name: Global 14

Date / Version / Description / Author
1/30/2011 / 1.0 / Initial draft / Global 14
4/25/2011 / 1.1 / Updated draft / Global 14
4/26/2011 / 1.2 / Formatted document / Caitlin Fowler

Team Website:

Table of Contents

1. Introduction

1.1 Project Overview

1.2 Purpose

1.3 Scope

2. Issues

2.1 Issues with the Functional Requirements

2.11 OFR1

2.12 OFR2

2.13 OFR3

2.14 OFR4

2.15 OFR5

2.16 OFR6

2.17 OFR7

2.18 OFR8

2.19 OFR9

2.10 OFR10

2.11 OFR11

2.12 OFR12

2.13 OFR13

2.14 OFR14

2.15 OFR15

2.16 OFR16

2.2 Issues with the Non- Functional Requirements

2.21 ONR1

2.22 ONR2

2.23 ONR3

2.24 ONR4

2.25 ONR5

2.26 ONR6

2.27 ONR7

2.28 ONR8

3. Improved Understanding

1. Introduction

1.1 Project Overview

“As system/software architects of a renowned company, your team is to architect a web search engine,Cyberminer, using the simple KWIC software system which you implemented as part of Project I. For this part of the project, you will use an Object-Oriented architectural style, and build a Java applet (or an equivalent), which should be accessible through your team’s web page.” – Project Summary Document.

1.2 Purpose

The product described within this document is Version 1.0 of the CyberMiner Software System. This system is uses the KWIC system in the back end to implement a search engine.

1.3 Scope

Although CyberMiner, will for the most basic parts, behave like conventional search engines (e.g. Google) it’s scope shall differ slightly. The search results will be based either on data that has been previously stored in the application’s database or data that is input by some other users. No real life crawling of live websites shall be implemented for time and complexity reasons.

2. Issues

The Project Summary Document gives a list of Functional and Non-Functional requirements that CyberMiner is required to meet. The following sections describe the issues that our team found with each class of requirement. Each requirement is restated and is followed by the list of of issues that are considered relevant. For each issue discussed, a list of possible solutions is considered, one of which is finally selected based on some stated rationale.

2.1 Issues with the Functional Requirements

2.11 OFR1

Cyberminer shall accept a list of keywords and return a list of URLs whose descriptions contain any of the given keywords.

Issue 1: It is not clear where will input be accepted from.

Option 1: Input will be accepted from the keyboard (via an input text field) only.

Option 2: Input will be accepted from input files only.

Option 3: Input will be accepted from both the keyboard and input files.

Choice/Rationale: Option 1 will be implemented, as search engines are typically driven by a user’s keyboard input.

Issue 2: It is not clear what languages will be supported.

Option 1: Only English will be supported.

Option 2: Languages other than English which read from left to right and which contain space as a delimiter between words will be supported.

Option 3: All languages will be supported

Choice/Rationale: Option 1 will be implemented, this increases the speed with which our team will be able to deliver the product. If necessary, support for other languages can be implemented later.

2.12 OFR2

Cyberminer shall use another software system, the KWIC system, as a component, in order to efficiently maintain a database of URLs and the corresponding descriptions.

Issue 1: The mode of interaction between the two systems needs to be determined.

Option 1: Cyberminer will access the database provided by KWIC.

Option 2: Cyberminer will implement the complete functionality of the KWIC system as a subset of Cyberminer’s functionality.

Option 3: Cyberminer will take input from the user and send this behind the scenes to the KWIC system. The KWIC system will use this to search it’s database and return the search results back to Cyberminer which will then display them to the user.

Choice/Rationale: The team decided on Option 3 based on information given to us regarding the required implementation of the KWIC system.

Issue 2: How does the KWIC system efficiently maintain a database of URL’s?

This questions seems to have multple parts. First of all is the database persistent or transient and how will cyberminer access the database? By persistent I mean either an SQL (or equivalent) database or a file will retain the information outside of the KWIC system and allow access to the database even if the KWIC system is not running. The transient approach could be that the KWIC system would hold only those descriptor URL pairs in memory while the process was running. Cyberminer could then access the information through IPC (or equivalent).

Option 1: A persistent SQL Database will be appended to by the KWIC system. Cyberminer will perform searches on that database. The word efficiently refers to the efficient accessibility and control of the data within the database. Under this definition an SQL database makes the most sense.

Option 2: A transient database held in KWIC accessed through Cyberminer by IPC. The word efficiently refers to the scale of effort it takes to implement. Considering the use of the system it makes since that a persistent database is overkill for something where simple IPC could implement much more easily.

Choice/Rationale: The team selected Option 2 to satisfice the need for ‘efficiency’, and to aid in ease of implementation.

Issue 3: What exactly does the word “efficiently” mean?

Option 1: The word efficiently refers to the efficient accessibility and control of the data within the database

Option 2: Efficiently refers to the space taken up by the database to store the data.

Option 3: Rather than search the database of URL and descriptions every time there is a search, we might maintain an index of all the URLs in the database. In this case, the KWIC system will only need to look through this index in a linear manner while searching for terms.

Choice/Rationale: Option 3 was selected on the grounds that a linear search will be straightforward and deliver results quickly.

2.13 OFR3

The KWIC system shall accept an ordered set of lines, where each line consists of two parts:

• The URL part, whose syntax is:

URL ::= ‘ identifier ‘.’ Identifier ‘.’ [‘edu’ | ‘com’ | ‘org’ | ‘net’]

identifier ::= {letter | digit}+

letter ::= [ ‘a’ | ‘b’ | … | ‘y’ | ‘z’ | ‘A’ | ‘B’ | … | ‘Y’ | ‘Z’]

digit ::= [‘1’ | ‘2’ | … | ’9’ | ‘0’]

• The descriptor part, whose syntax is:

identifier {‘ ‘ identifier}*

2.14 OFR4

The descriptor part of any line shall be “circularly shifted” by repeatedly removing the first word and appending it at the end of the line.

2.15 OFR5

The KWIC index system shall output a listing of all circular shifts of the descriptor parts of all lines in ascending alphabetical order, together with their corresponding URLs.

2.16 OFR6

No line in the output list shall start with any noise word such as “a”, “the”, and “of”.

Issue 1: We need to determine a basis for defining what might constitute a noise word (“the”, for example, might be an acronym for “The Higher Education” to some user and is not necessarily always a noise word when considered within the context of a search engine)

Option 1: The set of words considered ‘noise’ will be [a, an, the, A, An, The, AN, THe, THE, of, Of, OF]. Thus, we assume that these keywords will never be terms that the user is interested in searching for.

Option 2: The set of words considered ‘noise’ will be [a, an, the, of, is, at, to, for, and].

Choice/Rationale: The team selected Option 2. These are fairly common words that it is unlikely users will be interested in.

2.17 OFR7

The KWIC system shall allow for two modes of operation: i) for building an initial KWIC indices; and ii) for growing the indices with later additions.

Issue 1: Where are the shifted URL’s sent to or stored?

Option 1: Onscreen.

Option 2: The database as the key to be mapped URL.

Option 3: Session or ViewState variable

Choice/Rationale: The team selected Options 1 and 2, to provide visibility to the user and to store the relevant information in the database.

Issue 2: How do we build the indices?

Option 1: A file entry (or database query) inputs an original list. Each additional entry will be added from the keyboard. The clear button from phase 1 will still be utilized to clear the database.

Option 2: Only keyboard entry, again using a clear button.

Option 3: Sticking more to the letter of the requirements the UI will have one button for appending to the dataset and one button for adding an entry to a cleared data set.

Choice/Rationale: The team selected Option 1, as it allows for the greatest input flexibility when creating the database.

Issue 3: Where will the additions to the indices come from?

Option 1: The KWIC system will generate new indices from the new additions to its database.

Choice/Rationale: Option 1 will be selected by default. There is no other reasonable way to do this.

Issue 4: How do we input the additions to the indices into the KWIC system?

Option 1: Cyberminer will include an admin interface, which can only be signed into by an administrator of the software, for adding new URLs (along with their descriptions) into the KWIC system which shall, in turn, generate the indices, add them to the existing indices and store both the indices and new URL.

Option 2: Works similarly as Option 1, except that the indices are generated by Cyberminer itself and the KWIC system only has to store the new indices and the new URLs

Choice/Rationale: Option 1 will be implemented, as it is the more robust of the options and allows for greater control over they system.

2.18 OFR8

Cyberminer shall allow for Case sensitive search: The system shall store the input as given and retrieve the input also as such.

Issue 1: This FR creates a constraint on the users of the application which might not always be necessary or practical and may also degrade the usability of the system for certain class of users.

Option 1: Case sensitive search will be only way to use the system.

Option 2: Case sensitive searching will not be implemented.

Option 3: Case sensitive searching will be implemented, but will be selectable by the user as an option via a checkbox.

Choice/Rationale: Option 3 will be implemented, as it gives the user more customizability without presenting an insurmountable implementation challenge.

2.19 OFR9

Cyberminer shall allow for Hyperlink enforcement When the user clicks on the URL, which has been retrieved as the result of a query, the system shall take the user to the corresponding web site.

Issue 1: Will the website corresponding to the URL be displayed within the same window as Cyberminer (either in a new frame or replacing the Cyberminer view) or the website will be opened in another window?

Option 1: A new window will be opened.

Option 2: The current window will navigate to the selected link.

Option 3: The website will be opened on the same page but in a different frame below the search menu.

Choice/Rationale: Option 2 will be utilized, primarily for ease of implementation. This allows for swifter completion of the system.

2.10 OFR10

Cyberminer shall allow for Specifying OR/AND/NOT Search: A keyword-based search is usually an OR search, i.e., a search on any of the keywords given. The system shall allow the user to specify the mode of search, using “OR”, “AND” or “NOT”;

Issue 1: We need to establish operator precedence among the operators.

Option 1: Allow the user to force operator precedence with the use of parentheses such that the innermost parenthesis has the highest precedence

Option 2: Use the most commonly used logical operator precedence rules as described in

Choice/Rationale: Both Option 1 and Option 2 will be chosen; this gives the user more control over how searches will be performed.

Issue 2: Are OR/AND/NOT sufficient for most practical uses or we need to augment the list based on some practical scenario?

Option 1: The only missing feasible logical operator that is missing from the list is XOR so we add it

Option 2: We leave the list as-is since it caters for most search needs of most users

Choice/Rationale: Option 2 will be implemented, as it satisfies the needs of most users, without needlessly complicating implementation.

Issue 3: How will these options be presented to the user?

Option 1: The user will type the operands into the search box.

Option 2: The user will select a check box indicating the type of search to be used.

Option 3: Provide a link to a third page called “Advanced Search” where these options can be used with some presets.

Choice/Rationale: Option 3 will be implemented. A separate search page will be less confusing for the user.

Issue 4: Will the user be allowed to switch between these operators after search results have been displayed or she will have to do another search and select a new operator?

Option 1: The user will be allowed to switch by selecting a different option.

Option 2: The user will have to start another search.

Choice/Rationale: Option 2 will be implemented. This ensures that the user has reviewed their input before searching.

2.11 OFR11

Cyberminer shall allow for multiple search engines to run concurrently.

Issue 1: The meaning of “concurrently” is ambiguous and we need to define what it might mean in this context.

Option 1: Cyberminer will be multithreaded.

Option 2: The implementation of Cyberminer will be such that multiple systems can access the database simultaneously. (Cyberminer is a client program that can be run by multiple systems to access a central database)

Choice/Rationale: Option 2 will be implemented. Multiple search engines running concurrently has been decided to mean that multiple systems can access Cyberminer at the same time.

2.12 OFR12

Cyberminer shall allow for the deletion of out-of-date URLs and corresponding descriptions from the database.

Issue 1: How do we know that URL is out of date?

Option 1: We use a preset amount time that the URL can live in the KWIC database at the end of which a URL will be considered out of date.

Option 2: A URL will be considered out-of-date if, at the time of Hyperlink enforcement, the destination website cannot be found

Option 3: Periodically, and at the expiration of a preset time after the URL was originally stored, the KWIC system will “probe” the Internet to see if the corresponding website is still active and, if not, it is considered out of date

Choice/Rationale: Option 1 will be implemented, as the requirements seem to be looking for a time-to-live for URL’s.

Issue 2: At what point do we do the deletion from the database?

Option 1: When the system is restarted.

Option 2: As soon as the URL time-to-live expires.

Option 3: After the URL time-to-live expires, we check to see that the website is still online. If it is not, we delete it. If it is, we reset the time-to-live to the next interval.

Choice/Rationale: Option 1 will be implemented. This allows for some lenience when a URL will be deleted, so users will not be confused.

Issue 3: At what point do we update the date of the URL?

Option 1: We never update the date of the URL

Option 2: We update the date of the URL when one is clicked to be opened.

Option 3: We update the date after being returned by a search.

Choice/Rationale: Option 1 will be implemented. This ensures that out of date links are removed properly. It also ensures that no URL has an arbitrarily long time-to-live.

2.13 OFR13

Cyberminer shall allow for listing of the query result in ascending alphabetical order.

Issue 1: Alphabetical order of URL or alphabetical order of URL description?

Option 1: Alphabetical order based on URL.

Option 2: Alphabetical order based on URL description.

Option 3: Both, by clicking either of the header

Choice/Rationale: Option 1 will be implemented, as is most closely follows the requirement.

Issue 2: This requirement equates alphabetical order with the relevance of the search results. This is not always so in practice and we need to decide how to meet this requirement in a way that is practical in the search engine domain.

Option 1: We assume that alphabetical order is synonymous with relevance

Option 2: We assume that relevance of the search results with respect to the input search terms is not important for this application

Option 3: We assume that, relevance matters but it will be out of scope for this project for time and complexity-of-implementation reasons. Hence, relevance will be considered for future implementations.

Choice/Rationale: The team selected Option 1 based on the fact that relevance is never mentioned in the design document as an important search criterion.

2.14 OFR14

Cyberminer shall allow for Setting the number of results to show per page.

Issue 1: How will this option be presented to the user?

Option 1: This setting will be presented as a drop-down box.

Option 2: This setting will implement a text input field where the user may input the number of results to show per page.

Choice/Rationale: Option 2 was selected following a vote from the team. This is primarily an aesthetic choice, and has little to do with the core functionality of the system.

2.15 OFR15

Cyberminer shall allow for navigation between pages of the results.

Issue 1: Is there a cap on the number of total results

Option 1: yes, as this will lend itself to response time.

Option 2: no, as the first page can be acquired only and the other page results can

be acquired when needed.

2.16 OFR16

Cyberminer shall also possibly allow for Auto-fill.

Issue 1: Do we auto-fill results or do we auto-fill search terms?

Option 1: as the user types in a search term, Cyberminer will generate a list of “suggestions” to search for, based on previous searches.

Option 2: as the user types in a search term, Cyberminer will generate a list of “suggestions” to search for, based on the current indices in the KWIC database

Option 3: as the user types in a search term, Cyberminer will generate a list of search results based on the current indices in the KWIC database

Choice/Rationale: The team selected Options 1-3 here, and they will be implemented as time permits.