Information Cleansing Project

Information Cleansing Project

Statement of Requirements

Date: 31st August 2010

Author: Maria Ellis

Information Cleansing Project

Statement of Requirements Document History

Revision History

Revision date / Version / Summary of Changes / Changes marked
22/06/10 / 0.1 / First draft / No
29/06/10 / 0.2 / First draft issued / No
16/07/10 / 0.3 / Comments on first draft / No
22/07/10 / 0.4 / Meeting held 19/7 and 22/7 to review document. Tracked and untracked version created / Yes
09/08/10 / 0.5 / Updated requirements listing in document / No
13/08/10 / 1.0 / First issue / No
31/08/10 / 1.0 / Final following internal review / No

Contents

Document History
Contents
1.  Introduction & Background
2.  Proposed Solution Criteria
3.  Requirements
4.  Additional Information Required/Questions
5.  Timescales

1.0  Introduction & Background

The Royal Borough of Kensington and Chelsea (RBKC) has approximately 3500 staff and an unknown volume of unstructured data. Current estimates suggest there are approximately 28 terabytes across a variety of environments including desktop, servers and SAN. Some of this data may be structured e.g. within a database or unstructured as documents or spreadsheets. The amount of information has grown over a period of time and has become increasingly divorced from its original usefulness. Information is frequently retained and not destroyed or archived in case it might be needed. This is an issue as it takes time to locate information and even more time if it cannot be located when required.

Failure to address such uncontrolled information places the Council at risk of not meeting its legal obligations and staff time being wasted by sifting through obsolete and redundant information.

The long-term vision of the Council under the auspices of the Space programme is to drive a greater reliance upon electronic based information. As more and more documents, images and messages become electronic it is more crucial to successfully manage their storage and retrieval, archiving and deletion so that staff can access the right information at the right time.

As part of this it is proposed to move to the use of Microsoft SharePoint 2010 for the storage and management of documentation. To enable this, a data identification exercise needs to take place to enable RBKC to understand what is held and create a filing structure to underpin the valuable data before it is moved in to the new system. The process of migrating to a new system involves the restructuring of data we need and the destruction or permanent archiving of obsolete data.

More urgent than all of these requirements are present problems with the SAN which has capacity issues and is being increased at significant cost to the Council.

a) 

1.0 

2.0  Proposed Solution Criteria

The Council requires a software tool which provides the ability to easily identify and remove obsolete and duplicate information, and, once information has been cleansed, provides the capability to build an effective classification structure, or file plan, based on the local government classification scheme or a classification scheme designed by the Council. This will enable the Council to address the issue within a reasonable timescale as it will speed up the data identification and analysis process alongside the work on data management procedures.

The tool must be able to analyse a given information store, identify and classify all the documents and content on this information store and report on what it finds. The results should show duplications, near duplications and ideally suggest meta-data for use in any migration activity. The tool should also be able to suggest a business classification scheme/file plan based on the results of the identification with the ability for appropriate staff to make changes to the suggested scheme/file plan in line with established business functions and activities.

1.0 

2.0 

3.0  Requirements

Req. No: / ID0001 / Requirement: / Automatic extraction of key themes
Type: / Must Have / Description: / The product must be able to analyse the content of documents to present back an analysis consisting of themes which could be used for classification
Req. No: / ID0002 / Requirement: / Theme based searching
Type: / Must Have / Description: / The product must be capable of searching for themes within and across documents and presenting this back as a list or file.
Req. No: / ID0003 / Requirement: / Classification or taxonomies
Type: / Must Have / Description: / The product must be able to operate with pre-defined classifications or taxonomies but should allow definition of a new taxonomy or classification.
Req. No: / ID0004 / Requirement: / Taxonomy/Business classification scheme creation tools
Type: / Must Have / Description: / The product must be capable of providing tools to create and manipulate a taxonomy or business classification based on the information found.
Req. No: / ID0005 / Requirement: / Automatic summarization of content
Type: / Must Have / Description: / The product must be able to provide a summary of the content found during the analysis.
Req. No: / ID0006 / Requirement: / Summary of document types found
Type: / Must Have / Description: / The product must be able to provide a summary of all document types found during the analysis.
Req. No: / ID0007 / Requirement: / Document Type not supported
Type: / Must Have / Description: / The product must be capable of reporting on document types which it does not support.
Req. No: / ID0008 / Requirement: / Reporting Capability
Type: / Must Have / Description: / The product must have user-defined reporting capabilities.
Req. No: / ID0009 / Requirement: / Duplication reporting
Type: / Must Have / Description: / The product must provide a report listing all duplicate documents listing names, properties and contents across a variety of storage areas and devices, irrespective of the security permissions on particular folders or files, for the purposes of duplicate checking.
Req. No: / ID0010 / Requirement: / Near-duplication reporting
Type: / Must Have / Description: / The product must provide a report listing all near-duplicate documents listing document versions and highlighting degrees of match and differences across a variety of storage areas and devices, irrespective of the security permissions on particular folders or files, for the purposes of duplicate checking..
Req. No: / ID0011 / Requirement: / Duplication management tools
Type: / Must Have / Description: / The product must provide a method of managing discovered duplicate document and give the user the opportunity to keep, archive or dispose of duplicates.
Req. No: / ID0012 / Requirement: / Migration tools
Type: / Must Have / Description: / The product must provide migration tools either as an integral part of the product or as a standalone product. The migration tool must be capable of providing an interface to Microsoft Office SharePoint Server 2007 and Sharepoint 2010.
Req. No: / ID0013 / Requirement: / Microsoft Server environment
Type: / Must Have / Description: / The product must work in a Microsoft Server 2003 R2 or Microsoft Server 2008 R2 environment.
Req. No: / ID0014 / Requirement: / Security integrity
Type: / Must Have / Description: / The product must maintain the security of all documents searched and maintain the security level of the user so that they only have access to the information that they have a right to see.
Req. No: / ID0015 / Requirement: / Document Type comparison
Type: / Must Have / Description: / The product must be capable of comparing documents of all types as a minimum this must cover all MS Office document types (up to and including 2010), pst, pdf, jpeg, gif, png, bmp, Lotus Notes across different environments. The response should confirm the different types covered and the environment types.
Req. No: / ID0016 / Requirement: / User Roles
Type: / Must Have / Description: / The product must allow different roles to be set up for different groups of users of the software
Req. No: / ID0017 / Requirement: / Automation
Type: / Must Have / Description: / The product must be able to automate tasks such as document categorisation and document deletion and merging
Req. No: / ID0018 / Requirement: / Batch jobs
Type: / Must Have / Description: / The product must allow (batch) jobs to be created and run at defined times
Req. No: / ID0019 / Requirement: / Breakdown of disk space used
Type: / Should Have / Description: / The product must provide a complete breakdown of the disk space utilized by using defined criteria e.g. file type, file age, size.
Req. No: / ID0020 / Requirement: / Rules
Type: / Could Have / Description: / The product must allow rules to be created, for example for identifying duplicate documents which may be deleted, moved or merged
Req. No: / ID0021 / Requirement: / Metadata based index
Type: / Could Have / Description: / The product must be able to create a searchable index based on the metadata either found or created by the tool as part of the analysis and for compliance purposes (e.g. non-populated document property fields).
Req. No: / ID0022 / Requirement: / Assigning Metadata
Type: / Could Have / Description: / The product must be able to do bulk metadata assignment to documents based on user-defined parameters applied to document sets.
Req. No: / ID0023 / Requirement: / Web based user interface
Type: / Could Have / Description: / All user interaction must be via a web based interface rather than a client application.
Req. No: / ID0024 / Requirement: / Email Replication
Type: / Could Have / Description: / The product must allow analysis of email messages within an Outlook personal folder (.pst) for example:
a)  reduce replication such as identifying email chains and retaining only the last email in the chain
b)  analyse email content so that email can be reorganised by category
c)  ability to analyse attachments in emails

4.0  Additional Information Required/Questions

1.  Is the speed of processing governed by different file types?

2.  What support for the tool is available?

3.  How quickly can the tool process 1 terabyte of data?

4.  What is the lead time to start work from agreement of the contract?

5.  What logic does the product use to carry out analysis of the themes?

6.  How does the product usefully identify metadata from documents analysed?

7.  Describe the product technical architecture e.g. two-tier, three-tier, multi-tier

8.  Please provide details of the product roadmap and future direction

9.  Which file storage systems is this product compatible with e.g. NTFS, Sharepoint/SQL BLOBS

10. Please supply a list of all document types that the product works against

11. How does the product present alternative document structures e.g. document libraries/folders?

12. Define the capabilities of the product reporting tool including a list of all standard reports.

5.0  Timescales

The implementation approach will be to conduct the first phase within three months of purchase. Suppliers would be expected to demonstrate how they would support this and subsequent phases.

1.0 

2.0 

3.0