February 16, 2007

This document describes the automatic generation of METS and EADs through GenDB and GenX. These processes are similar in many ways, and for simplicity, I have referred to METS when the processing a METS and the processing an EAD coincided, and pointed out where the two diverged. An additional section at the end describes possible approaches to automatic XTF indexing.

Step 1: Requesting METS generation from the UI

There are three screens from which ProjMgr can issue a METS creation request:

The above screen - PrjMgrMakeEAD - is reachable from the the PrjMgrOptions screen for the creation of EAD and all the associated METS. Since for each project there are only a limited number of collection level objects, it’s conceivable to have a dynamically generated list with checkboxes in this screen.

·  list is of all objects in the project for which tblObject.ObjectType = collection

The above screen – PrjMgrMakeMETS - is also reachable through the ProjMgrOptions area. In this case, since a project can have a large number of METS, a list of candidates according to specified criteria can be created. The resulting page lists titles with corresponding checkboxes similarly to what is done for WorkOrders. Candidates only include standalone METS e.g. no imbedded objects are retrieved.

·  Candidates are objects for which tblObject.ImbeddedObject = false

·  The indexes available in the drop down are the same as for the object search in the opening page of WebGenDB.

The above screen is the top section of the data entry page when the user has a ProjMgr login. The ‘Generate METS’ does not appear for any of the lower privileges logins.

When deciding if the button is shown:

·  GenDB checks on user login level – kept in internal java class User UI (User Interface) ->ML (Middle Layer) User.getUserPrivilege > 10

·  GenDB checks on imbedded status. UI -> ML GenSubObj.isImbeddedRoot()

·  Checks if tblObject.Completed = true in which case the generation happens on Save.

·  All objects for which METS generation happens on Save should have a marker that distinguish them e.g. the writing “Save generates METS” in the top corner. Different color screen is also a possibility, but the color proliferation could create confusion for the users.

The ProjMgr can set “generation on save” as default behavior for objects in a given DB from the following screen:

By checking the checkbox, the PrjMgr make default the generation of METS at any save done on a data entry screen regardless of the level of privilege or of the presence already of existing METS.

·  When the PrjMgr checks the ‘Generate METS on save’ checkbox:

o  GenDB sets tblObject.Completed = true for all existing objects for that project

o  Any new object is created with tblObject.Completed = true

·  When the PrjMgr unchecks the ‘Generate METS on save’ checkbox:

o  All new objects are created with tblObject.Completed = false

o  Existing objects are untouched since they already have METS for which the ‘generate on save’ is default behaviour.

·  Side note: the feature ‘Generate METS on save’ default behavior for all privileges is only appropriate for projects for which the object MD is entered all at once, so mainly for simple objects.

Step 2: GenDB calling GenX

GenDB starts the process for METS generation if:

·  A submit is done from one of the PrjMgr screens (see two first images in this doc)

·  A “Generate METS” is issued from a data entry screen. Button appears only under the conditions listed above.

·  A “Save” is issued from a data screen for a non imbedded object when the PrjMgr has chosen the feature “Generate METS on save” for that project.

·  A “Save” is issued for a non imbedded object for which a METS was already generated regardless to the “Generate METS on save” choice. It is assumed that any “Save” after a first METS is generated, will be mostly edits so unlikely to require revision and further PrjMgr approval.

At every “Save” in the data entry screen GenDB checks on tblObject.Completed = true to determine if a METS needs to be generated.

METS generated from a data entry screen – either “Save” or “Generate METS”:

·  GenDB does a check on the min. requirements for METS creation.

o  If the validation is successful, GenDB shows the user an acknowledgment of the submission and a link to the WebGenDB entry page – possibly others. The acknowlegment also specify to check the full report on the METS creation in the user email.

o  If the validation is not successful, GenDB shows a list of the problematic fields and a link to the object. While the METS is not generated, all the values entered are saved as usual.

·  GenDB creates a GenX object and invokes GenX.Process (OBJIDs, PJID, DBID, LoginEmail)

METS generated from a PrjMgrMakeEAD or PrjMgrMakeMETS:

·  GenDB shows an acknowledgment of the submitted request, states that final results will be mailed to the user, and offers a link to the WebGenDB entry page. Multiple METS and above all EAD can take long time to complete which make more reasonable the mailing of the results to the user.

·  GenDB parses the list of OBJID for which METS/EAD need to be created.

·  For each candidate GenDB validates the minimum requirements creating two lists: one of successful validation and one of failed ones with causes of failure.

·  GenDB mails the list of failures with causes at validation time to the user. There is no need to send the list of successful ones since they still need to go through GenX and the final METS validation.

·  GenCB creates a GenX object and invokes GenX.Process() with the OBJID lists (semicolon ‘;’ separated) of successful validations.

·  NOTE: because of the need to associate an email with the login, GenDB needs the additional column tblUser.email which can be entered at project setup time.

Step 3: GenX

Once invoked, GenX.Process()

·  For each of the OBJID in the semicolons separated list received in the parameter list

o  GenX checks on ExportType= EAD in which case it creates a EAD and its associated METS

o  NOID – GenX checks for FileLevel ARK in GenDB

·  if there isn’t, GenX checks the file url against the reverse resolver, and only if it finds none, it will mind a new id, bind it to the url, reverse bind the url to the id, and write it in GenDB.

·  if there is already one, GenX reuses it.

o  NOID – GenX check if there is already a ObjectLevel ARK in GenDB.

·  if there is, GenX reuses it.

·  if there isn’t, GenX mints a new one, bind it to url, and reverse binds the url to the new id for the reverse resolver, and writes it in GenDB

o  GenX validates the created METS document against the METS schema using Xerces libraries, and does what EAD validation is feasible against BPG rules.

o  If validation is successful, GenX sets the tblObject.Completed=true for that object , writes the file in the appropriate directory, add the OBJID and Title to the list of successfully completed METS/EAD

o  If validation is unsuccessful, GenX writes the OBJID, the object Title and the reason for failure in the list of unsuccessful METS/EAD.

o  EAD NOTE: the failure of the creation of one imbedded METS causes the failure of the EAD to which it belongs and stops on the creation of any additional METS associated with the EAD. Because of the possibility of rollback, EAD & its associated METS are written to the destination directory only when all the METS have been checked as valid. GenX could either use tmp directory or retain the info in memory depending on the number of METS involved.

·  Once all the OBJID have been processed GenX returns the two lists (successful & unsuccessful) to GenDB, writes the same in a log, and mail them to the LoginEmail.

Step 4: XTF indexing – indexing is done at Collection level only (for now)

Once GenDB receives back the list of OBJID for which METS were successfully generated:

·  Checks in the database for the collections, if any, to which those objects belong.

·  For each of those collections, but only once for each, GenDB updates a trigger file which is periodically used to reindex collections which have received updates. A cron job will start the process.

·  GenDB mails to the LoginEmail a report acknowledging the future re indexing of the collections possibly specifying when such update will happen.

Keeping track of the XTF indexes:

·  XTF indexes need to be defined and configured before hand.

o  Solution with index names kept in Gendb.

§  This requires the creation of a new tblXTFIndex and a tblXTFLink since one METS can belong to multiple indexes.

§  Since XTF indexes need to be configured by LSO, it is conceivable that the tblXTFIndex is also updated only by LSO directly creating an entry in the DB and removing the need for an additional user interface.

§  All XTF configuration files are outside of GenDB in the currently dedicated directory.

o  Solution with index names outside Gendb

§  The connection between the Collection name and the XTF indexing name is kept in a mysql database or xml file. The table structure of the former or the schema for the latter to reflect the many to many relationship as in the case above.

§  As above LSO, once created the configuration files for XTF, the update of the file or the mysql will also be LSO responsibility.

§  XTF configuration files stay in the current dedicated directory.

Frequency of XTF updates:

Some of Lucene benchmarks are found under their web site http://lucene.apache.org/java/docs/benchmarks.html . I have also some anecdotic information from CDL whose their largest collection which uses XTF has 230,000 of mostly METS with a significant number of large TEI. CDL does an incremental update every week and a clean run every month. We currently have about 43,000 METS which include all the collections; it seems that a once a day incremental indexing on individual collections could be a choice.

Default assignment of Collection value to records:

Some projects, above all the ones which will be using the simplified version of the WebGenDB interface, might want to set up a Collection name by default so that all records are automatically created with the given value. Defaulting Collection simplify the input screen and assures that all the records for that project will be properly indexed.

·  The Collection name is added at Project SetUp time by LSO according to the ProjMgr specifications in tblDefaults.

·  The first time a record is created for that project, a tblRelated record with RelType=collection and RelName corresponding to the default is created. The first and all subsequent records are linked to that Related record.

·  Different projects can have the same Collection name because it’s the uniqueness of the XTF index name that assure that only the right METS are included.

·  The XTF index name is assigned by LSO and remains hidden from the data owners and from the patrons.