The SAFTINet Black Box Functional Requirements
Version 1.0
22-July-2011
1Purpose of the SAFTINet Black Box
The SAFTINet Black Box (SBB) is responsible for accepting data from multiple sources to create linked output data that is compliant with the SAFTINet grid node (see Figure 1). In its initial configuration, SBB will support two linked sources: Clinical data from clinical data partners and Medicaid data from patients associated with clinical data partners.
Figure 1: High-level view of SAFTINet Black Box inputs and outputs.
Key capabilities to be provided by the SBB are:
- Mapping site-specific codes and values into uniform codes and values based on the OMOP CONCEPT_IDs.
- Performing record linkage across two data sources, supporting both clear-text and encrypted identifiers
- Creating random identifiersand unaltered dates, resulting in a HIPAA limited data set.
- Supporting role-based administrative functions and pre-defined reports for evaluating data quality, source_to_concept_ID mapping performance, and record linkage performance
- Enabling ad-hoc data queries and reporting using clinical data, source to concept ID mapping data, and record linkage data
The SBB will be developed in multiple releases. Section 7 provides an initial view of a proposed (subject-to-change) functional roadmap for SBB releases.
2High Level System Design
The internal functional modules and processing workflows are illustrated in Figure 2. Data from clinical sources “flow” through the modules in the upper half of the figure. Medicaid data “flows” through modules in the lower half of the figure. By design, Medicaid data are matched to patients that are present in the clinical data set. That is, Medicaid identifiers are matched to clinical identifiers. To match Medicaid identifiers to clinical identifiers, clinical data must be fully processed PRIOR to processing Medicaid data.
The execution of the Medicaid data processing is optional. In other settings (e.g., DARTNet) that do not have a requirement to link a second data source, the clinical data flow can be used to create a SAFTINet-compliant grid XML data file from an import file that conforms to the SAFTINet ETL XML schema.
Figure 2: Key functional modules within the SAFTINet Black Box.
Description of key functional components
The tasks to be performed in each module are:
- XML Importer
- Accepts data in SAFTINet-specified XML, checks for conformance to SAFTINetETL XML Schema and inserts data into the SBB MySQL database. All import errors are logged for review, reporting, and analysis
- As additional methods for data import are developed, the XML Importer module will be generalized to be a more flexible Data Importer module
- Map Source Codes to OMOP Concept_IDs
- Maps local site codes into OMOP Concept_IDs using a SAFTINET specific Source-To-Concept_ID mapping table.
- Inserts Concept_ID = 0 into all mapping failures per OMOP convention.
- Support data quality and mapping failures reporting
- PPRL Encryption of Clear Text PHI
- Performs PPRL data encryption using SAFTINet-developed encryption algorithms on all clear text HIPAA identifiers
- Inserts clear-text and encrypted identifiers into SBB database (ID Mapping Table).
- GUID generation
- Generates large random GUID for each patient and each encounter ID
- Inserts GUID identifiers into ID Mapping Table.
- Clear Text Record linkage
- Performs clear text record linkage using code base from Regenstrief or OpenEMPI (TBD)
- Provides record linkage measures for non-matches, near-matches, and matches
- Privacy Protecting Record Linkage
- Performs PPRL record linkage using code base developed by Vijay Thurimellia based on methods developed by EA Durham @ Vanderbilt University
- Provides record linkage measures for non-matches, near-matches, and matches
- Map to GUID
- Maps clear text identifiers (if present) to GUID identifiers in ID Mapping Table
- Maps encrypted identifiers (if present) to GUID identifiers in ID Mapping Table
- JDBC Exporter
- Creates output data in SAFTINet defined grid node JDBCdatabase schema.
- Appends data quality, mapping performance, and record-linkage performance measures data as defined in grid node JDBC database schema.
- Appends site-specific source-to-concept_ID mappings and source mapping failures as defined in grid node JDBC database schema.
3Architectural Considerations
To meet AHRQ funding obligations and to support the wide distribution of technical work products, the SAFTINet team is committed to creating a system that can be widely deployed by other data contributors interested in participating in an OMOP-based distributed research network. To achieve these goals, the following architectural principles will be applied:
- As much as possible, development will be done using open-source, open-license software. An exception will be made if open source development significantly increases the development cost or development time. The use of proprietary software must be limited to software that is widely used and easily available, such as Microsoft Windows and Microsoft SQL Server. Since many components to be incorporated into the Black Box have been written in java, at least some components will require java unless the time and cost in redeveloping these tools in another language can be justified.
- Existing work by others that exist in the public domain will be leveraged as much as possible. For each component derived from the work of others, explicit permission to incorporate their work will be obtained. Acknowledgement of their contributions within the Black Box will be prominently displayed in the documentation to ensure proper intellectual and technical credits.
- All development source code will be hosted in an readily available open-access source management system such as SourceForge or GForge. Development, technical and user documentation will also be available from the same site.
- Detailed technical and user documentation will be required to support end-user acceptance of the Black Box. SAFTINet may enlist client partners to assist in creating documentation that meets the needs of a broad community.
- To minimize installation and configuration barriers, the SAFTINet Black Box will be distributed as a complete virtual machine. Detailed instructions for installation and configuration of the virtual machine will be provided during software distribution. Based on the deployment plans for the grid node, the VM will be deployed using VMWare and VMplayer.
4Functional Specifications
4.1System Administration
4.1.1Users must log into system using an unique user name and password
4.1.2User accounts are assigned to one and only one role
4.1.3Roles include root, SBB administrator, database administrator, data manager, standard user
4.1.4Functional capabilities by roles:
Full system access / Add users; Assign roles / Install S/W updates / Alter DBMS structures / Initiate data upload / Review log files / Create reports / View reportsRoot / x / x / x / x / x / x / x / x
SBB Admin / x / x / x / x / x / x
Database Admin / x / x / x / x / x
Data Manager / x / x / x / x
Standard User / x
4.1.5System software updates will be accomplished by installing a new virtual machine image
4.1.5.1Future functionality: System software updates will allow saving and restoring Black Box data and settings across upgrades.
4.2Data upload
4.2.1A role-based qualifieduser may initiate a data upload.
4.2.1.1Future functionality: Data uploads can be scheduled with pre-configuration of processing options.
4.2.2A qualified user may select a data file for uploading using standard file navigation dialogs
4.2.3System validates selected file for conformance with SAFTINet XML schema
4.2.4A file that fails XML conformance testing will not be allowed to proceed. The system will provide information on the nature of the XML non-conformance
4.2.5The system will check the XML input file for a Data Source attribute. If Data Source attribute = “clinical” then all data tables will be cleared of existing data. The user will be warned prior to data tables being cleared. This requirement implements a full data dump input model (to be replaced in later releases)
4.2.5.1Future functionality: System supports incremental data uploads
4.2.6The system will insert all data contained in the input file into the internal SBB DBMS
4.2.7The system will inform the user of a successful upload into the internal SBB DBMS, including the number of new records loaded into each table
4.2.8A qualified user will be able to clear all data tables. The user will be warned twice prior to executing this function.
4.3Terminology Management
4.3.1A role-based qualified user can map new site-specific source codes to existing Concept_IDs to the Source Code ToConcept_IDfile. New source code mappings will be marked for easy identification/querying.
4.3.2Because Concept_IDs must be common across all members of the DRN, local users cannot enter new Concept_IDs to their local copy of the Source-To-Concept_ID mappings. Only SAFTINet central administration can add new Concept_IDs to the master list of Concept_IDs. SAFTINet central administration will work with the national OMOP terminology team to ensure new SAFTINetConcept_IDs are incorporated into the OMOP national terminology
4.3.3A qualified user can change an existing Source Code to Concept_ID mapping to point a local source code to a different existingConcept_ID. Requirement 4.3.2 prevents users from changing an existing source code to a locally createdConcept_ID. All changes will be marked for easy identification/querying.New local mappings will be marked for easy identification/querying.
4.3.4New site-specific source code mappings to existing Concept_IDs (Requirements 4.3.1 and 4.3.3) will be incorporated into a master Source-To-Concept_ID centralized table maintained by SAFTINet.
4.3.5A qualified user can upload a new set of Source Code to Concept_ID mappings provided by SAFTINet. Uploading a new set of mappings overwrites all existing mappings including any local changes. Requirement 4.3.3 will incorporate local mappings into the new master file / upload.
4.3.5.1Future functionality: A centralized web service will enable terminology additions to be managed and distributed to all participating nodes automatically.
4.3.6Source codes that do not match any existing source code in the current Source Code to Concept_ID mappings must be assigned Concept_ID = 0 to conform to the national OMOP terminology convention.
4.3.7The current Source Code to Concept_ID mappings and markings (Requirements 4.3.1, 4.3.2) will be included in the Export XML and be available for querying on the grid. Local changes/addition to the current mappings will be detected via a grid query.
4.4ETL Processing Rules
Information on specific ETL processing rules and conventions will be added to this section as they are discovered. These rules cover the internal processing of input data. The format of the input ETL XML is described in a separate ETL specifications document. Rules on the construction of output XML is discussed in Section 4.8 in this document.
4.4.1Provider table: Provider records with a Source_Care_Site_Identifier value that does not match an existing value in the Care_Site table will be assigned a value of 0 in Provider.Source_Care_Site_Identifier attribute.
4.4.2Care_Site table: Care site records with a Source_Organization_Identifier that does not match an existing value in the Organization table will be assigned a value of 0 in Care_Site.Source_Organization_Identifier attribute.
4.4.2.1Future functionality: Black box will process address information in Location records to calculate geocoding longitude and latitude information and insert into Location records.
4.5PPRL Encryption
4.5.1All clear text PHI data fields will be processed by domain-specific data normalization routines. Both original and normalized PHI fields will be maintained in the SBB database.
4.5.2Normalized first name and last name clear text data fields will have the Double Metaphone and New York State Identification and Intelligence System (NYSIIS) Soundex systems applied and maintained in the SBB database.
4.5.3Clear text PHI data fields plus normalized first/last names plus all Soundex results will be individually encrypted using the PPRL Encryption routine developed by Vijay Thurimella based on the Bloom methodology described by EA Durham. Individual encrypted PHI data fields will be included in the SBB database along with the associated clear text fields.
4.6GUIDs
4.6.1Each patient and each visit encounter will be associated with a GUID that is guaranteed to be unique across all SAFTINet participating sites. A GUID is also guaranteed to not be reused for another patient or encounter within the current database. GUIDs may be reused across complete database reloads (Requirement 4.2.5).
4.6.1.1Future functionality: With incremental data loads, GUIDs will remain stable across data loads
4.6.2All database tables that contain either clear text or encrypted patient or encounter identifiers will have its associated GUID inserted into the SBB database
4.7Record Linkage
4.7.1The type of record linkage algorithm to be applied will be determined by the Identifiers Format XML attribute, a required XML element in the input data file. If Identifiers Format = “cleartext” then clear text record linkage will be used. If Identifiers Format = “PPRL”, then PPRL record linkage will be used. Any other value in this attribute is an error.
4.7.2The system will check the XML input file for a Data Source attribute, a required XML element in the input data file. If Data Source attribute = “medicaid” then the appropriate record linkage algorithms as determined in Requirement 4.6.1 will be applied to each record.
4.7.3Blocking will be used to minimize processing time. Records will be blocked using the month and day of thebirthdate provided in the Medicaid data set. Medicaid records without birthdates will be processed without blocking (MGK: Vijay – what do you think about this specification?)
4.7.4For each record to be linked, link scores will be calculated using all identifiers that are non-null in the both records being linked. Due to the existence of missing identifier fields, absolute and relative link scores will be calculated. The absolute link score is the weighted sum of all available identifying fields. The maximum link score is the sum of all available identifying fields assuming a perfect match for all fields. The relative link score is the absolute link score divided by the maximum link score.
4.7.5A minimal relative link score will be part of a Black Box configuration file.
4.7.6Link scores that do not exceed the minimal relative link score (Requirement 4.6.5) will not be matched.
4.7.7All record pairs that exceed the minimal relative link score (Requirement 4.6.5) will be marked as “potential links.”
4.7.8Thesingle record pair with the highest relative link score that exceeds the pre-established minimal relative link score (Requirements 4.6.44.6.5) will be determined to be the link match for that record pair.
4.7.9A log of all matched and unmatched Medicaid records will be created.
4.7.10Record linkage using the Fellegi-Sunter algorithm requires pre-computation of linkage weights for each field that could be compared during linkage. Linkage weights are unique to each population and thus must be calculated using data from each participating site. A separate independent utility application will be created to calculate site-specific record weights. The requirements of this application will be described in a separate document. Linkage weights calculated by this application will be part of a Black Box configuration file.
4.7.10.1Future functionality: Linkage weights can be calculated using the identifying data elements present in the current Black Box database.
4.8Data Output
4.8.1Data output will be XML that follows the SAFTINet output XML schema. The output XML schema will be based on the SAFTINet-extended OMOP data model.
4.8.2Location records linked to patient records will have all PHI attribute fields set to NULL in output XML. Location records linked to objects other than patient records (e.g., providers, organizations, care-sites) will include all PHI attribute values in output XML. All location records will include Zip_Code3 values.
4.8.3Data output will include the current Source_to_Concept_ID mappings as specified in the XML schema
4.8.4Data output will include failed Source to Concept ID mappings as specified in the XML schema
4.8.5Data output will include information on record linkage results as specified in the XML schema
4.8.6Data output will include information on data quality as specified in the XML schema
4.9Reporting
4.9.1Role-based authorized users will be able to execute pre-defined reports.
4.9.2Reports may be viewed on the screen, printed to a PDF file, or printed to a user-configured networked printer
4.9.3All available reports will be visible and executable by user roles with reporting enabled.
4.9.3.1Future functionality: Reports will be associated with roles. Only reports associated with the current user’s role will be visible and executable by the current user.
4.10Report Creation
4.10.1A role-based qualified user will be able to create an ad-hoc report using a graphical report writing software (such as iReport/JasperReports)
4.10.1.1Future functionality: New reports will be assigned one or more roles that can view and execute the report
4.10.2User-defined reports can be exported and imported to preserve report definitions across system upgrades and new virtual machines.
5Development Considerations
Rapid applications development (RAD) methods will be used with 2-week development sprints. Initial development will focus on processing clinical data from end-to-end. Implementation of Medicaid data processing will occur after clinical data processing reaches an initial alpha state. Depending on resources, Medicaid data processing can be developed in parallel with clinical data processing. Both clinical and Medicaid data processing workflows will use a number of shared modules, include XML data import, Source_to_Concept-ID mapping, PHI encryption, and XML data export. Modules unique to clinical data processing include GUID generation. Modules unique to Medicaid data processing include clear-text record linkage, PPRL linkage and GUID mapping. When possible, common functions should use common modules.
A proposed sequence of 2-week development sprints:
5.1Clinical data processing
5.1.1Import XML for single table (patient). Generate error for non-conforming XML.
5.1.2Import XML for second table (visit). Map one data element using Source_to_Concept-ID file
5.1.3Import XML for provider and location table. Map all data elements using Source_to_Concept-ID file
5.1.4Implement method to add new Source_to_Concept-ID mappings. Implement method that captures unmapped Source codes. Assign Concept-ID = 0 to unmapped source codes.
5.1.5Import XML for all remaining input tables; generate export XML from all input tables. Generate export XML for Source_to_Concept-ID mapings and unmapped source codes.