Loom User Guide
Michael Lang
September 2014
Contents
Overview 1
Features and Capabilities 2
Basic Use Cases 4
Source Cataloging and Profiling with Activescan 4
Data Preparation with Weaver and Hive 5
Batch Data Processing of Log Data with Activescan 7
Advanced Use Cases 7
Data Governance 7
Data Lineage 8
ETL Management 8
Appendix A: Loom Concepts and Model 8
Sources 9
Datasets 10
Transforms and Jobs 10
Activescan 11
Overview
Loom provides core capabilities needed for enterprises to successfully deploy a Hadoop data lake. Loom makes the first phase of the analytic workflow more efficient, enabling analysts to quickly find, understand, and prepare data in a Hadoop data lake, which allows them to spend more time developing analytics. Ultimately, this means more business insights are developed faster, which is the ultimate driver for ROI of the data lake.
The primary purpose of Loom is to serve as a “workbench” for an analyst working in the Hadoop data lake, helping them to:
· Find Data – Search/browse for data in the data lake through Loom’s Workbench
· Explore Data – View data previews and navigate between related datasets
· Understand Data – In addition to data previews, Loom provides the user with valuable metadata that gives extra context to the data and helps a user understand its true meaning, including statistics, business metadata, and lineage
· Prepare Data – Execute transformations to convert data from its original form into the form required for analysis, including cleaning tables and merging tables
Loom includes an automation framework, called Activescan, to assist with many of the underlying data management tasks necessary to support the capabilities above, including cataloging and profiling data.
Features and Capabilities
Loom’s features are categorized into three high-level buckets:
· Metadata management – Enables users to create and manage information about the data in a Hadoop cluster. Enables users to view samples of data.
· Automation (Activescan) – Automates the creation and management of metadata as much as possible. Automates some aspects of data processing.
· Data Preparation (Weaver, Hive) – Enables users to execute transformations in Hadoop needed to prepare data for analysis. Weaver provides a simple, interactive interface for defining transformations.
Metadata is a key underpinning of much of the value delivered by Loom. It is important in helping users find and understand available data. Aspects of metadata that Loom supports include:
· Technical metadata – information about the physical characteristics of some data, including the location, structure, format, and schema. Technical metadata also includes statistics about the data (e.g. row count, # of nulls) and lineage for the data.
· Business metadata – user-generated metadata that describes non-technical aspects of the data, including names, descriptions, tags, and custom properties. Custom properties can be configured by users. Business glossaries allow business terminology to be formally managed independent of any particular set of data. Business glossary terms can be mapped to technical metadata (e.g. table names).
The Loom Workbench provides a simple interface to browse metadata, explore data, and navigate relationships between entities (e.g. lineage relationships). The Workbench also has a search interface that supports keyword search over most metadata fields.
Loom’s Activescan framework automates many common data and metadata management tasks, including:
· Source Cataloging – Loom scans HDFS and Hive and catalogs all new data in supported formats
· Source Profiling – Loom automatically detects characteristics of a source, including format, structure, and schema
· Lineage Detection – Loom detects the lineage of sources that are (1) imported/exported with Sqoop or TDCH, (2) transformed with Hive, or (3) transformed with Loom
· Data Profiling – Loom generates descriptive statistics about tables and columns
· Data Processing – Loom can automatically apply structure to new files and transform them through Weaver. Alternately, Loom can “call out” to external code to process new files.
Finally, Loom enables users to execute the transformations required to prepare their data for analysis using:
· Weaver – Loom’s interface for “data wrangling”, Weaver enables users to modify and combine columns to produce new columns (or overwrite existing ones). Users can combine these operations with filters to transform only subsets of the rows in a table (or delete rows altogether). Users can also make schema level modifications (add/delete/rename columns, convert datatypes).
· HiveQL – users can also execute Hive queries from Loom, providing all the transformation capabilities of Hive (e.g. joins, unions, aggregations, UDFs)
Basic Use Cases
This section summarizes some basic use cases that can be simply developed and demonstrated with Loom:
· Source Cataloging and Profiling with Activescan
· Data Preparation with Hive and Weaver
· Batch Data Processing with Activescan
Each of these use cases can be viewed as stand-alone, but they also build on each other. Activescan catalogs data in the cluster. Users can execute transformations against data that has been cataloged, in order to prepare it for analysis. Users can automate the execution of their transformations for data that will be updated on a regular basis.
In addition to these three high-level use cases, there is a section below which discusses some advanced use cases:
· Data Governance
· Data Lineage
· ETL Management
These use cases require more advanced configuration both of Loom and Hadoop.
Source Cataloging and Profiling with Activescan
This use case provides a demonstration of how Activescan can catalog and profile data in HDFS and Hive. Multiple files are loaded into HDFS. Activescan scans HDFS, discovers the files, profiles the files (for structural metadata), and registers them in Loom. Users can then use data previews and generated metadata to explore the available data. Users can also “activate” sources to augment the metadata generated by Activescan with their own descriptive metadata. The next use case shows how
Scenario:
Many files are being loaded into a Hadoop cluster, which is serving as a “data lake” – a large scale platform for data storage, refinement, and analysis. The specific content of the files or the source of the files is not important for this use case (see ‘Advanced Use Cases’ below for more discussion about data lineage). Cluster administrators want to provide a simplified way for users to explore data in the cluster, giving them “self-service” access to the data lake. Activescan is used to automatically catalog and profile the data. Users can then browse and explore data through the Loom Workbench.
Loom Flow:
0. Set up data - Load files into HDFS of the following formats – log, CSV, Avro, Parquet. Create a Hive database with at least one Table.
1. Configure Activescan – turn source scanner on, configure target directory, start server (restart, if it is already running)
2. Browse Sources – log into the Loom Workbench and browse the available sources
3. OPTIONAL: Activate Sources and Add Metadata – users can activate sources catalogs by Activescan and add their own metadata, including basic descriptive information, custom properties (configurable by an administrator), and business glossaries.
Extensions to this use case:
· See the “data preparation” and “batch data processing” use cases below as examples
Limitations to this use case:
· Activescan will only discover files in supported formats - log, CSV, Avro, Parquet
· Activescan default configurations may need to be changed for files to be cataloged and profiled correctly. See the ‘docs’ folder in the Loom distribution for more details on how to configure Activescan.
Data Preparation with Weaver and Hive
This use case provides a demonstration focused on the data preparation features of Loom. Once data is registered with Loom, either by Activescan, through the API, or through the Workbench (manually), a user can apply structure to the data, and then execute transformations against the data to prepare it for analysis. This use case builds upon the capabilities described above. After a user can successfully explore and describe data in the cluster, the next natural step is to execute their own transformations to clean, combine, aggregate, and otherwise prepare their data for their own particular purpose. Loom provides two tools for this purpose: Weaver and Hive. Weaver lets a user quickly and iteratively “wrangle” an individual table. Hive lets a user execute SQL-style transforms – joins, unions, aggregations, UDFs, etc.
Scenario:
A pair of files have been loaded into HDFS. For the purposes of this use case, it does not matter how the files came to be in HDFS (discussed further below). One file contains general data about books, such as ISBN, title, author, and publisher (we refer to this file as the ‘books’ data). The second file contains user-submitted rating for the same books (we refer to this file as the ‘ratings’ data). An analyst wants to determine whether the words used in the title of a book have predictive power for its rating. That is, given the title of a book, can you successfully predict what rating it will get? Obviously, as a publisher or author, this would be valuable information, as you would seek to optimize your title to generate higher ratings. In order to do this analysis, the data must be prepared. The books file must be cleaned, and then merged with the ratings data. The data can then be analyzed from a variety of tools, including, but not limited to: R, Python, Hive, Pig, MapReduce, Tableau, and Excel.
Loom flow:
- Each file is registered as a Source – the data can be previewed and basic technical metadata can be seen in Loom. Users can augment this with their own descriptive metadata, if desired.
- A Dataset is created for each Source – this puts a schema onto the data and makes it available for transformation and profiling
- The books data is cleaned with Weaver – the title column is the main focus. We need to remove any special characters, normalize possessive nouns (i.e. remove all “ ‘s “), convert everything to lowercase, remove anything in between a pair of parentheses (this often references the publisher or series name), and any extra whitespace. We also need to delete any non-English books, as these would complicate our analysis, without adding value.
- The books data is further prepared with HiveQL and merged with the ratings data – Using Hive, we aggregate all the ratings into a single average rating for each book, join the books and ratings tables, extract the individual words from each title (using a Hive built-in function), split the joined table into one table of ‘good’ books and one table of ‘bad’ books (based on the average rating), and finally produce a count for ‘good’ books and a count for ‘bad’ books of how many times each individual word appears in book titles.
- OPTIONAL: Export the final tables from Hive to a relational database using Sqoop or TDCH. This demonstrates that Loom can automatically track the lineage of this operation, for the case where the analyst plans to use a relational platform for the final analysis.
- OPTIONAL: Import the data from Hive into R, for the case where the analyst plans to use R for the final analysis.
Extensions to this use case:
· Business metadata – For each entity involved in the workflow outlined above, add descriptions, tags, folders, and custom properties (either using default custom properties or by configuring new custom properties)
o Business Glossary – Create a business glossary which defines terminology relevant to the books domain and create mappings into some or all of the tables involved in the workflow
· Statistics – For each table in a Dataset, generate statistics (either manually trigger the statistics scan through the Workbench or configure Activescan to do it automatically)
· Lineage – This is generated automatically along the way for all transforms executed through Loom or for any Sqoop/TDCH job run. For other lineage (e.g. the original source of the books data outside of Hadoop), the API can be used to create the necessary entities. See ‘Advanced Use Case’ below for a discussion of data lineage.
Limitations within this use case:
· Loom can only read data from HDFS in the following formats – CSV, log (regex), Avro, Parquet
· Loom cannot execute transformations other than those described above (HiveQL, Weaver)
· Loom cannot automatically track lineage for transformations executed outside of Loom, except for Sqoop jobs, TDCH jobs, or Hive queries (new in Loom 2.3).
· A “sandbox” cluster (i.e. single-node virtual machine) may struggle executing multiple jobs at the same time (transformations or statistics). Queues can be used to address this.
Batch Data Processing of Log Data with Activescan
This use case provides a demonstration of how Loom can be configured to automatically move new files to Hive and apply a Weaver script. This is a combination of Loom’s data preparation features (namely, Weaver) and Loom’s Activescan framework. This is a natural extension of the ‘data preparation’ use case described above. Once a Weaver transform is defined to clean a specific table, a user will want to apply the same transform to new files that are loaded into the cluster.
Scenario:
A new directory is created in HDFS and a single file loaded, with the intention of loading new files on a scheduled basis (e.g. once a day). Each file represents a set of records in the same logical table. A user has defined a Weaver script for cleaning the data, using the original file. Activescan is configured to automatically execute this Weaver script for every new file that is loaded.
Loom flow:
- HDFS directory and original file are discovered by Activescan and registered as a Source and Table, respectively
- A user creates a Dataset from the Source
- A user defines a Weaver script to clean the Table
- A user defines a ‘directives’ file for the Dataset which references the Weaver script
- New files are loaded into the directory – Activescan automatically detects each file, registers it as a new Table in the existing Source, then creates a corresponding Table in the existing Dataset, and then executes the Weaver script. Lineage is tracked for all operations.
Extensions to this use case:
· Everything described above in the ‘data preparation’ use case