Current Loom Installation Guide

Loom Installation

Valid for: Loom 2.0.0+

Contents

Conceptual Overview of Installation 1

Prerequisites 2

First-time Installation 5

Download and Install Loom 5

Start Loom 8

Upgrade 10

Back Up Current Registry 10

Download and Install Loom 11

Stop and Start Loom 12

Restore Registry 14

Advanced Configuration 15

Security, User Impersonation, and Authentication 15

ActiveScan: Potential Sources 15

CSV Recognizer 16

Log File Recognizer 17

Custom Metadata Properties 18

Post-Installation “Smoke Test” 22

Troubleshooting 22

Steps 22

Revision History 29

Conceptual Overview of Installation

● Download the Loom distribution

● Edit configuration files

● Start the Loom server

In these instructions:

● Edit the red text before executing commands.

● Blue text highlights content of interest.

Prerequisites

Consult your system administrator as needed for the following prerequisites.

1. A Hadoop cluster running on Linux machines.

a. Loom 2.0+ has been tested on the following Hadoop distributions. Loom supports MRv2 (YARN) as well as MRv1.

Distributor / Version
Cloudera / CDH 5.1
Hortonworks / HDP 2.1
Teradata / TDH 2.1

b. Operating Systems: Linux. Loom has been run on Ubuntu, CentOS, RHEL, and SLES.

c. Browsers: Chrome and Firefox.

d. JDK: Oracle JDK or OpenJDK, versions 6 or 7

2. Choosing an installation location for Loom

a. On the cluster

i. It is recommended that you install Loom on the NameNode, for simplicity in managing permissions. However, Loom can be run on any node in the cluster.

b. Off the cluster

i. Loom can also be run outside the cluster a machine that can communicate with the Hadoop APIs but is not itself running any Hadoop services (commonly known as an “edge” node).

ii. It is not necessary for users on the machine to be able to access HDFS from the command line, but this machine will need to have a copy of the same Hadoop distribution files as the cluster – in particular, the libraries for Hadoop, Hive, and HCatalog.

3. Local Username/Permissions

a. On both the machine where you still be running Loom and on all nodes in the cluster, create a dedicated Linux username for Loom. The alphanumeric ID, numeric user ID (UID), and group ID (GID) for the user must be the same across machines.

i. This user will be referred to as loomuser throughout this document, but it can have any name.

ii. Depending on Loom security settings (see Advanced Configuration > Security), this will be the username interacting directly with Hadoop services.

b. Grant loomuser sudo privileges.

i. This is not absolutely necessary, but if you choose not to do so, you will need access to another username with sudo privileges in order to change ownership of the directory where Loom is downloaded.

c. On the machine where Loom will be running, grant loomuser ownership of the following local directory

file:/tmp/loomuser / The default location for local temporary files created by Hive when executed by the loomuser userid. This may be overridden in the “hive.exec.local.scratchdir” property of hive-site.xml.

d. Set HIVE_HOME, HADOOP_HOME, and HCAT_HOME environment variables for loomuser. These variables should be set permanently for loomuser, or specified in loom-server.sh, but should not just be set for the current shell session.

i. These variables should be set to the directories that contain the Hive, Hadoop, and HCatalog “lib” directories, respectively, and should NOT have a trailing slash.

1. The exact values will vary depending on your Hadoop distribution. Examples are below, but you should confirm that the Hive and Hadoop “lib” files are actually located at the paths below.

a. Typical example for Hortonworks

HIVE_HOME=/usr/lib/hive
HCAT_HOME=/usr/lib/hive-hcatalog
HADOOP_HOME=/usr/lib/hadoop

b. Typical example for CDH4 as installed by Cloudera Manager

HIVE_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hive
HCAT_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hcatalog
HADOOP_HOME=/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop

c. For TDH, the following additional environment variable is needed

PATH=$PATH:/opt/teradata/jvm64/jdk7/bin

4. Hadoop Username/Permissions

a. Grant loomuser read and write access to the following HDFS directory:

hdfs:/user/hive/warehouse / The default location of the Hive warehouse. This may be overridden in the “hive.metastore.warehouse.dir” property of hive-site.xml file.

b. Create and grant loomuser ownership of the following HDFS directories:

hdfs:/tmp/hive-loomuser / The default location for temporary files created by Hive when executed by the loomuser userid. This may be overridden in the “hive.exec.scratchdir” property of hive-site.xml.
hdfs:/user/loomuser / The home directory for loomuser on HDFS.

c. Grant loomuser read and write access to any HDFS directories where the user will want to browse, query, or output new data.

5. Hive

a. Install Hive with a multi-user metastore, such as MySQL or PostegreSQL.

i. If Hive was installed as a demo, it is probably using the default Apache Derby metastore, which is single-user. Your Hadoop distributor should have instructions on switching Hive to use a non-Derby metastore.

6. Networking

a. Ports: The port on which Loom will run (8080 by default, but you can specify any port at runtime) must be exposed such that intended users of Loom will be able to access that port through their web browser.

7. Web Browser

a. The latest versions of Firefox and Chrome are compatible with Loom. Internet Explorer is not supported.

First-time Installation

That is, on a cluster where Loom has never been installed:

Download and Install Loom

1. Open an SSH session on the machine where you are going to install Loom.

2. Create a loom directory wherever you want Loom installed (e.g. /usr/local), transfer ownership to loomuser, and cd into it.

loomuser@node:~$ cd /usr/local
loomuser@node:/usr/local$ mkdir loom
loomuser@node:/usr/local$ sudo chown -R loomuser /usr/local/loom
loomuser@node:/usr/local$ cd loom

3. Download Loom x.y.z (for example, 1.2.7) and unzip

loomuser@node:/usr/local/loom$ wget --no-check-certificate http://www.revelytix.com/transfer/loom-x.y.z-distribution.zip; unzip loom-x.y.z-distribution.zip

4. Run the bin/check-setup.sh script.

a. For MapR users: you will need to uncomment (i.e. delete the pound sign before) the following line in bin/check-setup.sh, in order to include certain native dependencies.

loom-x.y.z-distribution/bin/check-setup.sh

# MapR requires native dependencies
JAVA_LIB_PATH="-Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64"

b. You only need to run this script once: before the first time you start Loom.

loomuser@node:/usr/local/loom$ loom-x.y.z-distribution/bin/check-setup.sh
# Example output
loomuser@node:/usr/local/loom$ loom-x.y.z-distribution/bin/check-setup.sh
Checking Loom configuration
Checking default loom port ...... port '8080' on host 'localhost' ... OK.
Checking availability of datomic transactor port ...... port '4334' on host 'localhost' ... OK.
Checking default Hadoop FileSystem ...... configured to use hdfs://localhost:8020 ... OK.
Checking default Hadoop JobTracker ...... configured to use JobTracker 'localhost' port '50030' ... OK.
Loom is ready to run.

c. If “default loom port” check fails:

i. The default port for Loom Server is 8080, but Loom can easily be run on a different port. Instructions are included in the documentation below, starting with the phrase “To run this server on a different port...”

d. If “availability of datomic transactor” check fails:

i. This means another application is running on port 4334, 4335, or 4336. If you cannot remove the application, it is possible to configure Loom to start the transactor on a different set of three contiguous ports. Open loom-x.y.z-distribution/lib/datomic/transactor.properties, and set ‘port’ to the first port in the sequence you want to use:

loom-x.y.z-distribution/lib/datomic/transactor.properties

########### free mode config ###############
protocol=free
host=localhost
#free mode will use 3 ports starting with this one:
port=firstport

ii. You may also be seeing this error if you have started Loom on this machine before; as mentioned above, it is only necessary to run checkup.sh before the first time you start Loom. Once you start Loom, the transactor runs as a background process on ports 4334-4336, and will keep running on these ports in between restarts of the Loom server.

e. If “default Hadoop FileSystem” check fails: either you did not set HADOOP_HOME correctly (see Prerequisites > Username/Permissions) or HDFS is not running.

f. If “default Hadoop JobTracker” check fails, either you did not set HADOOP_HOME correctly (see Prerequisites > Username/Permissions) or JobTracker is not running.

5. Set Loom’s DistributedCache directory

a. In loom-x.y.z-distribution/config/loom.properties, set loom.dist.cache to the desired HDFS directory. It will default to hdfs:/user/${user.name}/loom-dist-cache unless otherwise changed, where ${user.name} is the name of the user who starts the loom server.

# Sets the location in HDFS where Loom manages the distributed cache that it
# uses to configure MapReduce jobs that it submits. The Loom server process
# must have permission to write in this location.
loom.dist.cache=distributedcachepath

b. IMPORTANT: distributedcachepath must BOTH be an absolute path (not a URI) for an HDFS folder AND ALSO end with a "/" For example:

/user/loom/ ACCEPTABLE
/user/loom NOT ACCEPTABLE
loom/ NOT ACCEPTABLE
hdfs://master:9000/user/loom/ NOT ACCEPTABLE

6. At this point, if you want to take advantage of Loom’s advanced configuration options, see the “Advanced Configuration” section and complete the relevant steps before proceeding to the next step below.

Start Loom

1. For MapR users: you will need to uncomment (i.e. delete the pound sign before) the following line in bin/loom-server.sh, in order to include certain native dependencies.

loom-x.y.z-distribution/bin/loom-server.sh

# MapR requires native dependencies
JAVA_LIB_PATH="-Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64"

2. Start the Loom Server.

a. IMPORTANT: always run the loom-server.sh script from the current distribution directory, e.g. /usr/local/loom/loom-x.y.z-distribution. Loom has certain dependencies that require to be started from the distribution directory

b. These examples use ‘nohup’ plus ‘&’ to run Loom in the background. You can also run Loom from a ‘screen’ window, if you have the ‘screen’ package installed.

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loom-server.sh &
[hit ENTER to regain command-line access]

a. To run this server on a different port, before starting the Loom server, include the port number after loom-server.sh

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loom-server.sh <port#>
# Example
loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ nohup ./bin/loom-server.sh 8081

b. Check the contents of nohup.out. Once Loom has is running, you will see the message, “Loom server started.”

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ tail -f nohup.out
-h gives a list of usages/options
Starting Database...
HADOOP_CP=<HADOOP_CP>
HIVE_CP=<HIVE_CP>
Starting Loom Server...
Starting Loom Server on port 8080
Loom Server started

Congratulations! You have now installed Loom.

Upgrade

For a cluster where Loom has already been installed:

Back Up Current Registry

1. To make a copy of your existing registry, run the backup.sh script from the distribution directory. By default, <host>=localhost, <port>=8080, and <outputfile>=backup.json.

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ ./bin/backup.sh -h <host> -p <port> -o backupfile

a. This will produce a backup.json file in the distribution directory

loomuser@node:/usr/local/loom/loom-x.y.z-distribution$ ls
backup.json bin config data datomic.pid docs lib license logs plugins R README.txt registry schema

Download and Install Loom

1. Open an SSH session on the machine where you have installed Loom.

2. Cd into the loom directory.

loomuser@node:~$ cd /usr/local/loom

3. Load Loom a.b.c (for example, 1.0.1) onto node and unzip.

loomuser@node:/usr/local/loom$ wget --no-check-certificate http://www.revelytix.com/transfer/loom-a.b.c-distribution.zip; unzip loom-a.b.c-distribution.zip

4. Set Loom’s DistributedCache directory.

b. IMPORTANT: distributedcachepath must BOTH be an absolute path (not a URI) for an HDFS folder AND ALSO end with a "/" For example:

/user/loom/ ACCEPTABLE
/user/loom NOT ACCEPTABLE
loom/ NOT ACCEPTABLE
hdfs://master:9000/user/loom/ NOT ACCEPTABLE

5. See the “Advanced Configuration” section in this document for instructions on additional configuration options.

Stop and Start Loom

1. Find the PID of the currently running Loom server.

a. If you have sudo permissions:

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ sudo netstat -tnlp | grep <port#>
# Example
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ sudo netstat -tnlp | grep 8080
[sudo] password for loomuser:
tcp6 0 0 :::8080 :::* LISTEN 18139/java

b. If you do not have sudo permissions, you can use an alternative method; the Loom server process will be the first process returned

loomuser@node:/usr/local/loom$ ps aux | grep revelytix.servlet
# Example
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ps aux | grep revelytix.servlet
loomuser 18139 0.8 13.6 1051396 280792 ? Sl 10:00 1:20 java -XX:PermSize=128m -XX:MaxPermSize=256m -Dtransactor.props=loom-0.6.1-distribution/bin/../lib/datomic/transactor.properties -cp loom-0.6.1-distribution/bin/../config:loom-0.6.1-distribution/bin/../lib/*:loom-0.6.1-distribution/bin/../lib/ext/*:/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/etc/hadoop::/opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/hadoop/client-0.20/* revelytix.servlet
loomuser 19380 0.0 0.0 7624 932 pts/0 S+ 12:30 0:00 grep --color=auto revelytix.servlet

2. Kill the currently running Loom server.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ kill <PID>

3. If you are upgrading Loom, you must stop the transactor processes. You can skip this step if you are simply restarting the Loom server, i.e. using the same distribution.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/stop-database.sh
Stopped Database

4. Start the new Loom server. IMPORTANT: always invoke the loom-server.sh script from the distribution directory, e.g. /usr/local/loom/loom-x.y.x-distribution directory. Loom has certain dependencies that require it to be started from the distribution directory.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ nohup ./bin/loom-server.sh &
[hit ENTER to regain command-line access]

a. To run this server on a different port, simply specify the port when starting the Loom server.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/loom-server.sh <port#>
# Example
loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/loom-server.sh 8081

5. Do not log into the Lab Bench or attempt to view or register data before finishing the next section.

Restore Registry

1. From the new distribution directory, restore the registry, using the backup.json file you created with the previous distribution. By default, <host>=localhost and <port>=8080.

loomuser@node:/usr/local/loom/loom-a.b.c-distribution$ ./bin/restore.sh /usr/local/loom/loom-x.y.z-distribution/backupfile -h <host> -p <port>

Congratulations! You have now updated Loom.

Advanced Configuration

Security, User Impersonation, and Authentication

1. See loom-x.y.x-distribution/docs/Loom_Security.txt for details. You will need to restart Loom after making any Loom configuration changes, and restart Hadoop services after making any Hadoop configuration changes.

ActiveScan: Potential Sources

1. One of Loom’s features is the ability to detect “Potential Sources;” that is, regularly and recursively scan a specified HDFS directory to detect new files, which Loom displays in the ‘Sources’ Home page of the Loom Lab Bench (browser UI), as well as on the ‘Loom’ home page in the ‘Recent Sources’ column.

2. To turn on ActiveScan: Potential Sources, edit loom-x.y.z-distribution/config/loom.properties:

loom-x.y.z-distribution/config/loom.properties

# Enable active scanning of potential datasets in HDFS.
activeScan.dataset.enabled=true
# Set the top-level directory under which to scan for potential datasets
# in HDFS. May be specified as an absolute hdfs:// URL or a relative
# path that will be resolved against the Loom working directory.
# Defaults to the Loom working directory.
activeScan.dataset.baseDir=HDFSdirectory

a. Example configurations

activeScan.dataset.baseDir=hdfs://node:8020/home/loomuser/loomInput ACCEPTABLE
activeScan.dataset.baseDir=/home/loomuser/loomInput ACCEPTABLE
activeScan.dataset.baseDir=loomInput ACCEPTABLE, if loomuser has a configured working directory

3. By default, Loom is set to scan the specified directory every 60 minutes, but you can change this: