EmBayes/R documentation

Escola Politécnica da Universidade de São Paulo

Laboratório de Tomada de Decisão

Author: André Hideaki Saheki

1. Introduction

This document explains the installation and usage of the EmBayes program inside the R environment.

EmBayes is a library developed in the Java Programming Language to create, manipulate, evaluate and learn Bayesian Networks. R is an environment used for statistical computing and graphics.

Finally, the interface between EmBayes and R was created in order to allow the use of Bayesian Networks inside the R environment.

2. How to install

2.1. Necessary software

To use the R/EmBayes system the following software are necessary.

2.1.1. A Java 2 Virtual Machine, preferably version 1.4 or higher.

2.1.2. R base system, version 1.7 or higher.

2.1.3. The SJava package for R. This package provides the necessary interface between the Java Programming Language and R.

2.2. Installation

2.2.1. Java installation: install this as usual using the setup program or equivalent.

2.2.2. R base installation: install this as usual using the setup program or equivalent.

2.2.3. SJava package installation:

For the Windows OS, uncompress the already compiled package SJava.zip into $RHOME/library.

A compiled version (for Windows) of this package can be downloaded from

This version is compatible with R < 2.0.

For Unix and Linux systems, download the source tarball from the omegahat page and run the following command in the directory where you saved the file:

R INSTALL –c [filename]

e.g., R INSTALL -c SJava_0.67-3.tar.gz

2.3. Patch

2.3.1. SJava patch

In the most recent versions of the R system (1.9.x), it is necessary to make a modification in the SJava library:

In the file $RHOME/library/SJava/R/SJava , modify the following line inside the body of the function javaHandlerGenerator (line 716 in SJava 0.65):

restart(T) => try(T)

The precompiled version described above already has the necessary modifications.

3. Usage

3.1. Java environment variable

Setting the environment to use JavaBayes inside R:

3.1.1. It is necessary to define the JAVA_HOME environment variable to reflect the installation location of the Java Virtual Machine. Examples: /usr/java/j2sdk1.4 , c:\j2sdk1.4.2

3.1.2. Add the Java Runtime Environment bin/client directory to the path. Example:

set PATH=%PATH%;c:\j2sdk1.4.2\jre\bin\client

3.2. Loading the files for the R/Embayes interface

The whole JavaBayes system is loaded using the following command in a R terminal:

source(“sources.r”)

This also loads and initializes the SJava library and the Java Virtual Machine; these steps are discussed in Sections 3.3 and 3.4, in case it becomes necessary to do them manually.

3.3. Loading the SJava library

This command loads the SJava library and makes its functions available under R.

library(SJava)

3.4. Initializing the Java Virtual Machine inside R

.JavaInit(list(classPath=c([path to JavaBayes]) ) )

, where [path to JavaBayes] is the location of the JavaBayes classes. E.g.: F:/JavaBayes/bin.

Note: there is no need to set the Classpath parameter only if the EmBayesR classes are in R work directory.

Another alternative is to set the system environment variable CLASSPATH with the EmBayesR path.

4. Functions for Bayesian networks

Methods for working with Bayesian Networks inside R:

network <- newNetwork(“newName”)

Creates a new network.

Parameters:

newName: String with the name of the network

network <- loadNetwork(“filename”)

Loads an existing network from a file.

Parameters:

fileName: String representing the file containing the network in XMLBIF or BIF formats.

network$saveNetwork(“filename”, “format”)

Saves a network to disk:

Parameters:

filename: String with the file name used to save the network.

format: Format to save the network. Either “xmlbif’ or “bif”.

network$listVars()

Returns a vector containing the variables inside a Bayesian network.

network$addVar(“newVariable”)

Adds a new variable to a network, with the categories “true” and “false”.

Parameters:

newVariable: String with the name of the new variable.

network$addVar(“newVariable”, c(“category1”, “category2”,…))

Adds a new variable to a network, with the categories defined by category1, category2,…

Parameters:

newVariable: String with the name of the new variable.

category1, category2,…: the categories of the new variable

network$addArc(“parent”, “child”)

Adds a arc from parent to child.

Parameters:

parent: String with the name of the parent node.

child: String with the name of the child node.

network$deleteVar(“node”)

Deletes a node from the network

Parameters:

node: String with the name of the node do delete.

network$deleteArc(“parent”, “child”)

Removes an arc between parent and child.

Parameters:

parent: String with the name of the parent node.

child: String with the name of the child node.

network$listCat(“node”)

Returns a list containing the categories of a node.

Parameters:

node: String with the name of the node.

network$query(“node”)

Calculates the posterior probability of node.

Parameters:

node: String with the name of the node.

network$query(c(“node1”, “node2”, …))

Calcutes the posterior probability of node1, node2, ...

Parameters:

node1, node2, ...: Queried nodes.

network$observeVariable(“node”, “category”)

Inserts an observation into a network:

Parameters:

node: String with the name of the node to observe.

category: String with the observed category.

network$unobserveVariable(“node”)

Removes an observation from a network.

Parameters:

node: String with the name of the node to unobserved.

network$showNetwork()

Displays the network in a JavaBayes Editor.

dataLearn <- readDiscreteData("filename")

Reads a file from a disk containing discrete variables. Returns the records in a data frame.

Parameters:

filename: Full or relative name of the file.

loadedData <- convertRData(dataframe)

Converts data to the format used by Embayes/R. Return

Parameters:

dataframe: A R data frame.

learnedNetwork = learnNetwork(algorithm = "TAN", RData, useOnlyStartingStructure = FALSE, discardUnlabeledData = TRUE, maxIterations = 1, likelihoodChange = 0.00001, removeZerosFromStartingPoint = FALSE)

Learns a Bayesian Network. Only the algorithm and RData parameters are mandatory. The RData object is in the format returned by convertRData. Algorithm can be “NB”, “TAN” and “EM”. In the case of EM, a parameter startNet must be set, indicating the structure (the probability values are then used as the starting point of EM).

network$testNetwork(RData)

Tests the classifier represented by a Bayesian Network. The RData parameter is mandatory, and must follow the format returned by convertRData.

5. Functions for gaussian networks

Methods for working with Gaussian Networks inside R:

network <- newGaussNetwork(“newName”)

Creates a new network.

Parameters:

newName: String with the name of the network

network <- loadGaussNetwork(“filename”)

Loads an existing network from a file.

Parameters:

fileName: String representing the file containing the network in XMLBIF or BIF formats.

network$saveNetwork(“filename”, “format”)

Saves a network to disk:

Parameters:

filename: String with the file name used to save the network.

format: Format to save the network. Either “xmlbif’ or “bif”.

network$listVars()

Returns a vector containing the variables inside a Bayesian network.

network$addVar(“newVariable”)

Adds a new variable to a network, with the categories “true” and “false”.

Parameters:

newVariable: String with the name of the new variable.

network$addArc(“parent”, “child”)

Adds a arc from parent to child.

Parameters:

parent: String with the name of the parent node.

child: String with the name of the child node.

network$deleteVar(“node”)

Deletes a node from the network

Parameters:

node: String with the name of the node do delete.

network$deleteArc(“parent”, “child”)

Removes an arc between parent and child.

Parameters:

parent: String with the name of the parent node.

child: String with the name of the child node.

network$query(“node”)

Calculates the posterior probability of node.

Parameters:

node: String with the name of the node.

network$observeVariable(“node”, value)

Inserts an observation into a network:

Parameters:

node: String with the name of the node to observe.

category: double with the value of the evidence.

network$unobserveVariable(“node”)

Removes an observation from a network.

Parameters:

node: String with the name of the node to unobserved.

network$showNetwork()

Displays the network in a JavaBayes Editor.

6. Examples

Here is a simple demo of the system.

First set up the environment variables appropriately:

set JAVA_HOME=C:\j2sdk1.4.2

set PATH = %PATH%;"C:\Applications\rw1080\bin";"C:\j2sdk1.4.2\bin";"C:\j2sdk1.4.2\jre\bin\client"

Note that the PATH is receiving indication for the location of the JVM bin/client directories, and that the location of the R system is not the standard one, so it is added to the PATH. These values could have been adopted automatically if included in the autoexec.bat file. Here suppose that they are either typed in or run through a batch file.

Now start R. Consider running the whole graphical user interface:

Rgui

Once inside R, go to the directory containing sources.r:

setwd(“path/to/EmBayesR”)

Now initialize everything and load the system:

source(“sources.r”)

Create a discrete network:

net = newNetwork(“net1”)

Add variables:

net$addVar(“v1”)

net$addVar(“v2”, c(“c1”, “c2”, “c3”))

net$addVar(“v3”)

Show variables:

net$listVars()

Connect variables:

net$addArc(“v1”, “v2”)

net$addArc(“v1”, “v3”)

Insert probability values:

net$setProbabilities(“v1”, c(0.2, 0.8))

net$setProbabilities(“v2”, c(0.1, 0.4, 0.2, 0.25, 0.7, 0.35))

net$setProbabilities(“v3”, c(0.4, 0.3, 0.6, 0.7))

Now show the network in a JavaBayes Editor:

net$showNetwork()

Note that it is possible to edit the network in the JavaBayes Editor, and the changes will be “propagated” to R – in fact, the JavaBayes Editor and R are sharing the same data structures.

Changes can be also made directly from R:

net$deleteVar(“v1”)

net$deleteArc(“v1”, “v3”)

Consider a few observations and queries:

net$query(“v1”)

net$query(c(“v1”, “v2”))

net$observeVariable(“v1”, “c1”);

net$query(“v1”)

Open and convert a data file:

dataR1 = readDiscreteData("adult_learn.in")

dataJ1 = convertRData(dataR1)

Learn TAN classifier from data:

netTAN = learnNetwork(learnData = dataJ1, algorithm="TAN", discardUnlabeledData=FALSE, maxIterations=10)

Open and convert a data file for test:

dataR2 = readDiscreteData("adult_test.in")

dataJ2 = convertRData(dataR2)

Test the TAN classifier:

netTAN$testNetwork(dataJ2)

Now look at the network in a JavaBayes Editor:

netTAN$showNetwork()

Another classifier based on Bayesian networks is the Naïve Bayes classifier; note that the data must be “reopen” to be used again:

dataJ1$reopen()

dataJ2$reopen()

netNB = learnNetwork(algorithm = “NB”, learnData = dataJ1, maxIterations = 10, likelihoodChange=0.01)

netNB$testNetwork(dataJ2)

netNB$listVars()

netNB$observeVariable(“X12”)

netNB$query(“c”)

Open and convert a data file, to be used with the EM algorithm; note that this file contain missing data:

dataR3 = readDiscreteData("em_learn.in")

dataJ3 = convertRData(dataR3)

It is possible to learn the parameters of a given network using EM:

netEMTAN = learnNetwork(learnData = dataJ3, algorithm="EM", startNet=netTAN, maxIterations=10)

Here is EM applied to a network read from a file; note that the data must be “reopen” to be used again:

netST = loadNetwork("em_adult.xml")

dataJ3$reopen()

netEMST = learnNetwork(learnData = dataJ3, algorithm="EM", startNet=netST, maxIterations=10)

Now show the network:

netEMST$showNetwork()