CS B553 Homework 7: Parameter and Structure Learning in Chow-Liu Tree Networks

Due date: 4/10/2012

In this assignment you will be looking at a very simple type of Bayesian network model called a Chow-Liu tree. Chow-Liu trees are useful because they provide a class network structures that are easy to learn. (In fact, they predate Bayesian networks.) In this model, the joint distribution over X1,…,Xn is decomposed into first-order conditionals as follows:

where R is the subset of root nodes, and for non-root nodes pai gives the index of i’s only parent. So, we can encode the graph structure through the encoding G=(pa1,…,pan) with pai{0,1,…,n}, where we take the convention that any node with pai=0 is taken to be a root. (Note that we’re allowing the encoding to produce forests rather than constraining the network to be a tree)

We’ll assume all variables are binary. You may wish to use the notation xi0 to indicate the event that Xi=0 and xi1 to indicate the event that Xi=1.

  1. List all of the parameters in the CPTs of a Chow-Liu tree network with structure G=(pa1,…,pan), with node Xr being the root. Gather these into a single parameter vector .
  2. Given a dataset D=( x[1],…,x[M]) of fully observed samples, write down the expression l(D; ,G) giving the log-likelihood of D given  under the structure model G. Use the textbook’s notation M[‘event’] to indicate the number of times ‘event’ occurs in the dataset D.
  3. Give a formula for the maximum-likelihood estimate * given the data D. What is the log-likelihood l(D; *, G)?
  4. Consider the model G0=(0,…,0), i.e., the entirely disconnected network. Now consider adding a new edge j->i by setting pai = j to derive the model G =(0,…,0,j,0,…,0) (j is in the i’th index). What is the change in log-likelihood of the maximum-likelihood parameter estimates when changing from G0 to G? That is, compute l(D; , G) – l(D; 0, G0) where  is the MLE for model G and 0 is the MLE for model G0.
  5. Now consider an arbitrary graph G0 in which Xi is a root node, and suppose that you could add the edge j->i to obtain a graph G that is also a forest. What expression do you get for l(D; , G) – l(D; 0, G0), with  the MLE for G and 0 the MLE for G0?
  6. Consider an algorithm that starts with a completely disconnected network and greedily starts to add edges j->i, ensuring that each new edge does not induce a cycle in the graph. Edges are added in order of largest to smallest improvement in likelihood. Demonstrate that it produces a tree rather than a forest. Give its running time in big-O notation. Furthermore, prove or argue that it produces the Chow-Liu tree with maximum likelihood over all trees (hint: consider how this relates to Prim’s algorithm for minimum spanning trees).