Final Project: Bae&Kim

Final project: Bae&Kim

Nov. 27, 2006

Clustering appearance and shape by Jigsaw, and comparing it with Epitome.

Soo Hyun Bae, Woo Young Kim

Georgia Institute of Technology

CS 7495 Fall 2006

Due:November28, 2006

Overview

This projectinvolves 4 papers, two of them are for the original implementation and the other two are for their corresponding methods. Our original intention was to implement the advanced techniques appeared in the paper, “Epitomic analysis of appearance and shape,”by N. Jojic, J.Frey and A.Kannan, although the implementation of the portion of learning Epitomewasreleased in public. While we are implementing the applications including image segmentation and image denoising with the Epitomic model, we found out that two recent papers have been proposed for improving the original epitomic analysis : (1) “Video epitomes,” by V.Cheung, J.Frey and N. Jojic. (2) “Clustering appearance and shape by learning jigsaws,” by A.Kannan, J. Winn and C.Rother.

The paper (1) is an extended version of the original paper for the spatio-temporal domain while retaining the generative models of given source and its training methods.Also it provided more detailed of how to exploit higher dimensional sources (e.g. videos). The paper (2) suggested a flexible technique for generating texture map, called Jigsaw in this paper, and provided more feasible results for images.

Accordingly we determineto implement the “Jigsaw” algorithm in the paper (2), and compare the results with epitomic learning and reconstructing. For aiming this goal, we have to employ “Graph Cut” algorithm, which is introduced in “Fast approximate energy minimization via graph cuts, ” by Y Boykov, O. Veksler, and R. Zabih, since the “Jigsaw” is obtained through iterative EM steps. An offset map, which generates a bridge from a given image to a jigsaw map, is obtained by an optimization procedure based on a graph-cut algorithm.

Papers

[1]A. Kannan, J. Winn and C. Rother Clustering appearance and shape by learning jigsaws

To appear in Advances in Neural Information Processing Systems, Volume 19, 2006.

[2]Y Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. PAMI, 23(11),2001.

[3]V. Cheung, B. J. Frey, and N. Jojic. Video epitomes. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2005

[4]N. Jojic, B. J. Frey, and A. Kannan. Epitomic analysis of appearance and shape. In Proc. IEEE Conf. Computer Vision (ICCV), 2003.

Algorithms

Epitomic and Jigsaw models are of patch-based probabilistic generation. Such representations of a given image Iare much compacter than the source representation, but retain most of the constitutive elements needed to reconstruct it. The differences between the two are as follows: While the Epitomic analysis and other patch-based models have to specify the appropriate set of dimensions of patches and shapes before learning, the Jigsaw model can effectively learn the patch sizes and shapes in the learning process.

3.1.Epitome

We consider epitome as a set of Gaussian distributions, that is, each entry hasa mean and variance, which are parameters. To learn theparameters of epitome, we train a set of patches from a set of imagesI, and update the parameters by averagingpatch valueswith appropriate mapping T.The training patches zk’s are systematically selected, for example, all the possible or partially overlapped patches in I. Figure 1 shows the graphical model of epitome. According to it, for each pixel x in image Ito generate, we considera number of patches zk’ that share x. Then,

, where N is the number of patches that share x. Each patch is generated by epitome e, using the corresponding mapping T.

Then the entire model has the following joint distribution.

Maximizing the joint distribution leads to the following update rules to be iterated.

The first and second valuesare computed in E-step, and the third and last are updated in M-step.

The value in (3) determines epitomic mean and (4) epitomic variance. Note that (1) is estimated value assuming that epitome and the mappings are known.

We should, however, decide the patch sizes and shapes before learning, and practically the patch shape is chosen as rectangle.

3.2.Jigsaw

Jigsaw is also a patch-based probabilistic generative model, and each entry is a Gaussian distribution too.Although it learns a jigsaw from a set of training patches of input imagesas epitome,the patches are not selected manually, but it learns the appropriate sizes and shapes of the patches automatically.In other words, the main difference of jigsaw against epitome is having an associated offset map L of the same size as the input imageI. Figure 2 shows the graphical model of generative process of Jigsaw.Each pixel z in a jigsaw J is defined as the intensity μ(z), and a variance 1/λ, (λ is called the precision), as epitome mean and epitome variance. It joins together pieces of the jigsaw using hidden offsetmap of the same size with input, and thenadds Gaussian noise of variance given by the jigsaw. Figure 2 is the graphical model of Jigsaw.

Note that the offset map L defines a position in the jigsaw for each pixel in the input.

Going into details for learning the jigsaw, it assumes the followings;

First each pixel of input imageI is independent conditioned on Jigsaw J, and offset map L.

To preserve the local consistency in the input, so that neighboring pixels have the same offsets, it defines Markov Random Field.

,where E is the set of edges in 4-connected grid.

In order to allow having the unused pixels in Jigsaw while learning, it has the following Normal-Gamma prior.

Now, it learns Jigsaw by maximizing the following joint probability, iteratively.

As epitome iterate the hidden mapping T and parameters - mean and variance of each Gaussian distribution-, Jigsaw iterate the hidden offset map L and parameters, which is mean and precision.

The offset map L is updated applying the alpha-expansion graph-cut algorithm in [2]. With the L, it updates mean and precision in the followings.

, where X(z) is the set of image pixels corresponding z pixel in Jigsaw.

Implementation

We implemented the learning process of the Jigsaw model and reconstruction with it. As a part of learning the Jigsaw model, we implemented the graph-cut in paper [2], in order to get offset map.

4.1.Jigsaw

There are two main parts. First it initializes the Jigsaw by setting the precisions to be the expected value under the prior b/a, where b is a shape, a is a scale parameter in Gamma of prior of Jigsaw. The mean is equivalent to the one of the input distribution on Gaussian. This helps us to avoid falling into a local maximum from the globally optimum solution. Beginning with these initial Jigsaw parameters, we get offset map Lby the graph-cut algorithm. This process will be detailed in the next subsection.

After offset map Lis obtained according to the input data, it updates means and precisions using it. The detail is explained in the 3.2 section. We repeat this process until it converges.

4.2.Graph-cut

The focus of the paper [2] is minimization of energy in early vision problems. The goal is to find a labeling that assigns each pixel a label , where is both piecewise smooth and consistent wit the observed data. Many vision problems can be naturally formulated in terms of minimization of two energy terms,

Here, measures the extent to which is not piecewise smooth, while measures the disagreement between and the observed data. In generating the optimized offset map for a given Jigsaw, the label set corresponds to a set of index for the Jigsaw set, thus is a Jigsaw index. The data energy is formulated as

At a certain pixel location, we calculate the datacost for a given offset. We set the distortion function as a second norm. The datacost is formulated as second norm between the pixel value and the corresponding jigsaw value mapped by the offset,

The smoothing term considers the distance of a location between the neighboring jigsaw values,

We have tested several distortion functions so that weighted first order norm works well as follows:

here, we set the to 0.2 for any.

In the paper [1], the authors described that they utilized alpha-expansion graph-cut algorithm for obtaining an optimized offset map for a given Jigsaw map. However, from a number of experiments, we here conjecture that the alpha-expansion algorithm does not work correctly. We would say that the general expansion graph-cut algorithm only work for this optimum offset map generation.

4.3.Epitome

For the purpose of comparing, we implemented the reconstruction of the original image from the help of the open source code for training epitome. We used the training implementation of “video epitomes,” instead of the original image epitome, believing the former code shows better performances.
Results

We focused on comparing the Jigsaw results with epitome, since the Jigsaw seems to be superior to the epitome in the sense that it can automatically learn the appropriate patch sizes and shapes.For the purpose of clear comparison, even though there are other applications then reconstruction, we applied the Jigsaw model for reconstruction only, and then compared the results with epitome.

Input
/ Jigsaw / Reconstruction / Offset map
Epitome / Reconstruction (average) / Reconstruction (no avg.)
Input
/ Jigsaw
/ Reconstruction
/ Offset map

Epitome
/ Reconstruction (average)
/ Reconstruction (no avg.)

Input
/ Jigsaw
/ Reconstruction
/ Offset map

Epitome
/ Reconstruction (average)
/ Reconstruction(no avg.)

As a result, we demonstrate the comparison with three images. We present Jigsaw mean and reconstructed image with the inferred offset map. Epitome mean and the reconstructed images are also shown. In the reconstruction with epitome, we show two reconstructed image, one of which used only one patch per pixel, the other a number of patches per pixel. In fact, the former method is most analogous to the jigsaw reconstruction, since Jigsaw algorithm only needs one offset per pixel. As we can see, even in the case of averaging patches with epitome, jigsaw is doing much better job for reconstruction.Notice that the image with one patch per pixel is very blocky and the other is blurry. With Jigsaw, we almost got the perfect reconstruction.

Note that, however, the image of Jigsaw mean still requires additional clustering step if we want to get the Jigsaw images introduced in the paper. Because of the ambiguous explanation in the paper, however, we could not implement this cluster part yet.

Problems

There were several problems, however, that weencountered while implementing the papers.

5.1.Out of memory

When we implement the graph-cut algorithm to obtain the appropriate offset map in Jigsaw learning, we face a critical problem of insufficient memory. Again, we want to minimize the data cost . For every pixel, the cost map for the entire Jigsaw values should be generated. The size of the “bird” image 268x179 and the Jigsaw size was set to 30x30. Thus, the data cost map requires for ‘double’ precision, which is beyond the maximum limitation of 32bit MATALB. Although the costs are truncated into 32bit integer, it still requires around 170MB, which is not affordable yet. Therefore, we implemented the graph-cut algorithm with MEX, a C implementation of MATLAB function. The current implementation only works in a 64bit Linux system with higher than 8GB system memory.

5.2.Clustering the output of Jigsaw image

The third image shown in the previous table is same as the image shown in the paper [2], but the jigsaw is not same as the corresponding jigsaw figure. In the paper [2], the authors mentioned that they applied a clustering step to determine the jigsaw pieces, but no further details. We did not implement the clustering algorithm at this time around,but, we note that the clustering algorithm helps us see the jigsaw pieces in comprehensive representation.