Dsp Final Reading Report

資工四歐永俊 b92b02053

Subject: Probabilistic Latent Semantic Indexing

Author: Thomas Hofmann

ACM Special Interest Group on Information Retrieval (ACM SIGIR), 1999

1) Introduction:

First, I want to introduce a little about LSA, it is a method for clustering. A document or query can be represented by a vector (each component corresponds to a word, it takes “0” when the word doesn’t appear in the document, otherwise it takes “the normalized word frequency”) Therefore, we can construct a term by document matrix. Each document is a high-dimensional column vector appearing in the matrix. Using SVD, we can reduce the high-dimensional vector space to the low-dimensional latent semantic space. The basis changes from “term” to “concept”. We claim that documents are relevant when their projections onto the latent semantic space are close (distance between two vectors is defined by their cosine matching function). Intuitively, the similarities between documents or between documents and queries measured in “concept” are better, because documents which share frequently co-occurring terms will have a similar representation in the latent space, even if they have no terms in common. LSA performs some kind of noise reduction.

On the other hand, the author argued some weakness of LSA (mainly due to its unsatisfactory statistical foundation) and proposed a new method “PLSA” (i.e. Probabilistic LSA). It has a solid statistical foundation since it is based on the “likelihood principle” and “generative model”.

2) Model--Generative model &ML principle:

(Notationsand derivations are different from the paper, but the idea is same)

AssumeK unobserved topics (i.e. K latent classes):

Substituteit:

The log likelihood of this generative model is the log probability of the entire collection (N is the number of documents in the collection):

Define:

The parameters of this model are P(t|k)and P(k|d), we can derive the equations for computing these parameters by Maximum Likelihood.

Maximizing the log likelihood function with respect to P(t|k) and P(k|d) (i.e. derivatives are taken with respect to the parameters one of them at a time and equalthe equations to zero) subject to the constraints(i.e.Lagrangian terms are added to the equation)

So,

After solving the above equations, we get fixed point equations which can be solve iteratively (i.e. EM algorithm)

Repeat {

for d=1 to N ; for t = 1 to T; for k = 1 to K;

} until equations meet the criterion for converge.

Here we see that there is a fundamental difference between PLSA and LSA, which is the objective function utilized to determine the optimal solution (approximation). In LSA, this is the L2-norm, which corresponds to an implicit additive Gaussian noise assumption on counts. In contrast, PLSA relies on the likelihood function and aims at an explicit maximization of the predictive power of the model.

3) Geometry:

From

k: latent classes

We can view P(t | k) spanning a K-dimensional subspace. The projection of P(t|d) onto that subspace with coefficients/weights P(k | d).

Note that the projection is KL divergence projection.

Because,

Define:

From log-likelihood function

The term in brackets corresponds to the negative KL divergence/cross-entropy between the empirical distribution of words in a documentand the model distribution. For fixed P(t | k), maximizing the log-likelihood with respect toP(k | d) thus amounts to projectingon the subspace spanned byP(t | k) based on KL divergence.

4) Experiments

Exp1:

The author uses the TDT-1 collection. (15,862 documents of broadcast news stories) Stop-words are removed and no stemming is performed. He defines 128 topics (K = 128). Followings are the results (We list 4 topics).

Each tropicis represented by the 10 most probable words. (i.e. the words are ordered according toP(t | k)), we can see that the first topic is about “plane”, the second topic is about “space shuttle”, the third topic is about “family”, the fourth topic is about “Hollywood”.

Exp2:

The author proposes to use PLSA for indexing. We first compute the latent semantic space representation of documents i.e.P(t | k) and P(k | d), then for the queries/documents, whichwere not part of the original collection, can be folded in by a simple calculation.( compute P(k | query) by EM iteration while P(t | k) are fixed, i.e. we only adapt P(k | query)). After that, we calculate similarities /scores between documents/queries using cosine matching function in latent semantic space.

In the following experiment, the author actually uses score which is a linear combination of original vector space model score (weight by λ)and the latent semantic space model score (weight by 1-λ).

From the precision-recall curves, we can see that PLSI performs best in the four different collections (compared with tf, LSI/SVD).

After the models are considered the idf-weight, PLSI still performs best.