Finding Natural Communities

Finding Natural Communities Graph G

Subgraphs G1 G2 G3 … Gn
Trees T1 T2 T3 … Tn
Trees TB = T1 T2 T3 … Tn
Cluster CB C2 C3 … Cn

If the match between CB and its best match in some tree CJ is greater than p, then these two communities are considered the same.
If there are f such “same” communities, then CB is considered natural

For the CiteSeer Data, both p and f were set to 70%.

Finding All Natural Communities
By using only one base in our filtering process, we will miss in the region of 1-f of the natural communities in the data by definition.
We must repeat the process with different base treesto arrive at the full set

Using five different bases reduces the probability of missing a natural community to an insignificant number when f = 70%: 0.35 = 0.2%.

Redundant Natural Communities
Due to the nature of the agglomerative merge, it is possible to find many natural communities that differ only by a small amount
We merge all pairs of natural communities that match by at least 70% by taking the larger of the two. This 70% threshold corresponds with our p parameter
Finding Natural Communities on the CiteSeer Data
Given the CiteSeer data, we removed papers referenced by only 1 paper, and papers that reference fewer than 5 papers.
Weset the number of trees to compare over (n) to 100. We found this to be a point of stabilization.
Performing the natural community filter twice on two separate sets of trees, we found 158 natural communities in the 1st run and 150 in the 2nd run.

Match Between Natural Community and its Representative in the Base Tree
Recall that we merge natural communities from several trees into one common base tree by finding its best match in the base tree
This best match consistently hovers around the 70% boundary at which we consider two clusters similar. The percentage average of all such matches is 68%.
If C1 and C2 are the best matches of a base community, how well do C1 and C2 match?

?? > 70% 85% of the time
When ?? < 70, the match is 66% on average

How well do communities from the two experiments match?
After finding the two sets of natural communities from our two experimental runs, we proceeded to examine their similarity.
To compare these two sets of natural communities, we used the concept of a two-way best match:
Given two clusters, C1 from the first set of natural communities and C2 from the second, if the best match of C1 is C2 and the best match of C2 is C1, then these two clusters form a two-way best match, and are considered essentially the same community.
80% of the natural communities formed a two-way match
The remaining 20%, which did not appear in both experimental runs were found to be threshold communities. Show Chart
We can conclude that the natural communities found in the two experimental runs are essentially the same, with the only discrepancies occurring with natural communities that barely made it in one experimental run and barely missed in the other. Due to the fact that we have no true bound for quantifying natural communities, such mismatches concerning natural communities that hover around the boundary f are to be expected.
(*We were also concerned with clusters that were found to be natural in one tree but did not map well into the base tree. Thus, when mapped into, this mapped version of the natural communities no longer qualifies as natural, and may display strengths below 40. However, even though these communities may not map well, in many cases, a two-way match with another set of naturals can still be found due to the distinctiveness of each natural community.*)

Variance in Strength
Table 1 gives the distribution of the natural communities of the two experiments as a function of their strength.

Strength / Number of Communities / Number of Communities
Experiment 1 / Experiment 2
70-75 / 55 / 47
76-80 / 35 / 35
81-85 / 17 / 21
86-90 / 22 / 16
91-100 / 29 / 31

(*The strength of its best match in tends to be slightly different.*)

Although the distribution remains quite constant, the strengths of corresponding natural communities were not found to match on a consistent basis.

We find that the strength of a community depends on which base tree is used.

We believe that for each natural community found there is one theoretical natural community with a core set of vertices and different versions of this natural community in each tree:
C  [C1, C2, …, Cn]
C != C1 != C2
We also expect that the strength of these different versions of natural communities will never be as high as the theoretical natural community C

When we find all natural communities when p=0.8, we find that there is almost no variance in strength.
Variance in Strength May not be a Problem
Given two clusters C1 and C2 of the same natural community C from trees T1 and T2, the best matches of these two clusters with respect to the remaining set of trees have stayed the same in all our tests.
Due to the fact that we find natural communities in 5 different base trees and combine them, the chance of missing a natural community due to its variance in strength is slim