Assignment 4:Web Searching using the HITS algorithm

The purpose of this programming assignment is to implement the HITS algorithm [1] to calculate the Hub and Authority value of a set of web pages related to a certain subject. This information can be used to decide which pages in the set can be considered the greatest authorities on the subject and which pages can be considered best at linking to good pages within the given subject area.

We have scheduled four labs in a room with a limited number of computers. To avoid overcrowded labs, each group should sign up for one lab.Lab lists will be posted on the board outside 1321.

Assignment

Assume that an input file named links.txt consists of ASCII text that looks as follows:

Each row can be interpreted in the following fashion:

The first web page in each row contains links to the rest of the web pages in that row. Theinformation is not complete: some web pages that are linked to by another webpage do notappear at the beginning of a line. Some rows possibly only contain one web address (like the lastone in the above example) signifying that that web page does not link to any other web page.

Note that some links appear in different forms e.g.:

All are linking to the same page despite the fact that they all look different. Another example is:

where one of the links is preceded by 'http//' and the other is not. They are still linking to the same pagethough. Since the text file can contain these different types of links you need to preprocess the file so thatall links that point to the same page have the same name.

Your assignment is to implement the HITS algorithm in AmosII to calculate the hub- and the authority weight of each web page in the file. The output of your implementation should be the 10 most authoritative pages in the collection together with their respective authority weight as well as the 10 most “hubby” pages together with their respective hub weights. A sample output is shown in Listing 1.

Your implementation needs to first preprocess the file so as to recognize the link names which are linking tothe same page and remove all the links for which there is no information about what they are linking to.The only thing you need to check for in order to find different links pointing to the same page are trailing'/', trailing '/index.html' and preceding '

To get started with the assignment, download the skeleton file for the assignment: a4hits.osql (and a string utility file: string.lsp) . Some pre-defined functions are available in the skeleton file. Follow the instructions and complete / replace / correct the code where necessary (marked with TODO) according to the function signatures, descriptions, HINTS, and NOTES. You are encouraged to experiment with explicit or implicit options and alternative implementations.

Amos 1> HITS('links_car_manufacturer.nt');

[Expanding image to 12323413]

Image moved in REMFUNCTION0

"Hubbiness"

<0.14939,"

<0.15862,"

<0.16073,"

<0.16516,"

<0.16695,"

<0.16754,"

<0.16976,"CARROS.WS/TRUCKS.HTM">

<0.17014,"

<0.20189,"

"Authority"

<0.20363,"

<0.16938,"

<0.17169,"

<0.17735,"

<0.18156,"

<0.18426,"

<0.18534,"

<0.19024,"

<0.19225,"

<0.1971,"

<0.21531,"

3.921 s

Listing 1: Sample output of the HITS algorithm in AmosII.

Validation

At the examination, your implementations will be executed in AmosII using:

< 'a4hits.osql';

When the execution is done, we expect the following functionand the sub-functions it callsto be fully and correctly defined:

  • HITS(Charstring file)->Bag of <Number, Charstring>

References

[1] David Gibson, Jon Kleinberg, and Prabhakar Raghavan. Inferring Web Communities from Link Topology. In Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia: Links, Objects, Time and Space - Structure in Hypermedia Systems, pages 225-234, June 1998.