Lab 4

(75 pts) Due Sun, Mar 25

One topic of research today is accurately determining the most relevant topic(s) of a document. While this is useful for many reasons (e.g., search engines, finding relevant court documents, relevant medical documents, determining redundancy among documents when, for instance, determining whether a document contains fake news, etc.), determining the topic of a document is not always straightforward.

In this lab you’ll be doing a first step in determining a document’s topic(s). For this lab, you’ll be reading in a web page with html tags. Below is an example of a very simple web page(Note that this is a shortened version of the actual web page I used for the output):

<!DOCTYPE html>

html

<head

<meta charset="utf-8">

<title> Dogs and Puppies</title>

</head>

body

<h1> Dog and Puppy Care </h1>

<h2>Links Section </h2>

<p> SPCA, ESPN </p>

<p>Check regularly for more relevant links </p>

<h3> Dogs</h3>

<p>A dog can be a wonderful addition to any home. It's important

to keep your canine companion's health and happiness a top priority.

Below are some useful tips for all dog parents.

</p>

<h3>Feeding</h3>

<p>Premium-quality dry food provides a well-balanced diet for adult

dogs and may be mixed with water, broth or canned food. Your dog may

enjoy cottage cheese, cooked egg or fruits and vegetables, but these

additions should not total more than ten percent of his daily food intake.

</p>

<h3>Puppy Care</h3>

<p>Puppies are adorable, and quite tempting when one is looking to adopt.

However, keep in mind that for the first year the amount of care a

puppy requires is quite high, almost to the level of a toddler. If you

think you will be unable to provide that level of care, please consider

adopting an older dog.

</p>

</body>

</html>

In a web page, the tags themselves indicate to some degree the most relevant information on the web page. Text surrounded by the <title> … </title> tags should strongly indicate a web page’s topics, just as text surrounded by <h1>… </h1> tags, and, to a slightly less extend the text surrounded by the <h2>…</h2> tags. Tags <h3>, <h4>, <h5>, and <h6> all indicate that the text is a subheading and, while relevant to the topic, are slightly less likely to be the dominant topic of a document. Text surrounded by <p> and other tags are even less likely to be the dominant topic. And, most likely, the tags themselves have no relevance to the topic of the document and can be stripped out.

Clearly there is much muchmuch more that should be done to accurately determine the topic of a document, but we’ll start with prioritizing the text within the document so that the most relevant text is listed first, followed by slightly less relevant text, and followed by the least likely text.

For this lab you’ll be writing methods for the SNode class, with nodes consisting of a word and the word’s priority, the methods for the SLL class to create, insert, and print nodes in the linked list that is ordered by a word’s priority, and some of the methods for the WebTopic class, which reads in a web page, strips out the html tags, determines a priority for each word that isn’t in a tag on the web page, and inserts that word and its priority into the web page’s linked list. It also prints out the linked list, broken down into categories.

Once you are done inserting the words into the linked list based on their priority, you will then remove all the words that have little to no real meaning in terms of identifying the topic of the document. For instance, the words “the”, “a”, “and”,etc. are just filler words (we refer to them as stop words) and don’t tell us anything about the topic of the document. So I’ve included a list of stopwords in the WebTopic class header file. You will remove all of those stopwords from the linked list.

So you will be given and then must write the following:

SNode.hpp is defined as follows:

#ifndef SNODE_HPP_

#define SNODE_HPP_

#includestdlib.h

#includeiostream

usingnamespacestd;

classSNode {

friendclassSLL;

friendclassWebTopic;

stringword; // instead of int data, now the data is a string

intpriority; // the priority of a node (1,2, or 3)

SNode *next;

public:

SNode(string w, int p);

~SNode();

voidprintNode();

};

#endif/* SNODE_HPP_ */

(5 pts) Write the accompanyingSNode.cpp

------

SLL.hpp is defined as follows:

#ifndef SLL_HPP_

#define SLL_HPP_

#include"SNode.hpp"

#includestdlib.h

#includeiostream

usingnamespacestd;

classSLL {

friendclassWebTopic;

SNode *first;

SNode *last;

SNode *p2; //points to the node in the list that is the last node with a priority of 2. If you add another node with a priority of 2 to the list, it will be added after this node

//Note that you still only have one list

intsize;

public:

SLL();

~SLL();

voidprintSLL();

// (4 pts) used for testing purposes – not actually needed for this

//lab, but useful

voidpriorityInsert(string s, int p);

//(8 pts)

//This method creates a new node with s as the word and p as the priority and,

//if the priority is 1, adds the new node to the beginning of the list, if it

//is 3, adds the node to the end of the list, and if it’s 2, it will insert it

//into the list right after pointer p2 (which is the last node with a priority

//of 2. In essence, all the nodes with a priority of 1 are at the beginning

//of the list, all the nodes with a priority of 2 are in the middle of the

//list, and all the nodes with a priority of 3 are at the end of the list.

voidpush(string s, int p);

//(4 pts)

// pushes a new node(with priority p and word s) onto the end of the stack

// (remember to reset the last pointer) – I called this from the

// priorityInsert method.

voidaddAtFront(string s, int p);

//(5 pts)

//adds a new node (made from word s and priority p) to the beginning of the

// list (remember to reset the first pointer) – I called this from

// priorityInsert

voidaddFirst(string s, int p);

//(4 pts)

//adds the very first node (made from word s and priority p) to an empty list

// I called this from priorityInsert

voidaddAtP2(string s, int p);

//(6 pts)

// inserts a new node into the middle of the list right after the priority2 p2

// pointer – I called this from priorityInsert

intremoveAll(string w);

// (10 pts)

// removes all occurrences of word w from the linked list

// this is used to remove every word in the array of stop words from the

// linked list - I returned the number of times the word w was removed from // the list.

stringpop();

// 4 pts)

// removes the last node from the linked list, and returns the word from the

// node removed. – I called this from removeAll

};

#endif/* SLL_HPP_ */

(45 pts, as defined above) Write the accompanying SLL.cpp file with the methods as defined above

------

And finally this is the WebTopic.hpp file (for reading in a web page, stripping out the html tags while determining the priority of words within the tags, then creating a list of those words ordered by their priority :

#ifndef WEBTOPIC_HPP_

#define WEBTOPIC_HPP_

#include"SNode.hpp"

#include"SLL.hpp"

#includestdlib.h

usingnamespacestd;

classWebTopic {

intcurrpriority;

//This priority changes as the web page is traversed based on the last tag

//that has been read in

SLL *wordlist;

//the list of words ordered by priority

stringfile;

//The name of the web page file being read in

// (below) the ARRAY of stopwords, to be removed from the linked list

stringstopwords[50]={"a",

"along",

"although",

"am",

"among",

"and",

"are",

"as",

"at",

"be",

"because",

"between",

"can",

"do",

"dont",

"either",

"for",

"got",

"has",

"have",

"havent",

"he",

"i",

"in",

"is",

"isnt",

"it",

"more",

"much",

"neither",

"no",

"none",

"nor",

"not",

"of",

"one",

"or",

"that",

"the",

"they",

"this",

"though",

"was",

"when",

"while",

"why",

"with",

"without",

"you",

"your"

};

intswlen = 50; // the length of the ARRAY stopwords, above

public:

WebTopic(string filename);

//constructor

voidReadFile();

//reads in the web page, character by character, into a a line, setting the

//current priority based on the latest tag read in. This method calls

//getPriority when a tag has been read in, and it calls parseString when the //text between tags has been read in.

voidparseString(string line);

//takes a line of characters and breaks the line up into words by creating a

//new string of only alphabetical characters: Note: I did this by first

//breaking the string into individual words separated by spaces, and then used

//the function stripSpace to remove anything that wasn’t an alphanumeric

//character using the built in isalpha function.

stringstripSpace(string s);

//Strips out any character that isn’t alphanumeric and returns the stripped

//string

voidgetPriority(string line);

//The line is the tag, without the first <. This method uses the line to

//determine the current priority as follows:

//If the first 5 characters in the line match the word “title” (I used

//line.compare for this), or the first 2 characters match either h2 or h2,

//then the current priority is set to 1. If the first 2 characters are

//anything between h3 and h6, then the priority is set to 2, and if it’s

//anything else, it’s set to 3.

voidprintPage();

// (6 pts) – YOU WRITE

//prints out the list of words on the web page, ordered by their priority (and

//listing each word’s priority

voidremoveStopWords();

// (4 pts) – YOU WRITE

// after the linked list has been created, this method removes all the

// stopwords in the array of stopwords from the linked list (using the

// removeAll method)

};

#endif/* WEBTOPIC_HPP_ */

You must fill in the missing code in the WebTopic.cpp source file below:

#include"SLL.hpp"

#include"SNode.hpp"

#include"WebTopic.hpp"

#includeiostream

#includestdlib.h

#includecstring

#includestdio.h

#includefstream

#includecctype

usingnamespacestd;

WebTopic::WebTopic(string filename) {

file = filename;

wordlist = newSLL();

currpriority = 3;

}

voidWebTopic::getPriority(string line) {

//coutendl;

//cout < line < endl;

if (line.compare(0,5,"title") == 0 || line.compare(0,2,"h1")==0 || line.compare(0,2, "h2")==0) {

currpriority = 1;

}

elseif (line.compare(0,2,"h3") == 0 ||line.compare(0,2,"h4") == 0||line.compare(0,2,"h5") == 0||line.compare(0,2,"h6") == 0) {

currpriority = 2;

}

else {

currpriority = 3;

}

//cout < "Curr Priority: " < currpriorityendl;

}

voidWebTopic::printPage() {

// YOU WRITE

}

voidWebTopic::readFile() {

ifstreaminfile(file.c_str(),ios::in); // open file

string line = "";

char c;

while (infile.get(c)) {

if (c == '<') {

if (!line.empty()) {

//cout < "Line outside of tag: "<endl;

//cout < line < endlendl;

parseString(line);

line = "";

}

}

elseif (c == '>') {

//cout < "Line inside of tag: "<endl;

//cout < line < endlendl;

getPriority(line);

line = "";

}

else {

line += c;

}

}

cout"*****************************************"endl;

cout"BEFORE REMOVING STOP WORDS"endl;

printPage();

coutendl;

removeStopWords();

cout"*****************************************"endl;

cout"AFTER REMOVING STOP WORDS"endl;

printPage();

coutendl;

infile.close();

}

stringWebTopic::stripSpace(string s) {

unsignedinti = 0;

while (is.length()) {

if (!isalpha(s[i])) {

s.erase(i);

}

i++;

}

return s;

}

voidWebTopic::parseString(string line) {

char *l=const_castchar *>(line.c_str());

char *token = strtok(l, " ");

while (token != NULL) {

string s = stripSpace(token);

if (s != "") {

wordlist->priorityInsert(s,currpriority);

}

token = strtok(NULL, " ");

}

}

voidWebTopic::removeStopWords(){

// YOU MUST WRITE

}

(10 pts, as defined above) Fill in the methods in the WebTopic.cpp file with the methods as defined above

------

And my main is:

#include"SLL.hpp"

#include"SNode.hpp"

#include"WebTopic.hpp"

#includeiostream

#includestdlib.h

#includecstring

usingnamespacestd;

intmain() {

WebTopic *x = newWebTopic("webpage.html"); //or another web page – I haven’t //tested it extensively, but it should work for any basic html page

x->ReadFile();

x->printPage();

return 0;

}

//15 pts for getting everything to work together

Altogether, you should

  • write the SNode.cpp file (5 pts)
  • The SLL.cpp file methods (45 pts)
  • The 2 methods in the WebTopic.cpp file (10 pts)
  • And get everything to work together (15 pt)

------

To turn in (zipped):

  • SNode.hpp //no changes from original
  • SNode.cpp
  • SLL.hpp //no changes from original
  • SLL.cpp
  • WebTopic.hpp
  • WebTopic.cpp
  • WebPageMain.cpp //no changes from original
  • webpage.html //no changes from original

______

And my output for webpage.html:

*****************************************

BEFORE REMOVING STOP WORDS

Priority 1:

Puppies:1, Section:1, Donation:1, Welfare:1, Animal:1, Section:1, Links:1, Organizations:1, Welfare:1, Animal:1, Care:1, Puppy:1

and:1, Dog:1, Puppies:1, and:1, Dogs:1,

Priority 2:

Dogs:2, Feeding:2, your:2, Dog:2, Exercise:2, Puppy:2, Care:2,

Priority 3:

SPCA:3, ESPN:3, Check:3, regularly:3, for:3, more:3, relevant:3, links:3, Make:3, sure:3, you:3, donate:3

to:3, your:3, local:3, animal:3, shelter:3, Animal:3, Welfare:3, Institute:3, American:3, Humane:3, Association:3, Humane:3

Farming:3, Association:3, dog:3, can:3, be:3, a:3, wonderful:3, addition:3, to:3, any:3, home:3, but:3

whether:3, you:3, experienced:3, pet:3, parent:3, or:3, a:3, first:3, adopter:3, it:3, important:3, keep:3

your:3, canine:3, companion:3, health:3, and:3, happiness:3, a:3, top:3, priority:3, are:3, some:3, useful:3

tips:3, for:3, all:3, dog:3, parents:3, Premium:3, dry:3, food:3, provides:3, a:3, well:3, diet:3

for:3, adult:3, and:3, may:3, be:3, mixed:3, with:3, water:3, broth:3, or:3, canned:3, food:3

Your:3, dog:3, may:3, cottage:3, cheese:3, cooked:3, egg:3, or:3, fruits:3, and:3, vegetables:3, but:3

these:3, should:3, not:3, total:3, more:3, than:3, ten:3, percent:3, of:3, his:3, daily:3, food:3

Exercise:3, your:3, dog:3, regularly:3, to:3, keep:3, your:3, dog:3, stimulated:3, and:3, healthy:3, to:3

keep:3, their:3, weight:3, in:3, check:3, Different:3, breeds:3, need:3, different:3, of:3, exercise:3, Before:3

adopting:3, a:3, dog:3, research:3, the:3, energy:3, level:3, of:3, breed:3, to:3, ensure:3, you:3

will:3, be:3, able:3, to:3, provide:3, the:3, proper:3, level:3, of:3, Our:3, newest:3, puppy:3

are:3, adorable:3, and:3, quite:3, tempting:3, when:3, one:3, is:3, looking:3, to:3, adopt:3, keep:3

in:3, mind:3, that:3, for:3, the:3, first:3, year:3, the:3, amount:3, of:3, care:3, a:3

requires:3, is:3, quite:3, high:3, almost:3, to:3, the:3, level:3, of:3, a:3, toddler:3, If:3

you:3, you:3, will:3, be:3, unable:3, to:3, provide:3, that:3, level:3, of:3, care:3, please:3

consider:3, an:3, older:3, dog:3, Copyright:3, DY:3, All:3, rights:3, reserved:3,

removing a

deleting a may cause a memory leak

deleting a may cause a memory leak

deleting a may cause a memory leak

deleting a may cause a memory leak

deleting a may cause a memory leak

deleting a may cause a memory leak

deleting a may cause a memory leak

removing along

removing although

removing am

removing among

removing and

deleting and may cause a memory leak

deleting and may cause a memory leak

deleting and may cause a memory leak

deleting and may cause a memory leak

deleting and may cause a memory leak

deleting and may cause a memory leak

deleting and may cause a memory leak

removing are

deleting are may cause a memory leak

deleting are may cause a memory leak

removing as

removing at

removing be

deleting be may cause a memory leak

deleting be may cause a memory leak

deleting be may cause a memory leak

deleting be may cause a memory leak

removing because

removing between

removing can

deleting can may cause a memory leak

removing do

removingdont

removing either

removing for

deleting for may cause a memory leak

deleting for may cause a memory leak

deleting for may cause a memory leak

deleting for may cause a memory leak

removing got

removing has

removing have

removinghavent

removing he

removingi

removing in

deleting in may cause a memory leak

deleting in may cause a memory leak

removing is

deleting is may cause a memory leak

deleting is may cause a memory leak

removingisnt

removing it

deleting it may cause a memory leak

removing more

deleting more may cause a memory leak

deleting more may cause a memory leak

removing much

removing neither

removing no

removing none

removing nor

removing not

deleting not may cause a memory leak

removing of

deleting of may cause a memory leak

deleting of may cause a memory leak

deleting of may cause a memory leak

deleting of may cause a memory leak

deleting of may cause a memory leak

deleting of may cause a memory leak

deleting of may cause a memory leak

removing one

deleting one may cause a memory leak

removing or

deleting or may cause a memory leak

deleting or may cause a memory leak

deleting or may cause a memory leak

removing that

deleting that may cause a memory leak

deleting that may cause a memory leak

removing the

deleting the may cause a memory leak

deleting the may cause a memory leak

deleting the may cause a memory leak

deleting the may cause a memory leak

deleting the may cause a memory leak

removing they

removing this

removing though

removing was

removing when

deleting when may cause a memory leak

removing while

removing why

removing with

deleting with may cause a memory leak

removing without

removing you

deleting you may cause a memory leak

deleting you may cause a memory leak

deleting you may cause a memory leak

deleting you may cause a memory leak

deleting you may cause a memory leak

removing your

deletingyour may cause a memory leak

deletingyour may cause a memory leak

deletingyour may cause a memory leak

deletingyour may cause a memory leak

deletingyour may cause a memory leak

*****************************************

AFTER REMOVING STOP WORDS

Priority 1:

Puppies:1, Section:1, Donation:1, Welfare:1, Animal:1, Section:1, Links:1, Organizations:1, Welfare:1, Animal:1, Care:1, Puppy:1

Dog:1, Puppies:1, Dogs:1,

Priority 2:

Dogs:2, Feeding:2, Dog:2, Exercise:2, Puppy:2, Care:2,

Priority 3:

SPCA:3, ESPN:3, Check:3, regularly:3, relevant:3, links:3, Make:3, sure:3, donate:3, to:3, local:3, animal:3

shelter:3, Animal:3, Welfare:3, Institute:3, American:3, Humane:3, Association:3, Humane:3, Farming:3, Association:3, dog:3, wonderful:3

addition:3, to:3, any:3, home:3, but:3, whether:3, experienced:3, pet:3, parent:3, first:3, adopter:3, important:3

keep:3, canine:3, companion:3, health:3, happiness:3, top:3, priority:3, some:3, useful:3, tips:3, all:3, dog:3